A lot of behaviour in large distributed systems is emergent and synthetic load tests etc often aren't enough to reveal what is going to happen under hundreds of thousands QPS.
Metrics and tracing are how you get a handle on this and make fixes before emergent behaviour boils over and causes an outage.
Performance, round-trip times, requests processed per $timeunit, error rate for both the application in question AND all other services it uses, ... - the list is nearly endless. But for every time-series dimension you collect, you really also want their value distributions.
Increased error rate or spiking tail latency are the first symptoms of an oncoming problem. Incidentally they tend to go hand in hand, because error handling is by definition outside the happy path and as such often more expensive. On a longer timespan, 30-day, 60-day or even 90-day windows can give very nice insights on peak resource use trends.
Spotting trends is important in capacity planning.
Why did the last 5-lines code change increase GC time by 3%? Why are traffic and memory having correlation of .7 instead of .5 as usual?
Why is 10% of the fleet is logging more lines than the others hosts during high network congestion events?
Questions like this lead to a much better understanding on how your system work and how to improve them.