Good observability on operational metrics, as much automation as possible, clear processes and playbooks for escalations.
Most engineering orgs I’ve seen require Eng oncalls. The perspective above is a simple non expensive way to minimize time investigations eg 5% movement of a metric. Teams spend lots of time asking “is this a real world effect, if so, should we be worried?”
Yes, especially with something that hasn’t been tested out. You might see a 3% increase in eg latency. How do you know if that is a telemetry bug, a temporary effect in actual user experience, or a permanent regression?
Teams will prioritize real regressions (if we don’t fix this big trouble b/c the UX is degraded). 2, should we wait a day or two for more data? Or 3, do we need to debug telemetry and pipelines?
You also need to provide guidelines for what is not normal variation (in case 2+3 above). Is 3% ok, what is the effect on UX and/or business outcomes?
And playbooks - “I am on call but don’t have deep expertise in this metric”. When/how do you escalate? What is the on call responsibility and what situation requires immediate escalation vs continuous monitoring. When do you declare a SEV?
Observability tools are only as good as the process behind them.
Most engineering orgs I’ve seen require Eng oncalls. The perspective above is a simple non expensive way to minimize time investigations eg 5% movement of a metric. Teams spend lots of time asking “is this a real world effect, if so, should we be worried?”