Ask HN: What engineering metrics help you do your job better?

edmundsauto · 2024-03-09T17:29:18.000000Z

Good observability on operational metrics, as much automation as possible, clear processes and playbooks for escalations.

Most engineering orgs I’ve seen require Eng oncalls. The perspective above is a simple non expensive way to minimize time investigations eg 5% movement of a metric. Teams spend lots of time asking “is this a real world effect, if so, should we be worried?”

topaztee · 2024-03-09T18:58:01.000000Z

“is this a real world effect, if so, should we be worried?”

can you expand on that? the way im translating that is teams are spending lots of time checking if something is broken or not

edmundsauto · 2024-03-09T21:17:50.000000Z

Yes, especially with something that hasn’t been tested out. You might see a 3% increase in eg latency. How do you know if that is a telemetry bug, a temporary effect in actual user experience, or a permanent regression?

Teams will prioritize real regressions (if we don’t fix this big trouble b/c the UX is degraded). 2, should we wait a day or two for more data? Or 3, do we need to debug telemetry and pipelines?

You also need to provide guidelines for what is not normal variation (in case 2+3 above). Is 3% ok, what is the effect on UX and/or business outcomes?

And playbooks - “I am on call but don’t have deep expertise in this metric”. When/how do you escalate? What is the on call responsibility and what situation requires immediate escalation vs continuous monitoring. When do you declare a SEV?

Observability tools are only as good as the process behind them.

topaztee · 2024-03-11T10:55:30.000000Z

those are some great points. I suppose for each company those numbers will differ (0.1 for amazon is massive) so would be hard to productize

wahnfrieden · 2024-03-09T17:30:22.000000Z