What metrics do you alert on? How do you distinguish between error due to faulty...

majewsky · 2024-08-26T09:48:12 1724665692

Taking my managed container image registry service as an example.

- The only critical alert that can actually page people is if the blackbox test fails. Every 30 seconds, it downloads a test image and if the contents don't match the expectation, an alert is raised (with some delay).

- Warning alerts are mostly for any errors being returned from background tasks, but these are only monitored during business hours.

perfect_wave · 2024-08-26T14:39:02 1724683142

i dont see how that is separated from the underlying infra. If the network/server/some dependency goes down, the blackbox test will fail and you'll get paged.

silisili · 2024-08-26T19:04:13 1724699053

You can test for this. For example, we had routines that were called on repeated HTTP failures that would then get 5 or so of the top US websites. If those fail too, it moves from an application error to an infra one.

dullcrisp · 2024-08-26T09:30:25 1724664625

Define SLOs based on what can realistically be achieved with underlying infrastructure, only alert if those SLOs are breached?

sgarland · 2024-08-26T10:42:49 1724668969

If your endpoint is failing, it might be you. If everyone’s endpoint is failing, it’s almost certainly not you.