Hacker News new | past | comments | ask | show | jobs | submit login

What metrics do you alert on? How do you distinguish between error due to faulty database client vs error due to database disk failure?



Taking my managed container image registry service as an example.

- The only critical alert that can actually page people is if the blackbox test fails. Every 30 seconds, it downloads a test image and if the contents don't match the expectation, an alert is raised (with some delay).

- Warning alerts are mostly for any errors being returned from background tasks, but these are only monitored during business hours.


i dont see how that is separated from the underlying infra. If the network/server/some dependency goes down, the blackbox test will fail and you'll get paged.


You can test for this. For example, we had routines that were called on repeated HTTP failures that would then get 5 or so of the top US websites. If those fail too, it moves from an application error to an infra one.


Define SLOs based on what can realistically be achieved with underlying infrastructure, only alert if those SLOs are breached?


If your endpoint is failing, it might be you. If everyone’s endpoint is failing, it’s almost certainly not you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: