Hacker News new | past | comments | ask | show | jobs | submit login

This is a concept I've had to explain to entirely too many teams over the years, that 0.001% of requests failing as a (mostly) random distribution of all requests is very different than a 0.001% subset of requests that will fail (nearly) every time until the underlying issue is mitigated. They look the same on a high level dashboard but they are completely different conditions in terms of how the customer will feel it, and understanding which kind of problem you have also guides the investigation and troubleshooting process.



In addition, some requests are more important that others.

`/assets/app_bundle.js` failing will most likely be visible immediately and make everything else useless, unless you've been clever and only used JS for upgrading website/app experience, rather than replacing

`/metrics/user-activity` failing won't (shouldn't) have any impact on the user experience

`/stripe/payment-succeeded-callback` failing could have disastrous impacts on the user, but not immediately be visible when it's failing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: