I'm trying to relate this to my experiences. The best I can make of it is that b...

I'm trying to relate this to my experiences. The best I can make of it is that burnout comes from dealing with either the same types of problems, or new problems at a rate that's higher than old problems get resolved.

I've been in those situations. My solution was to ensure that there was enough effort into systematically resolving long-known issues in a way that not only solves them but also reduces the number of new similar issues. If the strategy is instead to perform predominantly firefighting with 'no capacity' available for working on longer term solutions there is no end in sight unless/until you lose users or requests.

I am curious what the split is of problems being related to:

1. error rates, how many 9s per end-user-action, and per service endpoint

2. performance, request (and per-user-action) latency

3. incorrect responses, bugs/bad-data

4. incorrect responses, stale-data

5. any other categories

Another strategy that worked well was not to fix the problems reported but instead fix the problems known. This is like the physicist looking for keys under the streetlamp instead of where they were dropped. Tracing a bug report to a root cause and then fixing it is very time consuming. This of course needs to continue, but if sufficient effort it put to resolving known issues, such as latency or error rates of key endpoints, it can have an overall lifting effect reducing problems in general.

A specific example was how effort into performance was toward average latency for the most frequently used endpoints. I changed the effort instead to reduce the p99 latency of the worst offenders. This made the system more reliable in general and paid off in a trend to fewer problem reports, though it's not easy/possible to directly relate one to the other.