
Debugging Incidents in Google's Distributed Systems - yarapavan
https://queue.acm.org/detail.cfm?id=3404974
======
wbsun
In my experience, correctly identifying the culprit is the most difficult, but
during an incident, mitigation is the most important. Most incidents I ran
into happened after a job rollout, in order to find and mitigate an issue, a
distributed system needs monitoring, replicated drainable services, rollout
canary/rollback. With these, an SRE oncaller doesn't need as much knowledge of
the system as the devs in order to handle an incident.

------
tlarkworthy
Seems to miss step 1 which is verify there is an issue (exclude false
positives). Maybe the process is abandoned at step 3 sometimes (pinpoint exact
location of issue). I guess only true positives make it to post mortem

~~~
jedmeyers
The fact that the oncaller has been notified without an actual issue in a
production system, is an issue in itself - incorrect monitoring/alerting. And
will be addressed an the next production meeting.

