> We ended up embedding these dashboard within Confluence runbooks/playbooks followed by diagnosing/triaging, resolving, and escalation information. We also ended up associating these runbooks/playbooks with the alerts and had the links outputted into the operational chat along with the alert in question so people could easily follow it back.
When I used to work for Amazon, as a developer, I was required to write a playbook for every microservice I developed. The playbook had to be so detailed that, in theory, any site reliability engineer, who has no knowledge of the service should be able to read the playbook and perform the following activities:
- Understand what the service does.
- Learn all the curl commands to run to test each service component in isolation and see which ones are not behaving as expected.
- Learn how to connect to the actual physical/virtual/cloud systems that keep the service running.
- Learn which log files to check for evidence of problems.
- Learn which configuration files to edit.
- Learn how to restart the service.
- Learn how to rollback the service to an earlier known good version.
- Learn resolution to common issues seen earlier.
- Perform a checklist of activities to be performed to ensure all components are in good health.
- Find out which development team of ours to page if the issue remains unresolved.
It took a lot of documentation and excellent organization of such documentation to keep the services up and running.
I think this was mostly pushed through by sysadmins annoyed at getting alerts from new applications that didn't mean anything to them.
Subject: Disk usage high
There is a problem in cluster ABC.
Disk utilization above 90%.
Providing a link to a runbook makes resolving issues a lot faster. It's even better if the link is to a Wiki page, so you can edit it if the runbook isn't up to date.
Basically you are saying you were required to be really diligent about the playbooks and put effort in to get them right.
Did people really put that effort in? Was it worth it? If so, what elments of the culture/organisation/process made people do the right thing when it is so much easier for busy people to get sloppy?
Regarding the question about culture, yes, busy people often get sloppy. But when a P1 alert comes because a site reliability engineer could not resolve the issue by following the playbook, it looks bad on the team and a lot of questions are asked by all affected stakeholders (when a service goes down in Amazon it may affect multiple other teams) about why the playbook was deficient. Nobody wants to be in a situation like this. In fact, no developer wants to be woken up at 2 a.m. because a service went down and the issue could not be fixed by the on-call SRE. So it is in their interest to write good and detailed playbooks.