> We ended up embedding these dashboard within Confluence runbooks/playbooks followed by diagnosing/triaging, resolving, and escalation information. We also ended up associating these runbooks/playbooks with the alerts and had the links outputted into the operational chat along with the alert in question so people could easily follow it back.
When I used to work for Amazon, as a developer, I was required to write a playbook for every microservice I developed. The playbook had to be so detailed that, in theory, any site reliability engineer, who has no knowledge of the service should be able to read the playbook and perform the following activities:
- Understand what the service does.
- Learn all the curl commands to run to test each service component in isolation and see which ones are not behaving as expected.
- Learn how to connect to the actual physical/virtual/cloud systems that keep the service running.
- Learn which log files to check for evidence of problems.
- Learn which configuration files to edit.
- Learn how to restart the service.
- Learn how to rollback the service to an earlier known good version.
- Learn resolution to common issues seen earlier.
- Perform a checklist of activities to be performed to ensure all components are in good health.
- Find out which development team of ours to page if the issue remains unresolved.
It took a lot of documentation and excellent organization of such documentation to keep the services up and running.
I think this was mostly pushed through by sysadmins annoyed at getting alerts from new applications that didn't mean anything to them.
Subject: Disk usage high
There is a problem in cluster ABC.
Disk utilization above 90%.
Providing a link to a runbook makes resolving issues a lot faster. It's even better if the link is to a Wiki page, so you can edit it if the runbook isn't up to date.
Basically you are saying you were required to be really diligent about the playbooks and put effort in to get them right.
Did people really put that effort in? Was it worth it? If so, what elments of the culture/organisation/process made people do the right thing when it is so much easier for busy people to get sloppy?
Regarding the question about culture, yes, busy people often get sloppy. But when a P1 alert comes because a site reliability engineer could not resolve the issue by following the playbook, it looks bad on the team and a lot of questions are asked by all affected stakeholders (when a service goes down in Amazon it may affect multiple other teams) about why the playbook was deficient. Nobody wants to be in a situation like this. In fact, no developer wants to be woken up at 2 a.m. because a service went down and the issue could not be fixed by the on-call SRE. So it is in their interest to write good and detailed playbooks.
The most efficient ticketing systems I have ever seen were heavily customized in-house. When they moved to a completely different product, productivity in addressing tickets plummeted. They stopped generating tickets to deal with it.
> After process, documentation is the most important thing, and the two are intimately related.
If you have two people who are constantly on call to address issues because nobody else knows how to deal with it, you are a victim of a lack of documentation. Even a monkey can repair a space shuttle if they have a good manual.
I partly rely on incident reports and issues as part of my documentation. Sometimes you will get an issue like "disk filling up", and maybe someone will troubleshoot it and resolve it with a summary comment of "cleaned up free space in X process". Instead of making that the end of it, create a new issue which describes the problem and steps to resolve in detail. Update the issue over time as necessary. Add a tag to the issue called 'runbook'. Then mark related issues as duplicates of this one issue. It's kind of horrible, but it seamlessly integrates runbooks with your issue tracking.
I would like to point out that the dependency chain for repairing the space shuttle (or worse: microservices) can turn the need for understanding (or authoring) one document into understanding 12+ documents, or run the risk of making a document into a "wall of text," copy-paste hell, and/or out-of-date.
Capturing the contextual knowledge required to make an administration task straight-forward can easily turn the forest into the trees.
I would almost rather automate the troubleshooting steps than to have to write sufficiently specific English to express what one should do in given situations, with the caveat that such automation takes longer to write than said automation.
'docs disk filling up'
And in many cases because people thought they might be out of a job if they put their solutions in print. I'm guessing managers still need to counter those tendencies actively if they want documentation to happen. Plenty of good pointers in this article, I found.
This is the kicker - and the rarity. I don't think it's all trust, though. When your boss already knows going into Q1 that he's going to be fired in Q2 if he doesn't get 10 specific (and myopically short-term) agenda items addressed, it doesn't matter how much he trusts you, you're going to be focusing on only the things that have the appearance of ROI after a few hours of work, no matter how inefficient they are in the long term.