Hacker News new | past | comments | ask | show | jobs | submit login

I really like the point about runbooks/playbooks.

> We ended up embedding these dashboard within Confluence runbooks/playbooks followed by diagnosing/triaging, resolving, and escalation information. We also ended up associating these runbooks/playbooks with the alerts and had the links outputted into the operational chat along with the alert in question so people could easily follow it back.

When I used to work for Amazon, as a developer, I was required to write a playbook for every microservice I developed. The playbook had to be so detailed that, in theory, any site reliability engineer, who has no knowledge of the service should be able to read the playbook and perform the following activities:

- Understand what the service does.

- Learn all the curl commands to run to test each service component in isolation and see which ones are not behaving as expected.

- Learn how to connect to the actual physical/virtual/cloud systems that keep the service running.

- Learn which log files to check for evidence of problems.

- Learn which configuration files to edit.

- Learn how to restart the service.

- Learn how to rollback the service to an earlier known good version.

- Learn resolution to common issues seen earlier.

- Perform a checklist of activities to be performed to ensure all components are in good health.

- Find out which development team of ours to page if the issue remains unresolved.

It took a lot of documentation and excellent organization of such documentation to keep the services up and running.

A far-out old employer of mine decided that their standard format for alerts, sent by applications to the central monitoring system, would include a field for a URL pointing to some relevant documentation.

I think this was mostly pushed through by sysadmins annoyed at getting alerts from new applications that didn't mean anything to them.

When you get an alert, you have to first understand the alert, and then you have to figure out what to do about it. The majority of alerts, when people don't craft them according to a standard/policy, look like this:

  Subject: Disk usage high
  Priority: High
    There is a problem in cluster ABC.
    Disk utilization above 90%.
It's a pain in the ass to go figure out what is actually affected, why it's happening, and track down some kind of runbook that describes how to fix this specific case (because it may vary from customer to customer, not to mention project to project). This is usually the state of alerts until a single person (who isn't a manager; managers hate cleaning up inefficiencies) gets so sick and tired of it that they take the weekend to overhaul one alert at a time to provide better insight as to what is going on and how to fix it. Any attempt to improve docs for those alerts are never updated by anyone but this lone individual.

Providing a link to a runbook makes resolving issues a lot faster. It's even better if the link is to a Wiki page, so you can edit it if the runbook isn't up to date.

So did the system work, and how did it work?

Basically you are saying you were required to be really diligent about the playbooks and put effort in to get them right.

Did people really put that effort in? Was it worth it? If so, what elments of the culture/organisation/process made people do the right thing when it is so much easier for busy people to get sloppy?

The answer is "Yes" to all of your questions.

Regarding the question about culture, yes, busy people often get sloppy. But when a P1 alert comes because a site reliability engineer could not resolve the issue by following the playbook, it looks bad on the team and a lot of questions are asked by all affected stakeholders (when a service goes down in Amazon it may affect multiple other teams) about why the playbook was deficient. Nobody wants to be in a situation like this. In fact, no developer wants to be woken up at 2 a.m. because a service went down and the issue could not be fixed by the on-call SRE. So it is in their interest to write good and detailed playbooks.

That sounds like a great process there. It staggers me how much people a) underestimate the investment required to maintain that kind of documentation, and b) underestimate how much value it brings. It's like brushing your teeth.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact