Hacker News new | past | comments | ask | show | jobs | submit login
Things I Learned Managing Site Reliability (2017) (zwischenzugs.com)
109 points by bshanks on Feb 26, 2018 | hide | past | favorite | 13 comments

I really like the point about runbooks/playbooks.

> We ended up embedding these dashboard within Confluence runbooks/playbooks followed by diagnosing/triaging, resolving, and escalation information. We also ended up associating these runbooks/playbooks with the alerts and had the links outputted into the operational chat along with the alert in question so people could easily follow it back.

When I used to work for Amazon, as a developer, I was required to write a playbook for every microservice I developed. The playbook had to be so detailed that, in theory, any site reliability engineer, who has no knowledge of the service should be able to read the playbook and perform the following activities:

- Understand what the service does.

- Learn all the curl commands to run to test each service component in isolation and see which ones are not behaving as expected.

- Learn how to connect to the actual physical/virtual/cloud systems that keep the service running.

- Learn which log files to check for evidence of problems.

- Learn which configuration files to edit.

- Learn how to restart the service.

- Learn how to rollback the service to an earlier known good version.

- Learn resolution to common issues seen earlier.

- Perform a checklist of activities to be performed to ensure all components are in good health.

- Find out which development team of ours to page if the issue remains unresolved.

It took a lot of documentation and excellent organization of such documentation to keep the services up and running.

A far-out old employer of mine decided that their standard format for alerts, sent by applications to the central monitoring system, would include a field for a URL pointing to some relevant documentation.

I think this was mostly pushed through by sysadmins annoyed at getting alerts from new applications that didn't mean anything to them.

When you get an alert, you have to first understand the alert, and then you have to figure out what to do about it. The majority of alerts, when people don't craft them according to a standard/policy, look like this:

  Subject: Disk usage high
  Priority: High
    There is a problem in cluster ABC.
    Disk utilization above 90%.
It's a pain in the ass to go figure out what is actually affected, why it's happening, and track down some kind of runbook that describes how to fix this specific case (because it may vary from customer to customer, not to mention project to project). This is usually the state of alerts until a single person (who isn't a manager; managers hate cleaning up inefficiencies) gets so sick and tired of it that they take the weekend to overhaul one alert at a time to provide better insight as to what is going on and how to fix it. Any attempt to improve docs for those alerts are never updated by anyone but this lone individual.

Providing a link to a runbook makes resolving issues a lot faster. It's even better if the link is to a Wiki page, so you can edit it if the runbook isn't up to date.

So did the system work, and how did it work?

Basically you are saying you were required to be really diligent about the playbooks and put effort in to get them right.

Did people really put that effort in? Was it worth it? If so, what elments of the culture/organisation/process made people do the right thing when it is so much easier for busy people to get sloppy?

The answer is "Yes" to all of your questions.

Regarding the question about culture, yes, busy people often get sloppy. But when a P1 alert comes because a site reliability engineer could not resolve the issue by following the playbook, it looks bad on the team and a lot of questions are asked by all affected stakeholders (when a service goes down in Amazon it may affect multiple other teams) about why the playbook was deficient. Nobody wants to be in a situation like this. In fact, no developer wants to be woken up at 2 a.m. because a service went down and the issue could not be fixed by the on-call SRE. So it is in their interest to write good and detailed playbooks.

That sounds like a great process there. It staggers me how much people a) underestimate the investment required to maintain that kind of documentation, and b) underestimate how much value it brings. It's like brushing your teeth.

> It’s far more important to have a ticketing system that functions reliably and supports your processes than the other way round.

The most efficient ticketing systems I have ever seen were heavily customized in-house. When they moved to a completely different product, productivity in addressing tickets plummeted. They stopped generating tickets to deal with it.

> After process, documentation is the most important thing, and the two are intimately related.

If you have two people who are constantly on call to address issues because nobody else knows how to deal with it, you are a victim of a lack of documentation. Even a monkey can repair a space shuttle if they have a good manual.

I partly rely on incident reports and issues as part of my documentation. Sometimes you will get an issue like "disk filling up", and maybe someone will troubleshoot it and resolve it with a summary comment of "cleaned up free space in X process". Instead of making that the end of it, create a new issue which describes the problem and steps to resolve in detail. Update the issue over time as necessary. Add a tag to the issue called 'runbook'. Then mark related issues as duplicates of this one issue. It's kind of horrible, but it seamlessly integrates runbooks with your issue tracking.

Even a monkey can repair a space shuttle if they have a good manual

I would like to point out that the dependency chain for repairing the space shuttle (or worse: microservices) can turn the need for understanding (or authoring) one document into understanding 12+ documents, or run the risk of making a document into a "wall of text," copy-paste hell, and/or out-of-date.

Capturing the contextual knowledge required to make an administration task straight-forward can easily turn the forest into the trees.

I would almost rather automate the troubleshooting steps than to have to write sufficiently specific English to express what one should do in given situations, with the caveat that such automation takes longer to write than said automation.

Yeah, that's exactly what we found - we created a JIRA project called 'DOCS', which made search trivial:

'docs disk filling up'

It's pretty much organizing 101: study situation, plan, track well, document well but in a practical sense (write docs that people will actually read), get feedback from everybody, learn from your mistakes, admit your mistakes, and make the system and process better going forward.

This was posted a while back, and you can see the original thread with more comments here https://news.ycombinator.com/item?id=14031180

I may be out of touch with current affairs, but I don't think I've encountered a single workplace where documentation has worked. Sometimes because because people were only hired to put out fires, sometimes because there was no sufficiently customized ticketing system, sometimes because they simply didn't know how to abstract their tasks into well written documents.

And in many cases because people thought they might be out of a job if they put their solutions in print. I'm guessing managers still need to counter those tendencies actively if they want documentation to happen. Plenty of good pointers in this article, I found.

> I was trusted (culture, again!)

This is the kicker - and the rarity. I don't think it's all trust, though. When your boss already knows going into Q1 that he's going to be fired in Q2 if he doesn't get 10 specific (and myopically short-term) agenda items addressed, it doesn't matter how much he trusts you, you're going to be focusing on only the things that have the appearance of ROI after a few hours of work, no matter how inefficient they are in the long term.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact