
Things I Learned Managing Site Reliability (2017) - bshanks
https://zwischenzugs.com/2017/04/04/things-i-learned-managing-site-reliability-for-some-of-the-worlds-busiest-gambling-sites/
======
foo101
I really like the point about runbooks/playbooks.

> We ended up embedding these dashboard within Confluence runbooks/playbooks
> followed by diagnosing/triaging, resolving, and escalation information. We
> also ended up associating these runbooks/playbooks with the alerts and had
> the links outputted into the operational chat along with the alert in
> question so people could easily follow it back.

When I used to work for Amazon, as a developer, I was required to write a
playbook for every microservice I developed. The playbook had to be so
detailed that, in theory, any site reliability engineer, who has no knowledge
of the service should be able to read the playbook and perform the following
activities:

\- Understand what the service does.

\- Learn all the curl commands to run to test each service component in
isolation and see which ones are not behaving as expected.

\- Learn how to connect to the actual physical/virtual/cloud systems that keep
the service running.

\- Learn which log files to check for evidence of problems.

\- Learn which configuration files to edit.

\- Learn how to restart the service.

\- Learn how to rollback the service to an earlier known good version.

\- Learn resolution to common issues seen earlier.

\- Perform a checklist of activities to be performed to ensure all components
are in good health.

\- Find out which development team of ours to page if the issue remains
unresolved.

It took a lot of documentation and excellent organization of such
documentation to keep the services up and running.

~~~
twic
A far-out old employer of mine decided that their standard format for alerts,
sent by applications to the central monitoring system, would include a field
for a URL pointing to some relevant documentation.

I think this was mostly pushed through by sysadmins annoyed at getting alerts
from new applications that didn't mean anything to them.

~~~
peterwwillis
When you get an alert, you have to first understand the alert, and then you
have to figure out what to do about it. The majority of alerts, when people
don't craft them according to a standard/policy, look like this:

    
    
      Subject: Disk usage high
      Priority: High
      Message: 
        There is a problem in cluster ABC.
        Disk utilization above 90%.
        Host 1.2.3.4.
    

It's a pain in the ass to go figure out what is actually affected, why it's
happening, and track down some kind of runbook that describes how to fix this
specific case (because it may vary from customer to customer, not to mention
project to project). This is usually the state of alerts until a single person
(who isn't a manager; managers hate cleaning up inefficiencies) gets so sick
and tired of it that they take the weekend to overhaul one alert at a time to
provide better insight as to what is going on and how to fix it. Any attempt
to improve docs for those alerts are never updated by anyone but this lone
individual.

Providing a link to a runbook makes resolving issues a lot faster. It's even
better if the link is to a Wiki page, so you can edit it if the runbook isn't
up to date.

------
peterwwillis
> It’s far more important to have a ticketing system that functions reliably
> and supports your processes than the other way round.

The most efficient ticketing systems I have ever seen were heavily customized
in-house. When they moved to a completely different product, productivity in
addressing tickets plummeted. They stopped generating tickets to deal with it.

> After process, documentation is the most important thing, and the two are
> intimately related.

If you have two people who are constantly on call to address issues because
nobody else knows how to deal with it, you are a victim of a lack of
documentation. Even a monkey can repair a space shuttle if they have a good
manual.

I partly rely on incident reports and issues as part of my documentation.
Sometimes you will get an issue like "disk filling up", and maybe someone will
troubleshoot it and resolve it with a summary comment of "cleaned up free
space in X process". Instead of making that the end of it, create a new issue
which describes the problem and steps to resolve in detail. Update the issue
over time as necessary. Add a tag to the issue called 'runbook'. Then mark
related issues as duplicates of this one issue. It's kind of horrible, but it
seamlessly integrates runbooks with your issue tracking.

~~~
mdaniel
_Even a monkey can repair a space shuttle if they have a good manual_

I would like to point out that the dependency chain for repairing the space
shuttle (or worse: microservices) can turn the need for understanding (or
authoring) one document into understanding 12+ documents, or run the risk of
making a document into a "wall of text," copy-paste hell, and/or out-of-date.

Capturing the contextual knowledge required to make an administration task
straight-forward can easily turn the forest into the trees.

I would almost rather automate the troubleshooting steps than to have to write
sufficiently specific English to express what one should do in given
situations, with the caveat that such automation takes longer to write than
said automation.

------
tabtab
It's pretty much organizing 101: study situation, plan, track well, document
well but in a practical sense (write docs that people will actually read), get
feedback from everybody, learn from your mistakes, admit your mistakes, and
make the system and process better going forward.

------
willejs
This was posted a while back, and you can see the original thread with more
comments here
[https://news.ycombinator.com/item?id=14031180](https://news.ycombinator.com/item?id=14031180)

------
stareatgoats
I may be out of touch with current affairs, but I don't think I've encountered
a single workplace where documentation has worked. Sometimes because because
people were only hired to put out fires, sometimes because there was no
sufficiently customized ticketing system, sometimes because they simply didn't
know how to abstract their tasks into well written documents.

And in many cases because people thought they might be out of a job if they
put their solutions in print. I'm guessing managers still need to counter
those tendencies actively if they want documentation to happen. Plenty of good
pointers in this article, I found.

------
commandlinefan
> I was trusted (culture, again!)

This is the kicker - and the rarity. I don't think it's all trust, though.
When your boss already knows going into Q1 that he's going to be fired in Q2
if he doesn't get 10 specific (and myopically short-term) agenda items
addressed, it doesn't matter how much he trusts you, you're going to be
focusing on only the things that have the appearance of ROI after a few hours
of work, no matter how inefficient they are in the long term.

