
Writing Runbook Documentation When You’re an SRE - mooreds
https://www.transposit.com/blog/2020.01.30-writing-runbook-documentation-when-youre-an-sre/
======
williamDafoe
This is not a very good article. Here are some improvements.

1\. Runbooks go out of date faster than anything. Therefore it is absolutely
crucial to have the whole book on a single page and alerts vector to the
proper subsection. Also pepper each section with a good set of keywords. This
will allow users and newbies to easily search for procedures or related alerts
when links invariably break.

2\. Group related alerts ("host xyz down") in a single section with many
section titles. One for each possible xyz.

3\. Go ahead and put commandline commands in the runbook, in shell-command or
code highlighted boxes (NEVER inside sentences). User defined fields should be
delineated with $HOST etc (never <host>) and sample values for the vars given
beforehand with sample output afterward. Never use a copyable $ to delineate
shell commands. This creates the best possible user experience for people
reusing your commands by cut and paste so they can more easily change values
and so they know what to expect.

4\. Link all relevant consoles in very small links (CPU usage in clusters: ym
qf ij) including historical links to console views for past problems and how
they look.

5\. Section templates might look like this:

1.8. Too many cacheservers are down

1.8.1 Definition. This means that 25% of the hosts (on average) have been down
over the past 10 minutes

1.8.2 Severity. Our load balancer will route all requests to surviving hosts
and clients will retry on timeout so normally this is not severe (there is
only a performance impact). However it could cascade due to RAM exhaustion or
a query-of-death or due to a config push of broken software so assess the
service right away for these problems.

1.8.2 Remediation ... Rollback or resource scaling or bypassing the cache
service on the command line ...

\- Google search SRE

~~~
mooreds
Agree 100% on clearly delineating the commands. Do you have any suggestions
for keeping commands up to date?

I don't understand your comment #4: "CPU usage in clusters: ym qf ij"

Thanks for the great suggestions for improvement (disclosure, I work with the
author and posted the link).

~~~
joshuamorton
They're being google-jargon-y.

Replace "ym qf ij" with "us-east-1 us-west-1 eu-2" and each links to the
monitoring console for aggregate cpu usage for the job in that region.

------
kureikain
I have been using this at work internally.

Basically I write an app that parse a directory of markdown file. However, the
code block is runnable.

The code block is organize like this:

``` input: \- name: description: \- othervar: description: \--- real shell
code here ```

It adds a run button under the code block, when run, it parse the inputs to
generate UI for input these parameter. Then it spin up a pod in kubernetes and
run the script. We are already using Vault, so the script can access vault to
get the secret it wants.

It feels awesome because we put the link to runbook in pagerduty alert. It
keeps the document and code in-sync.

However, it's a kind of like a backdoor :-(. It essentially give shell access
to entire infrastructure(Since it has Vault access). I tried hard to protect
this tool but still feel uneasy about it. But without it I don't know how hard
it is to keep code and document in-sync.

------
epicgiga
Runbooks are more of an anti-pattern than anything.

A list of commands you should run to accomplish X -> that's a script. Or it
should be. If accomplishing a task takes more than a single step, it means
there's insufficient automation or programmability, and that's the problem you
should fix, not teaching devs how to be better writers of prose.

Code can be tested quickly and relatively easily. Docs can't.

Docs should be as minimal as possible: "# Server ## Rebooting -
./scripts/server/reboot.sh".

Any docs written by a dev should be run by an editor, who should strip them
down to the bare minimum. Your typical readme is inundated with waffle, and
the actual meat in it is wrong well over half the time.

Automate everything.

~~~
mooreds
> Automate everything.

Whenever I see someone say something like this, I'm always curious about where
human judgement and/or investigation comes in.

Don't get me wrong, I'm happy to automate what can be automated if it makes
sense (that is, the action can be easily turned into a script and the cost to
create and maintain the automation is not higher than the value created) but
aren't there plenty of areas where some level of human judgement is needed and
you can't just automate everything?

Isn't that kinda what SREs do? Develop intuition for what can be automated,
automate what they can, set up process for what they can't?

~~~
rtpg
I'm basically of the same opinion as you, but recently read this article[0] on
"do nothing" scripts and really like it. The idea is to write at your process
_as a script_ that will at first just print what to do. Then you can use
_computer magic_ to improve things over time and automate steps in the middle.

I generally agree that there are a lot of processes that are like "use your
brain and think at this step", which is hard to automate, really. Open-ended
workflow tools don't really exist for this kind of thing.

(Clarifying example: what script do you write to "diagnose a response time
spike"? You can definitely write up usual suspects but at one point you don't
have a script that gets you from the beginning to end )

[0]: [https://blog.danslimmon.com/2019/07/15/do-nothing-
scripting-...](https://blog.danslimmon.com/2019/07/15/do-nothing-scripting-
the-key-to-gradual-automation/)

~~~
mooreds
That's a great article. The HN discussion on it is worth reading too:
[https://news.ycombinator.com/item?id=20495739](https://news.ycombinator.com/item?id=20495739)

------
acvny
SRE is a scam

~~~
ryanklee
Care to qualify?

~~~
0x445442
I agree as well, here's the start of a qualification:

[https://news.ycombinator.com/item?id=22210952](https://news.ycombinator.com/item?id=22210952)

------
andybak
SRE = Site Reliability Engineering apparently...

Sigh...

