Hacker News new | past | comments | ask | show | jobs | submit login
Runbooks for better incident management (ashpatel.substack.com)
78 points by aceflux 10 days ago | hide | past | favorite | 27 comments

‘Prevents an issue like this: "I recently ran into a situation where I spent 6 hours understanding how something works that would have taken 20 minutes if the relevant information was stored somewhere."‘

Recently I spent 10 hours on a usually routine task. At the end of the slog, my first thought wasn’t “I should spend more time writing that up!” The article was a good reminder that it’s worth scribbling something down. Even the basics of the “gotcha!” and snippets of code for debugging could save someone else (future me?) another 10 hours.

One thing I didn’t get from the article was how runbooks are created. It mentions the “sticky note on someone’s desk” approach and the “workflows for everything” approach. There’s a lot of ground in between. I guess people write lots of how-tos and eventually they’re turned into a runbook?

For my teams it's always been a cultural thing. Whenever something comes up, someone (at first, probably you!) has to ask: "would a runbook have helped here?". If so, give the person time to write a quality runbook. Also when things change, you have to remember to update your runbooks.

Also wanted to mention something which I always harp on when runbooks come up: the best runbook is no runbook. They're necessary for some break-the-glass things, but if the answer to "would a runbook have helped here?" is "yes", then the immediate follow-up question should be "should it be automated?". If the answer is "yes" then build it into your system, with tests and everything, and if that's not feasible then write a script for it and check it into VCS somewhere.

I don't write enough 'runbooks' (I call them playbooks). But I do write them, normally in four ways:

- every three months I spend a day updating my 'what can go wrong' documents. List of issues, explanation of effects, possible technical and human solutions. A lot of these link to one or more playbooks, but some do not.

- when a new system is created or a long process discussed I'll throw the links and thoughts in an untidy playbook or a nearby README, whichever feels more appropriate. I'll tidy it up more if its going to be part of training.

- Randomly when I think about something.

- during an incident, or it the post-incident handling.


When actually writing the book I give an overall guide to whats going on in prose at the top, then several sections with relevant things. It might be a series of commands to run, a list of things to bear in mind, a series of links to relevant documentation/blogs/etc, or a mix of the lot. Again, as appropriate.

All the playbooks are in a repo of markdown files whoch gitlab renders and supports crosslinking. I do use a greasemonkey script to give copy+hide hover buttons on all scripts, as well as making large scripts scrollable, which are great ease of use features I haven't seen any more concrete solution do.

I'm finding that the best way for me to document is to realize that I'm going to go slower at everything but I'm going to document along the way.

I'm in Events, and typically in an event it's been low payoff to try and do anything _after_ the event. Each event is a burnout because you're always pushing yourself.

I assume the same is in software too.

I can't find the original article, but I often find that keeping a running note or checklist or even a pseudo-script (written like a script but not executable) are really helpful.

I see what you're saying though, my run book right now is just a collection of articles that I've written and occasionally refactored to make more sense as a whole.

gitlab runbooks is a great place to learn: https://docs.gitlab.com/ee/user/project/clusters/runbooks/

In decision making processing points automatic runbooks won't help.The percentages are circumspect. Runbooks won't help where human sense is required as it is not imbued with AI.

They don't claim to automate the decision points or human sense, to the contrary: these leave what and when to execute to the human.

And here are some actual runbooks which Societe Generale have donated to the community: https://github.com/certsocietegenerale/IRM/tree/master/EN

There's some good ideas in those docs, however they are not written with procedural steps - for example, all steps should start with a verb..

* Check that...

* Verify that...

* Disconnect...

* Contact...

You shouldn't have to read a step to understand what you need to do - the step should explicitly tell you what to do.

For example, in the Ransomware PDF, the step:

"Ensure that the endpoint and perimetric (email gateway, proxy caches) security products are up to date"

Doesn't tell you what "up to date" is, or how to actually perform the task/s. The phrase should be multiple steps - for example:

* Ensure the email gateway is up to date: - Login to the gateway server and run the command 'foo -version' and confirm that the version displayed is at least 3.01b4. If not do [this] [or see this document]

* Ensure that the proxy caches are up to date. The version numbers returned should be at least 1.04:

  - Login to server Turing, run the command 'bar -V' and check the result
  - Login to server Lovelace, run the command 'bar -V' and check the result
If any server is not at the correct version do [this] [or see this document]

* Ensure that the security products are up to date...

  - Login to .... and run the command .... Verify the version number is ....

  - Login to .... and run the command .... Verify the version number is ....

  - Open a Web browser and visit ..... Verify that the version number shown at the bottom of the login page is .....
If any product is not at the correct version do [this] [or see this document]

If you were doing this on a paper document, you have room to tick off the steps when done and also to write the results - this all helps with later incident/problem activities and makes it less likely that something is missed

This 'story arc' runbook also needs sub-plots...other documents...

* How do I login to a server?

* A list of servers by name with login process.

The trickle down is that competent and trained support staff can work from the head document because they know all the other stuff, but if they are not available, or need to delegate tasks, less-experienced people can resort to the secondary docs to fill their knowledge gaps.

Yes, this makes for wordy...but VERY easy to follow...documents. For newbies that have the basic system skills, these documents provide context for your setup and they will generally be able to follow much of the content (under loose supervision) even if they have not done any of the steps or worked on any of the mentioned resources before.

Yes, this approach needs regular document revisions as part of the patch management cycle (you do have a regular cycle for all this stuff, don't you?)

Yes, putting together the process docs (runbooks) like this is tedious work and many organisations don't allocate adequate time or resources for it.

A person a few steps back from the coal-face should oversee the docs, with the technical staff contributing to them. If the docs are written and managed and peer-reviewed by the coal-face technical staff then they tend to make assumptions about the knowledge level of the readers and don't fully document the 'obvious' steps.

Pet peeve: Any step written as 'Perform the usual troubleshooting sequence'!

Edit: Here's an example of a runbook/procedure for major incident management handling. The document was produced to assist new staff with the (then) horrendously complex procedure, which only seasoned staff could easily follow - and they kept the knowledge in their heads. The document is not complete (work in progress) and while it documented the 'here and now', the overall aim of the project was to simplify and automate many of the steps:


This is probably THE most common piece of feedback I give when people write up directions on how to do things.

“You said ‘if I need to do X then here are the steps’ but how would I decide if I need to do X?”


“It says check if X is Y but it doesn’t say how to check”

The way I always explain it is that if your steps are asking me to make a decision then they had better explain the decision making process/inputs as well.

I keep personal "runbooks" for a lot of the common work I deal with over time. Eventually, this stuff gets automated where possible, but taking the time to work through all of the problems, write it down, and do it in a way that I can show someone else has helped me make sure that when I sit down to automate something, I truly understand the "domain".

It also helps tremendously when you have someone reach out to you during off-hours when you can just look through some documentation you have on hand to blaze through a task that takes a lot less time than if you had to figure things out from scratch.

We use ms word, one run book per app for third part applications. That have to be reviewed/updated/APPROVED at least once per year, not optional. It's this that adds the value, not how the info is stored.

Make sure whatever you keep them in has an offline option. I was on the end of handling an outage and confluence blew up.

We're using Markdown in github now with clone per SRE.

You can write markdown in Google Alerts. This is what we do.

I'm founder of a startup in this area. Our product is NOT just the usual automated runbook approach.

If anyone has explored more sophisticated solutions than wiki pages, I would love to talk and learn from your experience

Well Microsoft has. Take a look at Microsoft Sentinel.

Are you referring to the SOAR features?

"We have a major incident with connectivity to the building, login to the knowledge server and see what the runbook says...oh!"

For mission critical stuff, print out your Incident Management processes and have a physical 'Master Runbook' in a prominent place in your department/cubicle/office.

Also, printed procedures with numbered steps and checkboxes allows for a visual record of progress, plus there's room for notes when things deviate from the expected.

Each annotated runbook then becomes a reference for the Incident/Problem management/RCA write-up - unless, of course, you fancy following the runbook on one screen (if you can), while also updating the service management ticket on another (if you can) and getting out comms to senior stakeholders (if you can) and dealing with the sudden influx of tickets, phone calls and emails (if all the systems are still accessible).

Plus, if you are called into a meeting, or have to go check something, you can take the paperwork with you.

There’s some cool ideas in here, but you can also just use a git repo that everyone has cloned.

Physical copies still have advantages over git in the event of:

- Power failure (if your UPS failed, backup laptop might not be charged up / left at home / ???)

- Power surge (if your laptops were plugged in, they could've been fried too)

- Disk encrypting ransomware (if you're unsure about the spread of the infection, booting a possibly compromised machine might cause more data loss)

- Theft (a casual thief might just grab your laptop, and leave the annoying-to-unplug PC tower alone, but that's not a guarantee - boring secretless paperwork is a lot less likely to be stolen)

- People forgetting to clone your git repository

To add to this, someone once suggested that the Incident Manager could just print off a copy of the process when needed.

Well, first, can you spare the time during a major incident to go print something, or arrange for someone else to do it?

Second, I have seen times where a lack of building connectivity stops centrally-audited printing from working. I daresay a fallback should have been configured, or someone (go find them!) has an override, but valuable time could be wasted sorting this out.

Third, if the process being followed on-screen needs revising, you shouldn't start doing that in the document 'on the fly' (where's your document change management now?), so out comes the pen and paper...or you could try firing up your favourite editing package to make notes while simultaneously following the electronic process, updating the ticket and chairing the incident response Teams/whatever meeting..or you could just write quick notes in the right place on the printed procedure so that they have context and you can come back to them later during the incident debrief.

I am sure there are pros and cons, and some people will be more comfortable having everything in 'electronic' format, because that's how their IT world has been from day 1, but having been in Technical Support and IT Management for some 30+ years (yeah, I'm an old fart), I know from bitter experience what works best under the vast majority of circumstances.

- Power failure where all of the above occur at once?

- Power surge big enough to take out all the breakers, UPS, servers, desktops, and laptops at once?

- Ransomware that had time to take control of the whole network including backups?

- Theft of a single laptop?

- People, yea I agree with people can be an issue.

And all of the above across multiple offices with WFH staff and on call staff?

Of all these catastrophic failures it's just as likely to be a fire, or earthquake, etc, where you should have multiple layers of redundancy in place anyway.

> Power failure where all of the above occur at once?

Sure, it's easy enough. I've had UPSes that only noticibly failed once the power failed, and those outages outlasted my fully charged laptop battery.

> Power surge big enough to take out all the breakers, UPS, servers, desktops, and laptops at once?

Sure! One lightning strike to an unprotected network link, for example, can threaten to do just that. https://www.youtube.com/watch?v=Ev0PL892zSE&t=354s (note: he even added some protection and another lightning storm still caused damage... so he resorted to fiber!)

Even if it doesn't destroy all of your gear, destroying all the main gear your IT crew has login passwords for can slow things way down as they figure out alternatives.

> Ransomware that had time to take control of the whole network including backups?

Even if it hasn't taken over the whole network, you might not be sure which nodes have been taken over (ransom messages might not appear until a lot of data has already been encrypted), and may wish to leave more sensitive nodes offline for forensics, or to preserve data that hasn't been backed up yet, or to avoid needing to resort to slower restoration from offline backups.

Hopefully you have offsite and cold nodes... good reason to have some bootable USBs with recovery images on 'em ready to go as well.

> Theft of a single laptop?

"but that's not a guarantee" was meant to point out that larger scale theft beyond the typical "just a single laptop" isn't unheard of.

> And all of the above across multiple offices with WFH staff and on call staff?

Runbooks can potentially be useful for single-office businesses, and means fewer games of telephone relay even for multi-office businesses.

> Of all these catastrophic failures it's just as likely to be a fire, or earthquake, etc, where you should have multiple layers of redundancy in place anyway.

And runbooks can point you towards those redundancies, many of which may require manual intervention by design (e.g. restoring from cold offsite backups.) You'll note such things as evacuation plans are often printed and displayed prominently near emergency exits, not left in a git repository ;)

And, of course, just because you should have multiple layers of redundancy, doesn't mean you actually have multiple layers of redundancy, even if you think you do.

That’s true. If those are the types of incidents you are preparing for, it would make sense to have printed copies. You could also do both.

How do you ensure everyone has it cloned (and up to date)?

If you have any network connectivity or are physically together the repo could be shared. So as long as one person has it cloned you should be fine. It’s the same as making sure people have a printed copy. Depending on the run book, it may not need to be perfectly up to date, but you’re right that it may be challenging for team members to remember to update it. One thing that can help is putting run books in a primary docs repo. Then users are making changes and updating the repository more actively. In turn they would likely get updates to the run book just by writing and updating other docs.

Crontab a git fetch. You can then pull when needed (but won’t have a process overwriting your docs and scripts at the wrong time)

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact