Recently I spent 10 hours on a usually routine task. At the end of the slog, my first thought wasn’t “I should spend more time writing that up!” The article was a good reminder that it’s worth scribbling something down. Even the basics of the “gotcha!” and snippets of code for debugging could save someone else (future me?) another 10 hours.
One thing I didn’t get from the article was how runbooks are created. It mentions the “sticky note on someone’s desk” approach and the “workflows for everything” approach. There’s a lot of ground in between. I guess people write lots of how-tos and eventually they’re turned into a runbook?
Also wanted to mention something which I always harp on when runbooks come up: the best runbook is no runbook. They're necessary for some break-the-glass things, but if the answer to "would a runbook have helped here?" is "yes", then the immediate follow-up question should be "should it be automated?". If the answer is "yes" then build it into your system, with tests and everything, and if that's not feasible then write a script for it and check it into VCS somewhere.
- every three months I spend a day updating my 'what can go wrong' documents. List of issues, explanation of effects, possible technical and human solutions. A lot of these link to one or more playbooks, but some do not.
- when a new system is created or a long process discussed I'll throw the links and thoughts in an untidy playbook or a nearby README, whichever feels more appropriate. I'll tidy it up more if its going to be part of training.
- Randomly when I think about something.
- during an incident, or it the post-incident handling.
When actually writing the book I give an overall guide to whats going on in prose at the top, then several sections with relevant things. It might be a series of commands to run, a list of things to bear in mind, a series of links to relevant documentation/blogs/etc, or a mix of the lot. Again, as appropriate.
All the playbooks are in a repo of markdown files whoch gitlab renders and supports crosslinking. I do use a greasemonkey script to give copy+hide hover buttons on all scripts, as well as making large scripts scrollable, which are great ease of use features I haven't seen any more concrete solution do.
I'm in Events, and typically in an event it's been low payoff to try and do anything _after_ the event. Each event is a burnout because you're always pushing yourself.
I assume the same is in software too.
I can't find the original article, but I often find that keeping a running note or checklist or even a pseudo-script (written like a script but not executable) are really helpful.
I see what you're saying though, my run book right now is just a collection of articles that I've written and occasionally refactored to make more sense as a whole.
* Check that...
* Verify that...
You shouldn't have to read a step to understand what you need to do - the step should explicitly tell you what to do.
For example, in the Ransomware PDF, the step:
"Ensure that the endpoint and perimetric (email gateway, proxy caches) security products are up to date"
Doesn't tell you what "up to date" is, or how to actually perform the task/s. The phrase should be multiple steps - for example:
* Ensure the email gateway is up to date:
- Login to the gateway server and run the command 'foo -version' and confirm that the version displayed is at least 3.01b4. If not do [this] [or see this document]
* Ensure that the proxy caches are up to date. The version numbers returned should be at least 1.04:
- Login to server Turing, run the command 'bar -V' and check the result
- Login to server Lovelace, run the command 'bar -V' and check the result
* Ensure that the security products are up to date...
- Login to .... and run the command .... Verify the version number is ....
- Login to .... and run the command .... Verify the version number is ....
- Open a Web browser and visit ..... Verify that the version number shown at the bottom of the login page is .....
If you were doing this on a paper document, you have room to tick off the steps when done and also to write the results - this all helps with later incident/problem activities and makes it less likely that something is missed
This 'story arc' runbook also needs sub-plots...other documents...
* How do I login to a server?
* A list of servers by name with login process.
The trickle down is that competent and trained support staff can work from the head document because they know all the other stuff, but if they are not available, or need to delegate tasks, less-experienced people can resort to the secondary docs to fill their knowledge gaps.
Yes, this makes for wordy...but VERY easy to follow...documents. For newbies that have the basic system skills, these documents provide context for your setup and they will generally be able to follow much of the content (under loose supervision) even if they have not done any of the steps or worked on any of the mentioned resources before.
Yes, this approach needs regular document revisions as part of the patch management cycle (you do have a regular cycle for all this stuff, don't you?)
Yes, putting together the process docs (runbooks) like this is tedious work and many organisations don't allocate adequate time or resources for it.
A person a few steps back from the coal-face should oversee the docs, with the technical staff contributing to them. If the docs are written and managed and peer-reviewed by the coal-face technical staff then they tend to make assumptions about the knowledge level of the readers and don't fully document the 'obvious' steps.
Pet peeve: Any step written as 'Perform the usual troubleshooting sequence'!
Edit: Here's an example of a runbook/procedure for major incident management handling. The document was produced to assist new staff with the (then) horrendously complex procedure, which only seasoned staff could easily follow - and they kept the knowledge in their heads. The document is not complete (work in progress) and while it documented the 'here and now', the overall aim of the project was to simplify and automate many of the steps:
“You said ‘if I need to do X then here are the steps’ but how would I decide if I need to do X?”
“It says check if X is Y but it doesn’t say how to check”
The way I always explain it is that if your steps are asking me to make a decision then they had better explain the decision making process/inputs as well.
It also helps tremendously when you have someone reach out to you during off-hours when you can just look through some documentation you have on hand to blaze through a task that takes a lot less time than if you had to figure things out from scratch.
We're using Markdown in github now with clone per SRE.
If anyone has explored more sophisticated solutions than wiki pages, I would love to talk and learn from your experience
For mission critical stuff, print out your Incident Management processes and have a physical 'Master Runbook' in a prominent place in your department/cubicle/office.
Also, printed procedures with numbered steps and checkboxes allows for a visual record of progress, plus there's room for notes when things deviate from the expected.
Each annotated runbook then becomes a reference for the Incident/Problem management/RCA write-up - unless, of course, you fancy following the runbook on one screen (if you can), while also updating the service management ticket on another (if you can) and getting out comms to senior stakeholders (if you can) and dealing with the sudden influx of tickets, phone calls and emails (if all the systems are still accessible).
Plus, if you are called into a meeting, or have to go check something, you can take the paperwork with you.
- Power failure (if your UPS failed, backup laptop might not be charged up / left at home / ???)
- Power surge (if your laptops were plugged in, they could've been fried too)
- Disk encrypting ransomware (if you're unsure about the spread of the infection, booting a possibly compromised machine might cause more data loss)
- Theft (a casual thief might just grab your laptop, and leave the annoying-to-unplug PC tower alone, but that's not a guarantee - boring secretless paperwork is a lot less likely to be stolen)
- People forgetting to clone your git repository
Well, first, can you spare the time during a major incident to go print something, or arrange for someone else to do it?
Second, I have seen times where a lack of building connectivity stops centrally-audited printing from working. I daresay a fallback should have been configured, or someone (go find them!) has an override, but valuable time could be wasted sorting this out.
Third, if the process being followed on-screen needs revising, you shouldn't start doing that in the document 'on the fly' (where's your document change management now?), so out comes the pen and paper...or you could try firing up your favourite editing package to make notes while simultaneously following the electronic process, updating the ticket and chairing the incident response Teams/whatever meeting..or you could just write quick notes in the right place on the printed procedure so that they have context and you can come back to them later during the incident debrief.
I am sure there are pros and cons, and some people will be more comfortable having everything in 'electronic' format, because that's how their IT world has been from day 1, but having been in Technical Support and IT Management for some 30+ years (yeah, I'm an old fart), I know from bitter experience what works best under the vast majority of circumstances.
- Power surge big enough to take out all the breakers, UPS, servers, desktops, and laptops at once?
- Ransomware that had time to take control of the whole network including backups?
- Theft of a single laptop?
- People, yea I agree with people can be an issue.
And all of the above across multiple offices with WFH staff and on call staff?
Of all these catastrophic failures it's just as likely to be a fire, or earthquake, etc, where you should have multiple layers of redundancy in place anyway.
Sure, it's easy enough. I've had UPSes that only noticibly failed once the power failed, and those outages outlasted my fully charged laptop battery.
> Power surge big enough to take out all the breakers, UPS, servers, desktops, and laptops at once?
Sure! One lightning strike to an unprotected network link, for example, can threaten to do just that. https://www.youtube.com/watch?v=Ev0PL892zSE&t=354s (note: he even added some protection and another lightning storm still caused damage... so he resorted to fiber!)
Even if it doesn't destroy all of your gear, destroying all the main gear your IT crew has login passwords for can slow things way down as they figure out alternatives.
> Ransomware that had time to take control of the whole network including backups?
Even if it hasn't taken over the whole network, you might not be sure which nodes have been taken over (ransom messages might not appear until a lot of data has already been encrypted), and may wish to leave more sensitive nodes offline for forensics, or to preserve data that hasn't been backed up yet, or to avoid needing to resort to slower restoration from offline backups.
Hopefully you have offsite and cold nodes... good reason to have some bootable USBs with recovery images on 'em ready to go as well.
> Theft of a single laptop?
"but that's not a guarantee" was meant to point out that larger scale theft beyond the typical "just a single laptop" isn't unheard of.
> And all of the above across multiple offices with WFH staff and on call staff?
Runbooks can potentially be useful for single-office businesses, and means fewer games of telephone relay even for multi-office businesses.
> Of all these catastrophic failures it's just as likely to be a fire, or earthquake, etc, where you should have multiple layers of redundancy in place anyway.
And runbooks can point you towards those redundancies, many of which may require manual intervention by design (e.g. restoring from cold offsite backups.) You'll note such things as evacuation plans are often printed and displayed prominently near emergency exits, not left in a git repository ;)
And, of course, just because you should have multiple layers of redundancy, doesn't mean you actually have multiple layers of redundancy, even if you think you do.