You should never use a run book -- instead you should spend the time you were going to write a run book writing code to execute the steps automatically. This will reduce human error and make things faster and more repeatable. Even better is if the person who wrote the code also writes the automation to fix it so that it stays up to date with changes in the code.
At Netflix we tried to avoid spending a lot of time on documentation because by the time the document was done, it was out of date. Almost inevitably any time you needed the documentation, it no longer applied.
I wish the author had spent more time on talking about incident reviews. Those were our key to success. After every event, you review the event with everyone involved, including the developers, and then come up with an action plan that at a minimum prevents the same problem from happening again, but even better, prevents an entire class of problems from happening again. Then you have to follow through and make sure the changes are actually getting implemented.
I agree with the author on the point about culture. That was absolutely critical. You need a culture that isn't about placing blame but finding solutions. One where people feel comfortable, and even eager, to come out and say "It was my fault, here's the problem, and here's how I'm going to fix it!"
Also your Netflix example - how many people do you have there? Probably the smallest of teams is bigger than my company's whole engineering department. We're running a whole company and not a few services. (I'm absolutely not trying to discourage what you're doing - but I strongly feel it's a different ballgame.)
The smaller your team with too many widespread responsibilities (not: make THIS service be available 99.9%, but make ALL services available 90%, then 95%, ...) the more you only automate what happens frequently. And yes, we try to first a basic runbook when something happens - and only if the same thing happens repeatedly, we automate it.
I personally found that runbooks were even worse for small size teams (like our four person reddit team) because they would get out of data even quicker than at the bigger places due to the rapidly changing environment.
I wrote down thread that if all of your deployment is automated than it is much easier to automate remediation, because you just change your deployment to fix the problem, as long as you can redeploy quickly.
On the other level, there's stuff we see as core infrastructure (for example Hardware, or even some parts of OpenStack running on that hardware) - of course there are also downtimes and emergencies and dumpster fires - but they are pretty much unique little snowflakes and the repeating ones happen a few times per year. There simply is nothing "to deploy". Maybe one can argue that "runbook" is not 100% correct, sometimes it's a runbook including debug info.
But it's not turtles all the way down, and I stand to my point there's stuff where the cost to benefit ratio totally ends up at "automate it away" and there's other stuff where it's the opposite.
What advice would you give for an org where the engineers who build systems are not responsible for keeping them running, and everyone on a (much smaller comparatively) infrastructure team is (which is slowly turning into an SRE team by necessity)?
Anecdotally, I've found documentation to be useless; despite documentation being of a high quality, no one refers to it, even after iterating to continue to add information, make it more relevant, streamline, etc.
If you can make the company culture focus on uptime, or get engineers involved in remediation, then you'll be better off.
If you can't do that, try to at least push for the Google model: The engineers are responsible for uptime of their product until they can prove that it is stable and has sufficient monitoring and alerting, and then they can turn it over to SRE, with the caveat that it will go back to the engineers if it gets lower in quality.
I don't think that's a healthy or sustainable culture for a company, certainly not one that's expecting to grow.
It might be sufficient for a company that has a small technical team and isn't looking to grow (think: the "tech department" for a company in another industry), but not for a company where engineering or technology is the primary focus.
My experience is that you can get regularly updated runbooks, but they're at the wrong level of abstraction.
They'll discuss some odd one-off failure that happened to trigger a given alert, rather than the general class of problems that this alert is trying to catch.
These days I consider looking at a runbook to be an act of desperation, that'll I'll only perform after attempting to debug from first principles.
Even at our super tiny 3-person operation this has saved a massive amount of time - it's even more valuable when it's panic time because a server went down due to disk or ISP failure and you know you have an ansible script that can get a new server up in 3 minutes flat while you grab the backup.
Not all processes can be automated today. Some can only be automated tomorrow. Some can only be automated after some blocking functionality is added. Documentation is how you plan your automation.
"Code as documentation" can be an okay answer if the question is "what is the behavior of this system"? But it's a bad answer for the question "what is the intended behavior of this system, and what assumptions does it run under?"
Looking back at a piece of code months (or years) later and not knowing if a particular edge case that you're running into was actually the intended function or not is not particularly fun.
The power of writing documentation is not just in the end product; it's that it serves as a forcing function for the developers to confront their own thought processes and make them explicit. It's possible to write code that makes all of its assumptions explicit and clearly states its contracts up-front, but in practice, it almost never happens without significant external documentation (whether that comes in the form of explicit code docs, whitepapers, or ad-hoc email threads and Slack conversations that need to be pieced together after-the-fact).
I mean, if your downstream partner calls up to say that the custom analytics feed from your xyz service is returning null data, but not erroring, and the guy that implemented that feed (with rolled eyes, thinking it was an inelegant hacky concession to a noisy customer) left in 2015, where do you even start? How much code from how many codebases and configuration management repos are you going to have to read through just to work out what kind of problem you've got?
Some type of high level documentation - what services / products exist, what infrastructure does each use, how is each one managed and tested, what is worrisome about it or what has tended to go wrong in the past - is going to help a lot.
I definitely agree with that, and it's partly a corollary of 'documentation is expensive and requires costly maintenance'.
Run books/checklists are mostly implemented really really badly.
Automation is the ideal, but is costly, and itself requires maintenance.
Most of the steps we had to perform did not lend themselves to automation, also.
I would contend that the cost of automation is about the same as the cost of documentation plus the cost of having to manually do the work over an over. It's just a cost borne up front instead of over time. But to your point in the article, you have to have a culture that supports bearing that up front cost.
> Most of the steps we had to perform did not lend themselves to automation, also.
I don't understand how that is possible? Could you give an example of a task that can't be automated?
Most attempts I've seen to automate tasks flounder for this reason.
Also, there's no point automating something that happens once a year - the cost will exceed the benefit. Hence the comments about decided which tasks to automate with a backlog and metrics.
Most anything _could_ be automated, eg 'Phone the customer and ask them to engage their network team', 'look for similar-looking strings', 'tcpdump the port, and see if the output is irregular', but is it really worth the effort to? That's where the backlog came in.
BTW I wrote a whole framework for automation
and it's taught me a lot about the hidden costs of automation...
If everything in your infrastructure is deployed with code, then automation is simply the act of making sure the infrastructure matches the deployment described in the code. It's true that automation is just as difficult if deployments are manual. Then remediation becomes changing the deployment code to fix a previously unknown problem, instead of manually fixing a problem. This then gains you the advantage of the problem being fixed in perpetuity.
So they aren't in contradiction if your remediation and deployment are the same process, because then it by definition is always up to date.
I guess I should add the caveat that deployment should be quick enough to solve problems via redeployment.
Shameless plug for people who are feeling inferior while reading this thread: as it happens, my consulting gigs revolve pretty heavily around doing exactly what 'jedberg has described, and it's super worth getting to that point. ;) Email's in the profile if you're looking to get there.
For instance, if you've got machines at customer sites where you manage the software, commercial filers or appliances where you don't really have a shell and you certainly can't spin up a VM, etc., your machines aren't fungible. You can't just redeploy a machine, and just about every problem on the machine is going to be a new one. You want the runbook so that it enumerates how to stop and start things around a human intervention (and yeah, as much of that as possible should be automated), but you can't automate the whole thing without changing your architecture.
If you have the option of starting with an architecture where you don't have these problems, by all means take that option! Maybe do cloud-hosted SaaS instead of maintaining on-prem software for customers, or use some fancy cloud storage API instead of a physical old-school filer. But people who have the old-school architectures need to keep things running smoothly, too.
OP said automating runbooks, and any fixes to production--so the code becomes the documentation of how to deploy, and what exactly was done to remedy an error. OP did not say automating failure responses.
Which brings up OP's proposed question...
>> I don't understand how that is possible? Could you give an example of a task that can't be automated?
Could you give a concrete example? There's a vague reference in the article about automating responses for encoding errors which sounds interesting. It sounds like the system generates lots of server errors, and it's easier/cheaper to communicate them to customers directly instead of making code fixes.
"Check the logs for service X (they're here <link>) and look for anything related to the issue"
"If the user impact is high, write an update to the status page detailing the impact and an estimated time to recovery"
The value of a runbook is that it can make use of human intelligence in its steps. No-one is arguing that you shouldn't be automating things like "if the CPU usage is > 90%, spin up another instance and configure it".
I have a long missive about how logs are useless and shouldn't be kept, but that's for another time. I'll summarize by saying that if you have to look at logs, then your monitoring has failed you.
> If the user impact is high, write an update to the status page detailing the impact and an estimated time to recovery"
I guess technically that would be a step in a runbook, that's fair. Although in my case that was left to PR to do based on updates to the trouble tickets. :)
> The value of a runbook is that it can make use of human intelligence in its steps
I'd rather human intelligence be spent on triage by reading the results of automated diagnosis and coding up remediation software than on repeating steps in a checklist.
Sure, there are uses for checklists of things to check, but even that should be automated through the ticket system at the very least, which I no longer consider a runbook, but I guess some might still consider that a runbook.
> I have a long missive about how logs are useless
> and shouldn't be kept, but that's for another time.
> I'll summarize by saying that if you have to look
> at logs, then your monitoring has failed you.
I have to say, you have a tendency on HN to chime in from the peanut gallery and be a bit unrelenting and even combative because jedberg does things differently.
That's a fair critique, and thank you for pointing it out. I try to always back up what I say with the reasons for what I say, but sometimes I get lazy or don't have time to write it all out. I too worry about folks who speak in absolutes, although in this case I happen to actually believe it.
The medium isn't always the best way to have a deep technical discussion unfortunately.
How does that work?
If you emit metrics as necessary to a time series database, then you should be able to build alerting based on the time series metrics. Your monitoring systems should be good at building alerts based on a stream of metrics and visualizing the time series data.
Sometimes you might have to look at the visualizations to find something, but ideally you then set up an alert on the thing you looked at so you have the alert for the next time it happens. A great monitoring system lets you turn graphs into alerts right in the interface, so if you're looking at a useful graph you can make an alert out of it.
Sometimes logs can be useful, but only after your monitoring system has told you which system is not behaving, and then you can turn on logs for that system until you've solved the problem, but you shouldn't need access to old logs, because if the problem was only in the past, then it's not really a problem anymore, right? If you have an ongoing problem, then maybe have the logs on for that service while you're investing that problem, but then turn them off again.
But having a ton of logs always generating and being stored tend to be fairly useless in practice with a good time series database at hand.
Likewise, turning logs on only after you've seen a problem means you miss out on troubleshooting the root cause of it - if there was a spike of badness this morning but you don't have logs for it, you're missing out on diagnostic information that may have protected you from repeats of that spike in future.
I've also had business guys want to analyse things like access logs in ways that they didn't know previously. Logs provide a datastore of historical activity, which in smaller shops is a cheap data lake.
Perhaps the 'no logs' thing works for your setup, but I think it's bad general advice. And your position is not that logs are useless ("turn on logs for that system until you've solved the problem"), but that retaining logs are useless - quite a significant difference between the two.
That's an important distinction, one that I agree with, and I should make clearer.
Logs do have a purpose, but I'm not sure that retaining them does.
Sure, for a very small shop, throw them on a disk, use awk, sed, grep, and perl to look through them, and call it a day. But once you get to the point of "spinning up a cluster of log servers" or something like it, I'd say you're probably better off investing in monitoring instead.
More than once I've run the entire corpus of requests to a system, ever, through a dummy rebuild as a pretty great integration test. It's a powerful SRE tool. Spelunking through all historical data is just icing on that cake, honestly. As the author says, SRE is basically just an information factory; I'd be haaaaaard pressed to agree with you on throwing away a lot of information -- you don't know what you don't know until you want to know it -- and betting all-in on monitoring. Retaining logs is not the hardest problem SRE deals with, either, but SREs turn around and force unrealistic latency requirements on the query side (I see a lot of ELK deploys running into this).
You have to look at it as an Oracle. Oh, great Oracle of a pile of meaningless logs, cook off this map/reduce and tell me an interesting number that I can put in a Keynote for executives. Definitely not dashboarding from logs data. That's an impedance mismatch that Google gets away with because of the nature of their logging.
Nonetheless, splunk is the most expensive software license on the planet. More than Oracle, yes.
Interestingly, we never used a cluster of log servers. I was always skeptical of their utility. It was grep, plus some hand-rolled utility scripts to interrogate. One was a thing of beauty I spent years on, which saved us a ton of time.
A monitoring system has a lower barrier to entry.
http://datadoghq.com/ => will do ALL of that and much more. You can deploy it in a few hours to thousands of hosts, no problem.
Direct competitor: http://signalfx.com/
Have no money to pay for high quality tool? graphite + statsd will do the trick for basic infrastructure. However it's single host, doesn't scale and only basic ugly graphs are supported.
That's what Grafana  is for -- i.e. creating nicer displays for Graphite.
However it's single host, doesn't scale
It may take some effort, but it can be done, and much of the heavy-lifting seems to have been done and been made available as open-source.
Here's a blog post from Jan. 2017  from a gambling site about scaling Graphite.
And here's a talk  from Vladimir Smirnov at Booking.com from Feb. 2017 about scaling Graphite -- their solution is open-source (links in the talk and slides available at the link):
This is our story of the challenges we’ve faced at Booking.com and how we made our Graphite system handle millions of metrics per second.
(And this  is an older, but more comprehensive, look at various approaches to scaling Graphite from the Wikimedia people with the pros and cons listed).
It can be done but at what costs? Better get a tool that gets the job done out of the box and does it well.
For starters, if you're operating in the cloud, you cannot get servers with FusionIO drives and top notch SSD. That limits your ability to scale vertically.
Similarly, basic infrastructure health is not giving you the same sort of information (what the software is actually doing) that logging does. In order to do time-series monitoring of your software rather than your system, you need to spend time thinking about what metrics you need to track and how you're going to obtain them.
I run both an ELK stack and a Prometheus stack, and I find they're good for different things.
Since you mention Papertrail specifically in the context of costs - Papertrail is actually a bit pricey relative to the competition. For example, compare https://papertrailapp.com/plans to https://sematext.com/logsene/#plans-and-pricing . I think Sematext Logsene is 2-3 times cheaper than Papertrail.
Lastly, I was at a Cloud Native Conference in Berlin last week. A lot of people have the same setup as you - ELK for logs + Prometheus for metrics. We're running Sematext Cloud where we ship both our metrics and our logs, so we can easily switch between metrics and logs much more easily, correlate, and troubleshoot faster. Seems a bit simpler than ELK+Prometheus...
Sumologic, papertrail, logentries (and many more) for cloud logs. Graylog or ELK or Splunk for self hosted logs.
However, logs should never be send to the cloud, they are too sensitive information to outsource. Server metrics + stats are more reasonable.
Agree, stats and logs cover different things. Need both.
Saying that logs should never be sent to the cloud is overly absolute. Some logs should indeed stay behind firewall, but lots of organizations have logs that can be shipped out to services whose features derive all kinds of interesting insights from logs.
Discalimer: I work at Sumo Logic and enjoy it.
Also the cost of the infrastructure to search the logs and view the logs.
> Sometimes logs can be useful, but only after your monitoring system has told you which system is not behaving, and then you can turn on logs for that system until you've solved the problem, but you shouldn't need access to old logs, because if the problem was only in the past, then it's not really a problem anymore, right?
Some things happen rarely, but can still have large impact. E.g. Imagine a once a day job of moving files which fails twice a month, rendering those files inaccessible.
Incident reviews. If something happened that wasn't covered, then it is added as an outcome of the incident review.
> Are there cases where an automated diagnosis could not be made for an incident and if so was manual recourse possible?
For sure. Manual recourse was to dig in and figure it out either with the command line or the monitoring system or whatever else.
> How would you retrain the 'diagnosing app' to handle the new case?
In most cases the "diagnosing app" was a dashboard on the monitoring system, with a set of relevant graphs, so you would add a new graph. There was also a tool that correlated graphs, so you could add a new graph and correlation.
I estimate there's maybe 25% of the industry that's ignorant of "best practices", 70% that follow them dogmatically, and 5% that use them as guidelines, evaluating each situation on its own merits and choosing what makes sense.
I feel like it's more frustrating to deal with the 70% than the 25%. For some people documentation can only sound like a virtuous thing. But documentation can do harm as soon as it gets out of sync with reality.
While I hesitantly agree despite it seeming counterintuitive (you make a good case), I'd contend the code can take a form that looks runbook-y. I've had success in my organizations with Jupyter notebooks with documentation mixed throughout. Sometimes you do need a human, and in those cases having the documentation update live with the state of the world was huge for comprehension, particularly when you're centrally executing the notebook. Each step is something like:
> 0) Blurb about what's going on, warnings, etc.
> 1) Call into your automation code with one well-named entry point, like reboot_all_frontend_servers().
> 2) Display the relevant results immediately under that cell.
Then you can step, yadda yadda. Idempotency of the steps is key. I have a vision for the operations bible taking such a form, with each thing-ops-needs-to-do represented with a notebook, but that might be unattainable -- you might be correct about that. Even still, mixing documentation and code seems to potentially push that barrier just a little further back. As a few people told you elsewhere in this thread (and, I think you know), complete automation is a big ask outside of a subset of maybe a dozen valley companies, even at a small property level. Until then, giving humans the tools to reliably do their job, like mechanized dynamic (not static) documentation, might be useful enough to not discard entirely.
I find that while documentation suffers from 'detail rot' quite fast, it does help with higher-level stuff like the rough outline of how things are organised and where they live, or why decisions were made to do -foo-.
What does one do in that kind of situation where one has a novel problem and needs to do something about it?
Should the organisation keeps all of the details of how to do things and how things work in their heads (and presumably ensuring that enough people are available whenever you might have an issue), or work out how things work/what does what on the fly? (or maybe a mix of the two?)
But if you don't have that luxury, then it means you're hoping that the documentation is up to date, but chances are that if you don't have enough resources to have on call coverage from someone very familiar with the area that's having a problem, then there is a good chance they didn't have time to write or update documentation.
So ideally the person who wrote it has it in their head, or you're figuring it out on the fly and hoping you have good comments in the code and good metrics.
What if the code is 15 years old, and everyone involved with writing it has left?
To make the script less naive, you'd need a development environment with the same infrastructure components as production, and the freedom to make them fail so you can develop your script against their failure states.
Externalizing a process that previously existed only in someone's head is a win. Creating a checklist is a straightforward framework for such externalizing, and allows you to separate the question "how do I accomplish my goal?" from the question "how do I express this with code?" (like writing pseudocode before real code).
Whether a process is automated or manual, there's always room for error if the surrounding context changes. When an automated process is annotated _like_ a checklist, I find that I get the best of both worlds: minimal affordance for human error with a clearly described thought process to fall back on in the event of a problem.
(It's also not terribly uncommon to have steps that can't be fully automated, like authenticating to a VPN with 2FA under certain security frameworks...!)
I've had at least one checklist where, by following it religiously, human error was rare - and by paying proper attention, what errors did occur were noticeable and correctable.
Handed off the checklist to someone else (at management's request - it was eating up some of my time) and the human error rate went to nearly 100%. Great guy, good at his main job, just not fastidious about checklist discipline. I ended up creating an incredibly brittle, constantly breaking, broken-as-hell time sinking set of scripts to automate the process on our build servers, because that was less work. Sigh.
If SREs are too good developers lose touch with production and get lazy.
Some of them flat out don't belong in Support. But poor management and poor metrics drive behavior in unhelpful directions, too.
This followed an attempt to run a more 'standard' support service (before my time) with specialist support staff. That failed badly, mostly because customers and devs hated that they added little value to the process.
It was a long time before people were allowed to be recruited direct to support again, and by then the leaders were all former devs->tech leads.
"Youporn.com is now a 100% Redis Site" (126 comments, 5 years ago)
"How YouPorn Uses Redis: SFW Edition" (95 comments, 4 years ago)
"YouPorn: Symfony2, Redis, Varnish, HA Proxy... (Keynote at ConFoo 2012)" (49 comments, 5 years ago)
There's more in the HN search.
dont face hard time recruiting top engineers , i always though this would be a major concern
Occasionally people expressed disquiet, but since the main alternative in London is working for banks, there wasn't a great deal of choice. Personally I don't see the excessive advertising we are subject to as much better for society than gambling being available, but hey.
Did both. Gambling is very similar to finance.
Turns out that finance pays more and treats their employees better.
It also probably helps that the major player is called "ladbrokes"!
Never mind that those people are literally the least qualified to control their gambling addiction.
If anyone's interested, it was called "Ka-ching! Pokie Nation" http://www.abc.net.au/tv/programs/kaching-pokie-nation/
Arguably worse than gambling or porn
I don't have a problem selling a good that is honest and that people are free not to consume. I also wouldn't have an issue with working for a company that provided porn or drugs, when they become legal.
I would have a problem working for an isp and selling peoples data.
Do you think that they have "questionable morality" by the business nature (gambling, porn, etc) or for how they run the business (ex. dark patterns) ?
morality is subjective
Such dilemmas leave me with the idea that it is very naive to think there is some kind of absolute morality.
So most things people consider morals cannot all be simultaneously morals. Without self-consistanty you will end up with paradoxes like you mention.
Let's just start at the top. Please give your absolute morality solution that has "self-consistanty" (sic) to "The trapped mining crew".
> Heather is part of a four-person mining expedition. There is a cave-in and the four of them are trapped in the mine. A rock has crushed the legs of one of her crew members and he will die without medical attention. She’s established radio contact with the rescue team and learned it will be 36 hours before the first drill can reach the space she is trapped in.
She is able to calculate that this space has just enough oxygen for three people to survive for 36 hours, but definitely not enough for four people. The only way to save the other crew members is to refuse medical aid to the injured crew member so that there will be just enough oxygen for the rest of the crew to survive.
Should Heather allow the injured crew member to die in order to save the lives of the remaining crew members?
Maybe if you point out what aspect of my claim you take issue with, I could answer it briefly.
You've just declined to offer any explanation of what your system would propose in that situation.
So I guess the aspect of your claim that I doubt is the mere existence of your self-consistent and absolute moral system at all?
1. The consistency: it's just one rule, so it should be able to drive decisions without much ambiguity.
2. The "absoluteness": if it's unique as the only workable self-consistent moral system, that should suffice. Perhaps there are others, though I suspect they have one of these features:
a. they drive equivalent decisions FAPP
b. they are unstable and eventually won't be able to continue; either the hosts die or move onto another system
I admit I haven't proven this to be the case, but I suspect it's true, or at least something along these lines. Certainly the opposite hasn't been proven.
Unfortunately, it's difficult for the audience to determine whether the author's success is attributable to the philosophy in the post, or the 10x growth in staff.
Whether that was due to me, the things we did, or other reasons is of course open to debate. This is just my view from where I sat.