> And that's why, even though it sounds crazy, the best way to avoid failure is to fail constantly.
This is my biggest concern with things like large nation-states, large banks, large reinsurance companies, large RAIDs, and large nuclear plants: we centralize resources into a larger resource pool in order to reduce the chances of failure, but in doing so we make the eventual failure more severe, and we reduce our experience in coping with it and our ability to estimate its probability. In fact, we may not even be reducing the chances of failure; we may just be fooling ourselves.
Consider the problem of replicating files around a network of servers. Perhaps you have a billion files and 200 single-disk servers with an MTBF of 10 years, and it takes you three days to replace a failed server.
One approach you can use is to pair up the servers into 100 mirrored pairs and put 10 million files on each pair. Now, about 20 servers will fail every year, leaving ten million files un-backed-up for three days. But the chance that the remaining server of that pair will fail during that time is 3/3650 = 0.08%. That will happen about once every 60 years, and so the expected lifetime of the average file on your system is about 6000 years.
So it's likely that your system will hum along for decades without any problems, giving you an enormous sense of confidence in its reliability. But if you divide the files that will be lost once every 60 years (ten million) by the 60 years, you get about 170 thousand files lost per year. The system is fooling you into thinking it's reliable.
Suppose, instead, that you replicate each file onto two servers, but those servers are chosen at random. (Without replacement.) When a server fails (remember, 20 times a year), there's about a one in six chance that another server will fail in the three days before it's replaced. When that happens, every three or four months, a random number of files will be lost --- about 10 million / 200, or about fifty thousand files, for a total data loss of about 170 thousand files a year. You will likely see this as a major problem, and you will undertake efforts to fix it, perhaps by storing each file on three or four servers instead of two.
This is despite the fact that this system loses data at the same average rate as the other one. In effect, instead of having 100 server pairs to store files on, you have 19,900 partition pairs, each partition consisting of 0.5% of a server. By making the independently failing unit much smaller, you've dramatically increased your visibility into its failure rate, and given yourself a lot of experience with coping with its failures.
In this case, more or less by hypothesis, the failure rate is independent of the scale of the thing. That isn't generally the case. If we had a lot of half-megawatt nuclear reactors scattered around the landscape instead of a handful of ten-gigawatt reactors, it's likely that each reactor would receive a lot less human attention to keep it in good repair. When it threatened to melt down, there wouldn't be a team of 200 experienced guys onsite to fight the problem. There would be a lot more shipments of fuel, and therefore a lot more opportunities for shipments of fuel rods to crash or be hijacked. And so on.
But we might still be better off that way, because instead of having to extrapolate nuclear-reactor safety from a total of three meltdowns of production reactors --- TMI, Tchernobyl, and Fukushima --- we'd have dozens, if not hundreds, of smaller accidents. And so we'd know which design elements were most likely to fail in practice, and how to do evacuation and decontamination most effectively. Instead of Tchernobyl having produced a huge cloud of radioactive smoke that killed thousands or tens of thousands of people, perhaps it would have killed 27, like the reactor failure in K-19.
With respect to nation-states, the issue is that strong nation-states are very effective at reducing the peacetime homicide rate, which gives them the appearance of substantially improving safety. Many citizens of strong nation-states in Europe have never lived through a war in their country, leading them to think of deaths by violence as a highly unusual phenomenon. But strong nation-states also create much bigger and more destructive wars. It is not clear that the citizens of, say, Germany are at less risk of death by violence than the citizens of much weaker states such as Micronesia or Brazil, where murder rates are higher.
Netflix is turning out to be my favourite tech company. Just a week ago, in an extensive interview they mentioned that to provide a consistent interface across so many platforms we ended up porting our own version of webkit. And now the Chaos Monkey. It's amazing how technically sound they are considering they were just an online DVD rental company at the start.
There's nothing "just" about it. From day 1, they were handling real people's money, and moving real physical items around. They were not a typical web company which is really just a website with Adwords.
The Guiness World record for most steps in a Rube Goldberg device was just set at a competition at Purdue University. The device has 244 steps to water a flower! Now, if you saw the Mythbusters' Christmas episode with the Rube Goldberg device, you know it's really hard to make all those steps go right. But in this one, the engineers used a "hammer test": at any point during the operation of the machine, an engineer could tap the side with a hemmer. If it screwed up, that stage was redesigned. http://www.popularmechanics.com/technology/engineering/gonzo... The end result was the most complex machine of its kind, but it runs very reliably.
The Chaos Monkey reminds me of some papers I've read about "crash-only software" and "recovery-oriented computing". With this approach, server software is written assuming the only way it would shutdown is a crash, even for scheduled maintenance. The software must be designed to recover safely every time the service is started. Instead of exercising recovery code paths rarely, they are tested every day.
When I first read about the Chaos Monkey, I had assumed it was used on their development/staging environment, but this article implies it is on their production system. Does anyone know which is correct?
Yeah, I've been playing around with the idea of Chaos Monkey at the code level, rather than at the systems level, but you can only truly do it with independent actors. I'm hoping to have something to show soon, probably on Akka/Scala.
There is a similarity with mutation testing, but mutation testing is trying to throw things too far up the chain; it wants your program to crash and die so the test fails. Really, we want it the other way: proof that the test would have failed, but the program is still running effectively.
I've worked with runtime repair in the past, which is also sort of similar, but, IMHO, less effective than Erlang-style Let It Crash. 
I think we're going to be seeing a lot more of Chaos Monkey.
CM is a form of active TDD at the system architecture level. This might evolve into setting up partition tests as a prerequisite to instantiating the deployment model (Translation: before you start putting something on a cloud instance, write code that turns the instance off and on from time to time) This assures that the requirements for survival are baked into the app and not something tacked on later after some public failure like the Amazonocolapse.
I was reading on HN the other day a guy talking about Google. He said he saw engineers pull the wires from dozens of routers handling GBs of data -- all without a hitch. The architecture was baked enough that failure was expected.
Many times failure modes like this are burned into hardware, but that kind of design is a long, long, long way from most people's systems.
As you say Google were one of the pioneers of the Chaos Monkey concept; they simply run at a scale where the Chaos Monkey occurs through normal failure rates. For sufficiently large MapReduce jobs you can expect one of the compute nodes to fail during the task. If the MapReduce jobs restarted any time this occurred the jobs would never actually complete!
As we're allowed to comment on anything public, I'll focus on a paper about disaster recovery at Google which focuses on the Perforce version control system. As Perforce is centralized and proprietary it even raises a few novel issues. As they can't modify the code themselves it's in fact one of the few vertically scaled pieces of software at Google (Perforce instances run on machines with 256GB of RAM).
Of particular interest to me is the Annual Disaster Recovery Test that Google runs. They assume that the admins/engineers at Mountain View are entirely unavailable and that the fail-overs happen with no advance notice. The idea is that during an actual disaster your staff won't have time to answer queries as to which folder that documentation was in or the order that commands need to be run.
 - This is in one of the official Google MapReduce papers, I'll try and hunt it down
Why are Google using a centralised and closed piece of software? Does it bring many benefits that haven’t been replicated in open alternatives? Or is it just that the cost of switching is high enough to become prohibitive?
The short answer is that when Google started, Perforce was the best kid on the block. Once you get as far down the road as Google, it's hard to change that incumbent, even if culture dictates the use of open-source software.
The other problem is that Google just has a whole lot of code. They've got engineers cranking out code all day all over the world. Working at that sort of scale rules out other alternatives (note that Google hired the Subversion guys, and Google isn't using Subversion... this should tell you something).
Building things this way strikes me as expensive. At Netflix's scale, it pays off, but for systems that don't serve as many requests I'm forced to wonder whether just avoiding the cloud might be more cost-effective.
...and then multiply that cost by the probability that this will happen to find the expected payoff. If the work costs substantially more than the payoff, don't bother. If substantially less, you're negligent if you don't do it. (The Learned Hand Rule http://aler.oxfordjournals.org/content/7/2/523.full)
A lot of it simply building in redundancy, it has little to do directly with the cloud. And yes, redundancy is always costly, especially if done right. You just need to decide whether your business plan will benefit from it, and what parts are necessary.
The rest of it is making sure your app or site fails gracefully; that is that failure of one part doesn't bring down the whole. That can be expensive to retrofit, but actually should have been designed in from the beginning, as it is a generally accepted part of good design for anything running over a network.
"Building things this way strikes me as expensive."
That is a qualitative statement. It implies a value proposition in your head between engineering effort involved in doing CM style disaster prep vs product benefit.
Operationally, not doing some level of CM is like paying for operations with "Lottery Checks". A Lottery check has a payee and a nominal amount but where the amount is actually printed, there is a scratch off box. Sometimes when you scratch it off its for a lot more money than you intended to spend :-).
But it is very hard to talk rationally about "we're spending two engineers here to do nothing but try to randomly break the system and get bugs fixed that would cover for that problem." Because the problem is self inflicted it seems like a waste of money, and there is no guarantee that they will have found and fixed the problem which is going to kill you in the future. However, if you run an experiment enough times, you eventually achieve the solution. Think of it as the Monte Carlo method of systems test. Its a good thing and it helps people sleep at night.
And when the world does go pear shaped like it did with AWS here you may find yourself yawning rather than panicking, and that feels very good indeed.
Sort of, although in the case of the OOM killer, the goal is actually to kill the "right" process. Of course choosing the "right" process in code turns out to be exceptionally hard over a lot of different workloads.
A decade or so ago, I heard computer programming described as a very good occupation for a person who had Asperger syndrome or perhaps limited social skills. I also recall reading then that some surveys of programmers working in that era suggested that those programmers were much more introverted than the general population. But I used to notice when I installed new programs on my Microsoft Windows computer, even after installing Windows 95, that sometimes installing one program would disable another program. That made me wonder if maybe social skills are an essential element of good programming skills. Now when software may have to run in the cloud, interacting with other software hosted on other hardware, all attempting to operate synchronously, wouldn't "software social skills" be rather important for any developer to understand?
The blog post seems to imply Stack Exchange is working with the Chaos Monkey when it really isn't. They didn't really build a system that randomly shuts down servers or services. The difference is subtle but important.
The point here is that when you know that you're living with the Chaos Monkey, your systems become very fault tolerant. Living with a monkey does that.
But even you don't embrace the concept, the Chaos Monkey is likely going to become a uninvited house guest at some point in time. In StackOverflow's case, they bought a mainstream server with a mainstream OS, and discovered that that server came with a monkey.
Think of it another way. My brother and sister in law never worried about the failure characteristics of dinner plates, so they had lots of nice stuff. Then they had a baby. All of the sudden, falling plates and glasses became something that they had to think about.
Right, I agree that is the main point of the post. However, generalizing Netflix's naming of their fault-inducing system into unexpected failures dilutes the meaning of such name. Jeff's introduction to Netflix's system is what made this post interesting (if you disagree, take the mental exercise and rewrite this post in your mind without mentioning Netflix's system). There's a disconnect between the main point of the post and the most interesting point of the post.
I think the post makes the distinction very clearly. First we hear about Netflix doing this surprising thing on purpose. Then we hear about the Stack Exchange guys pulling out their hair to fix a recurring problem. But "even in our time of greatest frustration, I realized that there was a positive side to all this." Where's the ambiguity?
This is where the distinction is made: "Who in their right mind would willingly choose to work with a Chaos Monkey? Sometimes you don't get a choice; the Chaos Monkey chooses you." But, the final takeaway point of the post blurs that distinction: "the best way to avoid failure is to fail constantly." Stack Exchange didn't "choose" the "best way to avoid failure" -- the problem chose them. Netflix, on the other hand, did make that choice. The article is written as if Stack Exchange followed Netflix's path and chose to fail constantly when it didn't.