
Working with the Chaos Monkey - CWIZO
http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html
======
DanielBMarkham
I think we're going to be seeing a lot more of Chaos Monkey.

CM is a form of active TDD at the system architecture level. This might evolve
into setting up partition tests as a prerequisite to instantiating the
deployment model (Translation: before you start putting something on a cloud
instance, write code that turns the instance off and on from time to time)
This assures that the requirements for survival are baked into the app and not
something tacked on later after some public failure like the Amazonocolapse.

I was reading on HN the other day a guy talking about Google. He said he saw
engineers pull the wires from dozens of routers handling GBs of data -- all
without a hitch. The architecture was baked enough that failure was expected.

Many times failure modes like this are burned into hardware, but that kind of
design is a long, long, long way from most people's systems.

~~~
Smerity
As you say Google were one of the pioneers of the Chaos Monkey concept; they
simply run at a scale where the Chaos Monkey occurs through normal failure
rates. For sufficiently large MapReduce jobs you can expect one of the compute
nodes to fail during the task. If the MapReduce jobs restarted any time this
occurred the jobs would never actually complete[1]!

As we're allowed to comment on anything public, I'll focus on a paper about
disaster recovery at Google which focuses on the Perforce version control
system[2]. As Perforce is centralized and proprietary it even raises a few
novel issues. As they can't modify the code themselves it's in fact one of the
few vertically scaled pieces of software at Google (Perforce instances run on
machines with 256GB of RAM[3]).

Of particular interest to me is the Annual Disaster Recovery Test that Google
runs. They assume that the admins/engineers at Mountain View are entirely
unavailable and that the fail-overs happen with no advance notice. The idea is
that during an actual disaster your staff won't have time to answer queries as
to which folder that documentation was in or the order that commands need to
be run.

[1] - This is in one of the official Google MapReduce papers, I'll try and
hunt it down

[2] -
[http://www.perforce.com/perforce/conferences/us/2009/Present...](http://www.perforce.com/perforce/conferences/us/2009/Presentations/Wright-
Disaster_Recovery-paper.pdf)

[3] -
[http://www.perforce.com/perforce/conferences/us/2009/Present...](http://www.perforce.com/perforce/conferences/us/2009/Presentations/Wright-
Disaster_Recovery-slides.pdf)

~~~
robin_reala
Why are Google using a centralised and closed piece of software? Does it bring
many benefits that haven’t been replicated in open alternatives? Or is it just
that the cost of switching is high enough to become prohibitive?

~~~
Lewisham
The short answer is that when Google started, Perforce was the best kid on the
block. Once you get as far down the road as Google, it's hard to change that
incumbent, even if culture dictates the use of open-source software.

The other problem is that Google just has a whole lot of code. They've got
engineers cranking out code all day all over the world. Working at that sort
of scale rules out other alternatives (note that Google hired the Subversion
guys, and Google isn't using Subversion... this should tell you something).

------
neebz
Netflix is turning out to be my favourite tech company. Just a week ago, in an
extensive interview they mentioned that to provide a consistent interface
across so many platforms we ended up porting our own version of webkit. And
now the Chaos Monkey. It's amazing how technically sound they are considering
they were just an online DVD rental company at the start.

~~~
gaius
There's nothing "just" about it. From day 1, they were handling real people's
money, and moving real physical items around. They were not a typical web
company which is really just a website with Adwords.

------
sp332
The Guiness World record for most steps in a Rube Goldberg device was just set
at a competition at Purdue University. The device has 244 steps to water a
flower! Now, if you saw the Mythbusters' Christmas episode with the Rube
Goldberg device, you know it's really hard to make all those steps go right.
But in this one, the engineers used a "hammer test": at any point during the
operation of the machine, an engineer could tap the side with a hemmer. If it
screwed up, that stage was redesigned.
[http://www.popularmechanics.com/technology/engineering/gonzo...](http://www.popularmechanics.com/technology/engineering/gonzo/the-
worlds-most-complicated-rube-goldberg-machine) The end result was the most
complex machine of its kind, but it runs very reliably.

------
pwim
When I first read about the Chaos Monkey, I had assumed it was used on their
development/staging environment, but this article implies it is on their
production system. Does anyone know which is correct?

~~~
dpritchett
I would think you'd have to run it on production for its results to be truly
worthwhile.

~~~
mey
Depends on how much risk/pain you are willing to accept while your systems are
designed to adapt to it.

------
cpeterso
The Chaos Monkey reminds me of some papers I've read about "crash-only
software" and "recovery-oriented computing". With this approach, server
software is written assuming the only way it would shutdown is a crash, even
for scheduled maintenance. The software must be designed to recover safely
every time the service is started. Instead of exercising recovery code paths
rarely, they are tested every day.

[http://www.armandofox.com/geek/past-projects/recovery-
orient...](http://www.armandofox.com/geek/past-projects/recovery-oriented-
computing-roc/)

<http://www.usenix.org/events/hotos03/tech/candea.html>

~~~
pnathan
That approach is very interesting.

The idea of having a constantly persisted state is something I think would be
massively innovative if applied right.

------
jamii
Here is an erlang version of the chaos monkey:

    
    
        potential_victim(Minions) ->
            fun (Pid) ->
    	        not(pman_process:is_system_process(Pid)) 
    	    	    and not lists:member(Pid, Minions)
            end.
    
        death_from_above(Minions) ->
            Pids = lists:filter(potential_victim(Minions), erlang:processes()),
            case Pids of
    	    [] -> none;
    	    _ ->
    	        Victim = lists:nth(random:uniform(length(Pids)), Pids),
    	        Name = pman_process:pinfo(Pid, registered_name),
    	        exit(Victim, kill),
    	        {ok, Victim, Name}
            end.
    

The idea is to run it during load tests. Afterwards run your normal unit tests
to check that nothing got permanently broken. It's good for finding broken
supervisor trees.

~~~
Lewisham
Yeah, I've been playing around with the idea of Chaos Monkey at the code
level, rather than at the systems level, but you can only truly do it with
independent actors. I'm hoping to have something to show soon, probably on
Akka/Scala.

There is a similarity with mutation testing, but mutation testing is trying to
throw things too far up the chain; it wants your program to crash and die so
the test fails. Really, we want it the other way: proof that the test would
have failed, but the program is still running effectively.

I've worked with runtime repair in the past, which is also sort of similar,
but, IMHO, less effective than Erlang-style Let It Crash. [1]

[1] <http://www.zenetproject.com/pages/lakitu>

------
augustl
Akin's Laws of Spacecraft Design [1], law no. 2:

To design a spacecraft right takes an infinite amount of effort. This is why
it's a good idea to design them to operate when some things are wrong.

[1]
[http://spacecraft.ssl.umd.edu/old_site/academics/akins_laws....](http://spacecraft.ssl.umd.edu/old_site/academics/akins_laws.html)

------
guelo
Wow cool idea, but I don't think I'd be able to convince my company to do
this.

------
andrewcooke
argh! so why was the server crashing? you can't leave me in such suspense....!

~~~
urbanjunkie
[http://serverfault.com/questions/104791/windows-
server-2008-...](http://serverfault.com/questions/104791/windows-
server-2008-r2-network-adapter-stops-working-requires-hard-reboot)

Broadcom networking card and Windows Server 2008

------
adamc
Building things this way strikes me as expensive. At Netflix's scale, it pays
off, but for systems that don't serve as many requests I'm forced to wonder
whether just avoiding the cloud might be more cost-effective.

~~~
larrik
Well, you really have to figure out how much it would cost your site/service
to be down for 2 days straight. Or maybe a week (PSN). Would this design pay
for itself in preventing that loss?

~~~
ludflu
...and then multiply that cost by the probability that this will happen to
find the expected payoff. If the work costs substantially more than the
payoff, don't bother. If substantially less, you're negligent if you don't do
it. (The Learned Hand Rule
<http://aler.oxfordjournals.org/content/7/2/523.full>)

~~~
larrik
And 6 months ago, what WAS your perceived probability that an entire AWS
location would be down for several days?

------
ankimal
Even linux has a chaos monkey of sorts. <http://linux-mm.org/OOM_Killer>

~~~
btmorex
Sort of, although in the case of the OOM killer, the goal is actually to kill
the "right" process. Of course choosing the "right" process in code turns out
to be exceptionally hard over a lot of different workloads.

------
tokenadult
A decade or so ago, I heard computer programming described as a very good
occupation for a person who had Asperger syndrome or perhaps limited social
skills. I also recall reading then that some surveys of programmers working in
that era suggested that those programmers were much more introverted than the
general population. But I used to notice when I installed new programs on my
Microsoft Windows computer, even after installing Windows 95, that sometimes
installing one program would disable another program. That made me wonder if
maybe social skills are an essential element of good programming skills. Now
when software may have to run in the cloud, interacting with other software
hosted on other hardware, all attempting to operate synchronously, wouldn't
"software social skills" be rather important for any developer to understand?

------
kragen
> And that's why, even though it sounds crazy, the best way to avoid failure
> is to fail constantly.

This is my biggest concern with things like large nation-states, large banks,
large reinsurance companies, large RAIDs, and large nuclear plants: we
centralize resources into a larger resource pool in order to reduce the
chances of failure, but in doing so we make the eventual failure more severe,
and we reduce our experience in coping with it and our ability to estimate its
probability. In fact, we may not even be reducing the chances of failure; we
may just be fooling ourselves.

Consider the problem of replicating files around a network of servers. Perhaps
you have a billion files and 200 single-disk servers with an MTBF of 10 years,
and it takes you three days to replace a failed server.

One approach you can use is to pair up the servers into 100 mirrored pairs and
put 10 million files on each pair. Now, about 20 servers will fail every year,
leaving ten million files un-backed-up for three days. But the chance that the
remaining server of that pair will fail during that time is 3/3650 = 0.08%.
That will happen about once every 60 years, and so the expected lifetime of
the average file on your system is about 6000 years.

So it's likely that your system will hum along for decades without any
problems, giving you an enormous sense of confidence in its reliability. But
if you divide the files that will be lost once every 60 years (ten million) by
the 60 years, you get about 170 thousand files lost per year. The system is
fooling you into thinking it's reliable.

Suppose, instead, that you replicate each file onto two servers, but those
servers are chosen at random. (Without replacement.) When a server fails
(remember, 20 times a year), there's about a one in six chance that another
server will fail in the three days before it's replaced. When that happens,
every three or four months, a random number of files will be lost --- about 10
million / 200, or about fifty thousand files, for a total data loss of about
170 thousand files a year. You will likely see this as a major problem, and
you will undertake efforts to fix it, perhaps by storing each file on three or
four servers instead of two.

This is despite the fact that this system loses data at the same average rate
as the other one. In effect, instead of having 100 server pairs to store files
on, you have 19,900 partition pairs, each partition consisting of 0.5% of a
server. By making the independently failing unit much smaller, you've
dramatically increased your visibility into its failure rate, and given
yourself a lot of experience with coping with its failures.

In this case, more or less by hypothesis, the failure rate is independent of
the scale of the thing. That isn't generally the case. If we had a lot of
half-megawatt nuclear reactors scattered around the landscape instead of a
handful of ten-gigawatt reactors, it's likely that each reactor would receive
a lot less human attention to keep it in good repair. When it threatened to
melt down, there wouldn't be a team of 200 experienced guys onsite to fight
the problem. There would be a lot more shipments of fuel, and therefore a lot
more opportunities for shipments of fuel rods to crash or be hijacked. And so
on.

But we might still be better off that way, because instead of having to
extrapolate nuclear-reactor safety from a total of three meltdowns of
production reactors --- TMI, Tchernobyl, and Fukushima --- we'd have dozens,
if not hundreds, of smaller accidents. And so we'd know which design elements
were most likely to fail in practice, and how to do evacuation and
decontamination most effectively. Instead of Tchernobyl having produced a huge
cloud of radioactive smoke that killed thousands or tens of thousands of
people, perhaps it would have killed 27, like the reactor failure in K-19.

With respect to nation-states, the issue is that strong nation-states are very
effective at reducing the peacetime homicide rate, which gives them the
appearance of substantially improving safety. Many citizens of strong nation-
states in Europe have never lived through a war in their country, leading them
to think of deaths by violence as a highly unusual phenomenon. But strong
nation-states also create much bigger and more destructive wars. It is not
clear that the citizens of, say, Germany are at less risk of death by violence
than the citizens of much weaker states such as Micronesia or Brazil, where
murder rates are higher.

------
martingordon
I would love to be a developer on a Chaos Monkey.

"Oh no, CM is down, everything's working!"

------
buddydvd
The blog post seems to imply Stack Exchange is working with the Chaos Monkey
when it really isn't. They didn't really build a system that randomly shuts
down servers or services. The difference is subtle but important.

~~~
Duff
The point here is that when you know that you're living with the Chaos Monkey,
your systems become very fault tolerant. Living with a monkey does that.

But even you don't embrace the concept, the Chaos Monkey is likely going to
become a uninvited house guest at some point in time. In StackOverflow's case,
they bought a mainstream server with a mainstream OS, and discovered that that
server came with a monkey.

Think of it another way. My brother and sister in law never worried about the
failure characteristics of dinner plates, so they had lots of nice stuff. Then
they had a baby. All of the sudden, falling plates and glasses became
something that they had to think about.

~~~
buddydvd
Right, I agree that is the main point of the post. However, generalizing
Netflix's naming of their fault-inducing system into unexpected failures
dilutes the meaning of such name. Jeff's introduction to Netflix's system is
what made this post interesting (if you disagree, take the mental exercise and
rewrite this post in your mind without mentioning Netflix's system). There's a
disconnect between the main point of the post and the most interesting point
of the post.

~~~
MartinCron
It's perfectly OK for people to draw inspiration and find parallels from other
sources, especially when they are both inspiring and timely.

Jeff had something happen on accident that Netflix was smart enough to
engineer on purpose. Once again, _Jeff learns an important lesson_ and shares
with everyone. No need to give grief over that.

~~~
buddydvd
I'm not trying to give grief here. I respect Jeff highly and found this
article to be insightful. If people dislike my comment for its tone, I accept
it. It was probably a bit overly antagonistic.

