
Chaos Monkey released into the wild - timf
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
======
InclinedPlane
This reminds me of something I read the other day¹, the idea that complex
fault tolerant systems tend to end up running with faults as a matter of
course (and sheer probability). This elevates that notion to another level,
get rid of the idea of operating without faults and maintain a low-level of
faultiness artificially to ensure that resiliency to faults in the system is
always working.

¹ [http://www.johndcook.com/blog/2012/07/13/fault-tolerant-
syst...](http://www.johndcook.com/blog/2012/07/13/fault-tolerant-systems-are-
faulty/)

~~~
ryusage
Reminds me of biology. It's always intrigued me that we often want to make
computer systems as "good" as humans, and yet our acceptance of human faults
is far higher than what we accept from computers. Any small part of a body may
have some problems, but the body as a whole continues to function well. I
wonder if this kind of design will become a trend?

~~~
delinka
We tolerate fewer faults from our computers because the underlying system is
_not_ complex enough to prevent those faults from affecting us in large ways.
As you say, a small failing in the human body doesn't render the entire
organism inert and useless; no, we can adapt our behavior during healing and
move on. When a small fault occurs in your computer, it tends to have
disastrous effects.

We will tolerate faults in our computer systems when then faults don't cost us
our homework, our jobs, or our lives. And complexity is the only way to making
that happen.

I, for one, welcome our new chaos-wielding overlord!

------
technomancy
Having the code behind the chaos monkey is not nearly as valuable as having
the guts to run it in the first place.

~~~
jedberg
If this were reddit, I'd link to that meme about doing testing in production.

Since this is HN, I'll instead say that if you don't have the guts, you're
only lulling yourself into a false sense of security. AWS or not, your systems
_will_ fail.

By running the monkey, at least you can make it fail on a schedule that is
convenient to you, instead of happening when you're drunk on a Saturday night
(or whatever your vice of choice might be).

~~~
freehunter
At my work (large corporate office), we have random node outages. It's not
quite as in depth as chaos monkey, but it goes towards the same purpose. Just
pull the plug on the server. More than once, a random node outage has caught a
novice developer making static links to nodes through the load balancer. We
also have random pen-tests designed to DoS or otherwise disable services
around the network. Controlled destruction of your infrastructure is the
quickest way to highlight any faults.

But remember: what's the difference between hacking and pentesting?
Permission.

------
waivej
This is so cool, and I'm wondering if I can do something similar. It reminds
me of "bug seeding" where you purposely insert bugs into your product and
count how many are found through testing. (Of course, you track them so you
can take them out later.)

~~~
TimJRobinson
That's a really good idea, do you know if there are any tools out there for
doing this? I'm thinking for something basic you could make it change random
variables or numbers in the code base and run the tests after each change and
report if they failed.

~~~
herge
Fuzz testing involves an automated process commenting out each line of your
code one-by-one and running your unit tests to see what lines are actually
tested.

~~~
DeepDuh
I wonder how well this would work for those Inuit guys with such and such
millions of LOC ;). "Wait, what Big-O complexity does this code have again you
say?"

------
coob
Very interesting. However, this won't catch bugs that don't cause instance
outages. I suppose the next step to is to somehow simulate
misconfigurations/overloaded services/unexpected errors (I have no idea how
this would work).

~~~
Zombieball
Netflix actually has a whole Simian Army
(<http://techblog.netflix.com/2011/07/netflix-simian-army.html>) which
includes in addition to Chaos Monkey: Latency Monkey, Chaos Gorrila,
Conformity Monkey, etc.

------
wanderr
This is certainly a great idea in theory, but I wonder if it has caused them
to build things to be a little too fault tolerant, by assuming there are
faults when maybe things are just slow. On Xbox, my recently watched fails to
appear about 30% of the time even though it appears just fine on the site. The
only thing I can think is that it must be taking just a little bit too long to
return, so they assume failure and move on.

~~~
samstave
The reason why this is critical for Netflix is that all the infrastructure is
provided as a service and this completely out of their control.

By creating faults in the Netflix service which they can build around to stay
optimally available, they completely or nearly completely eliminate the risk
of being affected by faults in AWS services.

Now, clearly this has not been 100% true yet because some major outages have
effected Netflix - though that's when the entire service fails, as opposed to
single point of failure types of faults, which chaos monkey is really designed
to accomodate.

(disclaimer: this is my outside perception based on purely interest in how
Netflix has built themselves. I could be incorrect)

~~~
jedberg
You are mostly correct. The infrastructure isn't _completely_ out of our
control, but it is indeed provided by a 3rd party.

By creating faults it means we know we can handle those types of outages.

As we experience new types of faults, we build tools and systems to make us
resilient to those types of faults, and where possible similar but yet
unexperienced faults.

~~~
samstave
Jedberg - what is your role at Netflix specifically?

I have a question, I asked this of Adrian, but he was of the opinion that
Netflix would likely never accommodate:

How can a hospital have group Netflix streaming accounts such that users in
their patient rooms could view/stream netflix to their rooms?

Can you make a commercial account / support this?

What about the following grey method for handling this: Th hospital pays for a
group of individual streaming accounts. Their Patient Entertainment system can
then check-out an account for use by the patient. The patient watches what
they want, then when done, the system checks-in the account for use.

Would this be amenable to the TOS you may have?

The reason this is important is that cable ompanies are raping hospitals for
providing cable TV service to their patient rooms. Netflix is a far better
path for the future.

Thanks

~~~
jedberg
I work on site reliability.

As for all the rest, I'm unfortunately not in a position to answer any of
that. Adrian's probably right though. I suspect our content licenses would
preclude that type of account.

------
oonny
Jeff Atwood had a good piece on this:
[http://www.codinghorror.com/blog/2011/04/working-with-the-
ch...](http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-
monkey.html)

Overall it's brilliant from an infrastructure standpoint but it's up to the
developers to make your code monkey proof!

~~~
sabat
It's up to the developers _and_ you, the operations/devops people, to work
together to make it chaos-monkey-proof. No engineering team is an island.

------
joelcox
Webpulp TV recently did an interview with Jeremy Edburg from Netflix in which
they discussed Chaos Monkey, albeit shortly.

[http://webpulp.tv/episodes/how-netflix-one-of-the-largest-
ec...](http://webpulp.tv/episodes/how-netflix-one-of-the-largest-
ec2-customers-embraces-the-chaos)

------
jrydberg
Usling the monkey, have you found that some design patterns are better than
others? For example, queues vs RPC?

------
photorized
AWS _is_ the chaos monkey.

Their outages are becoming more and more random.

~~~
acdha
> Their outages are becoming more and more random

… or more loudly reported. Very few of the reported outages affect people who
follow Amazon's guidelines for reliability - even the last one had most of the
impact on cheapskates who didn't pay for redundancy.

------
Zenst
Still think playing kerplunk with processes/systems whilst good is akin to
pissing on your server and then seeing how quick you can repair it.

I personaly like to run stuff on a CPU/network crippled test setup as that can
expose cracks normal testing fails to highlight and in a way that is
indicative of what can actualy happen under some load spikes.

~~~
mikeash
If your goal is for your server to be urine-proof, then peeing on it from time
to time to make sure would be a good idea.

~~~
Zenst
sadly my analogy failed most peoples IQ :facepalm:

~~~
mikeash
Yes, we're all too dumb to understand your brilliance. Couldn't possibly be
the fault of your analogy.

~~~
Zenst
no that would be you.

Point I was making is that pulling out process's/servers/cables is all fine
(aka pissing on your server) but it is not a ideal way to test. It is better
IMHO to test out on what you could call weak hardware, as in realy throttled
CPU/disc/network etc and see were things break as these emulate what potentual
peak bottlenecks you will get. This approach is better as you get to test out
a single process system as apposed to only being able to test cloud based
approaches as this does.

~~~
vidarh
Pulling out processes/servers/cables _is_ an ideal way to test what happens
when you pull out processes/servers/cables. Especially when you want to see
how it affects a system that is too large to realistically reproduce in a test
rig (though presumably you would still want to run it in your scaled down test
rig too).

Throttling CPU/disc/network is all well and fine and necessary too, but that's
only one set of failure scenarios, and won't protect you against a whole range
of failures related to systems just disappearing as they crash, or failing to
restart, or any number of other concerns.

