Hacker News new | past | comments | ask | show | jobs | submit login
Chaos Monkey released into the wild (netflix.com)
235 points by timf on July 30, 2012 | hide | past | web | favorite | 36 comments

This reminds me of something I read the other day¹, the idea that complex fault tolerant systems tend to end up running with faults as a matter of course (and sheer probability). This elevates that notion to another level, get rid of the idea of operating without faults and maintain a low-level of faultiness artificially to ensure that resiliency to faults in the system is always working.

¹ http://www.johndcook.com/blog/2012/07/13/fault-tolerant-syst...

Reminds me of biology. It's always intrigued me that we often want to make computer systems as "good" as humans, and yet our acceptance of human faults is far higher than what we accept from computers. Any small part of a body may have some problems, but the body as a whole continues to function well. I wonder if this kind of design will become a trend?

We tolerate fewer faults from our computers because the underlying system is not complex enough to prevent those faults from affecting us in large ways. As you say, a small failing in the human body doesn't render the entire organism inert and useless; no, we can adapt our behavior during healing and move on. When a small fault occurs in your computer, it tends to have disastrous effects.

We will tolerate faults in our computer systems when then faults don't cost us our homework, our jobs, or our lives. And complexity is the only way to making that happen.

I, for one, welcome our new chaos-wielding overlord!

Having the code behind the chaos monkey is not nearly as valuable as having the guts to run it in the first place.

If this were reddit, I'd link to that meme about doing testing in production.

Since this is HN, I'll instead say that if you don't have the guts, you're only lulling yourself into a false sense of security. AWS or not, your systems will fail.

By running the monkey, at least you can make it fail on a schedule that is convenient to you, instead of happening when you're drunk on a Saturday night (or whatever your vice of choice might be).

At my work (large corporate office), we have random node outages. It's not quite as in depth as chaos monkey, but it goes towards the same purpose. Just pull the plug on the server. More than once, a random node outage has caught a novice developer making static links to nodes through the load balancer. We also have random pen-tests designed to DoS or otherwise disable services around the network. Controlled destruction of your infrastructure is the quickest way to highlight any faults.

But remember: what's the difference between hacking and pentesting? Permission.

Without chaos monkey, instances will not randomly die anywhere near as frequently as with. (Otherwise, why run it at all?)

Increasing the amount of failures in order to increase the percentage of nice (not 3am) failures takes guts.

yes it's like knowing a mad axeman and then letting said person go wild in yoru server room.

This is so cool, and I'm wondering if I can do something similar. It reminds me of "bug seeding" where you purposely insert bugs into your product and count how many are found through testing. (Of course, you track them so you can take them out later.)

That's a really good idea, do you know if there are any tools out there for doing this? I'm thinking for something basic you could make it change random variables or numbers in the code base and run the tests after each change and report if they failed.

Fuzz testing involves an automated process commenting out each line of your code one-by-one and running your unit tests to see what lines are actually tested.

I wonder how well this would work for those Inuit guys with such and such millions of LOC ;). "Wait, what Big-O complexity does this code have again you say?"

There's Heckle for Ruby


Unfortunately it is only compatible with ruby 1.8

Very interesting. However, this won't catch bugs that don't cause instance outages. I suppose the next step to is to somehow simulate misconfigurations/overloaded services/unexpected errors (I have no idea how this would work).

Netflix actually has a whole Simian Army (http://techblog.netflix.com/2011/07/netflix-simian-army.html) which includes in addition to Chaos Monkey: Latency Monkey, Chaos Gorrila, Conformity Monkey, etc.

We actually have those too. Right now in fact we're doing a series of latency simulations.

This is certainly a great idea in theory, but I wonder if it has caused them to build things to be a little too fault tolerant, by assuming there are faults when maybe things are just slow. On Xbox, my recently watched fails to appear about 30% of the time even though it appears just fine on the site. The only thing I can think is that it must be taking just a little bit too long to return, so they assume failure and move on.

The reason why this is critical for Netflix is that all the infrastructure is provided as a service and this completely out of their control.

By creating faults in the Netflix service which they can build around to stay optimally available, they completely or nearly completely eliminate the risk of being affected by faults in AWS services.

Now, clearly this has not been 100% true yet because some major outages have effected Netflix - though that's when the entire service fails, as opposed to single point of failure types of faults, which chaos monkey is really designed to accomodate.

(disclaimer: this is my outside perception based on purely interest in how Netflix has built themselves. I could be incorrect)

You are mostly correct. The infrastructure isn't completely out of our control, but it is indeed provided by a 3rd party.

By creating faults it means we know we can handle those types of outages.

As we experience new types of faults, we build tools and systems to make us resilient to those types of faults, and where possible similar but yet unexperienced faults.

Jedberg - what is your role at Netflix specifically?

I have a question, I asked this of Adrian, but he was of the opinion that Netflix would likely never accommodate:

How can a hospital have group Netflix streaming accounts such that users in their patient rooms could view/stream netflix to their rooms?

Can you make a commercial account / support this?

What about the following grey method for handling this: Th hospital pays for a group of individual streaming accounts. Their Patient Entertainment system can then check-out an account for use by the patient. The patient watches what they want, then when done, the system checks-in the account for use.

Would this be amenable to the TOS you may have?

The reason this is important is that cable ompanies are raping hospitals for providing cable TV service to their patient rooms. Netflix is a far better path for the future.


I work on site reliability.

As for all the rest, I'm unfortunately not in a position to answer any of that. Adrian's probably right though. I suspect our content licenses would preclude that type of account.

FYI - we have a Latency Monkey as well

Jeff Atwood had a good piece on this: http://www.codinghorror.com/blog/2011/04/working-with-the-ch...

Overall it's brilliant from an infrastructure standpoint but it's up to the developers to make your code monkey proof!

It's up to the developers and you, the operations/devops people, to work together to make it chaos-monkey-proof. No engineering team is an island.

Webpulp TV recently did an interview with Jeremy Edburg from Netflix in which they discussed Chaos Monkey, albeit shortly.


Usling the monkey, have you found that some design patterns are better than others? For example, queues vs RPC?

AWS is the chaos monkey.

Their outages are becoming more and more random.

> Their outages are becoming more and more random

… or more loudly reported. Very few of the reported outages affect people who follow Amazon's guidelines for reliability - even the last one had most of the impact on cheapskates who didn't pay for redundancy.

Still think playing kerplunk with processes/systems whilst good is akin to pissing on your server and then seeing how quick you can repair it.

I personaly like to run stuff on a CPU/network crippled test setup as that can expose cracks normal testing fails to highlight and in a way that is indicative of what can actualy happen under some load spikes.

If your goal is for your server to be urine-proof, then peeing on it from time to time to make sure would be a good idea.

sadly my analogy failed most peoples IQ :facepalm:

Yes, we're all too dumb to understand your brilliance. Couldn't possibly be the fault of your analogy.

no that would be you.

Point I was making is that pulling out process's/servers/cables is all fine (aka pissing on your server) but it is not a ideal way to test. It is better IMHO to test out on what you could call weak hardware, as in realy throttled CPU/disc/network etc and see were things break as these emulate what potentual peak bottlenecks you will get. This approach is better as you get to test out a single process system as apposed to only being able to test cloud based approaches as this does.

Pulling out processes/servers/cables is an ideal way to test what happens when you pull out processes/servers/cables. Especially when you want to see how it affects a system that is too large to realistically reproduce in a test rig (though presumably you would still want to run it in your scaled down test rig too).

Throttling CPU/disc/network is all well and fine and necessary too, but that's only one set of failure scenarios, and won't protect you against a whole range of failures related to systems just disappearing as they crash, or failing to restart, or any number of other concerns.

Chaos monkey isn't for testing peak bottlenecks, it's for testing random failures at the per instance level. For bottleneck testing I expect Netflix has another tool with a similarly simian name, but that's not the topic of discussion.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact