Hacker News new | past | comments | ask | show | jobs | submit login
The Netflix Simian Army (netflix.com)
105 points by abraham on July 19, 2011 | hide | past | favorite | 17 comments



If I had to give a name to this approach, I'd call it "adversarial debugging".

It's an excellent technique for software improvement, as it somewhat mirrors an evolutionary game. You have your prey (your software) and a predator (the chaos-like monkeys). When the predator is successful, you improve your prey until the predator is under control (with non-chaos monkeys). Then you improve your predator (chaos-like monkeys). This cycle of prey/predator improvement can be repeated as long as needed.

As saurik points out, this has the potential to lead to cascading failures. But this is true of any complex system that has multiple levels of self-repair - repair systems in biological systems can also work against the host organism.

I'd love to see commonly used software hardened in this way - Apache for example. Imagine a contest where the aim is to find creative ways to bring down sandboxed Apache servers (executed on the machine of the contest participant). You (the contestant) come up with an Apache killer and submit it to the contest website and get points based on how much damage your code can do. This gives the Apache developers an idea of where to harden Apache.

The obvious danger with such a system is that it's a treasure trove of DOS attacks against existing Apache installations. But the argument in favor is that some black-hats might already have similar code anyway and they won't be publishing their code. Also, the code is a good test harness that can be used to verify that major architectural changes (such as what would be needed to integrate Google's SPDY into Apache) don't make Apache vulnerable to previous attacks. And of course, other similar software (Nginx et al.) can also benefit from some of these test cases.


While Chaos Monkey sounded like an interesting way to force people to be prepared for failure, Doctor Monkey just seems downright dangerous: removing an unhealthy instance from service may correlate to causing other instances to become unhealthy, thereby removing them from service as well... Chaos Monkey might cause such behavior, but it will be random and transient, whereas with Doctor Monkey you would imagine a sudden health-collapse leading to all instances being terminated by our new friend, the good doctor.


Plague Monkey :-)

You make a good point, but in operations there are two states 'degraded mode' and 'healthy', and basically the further into degraded mode you go, the more likely the next failure can take you completely off the air. So if I were deploying something like their automated 'treat unhealthy' it would be simply:

While (1) {

  if (unit-unhealthy AND state == healthy) {

      remove-unhealthy, state <- degraded.

  }
}

Thus constraining the automated repairs from taking you further into a degraded state but opportunistically fixing things when you were nominally healthy. By managing the hysteriesis gap between fully healthy and degraded as a percentage of overprovisioning you can supported automated repairs with minimal risk of them being the source of future downtime.


That's a very good point. Unfortunately is not as simple as having discrete states: 'healthy', 'degraded'. For example a tier might be deployed to an auto-scaling-group (ASG) running 100 instances. One of them is failing the health-check. Common sense dictates you can take action (kill the offending instance). What's the overall state of the ASG after this? Are you allowed to kill another instance that's having issues? How many after that? In general you want to put some rules into place. (Percentage of the ASG that's safe to kill, latency boundaries for the service, time to wait before taking the next action, etc.)


Thus constraining the automated repairs from taking you further into a degraded state but opportunistically fixing things when you were nominally healthy. By managing the hysteriesis gap between fully healthy and degraded as a percentage of overprovisioning you can supported automated repairs with minimal risk of them being the source of future downtime.

That reads almost like the marketing blurb on a can of "magic enterprise pixie dust". ;-)

the further into degraded mode you go, the more likely the next failure can take you completely off the air

This kind of measurement really only works for rather simplistic systems where all inter-dependencies, bottlenecks and failure-modes are well understood. I.e. a disk-array, a cluster of streaming servers, or a network cable.

The nasty stuff for complex applications is in unexpected/non-linear bottlenecks (RDBMS-monkey?), intermittent failures (packet-loss-monkey?), cascading failures (monkey tag-team?) and... human error (beer monkey, bug monkey?).


"That reads almost like the marketing blurb on a can of "magic enterprise pixie dust". ;-)"

Lol. So one of the things that Web 2.0 can teach is where to layer. Its something we do aggressively at Blekko and if you read the OpenCompute stuff that Facebook has done, or the papers on AWS or S3 as well the fundamental theory is all very similar.

Basically you have a 'system' which is nominally some CPUs, some disk, some memory, maybe some Flash, and a network connection. You create a pretty flat layer 2 'fog' out of them. Basically you name them algorithmically so the ops guys can find it when its broken, you assign everything fixed IP addresses by where it sits in the rack and for some level of 'nominal' you run all of the base set of software on each 'droplet' in this fog.

Now you go a bit further and you separate things by which ethernet switch they are on, which powerline circuit, maybe even which 'phase' of the incoming power, etc. And you let all your droplets count modulo the number of failure domains so if its power/net/cooling that would be three failure domains 0, 1, and 2.

Now as a scheduling exercise, lets say your application uses a soft storage layer to replicate by 3 or by N depending on your IO/sec requirement and you need Y amount of computer power and Z amount of RAM to hold the working set so that you can deliver the right latency. You figure out how many 'drops' (you might call them machines) you think that is, you install that many plus say 10% more. So to put real numbers lets say you have 100 drops (machines) and 10 spares. You spread those droplets through your three failure domains 37 per.

Now when all 100 machines are healthy, you run along just fine, everyone humming. Plague monkey decides that machine #82 is 'unhealthy' (maybe their disk is spewing errors, maybe their network keeps dropping pings) the key is that its kinda there and kinda not. For Map/Reduce or Hadoop type systems perhaps it is consistently the straggler in the reduce loop. Plague monkey comes along, shoots #82 in the head, sends an email to Ops that says machine R3-S4-N18 (rack 3, shelf 4, switch-port 18)is sick, its been shot in the head, spare machine R5-S1-N2 was tagged as being 'it'.

Now your system still has 100 machines running the application, but its down to 9 spares and there is one machine on the sick list.

Now if you've designed your system to be resilient to failures, and you have because you are using cheap whitebox servers which you know are going to fail randomly. You know that you can deliver against your service level SLA with up to 15 machines down (85 out of a nominal 100 running).

Operations can pick a 'badness' number which is '20' which is to say once 20 machines are identified as needing to be 'fixed' the margin for error is down to 5 machines (which is to say of your 110, 20 are dead so 90 are still kicking and all part of the 'active' pool). At that point even an unhealthy machine that is still providing some service is better than dropping below 85 which would take you from degraded (remember at 90 your are off 10 machines) to dead.

So in my shop at least, long before you get to the 20 machines out of 110 dead you're installing replacement machines. But if you were having a particularly bad long weekend (and the pagers had all mysteriously failed) then you can make things 'less worse' by not shooting any more machines in the head.

"The nasty stuff for complex applications is in unexpected/non-linear bottlenecks (RDBMS-monkey?), intermittent failures (packet-loss-monkey?), cascading failures (monkey tag-team?) and... human error (beer monkey, bug monkey?)."

There is an architectural invariant which avoids these, which is that you don't build systems which can't tolerate the sudden loss of the machine they are running on. Its something Google does masterfully and Oracle does not at all (AFAICT, its been a while since I was near an Oracle implementation). Which is why I think of it as the Web 2.0 way of doing things.

The AWS and S3 papers allude to some of these strategies as well, and of course NetFlix has pretty much gone 'all in' as they say in the poker tournaments. And no, not much of this stuff is 'off the shelf' (yet? who knows) But from an operational investment standpoint its nice to know that you can take a number of hits before you stumble. It really makes the economics work out much better.

Its not a free lunch however, you really are running those 10% of spare machines and not generating any revenue from them. So when utilization determines profitability you start balancing running on the edge vs not having any on-call ops guys around the data center during the weekend. There are ways to mitigate that pain, you can use those machines for development or test, throwing them into the 'production' cluster on an as-needed basis but at the end of the day if you just have to run them for resiliency you need to compare that cost to the cost of engineering more resilient equipment. But hey if it was easy anyone could do it :-)


unexpected/non-linear bottlenecks, intermittent failures, cascading failures

There is an architectural invariant which avoids these, which is that you don't build systems which can't tolerate the sudden loss of the machine they are running on.

Actually that is not the invariant you were looking for.

I was pointing specifically at issues that have little to do with machine failure but are frequent root-causes for system failure.


"""Now if you've designed your system to be resilient to failures, and you have because you are using cheap whitebox servers which you know are going to fail randomly. You know that you can deliver against your service level SLA with up to 15 machines down (85 out of a nominal 100 running)."""

All of this is great information and certainly quite useful, but it is all strongly was premised on that final "I know", which assumes a universe in which you are able to capacity plan pretty accurately ahead of time.

Imagine instead a world where it is entirely possible that, on the same day, and all over the United States, a new Harry Potter 7 trailer starts playing at the theaters, causing hundreds of thousands of families to suddenly decide that tonight is a great night to not only boot up Netflix, but to all watch the exact same movie.

(For the record, this sort of thing has totally happened to me a few times: suddenly you find out that jailbreaking the iPhone was mentioned on CNN, or a new tool is released from its secret development to that team's hundred-thousand-plus-follower Twitter feed, leading to a sudden multi-x spike in traffic.)

Now, not only do we suddenly have an unexpected spike in load, but that spike is all targeting the exact same data. It is in this kind of situation where, first off, if you really are sitting on racks at 85% capacity, you are probably screwed: no one is watching that movie tonight.

Luckily, we aren't talking about a world with a bunch of racks: we are talking about Amazon EC2, and we hopefully have an auto-scaling group setup to automatically increase the number of servers we have operating.

However, while that is spinning up, we may have a small set of nodes (and yes: these nodes may even be S3 nodes run by Amazon) that are /freaking out/ as they are the blessed 3-5 computers that are storing our mirrors of Harry Potter 6. 3-5 has, throughout the entire previous history of the service, been "safely more than adequate" to handle the load of any given movie (and even a random failure of one of the servers), but today it is running "a tad slow".

Now, even here, maybe we are setup to scale: maybe the system is designed to auto-scale the data as well, and is copying the movie to new servers as we contemplate this scenario (a situation that, unfortunately, leads to more a bit more load on these servers, although hopefully only marginally more).

However, as the service went from "safely over-provisioned" to "at capacity" within minutes (everyone sitting down at 7pm to watch a movie with their family as the game is over, or whatever), the new machines are still copying the data of as the old ones are now sufficiently slow that from raw number reports they look like "stragglers", with (due to the wonder of statistics) one of them randomly looking slightly worse than the others; maybe even sufficiently worse to hit some "unhealthy" threshold, and Doctor Monkey comes along and kills it.

This is the worst possible moment and the worse possible server to inflict that damage. Even if Doctor Monkey marks the entire world as "degraded" at this point and refuses to do anything else until there is manual intervention, it just drastically increased the load, possibly fatally (possibly so high that the copy operation spawning more of these servers grinds to a halt), to a small set of servers that was already "at capacity".

This is the specific kind of thing that causes cloud collapse scenarios like what killed Amazon EBS a few months ago: due to a busy network, servers decided prematurely that their replica pairs were "offline", causing attempts to find new buddies, causing further network congestion, leading to even a higher likelihood that servers will appear disconnected.

So yes: it may be possible to design a system that is able to tell "unhealthy" from "unpredictably mis-provisioned", but given how fine a line that is I can totally understand why moe chose to use the terminology "magic enterprise pixie dust" to describe the solution that minded that particular gap.


Oh my gosh. If this isn't a prime example of the value of HN I don't know what is. I suspect this post will give me months of pointers on stuff to go learn (no experience in distributed computing). Thank you!


I wish they also had a monkey which smacked devs who tinker with a perfectly usable UI and replace it with something that is less functional. The discoverability factor of discovering new interesting content seems much lower in the new UI in my usage. No sort by stars, I dont understand what they are, sorting by default, no stars visible by default, scanning a list of stars is much easier on the eyes and much faster than looking at a grid of pictures and actively deciding if I want to watch this .... If someone from Netflix is reading this - please provide a classic mode until you bring the new UI upto speed.


I don't think it's fair to blame developers for UI redesigns like this ("tiner with a perfectly usable UI" sounds like some developer decided to change it because he/she was bored). I am sure that Netflix has product managers/UI designers like every place else that originate these type of changes.


That, and I'd like to be able to see what others rated rather than what they think I'll rate it.

The average rating is better info than their "best guess".


I guess this begs the question - was it one of their monkeys that caused their outage on Sunday night?


It wasn't one of the monkeys. Most monkeys only run when we have developers who could notice and fix problems. It also happens that our peak usage is while we're home and our quiet time is while we're at the office. That means, in general, the monkeys don't run on a Sunday night.

Unfortunately we're not running 100% on the cloud today. We're working on it, and we could use more help. The latest outage was caused by a component that still runs in our legacy infrastructure where we have no monkeys :)


I suspect they got hacked. Their site always responded, but every time I logged in I was logged out within seconds, and the only way I could log in again was to reset my password, after which I'd be logged out again within seconds, repeat five times over a couple of hours before I gave up.


I know you are wrong.

We've hired lots of folks from yahoo to work on infrastructure and one guy from reddit to be a cloud SRE. but we've yet to hire a security guy from sony.


Downmodded into oblivion. How nice for sharing my personal experience of a topic raised by someone else. I guess I should have known that a perfectly normal outage was responsible why I could log into the site, load pages successfully for thirty seconds, and then be logged out and be unable to log back in without resetting my password. Couldn't have been attackers DOSing Netflix users or Netflix admins trying to squash an attack. The problem was just that their don't-constantly-log-users-out-and-deactivate-their-password service went down because it wasn't on the cloud and I was dumb not to figure that out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: