

The Netflix Simian Army - abraham
http://techblog.netflix.com/2011/07/netflix-simian-army.html

======
wynand
If I had to give a name to this approach, I'd call it "adversarial debugging".

It's an excellent technique for software improvement, as it somewhat mirrors
an evolutionary game. You have your prey (your software) and a predator (the
chaos-like monkeys). When the predator is successful, you improve your prey
until the predator is under control (with non-chaos monkeys). Then you improve
your predator (chaos-like monkeys). This cycle of prey/predator improvement
can be repeated as long as needed.

As saurik points out, this has the potential to lead to cascading failures.
But this is true of any complex system that has multiple levels of self-repair
- repair systems in biological systems can also work against the host
organism.

I'd love to see commonly used software hardened in this way - Apache for
example. Imagine a contest where the aim is to find creative ways to bring
down sandboxed Apache servers (executed on the machine of the contest
participant). You (the contestant) come up with an Apache killer and submit it
to the contest website and get points based on how much damage your code can
do. This gives the Apache developers an idea of where to harden Apache.

The obvious danger with such a system is that it's a treasure trove of DOS
attacks against existing Apache installations. But the argument in favor is
that some black-hats might already have similar code anyway and they won't be
publishing their code. Also, the code is a good test harness that can be used
to verify that major architectural changes (such as what would be needed to
integrate Google's SPDY into Apache) don't make Apache vulnerable to previous
attacks. And of course, other similar software (Nginx et al.) can also benefit
from some of these test cases.

------
saurik
While Chaos Monkey sounded like an interesting way to force people to be
prepared for failure, Doctor Monkey just seems downright dangerous: removing
an unhealthy instance from service may correlate to causing other instances to
become unhealthy, thereby removing them from service as well... Chaos Monkey
might cause such behavior, but it will be random and transient, whereas with
Doctor Monkey you would imagine a sudden health-collapse leading to all
instances being terminated by our new friend, the good doctor.

~~~
ChuckMcM
Plague Monkey :-)

You make a good point, but in operations there are two states 'degraded mode'
and 'healthy', and basically the further into degraded mode you go, the more
likely the next failure can take you completely off the air. So if I were
deploying something like their automated 'treat unhealthy' it would be simply:

While (1) {

    
    
      if (unit-unhealthy AND state == healthy) {
    
          remove-unhealthy, state <- degraded.
    
      }
    

}

Thus constraining the automated repairs from taking you further into a
degraded state but opportunistically fixing things when you were nominally
healthy. By managing the hysteriesis gap between fully healthy and degraded as
a percentage of overprovisioning you can supported automated repairs with
minimal risk of them being the source of future downtime.

~~~
moe
_Thus constraining the automated repairs from taking you further into a
degraded state but opportunistically fixing things when you were nominally
healthy. By managing the hysteriesis gap between fully healthy and degraded as
a percentage of overprovisioning you can supported automated repairs with
minimal risk of them being the source of future downtime._

That reads almost like the marketing blurb on a can of "magic enterprise pixie
dust". ;-)

 _the further into degraded mode you go, the more likely the next failure can
take you completely off the air_

This kind of measurement really only works for rather simplistic systems where
all inter-dependencies, bottlenecks and failure-modes are well understood.
I.e. a disk-array, a cluster of streaming servers, or a network cable.

The nasty stuff for complex applications is in unexpected/non-linear
bottlenecks (RDBMS-monkey?), intermittent failures (packet-loss-monkey?),
cascading failures (monkey tag-team?) and... human error (beer monkey, bug
monkey?).

~~~
ChuckMcM
"That reads almost like the marketing blurb on a can of "magic enterprise
pixie dust". ;-)"

Lol. So one of the things that Web 2.0 can teach is where to layer. Its
something we do aggressively at Blekko and if you read the OpenCompute stuff
that Facebook has done, or the papers on AWS or S3 as well the fundamental
theory is all very similar.

Basically you have a 'system' which is nominally some CPUs, some disk, some
memory, maybe some Flash, and a network connection. You create a pretty flat
layer 2 'fog' out of them. Basically you name them algorithmically so the ops
guys can find it when its broken, you assign everything fixed IP addresses by
where it sits in the rack and for some level of 'nominal' you run all of the
base set of software on each 'droplet' in this fog.

Now you go a bit further and you separate things by which ethernet switch they
are on, which powerline circuit, maybe even which 'phase' of the incoming
power, etc. And you let all your droplets count modulo the number of failure
domains so if its power/net/cooling that would be three failure domains 0, 1,
and 2.

Now as a scheduling exercise, lets say your application uses a soft storage
layer to replicate by 3 or by N depending on your IO/sec requirement and you
need Y amount of computer power and Z amount of RAM to hold the working set so
that you can deliver the right latency. You figure out how many 'drops' (you
might call them machines) you think that is, you install that many plus say
10% more. So to put real numbers lets say you have 100 drops (machines) and 10
spares. You spread those droplets through your three failure domains 37 per.

Now when all 100 machines are healthy, you run along just fine, everyone
humming. Plague monkey decides that machine #82 is 'unhealthy' (maybe their
disk is spewing errors, maybe their network keeps dropping pings) the key is
that its kinda there and kinda not. For Map/Reduce or Hadoop type systems
perhaps it is consistently the straggler in the reduce loop. Plague monkey
comes along, shoots #82 in the head, sends an email to Ops that says machine
R3-S4-N18 (rack 3, shelf 4, switch-port 18)is sick, its been shot in the head,
spare machine R5-S1-N2 was tagged as being 'it'.

Now your system still has 100 machines running the application, but its down
to 9 spares and there is one machine on the sick list.

Now if you've designed your system to be resilient to failures, and you have
because you are using cheap whitebox servers which you know are going to fail
randomly. You know that you can deliver against your service level SLA with up
to 15 machines down (85 out of a nominal 100 running).

Operations can pick a 'badness' number which is '20' which is to say once 20
machines are identified as needing to be 'fixed' the margin for error is down
to 5 machines (which is to say of your 110, 20 are dead so 90 are still
kicking and all part of the 'active' pool). At that point even an unhealthy
machine that is still providing some service is better than dropping below 85
which would take you from degraded (remember at 90 your are off 10 machines)
to dead.

So in my shop at least, long before you get to the 20 machines out of 110 dead
you're installing replacement machines. But if you were having a particularly
bad long weekend (and the pagers had all mysteriously failed) then you can
make things 'less worse' by not shooting any more machines in the head.

"The nasty stuff for complex applications is in unexpected/non-linear
bottlenecks (RDBMS-monkey?), intermittent failures (packet-loss-monkey?),
cascading failures (monkey tag-team?) and... human error (beer monkey, bug
monkey?)."

There is an architectural invariant which avoids these, which is that you
don't build systems which can't tolerate the sudden loss of the machine they
are running on. Its something Google does masterfully and Oracle does not at
all (AFAICT, its been a while since I was near an Oracle implementation).
Which is why I think of it as the Web 2.0 way of doing things.

The AWS and S3 papers allude to some of these strategies as well, and of
course NetFlix has pretty much gone 'all in' as they say in the poker
tournaments. And no, not much of this stuff is 'off the shelf' (yet? who
knows) But from an operational investment standpoint its nice to know that you
can take a number of hits before you stumble. It really makes the economics
work out much better.

Its not a free lunch however, you really are running those 10% of spare
machines and not generating any revenue from them. So when utilization
determines profitability you start balancing running on the edge vs not having
any on-call ops guys around the data center during the weekend. There are ways
to mitigate that pain, you can use those machines for development or test,
throwing them into the 'production' cluster on an as-needed basis but at the
end of the day if you just have to run them for resiliency you need to compare
that cost to the cost of engineering more resilient equipment. But hey if it
was easy anyone could do it :-)

~~~
moe
_unexpected/non-linear bottlenecks, intermittent failures, cascading failures_

 _There is an architectural invariant which avoids these, which is that you
don't build systems which can't tolerate the sudden loss of the machine they
are running on._

Actually that is not the invariant you were looking for.

I was pointing specifically at issues that have little to do with machine
failure but are frequent root-causes for _system failure_.

------
dman
I wish they also had a monkey which smacked devs who tinker with a perfectly
usable UI and replace it with something that is less functional. The
discoverability factor of discovering new interesting content seems much lower
in the new UI in my usage. No sort by stars, I dont understand what they are,
sorting by default, no stars visible by default, scanning a list of stars is
much easier on the eyes and much faster than looking at a grid of pictures and
actively deciding if I want to watch this .... If someone from Netflix is
reading this - please provide a classic mode until you bring the new UI upto
speed.

~~~
brown9-2
I don't think it's fair to blame developers for UI redesigns like this ("tiner
with a perfectly usable UI" sounds like some developer decided to change it
because he/she was bored). I am sure that Netflix has product managers/UI
designers like every place else that originate these type of changes.

------
commanda
I guess this begs the question - was it one of their monkeys that caused their
outage on Sunday night?

~~~
dkarl
I suspect they got hacked. Their site always responded, but every time I
logged in I was logged out within seconds, and the only way I could log in
again was to reset my password, after which I'd be logged out again within
seconds, repeat five times over a couple of hours before I gave up.

~~~
elq
I know you are wrong.

We've hired lots of folks from yahoo to work on infrastructure and one guy
from reddit to be a cloud SRE. but we've yet to hire a security guy from sony.

