
Netflix Chaos Monkey Upgraded - dustinmoris
http://techblog.netflix.com/2016/10/netflix-chaos-monkey-upgraded.html
======
yalooze
I wonder what the reasoning was for having version 2 only terminate instances
(vs burning up CPU, taking disks offline, etc.)? I assume it's something to do
with what Chaos Monkey is NOT trying to solve (ie. eating up CPU is caught
elsewhere by another system and out of scope for Chaos Monkey now). Just
trying to think it through...

~~~
jsingleton
I would assume that terminating is easy via the AWS API, whereas some of the
other things need a process on the instance. You shouldn't really be
connecting to boxes directly over SSH if you do DevOps correctly, so maybe
they blocked port 22 to enforce this.

~~~
benhoyt
What do you mean by "do devops correctly" to avoid SSH on boxes? (I'm a
developer, not devops.)

~~~
devonkim
"Devops" has bazillions of meanings, but avoiding (human) ssh to production
boxes is a generally sound principle these days because our infrastructures
are becoming harder to understand by poking at boxes one or two at a time now
even for forensic analysis.

~~~
dberg
So logging in to a server to check a logfile (assuming i dont or cant do
centralized logging) is considered anti-devops ?

Edit: Sorry responded to wrong parent, sigh.

~~~
saryant
At this scale, you basically have to have centralized logging. When you have
thousands of parallel instances of a single application, searching logs box-
by-box just isn't practical.

Consider also that if you're elastically scaling EC2 instances and you need
logs off an instance that's since been terminated, too late! That disk is
gone. So again, you need a central log service.

~~~
vacri
Parsing logs isn't the only way to troubleshoot.

~~~
flukus
It's not even the best way, just sometimes it's the only way.

------
iagooar
I posted this piece of news to my team Slack at work, and a colleague of mine
wrote: "we don't need chaos monkey, we have developers for that".

While being funny, it also holds a lot of truth. I guess that Netflix can hire
really top-notch devs who do not accidentally force downtime to their
software.

~~~
jeremiep
The difference is scale. Only a handful of companies run at Netflix scale.

I wouldn't trust developers to do what Chaos Monkey does at such a scale, no
matter how good you think they are.

------
e1g
Another useful tool is [https://github.com/gaia-
adm/pumba](https://github.com/gaia-adm/pumba) \- like ChaosMonkey, but just
for Docker containers. The coolest part for us was emulating networking
problems between containers (packet loss, unavailability etc).

~~~
brunoqc
What do you use for high availability with Docker?

~~~
e1g
HA containers for us means smart orchestration tools. We did not want to lock
ourselves into Docker-only infrastructure (even now rkt is a very compelling
alternative), and wanted an orchestrator/scheduler that is focused entirely on
that job. Outside of Swarm, Mesos & co appeared too intrusive, and Nomad is
quite narrow in what it does. So we picked Kubernetes and are very happy with
it.

------
tomcart
First, a shameless plug for an alternative implementation:
[https://github.com/BBC/chaos-lambda](https://github.com/BBC/chaos-lambda)

Seems unfortunate that it requires the coupling with spinnaker - although i
can see how it helps with the cluster definition features.

Edit: I'll add that we've been using the original chaos monkey and chaos
lambda extensively in production for some time with very few problems.

------
andrewguenther
Interesting that all of the resource burning features have been removed, I
wish they had expanded on the reasons why. I always found those to be the most
differentiating features of Chaos Monkey. Did they just not get a lot of use
internally at Netflix?

~~~
aaronblohowiak
Resource exhaustion manifests as latency or failure. We inject latency and
failure using FIT, so we can limit the "blast radius". When you are testing
these failure modes, you are more testing the interaction between micro
services, and this requires a bit more precision and sophistication.

Source: I'm on the Chaos team here at Netflix.

~~~
btown
For reference, FIT: [http://techblog.netflix.com/2014/10/fit-failure-
injection-te...](http://techblog.netflix.com/2014/10/fit-failure-injection-
testing.html)

------
moondev
This is awesome. Since Chaos Monkey now leverages spinnaker you can run it
against clouddriver provider. Looking forward to trying this out with
kubernetes. I believe spinnaker treats namespaces as regions. Eventually it
would be cool to simulate masters or other entire federated clusters going
down to test kubernetes scheduling resilience

------
Beltiras
Every time this is in the news I get a feeling of awe for operations teams
having confidence enough to deploy this. I've usually been in small teams with
the feature mill factor turned up way too high.

~~~
aaronblohowiak
If you don't use a tool like this, entropy will take care of taking your
machines down for you. Only then, it won't be a regularly rehearsed part of
"normal operations" so you might find yourself up creek without a paddle.

~~~
tormeh
Yeah, sure, but then it won't be the fault of whoever (me?) thought using
Chaos Monkey in production was a good idea.

Is it good for the organization? Yes. Good for the guy pushing it? Very
possibly very not.

------
cpeterso
Beyond external termination services like Chaos Monkey, what are good examples
of software that purposely increase internal nondeterminism or failure
injection in production?

Go has race detector mode, but it is an optional debug feature with a
performance cost. The Linux kernel's jiffy clock starts counting from -5
minutes so drivers must handle clock rollover correctly because it's not a
uncommon "once every 48 days" event. Firefox has a chaos debug mode that does
things like randomize thread priorities and simulate short socket reads, but
that has performance costs.

~~~
dsl
Go intentionally introduces randomness when reading maps so developers don't
write code dependent on order.

~~~
cpeterso
Ironically, some Go developers began depending on the randomized map ordering
and were surprised when it changed! :)

 _runtime: hashmap iterator start position not random enough #8688_

[https://github.com/golang/go/issues/8688](https://github.com/golang/go/issues/8688)

------
jsingleton
Has anyone else deployed a Chaos Monkey in _production_?

I can imagine it would be a tough sell to the CEO. :)

~~~
klapinat0r
How so? The benefits are worth it, and I doubt any CEO will be argue against
having fault tolerant code :)

You catch bugs, and no one says you can't run Chaos Monkey in staging or a
similar environment if it really is a tough sell.

~~~
birdman3131
The drawbacks of potentially causing downtime and therefore having the
potential to drive away customers as well as obtain an image of unreliability
can be much more damaging than not using it in the first place. Customer image
means quite a bit.

------
rlau26
"Chaos Monkey even periodically terminates itself."

Whoa, that's meta.

~~~
brianwawok
Just wait till it starts killing developers to test your bus factor.

------
stgnet
Inspired by chaos monkey, I introduced malloc chaos mechanism into our
codebase:
[https://reviewboard.asterisk.org/r/4463/](https://reviewboard.asterisk.org/r/4463/)

Although designed originally to catch places where malloc failure wasn't being
handled, it can also be used to randomly trigger other off-nominal portions of
the code that might not otherwise be tested.

------
caf
I can see that Chaos Monkey adds selective pressure to ensure that systems
evolve into a state where they can handle unexpected server outages.

But isn't there a danger that it also encourages maladaptions that come to
rely on being regularly restarted by the Chaos Monkey? I'm particularly
thinking that you might evolve a lot of resource leaks that go unnoticed so
long as Chaos Monkey is on the job.

------
kpwagner
Holy fuck that is some small font size. And the paragraphs aren't in paragraph
tags... just hanging out between <div>'s with some <br>'s to keep them
company.

------
CorvusCrypto
Following in line with the jokes, it's like they looked at my company's
production environment and said "Let's make that a tool!"

------
ticklemyelmo
It looks like it's working too well. The site is unreachable for me.

~~~
Thaxll
Every time that URL comes up people try to access it from https but the site
is only available from http... Fix your Firefox, it's clearly at fault here.

~~~
scrollaway
"Every time"?

If one random guy's complaining about a URL being unreachable and you're
already seeing a pattern... is it at all possible the users aren't at fault?

~~~
corobo
To be fair it does happen frequently enough. I wouldn't say _every_ time but
often on Netflix posts. A bit of a sampling for you:

[https://news.ycombinator.com/item?id=12269411](https://news.ycombinator.com/item?id=12269411)

[https://news.ycombinator.com/item?id=12217900](https://news.ycombinator.com/item?id=12217900)

[https://news.ycombinator.com/item?id=12038367](https://news.ycombinator.com/item?id=12038367)

[https://news.ycombinator.com/item?id=11771714](https://news.ycombinator.com/item?id=11771714)

