
Introducing Chaos Engineering - tweakz
http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html
======
herge
Man, I really have the impression that my video-on-demand service has a better
understanding of ensuring availability and risk-management than either my bank
or any government service.

Could you imagine say a utility company creating a title called 'Chaos
Engineer'.

~~~
Kronopath
Can you imagine how difficult that would be to justify in a bank or government
service? It would be very hard to convince someone in a position of power at
an institution like that that _deliberately introducing random failures and
problems_ into your production services would be a good idea. Even if it is a
good idea, it would be hard to make that argument.

If Netflix's system fails, the worst thing that could happen is you don't get
to watch _Orange is the New Black_. If your bank's system fails, you may lose
payments, fail to pay your bills on time, or worse.

~~~
Spooky23
I work for a large institution. The thought process doesn't work that way --
the suggestion of introducing chaos would be a no-go not because of business
risk, but the risk of disrupting the layers of bullshit that people build
between them and whatever the actual business is.

Netflix's cost of failure is very high -- customers get real-time discovery of
your failures, and the service costs less than $10/mo, so they aren't deeply
invested in it and will drop you like a hot potato.

~~~
pgt
When people are only paying $10/month, failures are cheap. I imagine low-cost
customers tend to be more forgiving. I could be wrong on this.

~~~
Spooky23
The difference in many government and some banking situations is that you
don't have much recourse. If the DMV goes down, you still need plates. If it's
a really epic failure, there may be political fallout, otherwise, it's just
meeting low expectations.

For many people, changing bank accounts is very difficult for many people,
which is one of the reasons that many banks are able to treat customers with
something close to contempt.

------
kperry
Netflix puts out some great articles about architecture in the cloud. Auto-
scaling, chaos monkey, and how they handle 'steal-time.' Does anyone know of
any other company that publishes so much about cloud architecture? This is
great stuff!

~~~
frozenport
Most companies their size would use their own servers instead of the cloud.

~~~
kooshball
Using AWS does not magically give you a HA infrastructure when you have a
complicated service oriented architecture like Netflix. All the stuff
mentioned here are still relevant even if they're running their own DC.

------
diminoten
I wonder if Netflix has ever come out with some kind of, "So you want to get
into chaos engineering, eh?" kind of article that explains the basics and some
pitfalls/things to look out for.

~~~
q_
I gave a talk at pagerduty about how this has been done for the last few
years: [https://blog.pagerduty.com/2014/03/injecting-failure-at-
netf...](https://blog.pagerduty.com/2014/03/injecting-failure-at-netflix-
staying-reliable-for-40-million-customers/)

~~~
kperry
Nice! Thanks for sharing man!

------
dirtyaura
I assume that other big internet companies also practice chaos engineering
under a different name, but having such a name for the job is awesome. It
highlights the difference to traditional stress testing. Names have surprising
power. Growth Hacker was a bit annoying but very effective title trend and it
helped to communicate the different approach to traditional marketing efforts.
I think Chaos Engineer has the same potential.

~~~
numlocked
You're probably right re:other companies doing similar things. Certainly
Google does something similar via their "disaster days":
[http://queue.acm.org/detail.cfm?id=2371516](http://queue.acm.org/detail.cfm?id=2371516)

------
chton
It's great to see Netflix taking disaster recovery and chaos mitigation
seriously. Learning to work with constant failure is one the biggest
challenges to anyone working with distributed systems and scale, and concepts
like the Chaos Monkey help enormously. I hope other companies follow suit, and
soon.

------
kylekampy
How does one go about becoming a chaos engineer? I imagine up to this point it
is a field one falls into accidentally and gains experience over time. I can
imagine it becoming a topic taught at a college level in the near future.

~~~
herge
Not really, you probably start with an entry-level sysadmin/devops position,
and work your way from fire-fighting incident to firefighting incident.

You just need an appreciation that any down-time in a system is a symptom of
larger problems, and the will to identify (and reproduce as chaos!) those
problems.

------
tomwilde
Someone calling themselves 'the chaos commander' who uses multi-regional
active-active jargon is looking for chaos engineers.

Nope.

------
ellysetaylor21
Lol at monkey picture, full rambo style :p

