
Principles of Chaos Engineering - F_J_H
http://principlesofchaos.org/
======
antoncohen
There is no mention of Netflix on the site, but the term Chaos Engineering,
and the popularization of the technique, seem to come from Netflix. The Chaos
Monkey README even links to this site.

[https://github.com/Netflix/chaosmonkey](https://github.com/Netflix/chaosmonkey)

[https://medium.com/netflix-techblog/chaos-engineering-
upgrad...](https://medium.com/netflix-techblog/chaos-engineering-
upgraded-878d341f15fa)

[http://www.oreilly.com/webops-perf/free/chaos-
engineering.cs...](http://www.oreilly.com/webops-perf/free/chaos-
engineering.csp)

[https://en.wikipedia.org/wiki/Chaos_Monkey](https://en.wikipedia.org/wiki/Chaos_Monkey)

~~~
rapnie
i think you are indeed correct on stating that.. the chaos monkey, cool naming
:)

------
nbraga
A related Ask HN (from 2016) that I submitted: [Ask HN: Who uses Chaos Monkey
in
production?]([https://news.ycombinator.com/item?id=12749597](https://news.ycombinator.com/item?id=12749597)).
A couple of interesting responses there.

------
taneq
This sounds like the kind of thing that would be very powerful if built in
from the start, but an absolute nightmare to try and retrofit to an existing
system.

It reminds me of the Amazon "everything must be a network service"
architecture. Painful to institute but with great payoffs later in terms of
robustness and scalability.

~~~
empath75
We started our current project running chaos monkey from almost day one, and a
few weeks ago, someone fucked up a tag enforcement script on our dev aws
account and started shutting down every single instance, while all of our asgs
kept restarting them. So basically every single service from consul to kafka
to our microservices was down to a few or sometimes even no instances.

Because we had already been dealing with chaos monkey, everything recovered by
itself within a few minutes of the script stopping, and we barely had any
alerts from any services going down.

Also, when aws was pushing our meltdown patches, we just ignored all the
notifications because we didn’t care about instances being shut down, because
they go down randomly all the time.

It’s a pain in the ass at first to deal with, but it definitely forces you to
fix a lot of potential problems unless you want to spend all your time
manually fixing stuff every day.

Even just aside from chaos monkey, we build new base amis and push out rolling
updates _daily_. That’ll also quickly suss out problems with not having
version numbers locked, etc.

------
xg15
Very considerate that they mentioned negative impact for users in the last
point. I wonder though how well that is measured and controlled in real-world
applications of this design style.

~~~
kostarelo
I too wonder if any of these experiments actually affected real customers.

~~~
zzzcpan
Not affecting real customers is kind of the point of controlled experiments.

~~~
drdrey
You can very much impact real customers, but the idea is that you should be
able to only apply the experiment to a small subset of customers and abort it
in case something goes wrong.

For instance, you may want to check that the authorization service can
function without access to the A/B testing service, so you cut the network
connection between them. If authoritazion errors start rising, you stop the
experiment and investigate why this happened. You may also find that clients
did not retry properly on unexpected errors from the authorization service
(e.g. http 500 error codes)

------
kostarelo

      improper fallback settings when a service is unavailable; 
      retry storms from improperly tuned timeouts; outages when a 
      downstream dependency receives too much traffic; cascading 
      failures when a single point of failure crashes;
    

I don't see how one couldn't replicate those in an environment other than the
production. They all involve bringing down complete services and sending some
unexpected high load to other services.

~~~
drdrey
You can start finding and fixing issues in non prod environments, but
eventually you have to run this in prod if you want to find the good stuff.
Nothing is going to replicate the particular configuration, load, mix of
requests and client behavior that you have in prod

------
vvdcect
So did this come about based off Chaos Theory? Then the main goal of chaos
engineering would be to "create order out of chaos"?

------
projectileboy
I’m not affiliated with Netflix, but the negative tone of some of the comments
here is hilarious. Would the critics care to share how _their_ traffic/uptime
ratio compares to Netflix’s? On weekends, Netflix is responsible for about 50%
of the packets on the Internet. On Earth. Netflix is down how often? But
please, do share how you once replicated a MySQL database from prod to a
staging environment.

------
marknadal
Chaos Monkey and Kyle Kingsbury's work on the Jensen tests inspired us to make
this ([https://github.com/gundb/panic-server](https://github.com/gundb/panic-
server)), which lets us simulate all sorts of failure cases. It should be
extremely useful to others if they enjoy stuff like Chaos Monkey, etc.!

~~~
beagle3
^Jensen^Jepsen

Kyle Kingsbury == Aphyr

[https://jepsen.io/](https://jepsen.io/)

~~~
marknadal
Sorry, thanks, my phone autocorrected!

------
voidmain
It says something about the engineering maturity of the industry that
destructive testing in production environments is among the _more_
sophisticated practices.

~~~
darkerside
Seems analogous to load testing a physical structure. At first, that probably
seemed unnecessarily risky, but in retrospect, you want to build robustly
enough that even fairly excessive/destructive loads have no real impact.

~~~
tormeh
Yeah, when hardware failures are expected, simulating hardware failures is no
different from simulating trucks driving over a bridge.

------
philipov
I don't like the name. It's sounds too much like it's the Engineering of
Chaotic Systems, but there is no reference to Chaos Theory:
[https://en.wikipedia.org/wiki/Chaos_theory](https://en.wikipedia.org/wiki/Chaos_theory)

Are they using the formal definition of chaos, or just a colloquialism?

~~~
jimnotgym
I'm not sure I can agree.

> Are they using the formal definition of chaos, or just a colloquialism?

Is it a colloquialism to use a word in its normal English sense? Chaos as a
word precedes Chaos Theory by some thousands of years. If I write a theory and
take over an existing word in everyday use, it seems a bit much to accuse
every one else of colloquiallism when they use an existing but less strict
definition?

~~~
letlambda
> If I write a theory and take over an existing word in everyday use, it seems
> a bit much to accuse every one else of colloquialism when they use an
> existing but less strict definition?

It does, but engineering is a technical profession and it's practitioners are
likely familiar with the mathematical concept.

I've read about applications of chaos theory in system design, and I expected
'Principals of Chaos Engineering' to be about that topic.

~~~
philipov
The problem I have with the name is that it feels like their audience is
people not familiar with the mathematical concept. In other words, not
engineers; in other words, non-technical managers. And that marks it as the
latest marketing fluff being pitched by management consultants.

~~~
mcbits
People familiar with the mathematical concept will almost always be talking
about something like "nonlinear dynamics" instead, anyway.

~~~
philipov
Funny you say that. I had to restrain myself from using that term.
Technically, chaotic systems are only a subset of nonlinear systems. For
example, solitons are nonlinear, but neither are they chaotic.

