
Principles of Chaos Engineering (2018) - archielc
http://principlesofchaos.org
======
KenanSulayman
We built a system called „friendly fire“ that nukes a server every 10 minutes.
It has changed the mindset of all engineers and made our infrastructure
missile-proof.

Funnily enough it also improved our latencies a lot (which I guess is mostly
due to memory leaks et al.)

~~~
jaggederest
We used to boot ~3x the servers we needed, run a hard load on them for a
while, performance test them, and kill the "weakest" 2/3rds. You can get a
bunch of nodes on iffy hardware or greedy neighbors (or all on the same
physical box) and see significant performance improvements that way.

Of course this was a decade ago, but I think the fundamentals are still sound,
as far as being skeptical about the quality and longevity of your nodes in a
virtual environment.

~~~
isbvhodnvemrwvn
At my previous job we did that for DB nodes on AWS. Still definitely a known
technique.

------
jinqueeny
The following link shows how we do Chaos Engineering in TiDB, an open source
distributed database:

[https://www.pingcap.com/blog/chaos-practice-in-
tidb/](https://www.pingcap.com/blog/chaos-practice-in-tidb/)

Regarding the Fault Injection tools we are using:

\- Kernel Fault Injection, the Fault Injection Framework included in Linux
kernel, you can use to implement simple fault injections to test device
drivers.

\- SystemTap, a scripting language and tool diagnose of a performance or
functional problem.

\- Fail, gofail for go and fail-rs for Rust

\- Namazu: a programmable fuzzy scheduler to test a distributed system.

We also built our own Automatic Chaos platform, Schrodinger, to automate all
these tests to improve both efficiency and coverage

------
jtms
I have not used it, but I have heard this is a very useful tool
[https://github.com/Netflix/chaosmonkey](https://github.com/Netflix/chaosmonkey)

~~~
poooogles
Sure, often at a super initial stage the people running the test can just do
this manually though. Don't be put off by having to set up the whole suite, a
lot of the value from Chaos Engineering can be achieved from randomly removing
bits of infrastructure manually (a d6 and a lookup table works fine). The
value comes from the learnings as a result of infrastructure being terminated.

~~~
polskibus
What would be advised in the following situation if one wanted to follow chaos
engineering principles:

\- there's a service that needs config data from a DB on another node to
initialize itself to become useful \- should the service die if it doesn't
have connection to the DB on startup (so that the error propagates), or should
it start and perform retry indefinitely until DB connection is set? Until that
happens it sends back error code to its consumers.

~~~
8note
I don't think what to do during the failure case is part of chaos engineering.

identifying that it needs to do something, and whether it does it or not is
part of chaos engineering. eg by turning off the DB for a bit and seeing what
it does

------
azhenley
Other useful materials:

\- Chaos Monkey Guide for Engineers [https://www.gremlin.com/chaos-
monkey/](https://www.gremlin.com/chaos-monkey/)

\- Recent HN discussion on Resilience Engineering: Where do I start?
[https://news.ycombinator.com/item?id=19898645](https://news.ycombinator.com/item?id=19898645)

------
jorblumesea
If you've never run a chaos experiment, how do you square up blast radius with
running in prod?

It seems like this setup works great if built from the get-go but incredibly
painful and possibly dangerous if starting with existing applications.

~~~
andreareina
Set up a parallel deployment, run the experiments there. Document the failures
in as granular a way as possible, decree that future deployments aren't
allowed to add to the set of known failures. Assign e.g. 20% time to fixing
the known problems. When confidence passes a threshold, start running the
experiments in prod.

It's basically the strangler pattern[1]. It _is_ painful, but can be made
arbitrarily safe.

[1]
[https://news.ycombinator.com/item?id=19122973](https://news.ycombinator.com/item?id=19122973)

------
dang
A thread from 2018:
[https://news.ycombinator.com/item?id=16244586](https://news.ycombinator.com/item?id=16244586)

------
agumonkey
I see no mention of AFL which seems like a fitting tool for the topic.

Also the term 'antifragile' (lightly controversial) comes to mind.

