Hacker News new | past | comments | ask | show | jobs | submit login
The Chaos Engineering Book (verica.io)
237 points by talonx 38 days ago | hide | past | favorite | 32 comments

It's amazing to me that something I threw out in a meeting six years ago became an entire engineering culture.

It all started when we were talking about what to call the folks who were building things like Chaos Monkey/Gorilla/Kong and I said, "Let's call them Chaos Engineers, since they are engineering chaos". And so we adopted that title at Netflix, and now here we are.

I should also point out here how valueless what I did was -- I literally just came up with the name for what we were already doing inside Netflix. Everyone else actually wrote it down and spread it outside of Netflix. Their contributions to spreading the word are far more important than the name I came up with.

What's the difference between "Chaos Engineering" and fault injection, which has been around since 1970~?

Chaos Engineering is fault injection being incorporated as a core aspect of a production system. So whereas in the 1970s fault injection might have been more used in the development and test phases, Chaos Engineering also includes a bunch of persistent services knocking out key aspects of your production system.

So basically you are writing tests that break things, and then keeping that code in the final compile?

You nailed it

Reminds me of these kinds of tables: https://www.google.com/search?q=broken+glass+table&rlz=1C5GC...

Wherein the table is made, then shattered, but keeps its functionality :)

Mostly better marketing... I'll bet "Chaos Engineers" get paid two or three times whatever the role that did "fault injection" got.

Fault injection is a technique. Chaos engineering is a discipline.

Yes; to expand on that a bit -- you can do fault injection in a chaos experiment. You can also do many other things within an experiment that would be difficult to classify as fault injection; for example, a sharp increase in the number of customers making requests to your service isn't really a fault. Chaos Engineering then wraps all of those techniques in a framework of experimentation and analysis. That's the discipline, which encourages you to proactively make improvements to Availability, and now Security as well.

Naming things is notoriously difficult and you nailed it. Bravo.

I remember clearly the chaos monkey posts and how it resonated for our team, as we were building an ambitious real time project at the time. Thanks for the inspiration booster o/

At my old job, we were creating a new mobile app, and as a joke I suggested a name pretty close to grindr (it was a completely unrelated app). The CEO was oblivious to my sarcasm and decided it was a great name for the app, so that's what we ran with.

Thanks, Jeremy! Having a name for it made it a lot easier to write the principles: https://principlesofchaos.org/

And now we have a canonical book. ;-)

Thanks for the details, and the much appreciated humility.

Instead just deploy to AWS us-east-1 and the random failures will keep you on your toes.

AWS? You kids have good life, try Oracle Cloud, let me show you... oh sorry can't login now

This made me laugh. I get shocked looks of disbelief and claims that AWS are basically gods when I tell people that we're losing packets in eu-west-1 occasionally. The only thing we gained is not having to argue with one vendor when that happens, not two.

I wish AWS would just create a region where failure rate is at 10%. I bet there's a demand for that

Is it really that bad?

Not at all. I run production infrastructure in us-east-1. It has one gone down once due to an AWS issue in 5 years.

No, it really isn't.

I was hoping that this was going to teach me how to build an altar to Khorne.

I mean, you mostly just need blood and skulls, it doesn't need to be fancy - like any gift, it's the thought that counts. Excess bones and bloods can be turned into glue [1] or filler material to increase the structural robustness.

[1] Glue made from animal bones is well known, but apparently you can also make glue from blood: https://www.warf.org/documents/technology-summary/P07418US.p...

^ this man has a plan

Blood for the blood god!

Reminds me of "crash-only software" as embraced by Ericsson/Erlang.


Reminds me of an anecdote about a CTO that pulled the plug on their main production database while giving a tour to investors. He had complete confidence that the solutions they had in place would deal with the problem gracefully.

That's the kind of confidence I'd like to have some day.

Has anybody tried this sort break stuff experiments in financial institutions(large banks). Does the chaos you caused has it affected any customers and made them report

Would buy but it’s expensive for me.

They are offering free copy of the book at https://www.verica.io/book/

Except it is not free. I don't see a direct link, only a form that require my informations.

Your information, or SOME information?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact