
The Discipline of Chaos Engineering - kiyanwang
https://blog.gremlininc.com/the-discipline-of-chaos-engineering-e39d2383c459
======
voltagex_
>Kolton is co-founder and CEO of Gremlin Inc. Previously he was a Chaos
Engineer at Netflix improving streaming reliability and operating the Edge
services.

Right - this makes sense. I thought this philosophy sounded Netflix-ish.
Interesting to see it "spun out" into a product.

------
coldcode
Watching this kind of complex distributed system in the throws of random
failure every day at work, I wish we actually did this instead of just
reacting to one nightmare after another.

------
westoque
>>> Chaos Engineering is the discipline of experimenting on a distributed
system in order to build confidence in the system’s capability to withstand
turbulent conditions in production.

I don't like how they define "Chaos Engineering" as being strictly related to
distributed systems.

~~~
Arcsech
Chaos Engineering seems most useful in the context of distributed systems
though, and I'm not entirely certain how you would implement it outside of
that context.

I would consider applying invalid inputs, etc. to non-distributed systems to
be more along the lines of traditional testing. Perhaps you could implement
chaos engineering principles in a non-distributed system by simulating the
failure of a CPU core or a region of memory? It seems less useful though, as
those things seem very difficult to effectively recover from.

How would you define "chaos engineering" to apply to non-distributed systems?

~~~
westoque
> Perhaps you could implement chaos engineering principles in a non-
> distributed system by simulating the failure of a CPU core or a region of
> memory?

Could be. In a general sense of chaos engineering.

For me, I think chaos engineering would be to keep generating "chaos" in a
system or module or any unit. Chaos in this sense would be to break it, or any
kind of maltreatment to it.

------
nstart
I'm curious though. What does one do if the database goes away? Also, how does
one achieve high availability without doubling costs? Is this covered anywhere
as a topic area or book or dedicated blog? Quite curious how to get started
with something like this.

~~~
dastbe
you're increasing costs, but you're insuring against the lost profits from
your site being down as well as reduced consumer trust. there are ways to keep
the cost multiplier down (running many smaller instances vs a single large
instance, using containers and running mulitenant machines), but when you
start thinking about all layers you want to add redundancy (load balancing,
application, database, storage, etc.) you're going to be spending quite a bit
of additional money for that insurance. which is why there are good arguments
that not every application needs to build in these kinds of redundancy.

For your specific question about databases, you generally have clustering to
reduce the impact of any one database instance going down, caching of data to
guard against temporary db outages/network issues, and sharding of data across
multiple databases to reduce the blast radius of any one logical database
going away entirely.

------
tehmillhouse
Isn't this basically what Google has been doing for years with their DiRT
exercises?

~~~
dastbe
1\. Netflix has been doing "chaos engineering" for years (earliest public
reference I can find is 2011), especially w.r.t. to public cloud.

2\. Netflix has done a great job at publicizing their efforts and open
sourcing software that helps you do this kind of testing in a continuous,
automated fashion.

So I think the "basically what google has been doing" comment is reductive.

