
Chaos Monkey Guide for Engineers - vinnyglennon
https://www.gremlin.com/chaos-monkey/
======
yavor-atanasov
At the BBC a few years ago we completely replaced our Chaos Monkey setup with
a "chaos-lambda". It's extremely simple to setup and saves us a chunk of cash
in terms of compute and maintenance. I'm sure Chaos Monkey is more featureful,
but a lot of times you don't really need these features. If you just need
something that ticks on a cron and kills instances at random across your
estate and you have a good amount AWS accounts, have a look.

[https://github.com/bbc/chaos-lambda](https://github.com/bbc/chaos-lambda)

It's always good to see this "chaos culture" being promoted as it drives best
engineering practices in terms of architecting, testing and monitoring
resilient systems. However it's also interesting to see how at times this
space gets inflated, packaged and sold as this big complicated thing that
requires "chaos engineers" to implement (almost like "agile" got inflated into
a stand alone industry :)). It's just a set of good practices that engineers
can do to improve critical systems.

------
expertentipp
Honestly I've never worked in an environment and with stack mature and
scalable enough, with roles within the team so clearly divided, for chaos
monkey to be of any use. Hectic deployments, "it's not my job" attitude,
feature creep, and a litany of minor errors and misconfigurations were taking
its place.

~~~
jedberg
Perhaps if you had introduced Chaos Monkey, those other problems would have
gotten solved as a solution to the Chaos (tm). :)

~~~
karlkatzke
No. That’s entirely not the way it works. Chaos Engineering works in companies
with significant scale and sufficient maturity, and the purpose is to discover
design or implementation failures in distributed systems before they happen
during big rushes where you’re hopefully making lots of profit.

If you lack sufficient maturity or you are not large enough, you will actually
lose benefits if you try to implement chaos engineering. It’s like teaching a
toddler to use a propane torch. You just don’t do it, or everything you love
in the world will burn.

~~~
karlkatzke
Self reply to realize that I shouldn’t lecture my grandma on how to suck eggs.

~~~
jedberg
Maybe I should have added a disclaimer with the :) at the end.

Thanks for the laugh though. Your phrasing was awesome.

------
empath75
Older versions of chaos monkey didn’t require spinnaker and still work
perfectly fine.

Also, if you’re using this, the right time to start doing it is in
development, account wide. It’s mildly obnoxious for devs to lose stuff like
Jenkins servers and bastions, but you really quickly find single points of
failure and start to engineer around it. Usually making your stuff resilient
isn’t that much work, but it’s hard to know what you need to do until you test
it.

If you’re randomly terminating everything in dev and testing and staging, you
should already be battle hardened by the time you get into production.

------
NHQ
I built a similar, much more generic thing, to test networks with various
failure modes [0]. It creates proxies for tcp/http/APIs, and handles those
streams however you want—cutoff, slow down, dropout, or DIY handlers.
Everything can be configured w/JSON, routed by API endpoints, and modulated
live through an admin API. Process clustering supported.

0\. [https://github.com/NHQ/netmorphic-1](https://github.com/NHQ/netmorphic-1)

------
tayo42
something that i dont think gets talked about or addressed enough are partial
failures. a whole instance dying is pretty much a solved problem and i think
most people are aware of it and how to deal with it. if its not addressed
people are just taking their chances at this point. a server being removed is
an on and off kind of thing. something sporadic latency or corrupted data can
cause some surprising and unpredictable issues.

~~~
jedberg
Chaos Monkey does that too. We used to have a separate monkey called Latency
Monkey that induced network latency, because as you aptly point out, detecting
if something is down is a lot easier than detecting if it is slow or
intermittently down. Netflix has since incorporated those failure modes into
chaos monkey: [https://github.com/Netflix/SimianArmy/wiki/The-Chaos-
Monkey-...](https://github.com/Netflix/SimianArmy/wiki/The-Chaos-Monkey-Army)

------
smartbit
What is a recommended Chaos Monkey guide/method on Kubernetes iow _without_
running Spinnaker?

~~~
dankohn1
Here are several chaos engineering efforts that work with Kubernetes:
[https://landscape.cncf.io/category=chaos-
engineering&format=...](https://landscape.cncf.io/category=chaos-
engineering&format=card-mode&grouping=category)

