
Ask HN: Who uses Chaos Monkey in production? - nbraga
With Netflix announcing an upgrade [1] to Chaos Monkey today, I would be curious to know:<p>- Is your team using Chaos Monkey in your production&#x2F;staging infrastructure?<p>- If not, do you use a variant tool or any interesting implementation of &quot;Chaos Engineering&quot; [2], and to what degree of success?<p>[1] https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=12743693<p>[2] http:&#x2F;&#x2F;principlesofchaos.org<p>Previous discussion (2014): https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=8713950
======
romanhn
I commented about Failure Fridays at PagerDuty in the older thread (not Chaos
Monkey precisely, but similar concept). We still do that, with a few
modifications: 1) ChatOps is used to execute the commands from Slack in order
to preserve history and help interested parties follow along. 2) If we're not
testing a specific service/AZ/region, a "reboot roulette" bot is run that
reboots any host from our production infrastructure at random. Every single
production host is game. 3) This is now scheduled to run automatically at
certain times of the week.

Many of the things we do regularly used to be terrifying. They no longer are
precisely because we do them regularly. That's the value of chaos engineering.

------
usgroup
I did something similar and had a test suite across a CoreOS cluster that
would fail machines in patterns for hours whilst checking service and data
integrity. One can run the suite against the replica cluster when new
deployment features are added or changed.

Doing it on live wasn't possible for me because failover resulted in some down
time although I consider the former to have been very useful.

