
Chaos Engineering Upgraded - vquemener
http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html
======
romanhn
Great to see the principles codified. We practice them at PagerDuty with our
Failure Fridays - [https://www.pagerduty.com/blog/failure-friday-at-
pagerduty](https://www.pagerduty.com/blog/failure-friday-at-pagerduty). Seeing
the production system handle a data center failure gracefully during practice
gives us real confidence that we can handle an unplanned region outage (and we
have, in fact).

~~~
roghummal
PagerDuty - YC S10

------
burnte
They practice something I grabbed onto early in my computing life, error
handling is critical. There's pretty much one way for an application to
succeed, and that's by doing what the user told you do to without error.
However, there are potentially an infinite number of ways to fail, and it's
important to think about that early on. I spent the better part of wednesday
debugging what should have been very simple DB stuff because things were
failing silently, no errors at all.

~~~
akavel
This interestingly leads I believe to the "let it crash" philosophy of Erlang
(and apparently Akka?) -- see e.g.:
[http://c2.com/cgi/wiki?LetItCrash](http://c2.com/cgi/wiki?LetItCrash),
[https://lwn.net/Articles/191059/](https://lwn.net/Articles/191059/).

I do believe it's a very good and important approach, in that I don't really
know of a better one. Still, on the other hand, I've learned that it seems to
also have it's own weird and counter-intiuitive risks -- vide Systemantics:
[https://en.wikipedia.org/wiki/Systemantics#System_failure](https://en.wikipedia.org/wiki/Systemantics#System_failure).
As I understand the premise: it's easy to over-rely on one's "fail-safe"
measures, to the point where the "regular" mode of operation is allowed to
deteriorate such that the fail-safes are actually running the system (or even
they're just hit too often); then, failure of a fail-safe takes the system
down, and unexpectedly ( _" Why, when we had so many and so good fail-
safes!"_).

~~~
burnte
Oh, I'm all for the "let it crash" ideology, just make it crash in a loud and
verbose manner. Let me figure out what went wrong. :)

------
fahimulhaq
On the same subject, Facebook also turned off one of their data centers last
year. [http://bit.ly/1gX6E6h](http://bit.ly/1gX6E6h)

