

Inside Azure Search: Chaos Engineering - plurby
http://azure.microsoft.com/blog/2015/07/01/inside-azure-search-chaos-engineering

======
henakama
Hi, author of the post here. Happy to answer any questions, would love to hear
from any of you who have used similar approaches to managing failure in
distributed systems.

~~~
curiousDog
How do other Azure teams inside Microsoft do their testing (SQL Azure, Azure
DocDB etc?) I'd assume there's a common platform that allows you to just throw
stuff and inject failures? Further, in production, what percent of bugs if any
did you find were not caught by Chaos engineering? Where they of a particular
class? Did you still see a necessity for functional tests that test all
permutations diligently?

~~~
henakama
Regarding testing, I think that static functional tests that verify
permutations of a specific set of inputs are still essential, in the same way
that unit tests are needed even when you have comprehensive integration test
coverage.

Like the old axiom, the earlier you find a bug, the cheaper it is to fix. The
Search Chaos Monkey tests run against the current state of the repository, are
not very precise and are relatively expensive and time-consuming to run. I
only know when a test has failed because we start getting notifications that
the service is unavailable, or has degraded performance. I still need to go
dig up logs to see how the fault actually impacted the system, then go file a
bug against code that's already been committed. If I spend 15 minutes digging
through logs only to find that an error was caused by a bug in Azure Search
code that could have been caught at a lower level of testing before that code
was checked in, I won't be very happy.

The monkey is there to test how our code works as the underlying cluster
changes states and experiences failures. Before any code gets passed to the
monkey, we should be fairly confident through rigorous functional tests that
it performs correctly in most circumstances.

