>Fortunately, as part of some unrelated work we'd done recently, we had a version of the cluster that we could run inside Docker containers. We used it to help us build a script that mimicked the failures we saw in production. Being able to rapidly turn clusters up and down let us iterate on that script quickly, until we found a combination of events that broke the cluster in just the right way.
this is the coolest part of this story. Any chance these scripts are opensource ?
this is the coolest part of this story. Any chance these scripts are opensource ?