That sounds fascinating! How often does your team have to leap into action?

aaronblohowiak · on June 3, 2019

We don’t usually discuss the frequency of unplanned failovers, but I will tell you that we do a planned failover at least every two weeks. The team also uses traffic shaping to perform whole system load tests with production traffic, which happens quarterly.

justinator · on June 3, 2019

Do you do any chaos testing? Seems like it would slot right in, there.

Zobat · on June 3, 2019

I'd say yes. I heard about this tool just a week ago at a developer conference.

https://github.com/Netflix/chaosmonkey

a_t48 · on June 3, 2019

Netflix was a pioneer of chaos testing, right? https://en.m.wikipedia.org/wiki/Chaos_engineering

aaronblohowiak · on June 3, 2019

https://www.oreilly.com/library/view/chaos-engineering/97814... ;)

arainwater · on June 3, 2019

they have invented the term, so probably yes :)

azimuth11 · on June 3, 2019

I think some Google engineers published a free Meap book on service relatability and uptime guarantees. Seemingly counterintuitive, scheduling downtime, without other teams’ prior knowledge, encourages teams to handle outages properly and reduce single points of failure, among other things.

fnord123 · on June 3, 2019

Service Reliability Engineering is on OReilly press. It's a good book. Up there with ZeroMQ and Data Intensive Applications as maybe the best three books from OReilly in the past ten years.

fnord123 · on June 3, 2019

Derp, Site Reliability Engineering.

https://landing.google.com/sre/books/