Hacker News new | past | comments | ask | show | jobs | submit login

That sounds fascinating! How often does your team have to leap into action?



We don’t usually discuss the frequency of unplanned failovers, but I will tell you that we do a planned failover at least every two weeks. The team also uses traffic shaping to perform whole system load tests with production traffic, which happens quarterly.


Do you do any chaos testing? Seems like it would slot right in, there.


I'd say yes. I heard about this tool just a week ago at a developer conference.

https://github.com/Netflix/chaosmonkey


Netflix was a pioneer of chaos testing, right? https://en.m.wikipedia.org/wiki/Chaos_engineering



they have invented the term, so probably yes :)


I think some Google engineers published a free Meap book on service relatability and uptime guarantees. Seemingly counterintuitive, scheduling downtime, without other teams’ prior knowledge, encourages teams to handle outages properly and reduce single points of failure, among other things.


Service Reliability Engineering is on OReilly press. It's a good book. Up there with ZeroMQ and Data Intensive Applications as maybe the best three books from OReilly in the past ten years.


Derp, Site Reliability Engineering.

https://landing.google.com/sre/books/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: