Hacker News new | past | comments | ask | show | jobs | submit login

Thanks, so a handful at most, and the "usual" ones, I always thought that those companies keep their machines connected in (redundant) "sets" and that a command affecting all of them was more a case for "never" rather than "once in a while".



Google, at least, has a thing that is supposed to prevent widespread disruption at the machine level, called the "Safe Removal Service"[1]. This is a good idea that in practice isn't perfect. If you write a tool that does not consult SRS, or your service doesn't declare a SRS policy, there can be surprises.

A particular outage that I will never forget took out Gmail delivery worldwide in an instant, because the change was not expected to be disruptive and therefore did not integrate with SRS. As it turned out the change disabled the machines where it was applied, and the process of selecting a subset of machines to canary the change was not independent of the way in which Gmail assigns services to machines, so in the space of a few seconds they created a global outage.

https://twitter.com/bgrant0607/status/1134536670504554496




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: