Thanks, so a handful at most, and the "usual" ones, I always thought that those ...

jeffbee · on Oct 28, 2020

Google, at least, has a thing that is supposed to prevent widespread disruption at the machine level, called the "Safe Removal Service"[1]. This is a good idea that in practice isn't perfect. If you write a tool that does not consult SRS, or your service doesn't declare a SRS policy, there can be surprises.

A particular outage that I will never forget took out Gmail delivery worldwide in an instant, because the change was not expected to be disruptive and therefore did not integrate with SRS. As it turned out the change disabled the machines where it was applied, and the process of selecting a subset of machines to canary the change was not independent of the way in which Gmail assigns services to machines, so in the space of a few seconds they created a global outage.

https://twitter.com/bgrant0607/status/1134536670504554496