Hacker News new | past | comments | ask | show | jobs | submit login

So in instances like this where everything goes wrong. Does google have the equivalent of a revert button to undo whatever infrastructure changes were done?



(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an SRE at Google. My team is oncall for this service and I know exactly what happened here; I probably can't answer most questions you might have.)

Let's go with "yes", as the most accurate answer. As soon as I or whoever is oncall has figured out what change was responsible, we can usually revert it quickly and easily. Usually, if I'm oncall and I have reason to even suspect a recent change might be the cause, I'll revert it and see if the problem goes away.

The difficulty becomes more apparent when you realise the sheer number of infrastructure changes being made every hour, some of which will be fixes to other outages, and some of which will be things you can't revert because they are of the form "that location has fallen offline; probably lost networking" or "we are now at peak time and there are more users online". So if your question is "can we just roll the whole world back one day" - no, too much has changed in that time.


I know it late but thanks for this. It kind of reinforces the amazing size of the system and the number of people making changes to it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: