My burning question is what is a "relatively rare maintenance event type"?

the-rc · on June 7, 2019

My hunch: a more invasive one. Think of turning off all machines in a cluster for major power work or to replace the enclosures themselves. Maintenance on a single machine or rack, instead, happens all the time and requires little more scheduling work than what you do in Kubernetes when you drain a node or a group of nodes. I used to have my share of "fun" at Google sometimes when clusters came back unclean from major maintenance. That usually had no customer-facing impact, because traffic had been routed somewhere else the entire time.

rurban · on June 7, 2019

That means a task which is only run every few years, so there's not much experience with it, and it's harder to test and predict.

You normally prepare for such a task for a month, and then you hope it will work. In my case (I brought down one the core DNS in Austria for a few minutes, for a very trivial oversight) everyone knew, and after the caches ran out we immediately restored the backup. We weren't on page one in the news as Google.

In the Google case they had no idea of the root cause, so they had to run after this guy who caused it. Only after 4 hours they found him, and they could stop this job. Reminds me a bit of Chernobyl, where nobody told anybody.

shereadsthenews · on June 7, 2019

I don’t have the inside knowledge of this outage but there are some details in here. They say that the job got descheduled due to misconfiguration. This implies the job could have been configured to serve through the maintenance event. It also implies there is a class of job which could not have done so. Power must have been at least mostly available, so it implies there was going to be some kind of rolling outage within the data center, which can be tolerated by certain workloads but not by others.

klodolph · on June 7, 2019

I have no idea what this was. But power distribution in a data center is hierarchical, and as much as you want redundancy, some parts in the chain are very expensive and sometimes you have to turn them off for maintenance.

I never actually worked in a data center, so keep in mind I don’t know what I’m talking about. Traditional DCs have UPS all over the place, but that will only last a finite amount of time, and your maintenance might take longer than the UPS will last.

Jamesanon · on June 7, 2019

Total speculation and just my interpretation, of course.

What it means to me is that initially some unusually poor decisions were made that triggered an unfortunate and unavoidable events. Very rare is a damage control statement. There is a subtle tone of concern and feeling of blame trough that entire postmortem. This will be buried but if it was investigated thoroughly I wouldn’t be surprised of some serious consequences.

Total speculation. I do not work for google.