
Google API infrastructure outage incident report - cleverjake
http://googledevelopers.blogspot.com/2013/05/google-api-infrastructure-outage_3.html
======
WestCoastJustin
Very professional -- this should be a model for how to handle and report such
outages. There is a lot of procedural machinery happening to even get to this
state, i.e. monitoring, a change management system (who did what, and when),
troubleshooting and escalation, deploying a fix and verifying that it worked.
They do not gloss over the fact that this jumped QA!

~~~
cheez
Yep, someone's performance review is not going to go very well.

~~~
packetslave
There's nothing in the incident report that says this was due to human action.
An automated process could just have easily pushed configs to the wrong
environment (either through a bug or a mis-configuration of the process
itself). Not saying that's what happened, but the IR doesn't say.

In my experience (having personally been the trigger for a widespread outage
that got us in the news), Google takes a fairly sophisticated view of outages:
SREs as a rule care less about crucifying the engineer who did it, and much
more about questions like:

what was the REAL root cause that caused this to happen? why didn't our
processes/tools STOP it from happening? did our monitoring detect it? could we
have detected it faster? did we fix it or mitigate it fast enough? did things
like communication, escalations, handoffs between teams, etc. work
effectively? _what can we do better next time?_

When I first started, my director took a bunch of us out to a Noogler lunch.
We sat down with our plates, and he said "Ok, let's talk about how to get
fired." Basically, mistakes happen. Bugs happen. If you cause a huge outage,
that isn't necessarily a negative reflection on you, and shouldn't hurt your
perf. If you cause a huge outage because you willfully ignored procedure, went
around established safety controls, didn't monitor to make sure your changes
didn't turn all the pretty dashboards red, THEN you're going to have a
problem.

------
ConceitedCode
Great job, Google API Infrastructure team! Communication is key. Now if only
everyone did this...

------
tlogan
This is great - now if they actually have status page which says that API is
not working so we do not need scramble thru forums (which are closed) and
StackOverflow forums (which are open, but all questions regarding outage are
immediately closed). In case of previous outages (the last one happened on Mar
18), there was actually an email sent to people subscribed to google-apps-
apis-downtime-notify

~~~
thezilch
FTFA...

 _Develop better mechanism for quickly delivering status notifications during
incidents._

I can't be sure how the numerous references to _monitoring_ failures being
related to _status_ updates, but at some point, you have to assume your
"status page" will have "bugs" too.

~~~
tlogan
Shouldn't be there somebody with title "Google Developer Relations" (or
something like that) to send email to mailing list?

------
staunch
Hopefully this is the first time that person screwed up like this so they
weren't summarily executed^Wfired.

------
ushi
_a configuration change was inadvertently released to our production
environment without first being released to the testing enviroment._

Google - Just humans, too.

