

Post-mortem analysis on Google App Engine outage - DocSavage
http://groups.google.com/group/google-appengine/msg/ba95ded980c8c179

======
sah
I think the duration of this outage was basically about waiting for the GFS
team to roll in. They knew at 6:44am that they had a crashing GFS Master on
their hands, but they didn't decide to try upgrading until 8am, and only
escalated to the GFS team _after_ the upgrade failed to help, at 9am. That's
the story of this outage, to me:

6:44 AM --- A GFS Site Reliability Engineer (SRE) reports that the GFS Master
in App Engine's primary data center is failing and continuously restarting.

8:00 AM --- The cause of the GFS Master failures has not yet been identified.
[...] the GFS SRE decides to commence the upgrade immediately in an attempt to
alleviate the problem.

9:00 AM --- The GFS upgrade [...] finishes, but the Master is still failing.
The GFS SRE escalates directly to the GFS engineering team, and [they begin]
live debugging of the failing software[...].

10:00 AM --- GFS SRE advises that the GFS engineering team has identified the
cause [...] GFS Master is no longer failing and GFS Chunkservers, which hold
the actual needed data, are starting to come back up by 10:30 AM.

------
tlrobinson
These sort of public post-mortems are a fascinating look inside an otherwise
opaque system.

~~~
rufo
Seriously - I don't have anything currently running on GAE, and I found it
incredibly interesting reading.

------
zmimon
> the primary engineer discovered that the isolated servers supporting the
> Status Site were running in the same data center as the primary App Engine
> serving cluster.

Hilarious - they were running the status site for App Engine on the App Engine
infrastructure so when it went down they couldn't post it's status as being
down. I'm sure people will be quite philosophical about this outage and how
difficult distributed systems are to manage, but this seems kind of an obvious
mistake to make...

~~~
moe
Yes, that was an obvious mistake. A brown-paper-bag bug so to say.

The difference between google and many other companies is: They are not afraid
to admit it and I'm sure that will be fixed before the next outage.

Most other hosting providers I've dealt with would not even tell me the exact
cause for an outage. Or at most in blurry terms, preferably handing off the
blame to someone else. You wouldn't believe how often these pesky core routers
at my datacenters supposedly had a problem!

I also don't know many other providers with a status page as detailed as
google's in first place.

So, yes, this was a funny screw-up. But on a very high niveau. This is a bit
like mocking someone who drove his Ferrari into a wall - at least he _had_ a
Ferrari to drive into a wall... ;-)

------
ggruschow
Am I reading this right? I hope not.

 _Another user of GFS in the same primary datacenter as App Engine is issuing
a request to the GFS servers that reliably causes a crash._

A google employee had a bug in their GFS-using code which ended up crashing
GFS for themselves and all GAE users?

 _The failover procedure [...] was not designed to handle failover during full
unavailability for a long period (greater than three hours)._

Wouldn't something like a fire or a significant code bug in GFS cause full
unavailability for hours?

~~~
robk
Yes, this is a bit curious. GFS cells are generally restricted to particular
eng groups depending on the need, so I assume this particular GFS cell was
fairly restricted, but it's still somewhat worrisome that unrelated, non-
production code could lead to an outage of this scale. I expect whoever wrote
the offending code is very aware of the repercussions of their work this week
:)

------
jacquesm
this is a puzzling bit:

"8:00 AM --- The cause of the GFS Master failures has not yet been identified.
However, a similar-looking issue that had been seen in a different data center
the week prior had been resolved by an upgrade to a newer version of the GFS
software. This upgrade was already planned for the App Engine primary data
center later in the week, so the GFS SRE decides to commence the upgrade
immediately in an attempt to alleviate the problem."

So, they let their old version of the GFS continue to run in spite of knowing
that there was a critical bug, and then only when it crashed they decided to
TRY the upgrade to see if it would cure the problem ?

If that upgrade would have been done earlier they could have shaved off an
hour of their outage. I'm sure it's easy to second-guess from the sidelines
but this really does puzzle me.

Also in the 'what did we do wrong' section they completely gloss over this
point.

~~~
boundlessdreamz
All upgrades should be rolled out gradually so that if the the upgrade
triggers a bug or in general have some issues, all data centers are not
affected at the same time.

~~~
rgoddard
Plus the update was already planned on being installed later in the week, so
it was not like they were ignoring it.

------
mitchellh
Am I the only person who finds this slightly amusing? I just read on HN about
a week ago about Google bragging about how since all their sites run on the
same underlying architecture, one upgrade and all their sites get the upgrade
(speed, scalability, etc.).

But it seems to have turned around in this case. One underlying bug brought
down their whole system (or whole app engine system).

Haha.

~~~
moe
I don't see what you're laughing at. Their systems seem to be sufficiently
isolated that the bug at least didn't affect other products (search, mail,
etc.).

That alone is already more than you could expect from many other sites.

Showstopper bugs happen in any system, google is no exception here. More
important is their handling and communication of the issue. And that was
_stellar_ , even despite the communication hole during the blackout.

This post mortem makes up for that faux-pas in my book. How often you get to
read a detailed analysis like that from a company the size of google? With
clear admission of their faults and detailed description of the steps they're
going to take to resolve them?

This is the google I want to see and it gives me good faith that they'll fix
their communication issues for the next downtime.

It also makes me wonder what there is to laugh about. As a customer I can not
complain much about the way this was handled (I have seen much, much worse).
As a competitor I'd piss my pants just again over the sheer self-confidence
and professionalism google displays at this scale.

Does _your_ company have equal monitoring and procedures in place to detect,
identify and resolve a low-level bug like this in a comparable timeframe?

So, no laughing here.

~~~
mitchellh
Of course I respect all that. I didn't mean I was sitting here laughing at all
their problems saying "haha, gotcha suckers."

I'm just saying the fact it happened so close to when I read that is
_slightly_ amusing. Its a serious thing that happened, definitely, I'm just
taking a step back and enjoying a small part of it.

