Post-mortem analysis on Google App Engine outage

sah · on July 9, 2009

I think the duration of this outage was basically about waiting for the GFS team to roll in. They knew at 6:44am that they had a crashing GFS Master on their hands, but they didn't decide to try upgrading until 8am, and only escalated to the GFS team after the upgrade failed to help, at 9am. That's the story of this outage, to me:

6:44 AM --- A GFS Site Reliability Engineer (SRE) reports that the GFS Master in App Engine's primary data center is failing and continuously restarting.

8:00 AM --- The cause of the GFS Master failures has not yet been identified. [...] the GFS SRE decides to commence the upgrade immediately in an attempt to alleviate the problem.

9:00 AM --- The GFS upgrade [...] finishes, but the Master is still failing. The GFS SRE escalates directly to the GFS engineering team, and [they begin] live debugging of the failing software[...].

10:00 AM --- GFS SRE advises that the GFS engineering team has identified the cause [...] GFS Master is no longer failing and GFS Chunkservers, which hold the actual needed data, are starting to come back up by 10:30 AM.

tlrobinson · on July 9, 2009

These sort of public post-mortems are a fascinating look inside an otherwise opaque system.

rufo · on July 9, 2009

Seriously - I don't have anything currently running on GAE, and I found it incredibly interesting reading.

zmimon · on July 9, 2009

> the primary engineer discovered that the isolated servers supporting the Status Site were running in the same data center as the primary App Engine serving cluster.

Hilarious - they were running the status site for App Engine on the App Engine infrastructure so when it went down they couldn't post it's status as being down. I'm sure people will be quite philosophical about this outage and how difficult distributed systems are to manage, but this seems kind of an obvious mistake to make...

moe · on July 9, 2009

Yes, that was an obvious mistake. A brown-paper-bag bug so to say.

The difference between google and many other companies is: They are not afraid to admit it and I'm sure that will be fixed before the next outage.

Most other hosting providers I've dealt with would not even tell me the exact cause for an outage. Or at most in blurry terms, preferably handing off the blame to someone else. You wouldn't believe how often these pesky core routers at my datacenters supposedly had a problem!

I also don't know many other providers with a status page as detailed as google's in first place.

So, yes, this was a funny screw-up. But on a very high niveau. This is a bit like mocking someone who drove his Ferrari into a wall - at least he had a Ferrari to drive into a wall... ;-)

ggruschow · on July 9, 2009

Am I reading this right? I hope not.

Another user of GFS in the same primary datacenter as App Engine is issuing a request to the GFS servers that reliably causes a crash.

A google employee had a bug in their GFS-using code which ended up crashing GFS for themselves and all GAE users?

The failover procedure [...] was not designed to handle failover during full unavailability for a long period (greater than three hours).

Wouldn't something like a fire or a significant code bug in GFS cause full unavailability for hours?

robk · on July 9, 2009

Yes, this is a bit curious. GFS cells are generally restricted to particular eng groups depending on the need, so I assume this particular GFS cell was fairly restricted, but it's still somewhat worrisome that unrelated, non-production code could lead to an outage of this scale. I expect whoever wrote the offending code is very aware of the repercussions of their work this week :)

jacquesm · on July 9, 2009

this is a puzzling bit:

"8:00 AM --- The cause of the GFS Master failures has not yet been identified. However, a similar-looking issue that had been seen in a different data center the week prior had been resolved by an upgrade to a newer version of the GFS software. This upgrade was already planned for the App Engine primary data center later in the week, so the GFS SRE decides to commence the upgrade immediately in an attempt to alleviate the problem."

So, they let their old version of the GFS continue to run in spite of knowing that there was a critical bug, and then only when it crashed they decided to TRY the upgrade to see if it would cure the problem ?

If that upgrade would have been done earlier they could have shaved off an hour of their outage. I'm sure it's easy to second-guess from the sidelines but this really does puzzle me.

Also in the 'what did we do wrong' section they completely gloss over this point.

nostrademons · on July 9, 2009

[Disclaimer: I work at Google, but my project is about as far away from App Engine and GFS as you could possibly get, so on this issue I have no more information than a layperson.]

It's possible that nobody knew or thought that it was a critical bug. Oftentimes, you'll see a little bug somewhere, think "Oh, we should fix that eventually", but the conditions that would make it into a really big bug are rare or unknown. It's only when those conditions actually happen that you think, "Well shit. I guess we should've fixed that last week."

If a bug isn't believed to be critical, then waiting to deploy the fix absolutely is the right decision. There're all sorts of things that can go wrong with an upgrade, and doing it on a schedule lets you go through a rigorous QA process and monitor the push as it happens.

sriramk · on July 9, 2009

[Disclaimer - I work on Windows Azure which could be considered a competitor to GAE]

You always have critical bugs in large systems. Unless it is causing an outage or impacting some critical scenario in a major way, you don't want to mess up your usual process. You want to do orderly upgrades to all your environments/clusters/DCs and short-cut/hot patch only in dire emergencies.

Also, it is hard to judge the severity of bugs, especially in large scale distributed systems.

boundlessdreamz · on July 9, 2009

All upgrades should be rolled out gradually so that if the the upgrade triggers a bug or in general have some issues, all data centers are not affected at the same time.

rgoddard · on July 9, 2009

Plus the update was already planned on being installed later in the week, so it was not like they were ignoring it.

brown9-2 · on July 9, 2009

I don't think its clear that the query of death bug was included/fixed in the 8am GFS upgrade - they're just saying that an issue with similar external symptoms was recently fixed by upgrading to this newer version.

Sounds like the fact that this newer version was available was completely unrelated to this specific bug.

jacquesm · on July 9, 2009

Yes, I understood that from reading the article, the point is that they still wasted an hour in trying to see if that would fix it or not because the symptoms were similar (crash of the master GFS server).

mitchellh · on July 9, 2009

Am I the only person who finds this slightly amusing? I just read on HN about a week ago about Google bragging about how since all their sites run on the same underlying architecture, one upgrade and all their sites get the upgrade (speed, scalability, etc.).

But it seems to have turned around in this case. One underlying bug brought down their whole system (or whole app engine system).

Haha.

moe · on July 9, 2009

I don't see what you're laughing at. Their systems seem to be sufficiently isolated that the bug at least didn't affect other products (search, mail, etc.).

That alone is already more than you could expect from many other sites.

Showstopper bugs happen in any system, google is no exception here. More important is their handling and communication of the issue. And that was stellar, even despite the communication hole during the blackout.

This post mortem makes up for that faux-pas in my book. How often you get to read a detailed analysis like that from a company the size of google? With clear admission of their faults and detailed description of the steps they're going to take to resolve them?

This is the google I want to see and it gives me good faith that they'll fix their communication issues for the next downtime.

It also makes me wonder what there is to laugh about. As a customer I can not complain much about the way this was handled (I have seen much, much worse). As a competitor I'd piss my pants just again over the sheer self-confidence and professionalism google displays at this scale.

Does your company have equal monitoring and procedures in place to detect, identify and resolve a low-level bug like this in a comparable timeframe?

So, no laughing here.

mitchellh · on July 9, 2009

Of course I respect all that. I didn't mean I was sitting here laughing at all their problems saying "haha, gotcha suckers."

I'm just saying the fact it happened so close to when I read that is _slightly_ amusing. Its a serious thing that happened, definitely, I'm just taking a step back and enjoying a small part of it.

mahmud · on July 9, 2009

You laugh, but you should be respecting the engineering vision and guts it took do the Right Thing; dropping the kludgy "industry standards" and aspiring to the Ideal, instead of running yet another ASP/Java/PHP/Oracle farm like all their peers.

mitchellh · on July 9, 2009

Again, I'm not laughing at their architecture. I'm taking a step back and looking at the big picture of what happened and just found it amusing.

Google has great engineering vision and they're generally a great company in the way they handle things. No questions there.