Am I the only person who finds this slightly amusing? I just read on HN about a week ago about Google bragging about how since all their sites run on the same underlying architecture, one upgrade and all their sites get the upgrade (speed, scalability, etc.).
But it seems to have turned around in this case. One underlying bug brought down their whole system (or whole app engine system).
I don't see what you're laughing at. Their systems seem to be sufficiently isolated that the bug at least didn't affect other products (search, mail, etc.).
That alone is already more than you could expect from many other sites.
Showstopper bugs happen in any system, google is no exception here. More important is their handling and communication of the issue. And that was stellar, even despite the communication hole during the blackout.
This post mortem makes up for that faux-pas in my book. How often you get to read a detailed analysis like that from a company the size of google?
With clear admission of their faults and detailed description of the steps they're going to take to resolve them?
This is the google I want to see and it gives me good faith that they'll fix their communication issues for the next downtime.
It also makes me wonder what there is to laugh about. As a customer I can not complain much about the way this was handled (I have seen much, much worse). As a competitor I'd piss my pants just again over the sheer self-confidence and professionalism google displays at this scale.
Does your company have equal monitoring and procedures in place to detect, identify and resolve a low-level bug like this in a comparable timeframe?
Of course I respect all that. I didn't mean I was sitting here laughing at all their problems saying "haha, gotcha suckers."
I'm just saying the fact it happened so close to when I read that is _slightly_ amusing. Its a serious thing that happened, definitely, I'm just taking a step back and enjoying a small part of it.
You laugh, but you should be respecting the engineering vision and guts it took do the Right Thing; dropping the kludgy "industry standards" and aspiring to the Ideal, instead of running yet another ASP/Java/PHP/Oracle farm like all their peers.
But it seems to have turned around in this case. One underlying bug brought down their whole system (or whole app engine system).
Haha.