

Google App Engine Team's Post-mortem for February 24 Outage - anurag
https://groups.google.com/group/google-appengine/browse_thread/thread/a7640a2743922dcf

======
ww520
Good to hear Google being open about the problem. Couple things:

\- Doesn't each server has its own UPS? I thought some guy from Google showing
the machine design with the battery pack attached to each machine. Why did a
data center power outage crash 25% of the machines at the same time?

\- Should the servers be restarted and recovered automatically? It said 25% of
servers didn't get the UPS power in time and crashed. That implied power was
restored shortly and the servers can just be restarted. The servers are just
dumb servers that can be shut down, restarted, and recovered in any time. If
enough servers were restarted, then BigTable/DataStore would have enough data
nodes to continue. The downtime would be couple minutes to tens of minutes.

\- Lack of precise monitoring. The first clue of problem was as drop in
traffic and outside discussion group posts. Drop-in traffic can be caused by
so many things. Shouldn't there be health check monitors on the BigTable and
DataStore clusters? Like if more than 10% (25% in this case) of nodes in the
clusters are down, raise an alarm and page someone right the way?

\- Were there capacity planning and stress testing to determine what
percentage of the capacity that can serve traffic in each DC and each cluster?
25% servers down affecting the whole BigTable cluster sounds too little of
safety margin.

I don't envy the on-call staff. The pressure must be tremendous.

~~~
simonw
I seem to remember the on board batteries only last a few minutes - long
enough to fail over to another datacenter, supposedly.

------
jeff18
A lot of companies have nice official apologies but this one really stands
out. Google is not just saying sorry, they are actually implementing serious
changes which probably represents millions of dollars of development to help
make sure this doesn't happen again.

------
alain94040
This is the part that I find fascinating:

    
    
      7:48 AM - Internal monitoring graphs first begin to
      show that traffic has problems in our primary datacenter
      9:35 AM - An engineer with familiarity with the unplanned 
      failover procedure is reached
    

And the blog entry keeps sounding like there was only one engineer making
decisions for the first two hours.

------
cellis
For a seriously detailed analysis of the challenges of provisioning App
Engine, take a look at Ryan Barrett's video:

[http://sites.google.com/site/io/under-the-covers-of-the-
goog...](http://sites.google.com/site/io/under-the-covers-of-the-google-app-
engine-datastore)

------
_delirium
A strangely prosaic failure. A large portion of the entries there seem to be
the on-call engineer simply trying to figure out what the failover procedure
actually is. Once he finally gets a copy of it, it takes 16 minutes to get
back up:

    
    
      9:53 AM - After engineering team consultation with the relevant
      engineers, now online, the correct unplanned failover procedure 
      operations document is confirmed, and is ready to be used by 
      the oncall engineer. The actual unplanned failover procedure for 
      reads and writes begins. 
      10:09 AM - The unplanned failover procedure completes, without 
      any problems. Traffic resumes serving normally, read and write. 
      App Engine is considered up at this time.

~~~
ww520
It's probably just a matter of configuring the routers to redirect traffic to
the backup data center. All the servers are already running in standby in the
backup data center.

The only thing tricky is to hold all writes on backup, and flush all pending
updates from primary to backup. That they did: they made backup in readonly
mode, and slowly turned on read/write after a while.

I don't understand why they lost so much data. 0.00002% is a lot considering
the dataset is really large. Shouldn't there be replication from primary to
backup? I assume Google is rich enough to have multiple links and fatpipes.

------
pquerna
most open i've seen google about these kind of operations thing, very helpful
of them, and I hope they continue to do it!

~~~
eitally
They do this every time there's a system or process failure that affects
paying customers. They're also quite good about proactively crediting accounts
when SLAs are exceeded. Not quite as good as Netflix, but better than most.

------
richardw
_\- Implement a regular bi-monthly audit of our operations docs to ensure that
all needed procedures are properly findable, and all out-of-date docs are
properly marked "Deprecated."_

Surely that leaves a two-month window in which weird things can happen?

How about:

\- All features/changes that could affect the document set require a
documentation update, a documentation review and training of all relevant
staff before deployment.

That should ensure the document set is consistent and the staff is aware of
the changes. My thinking in general is to replace periodic reviews with
processes that ensure the reviews aren't necessary.

Any improvements/suggestions/reasons why it wouldn't be better?

~~~
sriramk
Sounds like a great way to drown the team in process. I don't mean to sound
snarky but you often need to find the right balance between process and making
sure the team can write code and ship features without having to spin up a ton
of paperwork and training.

~~~
richardw
You don't have to use the same people to develop, document and train. Most of
the time it's a very bad idea.

Also, each change in the _dev_ environment doesn't kick off a bunch of admin.
However, each change in a _live_ environment with hundreds of thousands of
customers, who in turn have businesses with collectively millions of
customers, should be as close to perfect as you can get. You just can't get
that if you only document something up to two months later.

------
lennysan
I've been working on a standard guideline for postmortem communication, and
ran this post against that template:
[http://www.transparentuptime.com/2010/03/google-app-
engine-d...](http://www.transparentuptime.com/2010/03/google-app-engine-
downtime-postmortem.html)

------
ryan_b
re the questions about replication across datacenters, see:

[http://code.google.com/events/io/sessions/TransactionsAcross...](http://code.google.com/events/io/sessions/TransactionsAcrossDatacenters.html)

it discusses both app engine's approach and the underlying factors and
tradeoffs that apply to any similar system.

for the executive summary, see slide 33 from:

[http://snarfed.org/space/transactions_across_datacenters_io....](http://snarfed.org/space/transactions_across_datacenters_io.html)

------
jbyers
Power is the great uptime equalizer. Show me the most elegant HA design you
can dream up and chances are I'll show you a system that's one unexpected
power failure away from failure.

~~~
arethuza
Indeed, I can remember us all feeling quite nice and smug with our racks
having dual A and B feeds from separate mega UPSs generators etc.

We didn't feel quite so smug when we actually found out that some of the racks
actually had both sides plugged into the A feed - which was probably confusion
between us, the data center owner and the electrician installing the feeds.

