Post-Mortem of AWS Outage

dlgeek · on June 17, 2012

What I found most remarkable was how in-depth their backup systems were and how many things had to go wrong for this to become an event.

This happened because they lost primary power AND had a generator fail AND had a distribution breaker fail. I wonder how often any two of those happen without us ever knowing about it...

Rantenki · on June 17, 2012

The likelihood of these occurring in parallel is probably higher than you would expect. Backup systems are difficult to test fully since you can only do it via a full failover. Without running 100% of your systems via the backup distribution breaker, for example, you cannot verify that it can handle the load. You can only design for it. The full system was probably tested at initial install, but obviously more power was being consumed by live systems than by the original test. Entropy doesn't leave backup systems alone either, so over time they can fail even without use, and it is a surprise to everyone when the "working" backup systems fall over.

dholowiski · on June 18, 2012

A data center outage is (usually) like a plane crash. It doesn't happen just because of one thing going wrong. A series of things have to go wrong, including backups failing, backups of backups failing, sometimes human error, and often lots of bad luck. Just like flying on a plane, single things go wrong all the time. If you don't find out about it, it means they're doing their job.

mw642 · on June 17, 2012

Wow! A proper outage post-mortem that isn't babbling on about control rods. I love it.

fish2000 · on June 18, 2012

I don't get the reference -- what was the infamously control-rod-centric post-mortem?

cmwelsh · on June 18, 2012

That would be Heroku's response a couple weeks ago: https://status.heroku.com/incidents/372

gulfie · on June 20, 2012

Well that's suck. Reading between the lines a little.

0) Wire faults happen, stuff breaks, that isn't fixable. No problem.

0.5) A cable fault that can take out your redundant pair of substations... means you do not have a redundant pair of substations. (sometimes you have to go to war with the army you have, not the army you'd want).

1) Faulty generator fan? a) Was this generator periodically run with a dummy load? { okay maybe a chicken fell in the fan or a weasel ate some control wires or something... }

2) The power circuits were never tested after a breaker got installed/reconfigured, and manual validation of configuration failed to find a discrepancy between plan and as build.

3) some back end parts of the EBS are probably A/B pairs and subject to split brain issues.

This is why you always need to test what you run, and run what you test...

Kudos for the write up. I see lots of electrician overtime in the near future.

dholowiski · on June 18, 2012

If you haven't learned your lessen yet then you deserve what you get. If you need 100% availability the only way you can get close is multiple servers, in multiple data centres, from multiple companies.

You cannot count on somebody else to make sure your web site is up. It's your job.