

Post-Mortem of AWS Outage - dlgeek

[Self-post because it's not easily linkable]<p>We would like to share some detail about the Amazon Elastic Compute Cloud (EC2) service event last night when power was lost to some EC2 instances and Amazon Elastic Block Store (EBS) volumes in a single Availability Zone in the US East Region.<p>At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system.  Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power.   At 8:53PM PDT, one of the generators overheated and powered off because of a defective cooling fan.  At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity).  Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit.  After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power.  Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.<p>The generator fan was fixed and the generator was restarted at 10:19PM PDT.  Once power was restored, affected instances and volumes began to recover, with the majority of instances recovering by 10:50PM PDT.  For EBS volumes (including boot volumes) that had inflight writes at the time of the power loss, those volumes had the potential to be in an inconsistent state.  Rather than return those volumes in a potentially inconsistent state, EBS brings them back online in an impaired state where all I/O on the volume is paused.  Customers can then verify the volume is consistent and resume using it.  By 1:05AM PDT, over 99% of affected volumes had been returned to customers with a state 'impaired' and paused I/O to the instance.<p>Separate from the impact to the instances and volumes, the EBS-related EC2 API calls were impaired from 8:57PM PDT until 10:40PM PDT.  Specifically, during this time period, mutable EBS calls (e.g. create, delete) were failing.  This also affected the ability for customers to launch new EBS-backed EC2 instances.  The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores. The EBS datastore is used to store metadata for resources such as volumes and snapshots.  One of the primary EBS datastores lost power because of the event.  The datastore that lost power did not fail cleanly, leaving the system unable to flip the datastore to its replicas in another Availability Zone.  To protect against datastore corruption, the system automatically flipped to read-only mode until power was restored to the affected Availability Zone.  Once power was restored, we were able to get back into a consistent state and returned the datastore to read-write mode, which enabled the mutable EBS calls to succeed.  We will be implementing changes to our replication to ensure that our datastores are not able to get into the state that prevented rapid failover.<p>Utility power has since been restored and all instances and volumes are now running with full power redundancy.  We have also completed an audit of all our back-up power distribution circuits.  We found one additional breaker that needed corrective action.  We've now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes.<p>We sincerely apologize for the inconvenience to those who were impacted by the event.
======
dlgeek
What I found most remarkable was how in-depth their backup systems were and
how many things had to go wrong for this to become an event.

This happened because they lost primary power AND had a generator fail AND had
a distribution breaker fail. I wonder how often any two of those happen
without us ever knowing about it...

~~~
Rantenki
The likelihood of these occurring in parallel is probably higher than you
would expect. Backup systems are difficult to test fully since you can only do
it via a full failover. Without running 100% of your systems via the backup
distribution breaker, for example, you cannot verify that it can handle the
load. You can only design for it. The full system was probably tested at
initial install, but obviously more power was being consumed by live systems
than by the original test. Entropy doesn't leave backup systems alone either,
so over time they can fail even without use, and it is a surprise to everyone
when the "working" backup systems fall over.

------
mw642
Wow! A proper outage post-mortem that isn't babbling on about control rods. I
love it.

~~~
fish2000
I don't get the reference -- what was the infamously control-rod-centric post-
mortem?

~~~
cmwelsh
That would be Heroku's response a couple weeks ago:
<https://status.heroku.com/incidents/372>

------
gulfie
Well that's suck. Reading between the lines a little.

0) Wire faults happen, stuff breaks, that isn't fixable. No problem.

0.5) A cable fault that can take out your redundant pair of substations...
means you do not have a redundant pair of substations. (sometimes you have to
go to war with the army you have, not the army you'd want).

1) Faulty generator fan? a) Was this generator periodically run with a dummy
load? { okay maybe a chicken fell in the fan or a weasel ate some control
wires or something... }

2) The power circuits were never tested after a breaker got
installed/reconfigured, and manual validation of configuration failed to find
a discrepancy between plan and as build.

3) some back end parts of the EBS are probably A/B pairs and subject to split
brain issues.

This is why you always need to test what you run, and run what you test...

Kudos for the write up. I see lots of electrician overtime in the near future.

------
dholowiski
If you haven't learned your lessen yet then you deserve what you get. If you
need 100% availability the only way you can get close is multiple servers, in
multiple data centres, from multiple companies.

 __You cannot count on somebody else to make sure your web site is up. It's
your job. __

