

Cascading errors caused AWS to go down - Eliseann
http://status.aws.amazon.com/rss/ec2-us-east-1.rss

======
eli
In case your browser doesn't speak RSS:

 _Service is operating normally: Root cause for June 14 Service Event June 16,
2012 3:15 AM

We would like to share some detail about the Amazon Elastic Compute Cloud
(EC2) service event last night when power was lost to some EC2 instances and
Amazon Elastic Block Store (EBS) volumes in a single Availability Zone in the
US East Region.

At approximately 8:44PM PDT, there was a cable fault in the high voltage
Utility power distribution system. Two Utility substations that feed the
impacted Availability Zone went offline, causing the entire Availability Zone
to fail over to generator power. All EC2 instances and EBS volumes
successfully transferred to back-up generator power. At 8:53PM PDT, one of the
generators overheated and powered off because of a defective cooling fan. At
this point, the EC2 instances and EBS volumes supported by this generator
failed over to their secondary back-up power (which is provided by a
completely separate power distribution circuit complete with additional
generator capacity). Unfortunately, one of the breakers on this particular
back-up power distribution circuit was incorrectly configured to open at too
low a power threshold and opened when the load transferred to this circuit.
After this circuit breaker opened at 8:57PM PDT, the affected instances and
volumes were left without primary, back-up, or secondary back-up power. Those
customers with affected instances or volumes that were running in multi-
Availability Zone configurations avoided meaningful disruption to their
applications; however, those affected who were only running in this
Availability Zone, had to wait until the power was restored to be fully
functional.

The generator fan was fixed and the generator was restarted at 10:19PM PDT.
Once power was restored, affected instances and volumes began to recover, with
the majority of instances recovering by 10:50PM PDT. For EBS volumes
(including boot volumes) that had inflight writes at the time of the power
loss, those volumes had the potential to be in an inconsistent state. Rather
than return those volumes in a potentially inconsistent state, EBS brings them
back online in an impaired state where all I/O on the volume is paused.
Customers can then verify the volume is consistent and resume using it. By
1:05AM PDT, over 99% of affected volumes had been returned to customers with a
state 'impaired' and paused I/O to the instance.

Separate from the impact to the instances and volumes, the EBS-related EC2 API
calls were impaired from 8:57PM PDT until 10:40PM PDT. Specifically, during
this time period, mutable EBS calls (e.g. create, delete) were failing. This
also affected the ability for customers to launch new EBS-backed EC2
instances. The EC2 and EBS APIs are implemented on multi-Availability Zone
replicated datastores. The EBS datastore is used to store metadata for
resources such as volumes and snapshots. One of the primary EBS datastores
lost power because of the event. The datastore that lost power did not fail
cleanly, leaving the system unable to flip the datastore to its replicas in
another Availability Zone. To protect against datastore corruption, the system
automatically flipped to read-only mode until power was restored to the
affected Availability Zone. Once power was restored, we were able to get back
into a consistent state and returned the datastore to read-write mode, which
enabled the mutable EBS calls to succeed. We will be implementing changes to
our replication to ensure that our datastores are not able to get into the
state that prevented rapid failover.

Utility power has since been restored and all instances and volumes are now
running with full power redundancy. We have also completed an audit of all our
back-up power distribution circuits. We found one additional breaker that
needed corrective action. We've now validated that all breakers worldwide are
properly configured, and are incorporating these configuration checks into our
regular testing and audit processes.

We sincerely apologize for the inconvenience to those who were impacted by the
event._

~~~
namidark
Sounds like Amazon is doing something wrong, shouldn't it fail over to Battery
then Generator?

~~~
ericabiz
The batteries at colocation facilities are only designed to hold power long
enough to transfer to the generator. They're also a huge single point of
failure. A better design is a flywheel that generates enough power. But
datacenters are often hit with these generator failures (in my experience,
once every year or so.)

Amazon had a correct setup--but not great testing.

By the way, these are great questions to ask of your datacenter provider: Are
there two completely redundant power systems up to and including the PDUs and
generators? How often are those tested? How do I set up my servers properly so
that if one circuit/PDU/generator fails, I don't lose power?

There is a "right way" to do this--multiple power supplies in every server
connected to 2 PDUs connected to 2 different generators--but it's expensive,
and many/most low-end hosting providers won't set this up due to the cost.

(I ran a colocation/dedicated server company from 2001-2007.)

~~~
rdl
The rotational UPSes are the cause of the majority of 365 Main's downtime, and
in general, horrible and must be destroyed with prejudice.

They're a nice idea in principle (and were the best option back in the
mainframe era), but power electronics have gotten better faster than
rotational maintenance at a datacenter company. They also weren't widely
deployed enough to have a great support system, and it was firmware/software
which caused most of their outages.

Dual line cord for network devices, and then STSes per floor area, probably
make the most sense. Basically no commodity hosting provider uses dual line
cord servers on A and B buses. I love having dual line cord for anything
"real" (including virtualization servers for core infrastructure), but when
you're selling servers for $50/mo, you can't.

(there's the Google approach to put a small UPS in each server, too...)

~~~
ericabiz
(Person you replied to here.) "The rotational UPSes are the cause of the
majority of 365 Main's downtime, and in general, horrible and must be
destroyed with prejudice."

No. Incorrect. There is a reason I 100% refused to move my hosting company
there. I'm not going to say anything else publicly, but it wasn't the hardware
that caused repeated outages there. (I moved my hosting company from San
Francisco to San Jose, and lived in the Bay Area for 10 years. Everyone in the
hosting industry in the Bay Area knew each other. I also hosted for years in
AboveNet SJC3, which had the same flywheel setup.)

Note: I hope at this point they've fixed the issue. I've been out of the
industry for a few years. I wish them the best.

~~~
rdl
Yes, I almost took half a floor of 365 Main back in 2003-2004, and didn't due
to their (at the time) tenuous financial situation and thus being
underresourced on everything. That and there being ~no carriers in the
building at the time. For SF colo, 200 Paul remains a superior choice,
although some floors have had problems, and it's a totally conventional
design.

But the hitech UPS was a weak link. When they sold all their facilities to
someone else (DRT), that fixed most of the other issues.

------
jrockway
The RSS link was quite amusing. My Chrome instance downloaded the RSS file
without displaying it. Then I clicked it to open, and it opened Firefox.
Firefox showed its file download box, suggesting I open the RSS with Google
Chrome.

Deadlock detected.

~~~
tedunangst
On Linux, right? Firefox, or whatever gnome/dbus/opendesktop/gtk fuckery it
uses has all sorts of strange notions about file types. When I download a
tar.gz file, it saves a copy to /tmp, then launches a new instance of firefox
with a file:// url, which opens a save file dialog.

~~~
dfc
Respectfully, your setup is broken; not linux...

~~~
TazeTSchnitzel
Yup, similar file association nonsense can happen on Windows too (and it is a
right PITA to fix)

------
forgotusername
In my time at larger companies, DC power seems to be one of the weakest links
in the reliability chain. Even planned maintenance often goes wrong ("well we
started the generator test and the lights went out, that wasn't supposed to
happen. Sorry your racks are dead").

Usually the root cause appears simple - a dead fan, breaker set to the wrong
threshold, alarm that didn't trigger, incorrect component picked during design
phase, or whatever else that gets the blame - things it would seem to a
software guy that good processes could mitigate.

Can any electrical engineers elaborate on why power networks fail (in my
experience at least) so frequently? I guess failure modes (e.g. lightning
strike) are hard to test, but surely an industry this old has techniques. Is
it perhaps a cost issue?

~~~
mrkurt
It's really incredibly complicated, and difficult to test fully. The bits of
Amazon's DC that failed seem like stuff normal testing should catch, but the
DC power failures I've dealt with in the past always had some really precise
sequence of events that caused some strange failure no one expected.

As an example, Equinix in Chicago failed back in like 2005. Everything went
really well, except there was some kind of crossover cable between generators
that helped them balance load that failed because of a nick in its insulation.
This caused some wonky failure cycle between generators that seemed straight
out of Murphy's playbook.

They started doing IR scans of those cables regularly as part of their
disaster prep. It's crazy how much power is moving around in these data
centers, in a lot of way they're in thoroughly uncharted territory.

~~~
rdl
The even crazier thing is big industrial plants where they are using tens or
hundreds of MW and have much lower margins than datacenter companies, so they
run with dual grid (HV, sometimes like 132kV) feeds and no onsite redundancy.
As in, when the grids flicker, they lose $20mm of in-progress work.

~~~
bigiain
I'd guess that's because "tens or hundreds of MW" of on-site backup power
would be _ludicrously_ expensive to own/maintain, and the tradeoff against the
risk of both ends of their dual grid flickering at once and trashing the
current batch is less expensive. (or maybe the power supply glitches are
insurable against, or have contract penalty clauses with the power companies?)

------
jluxenberg
_"Those customers with affected instances or volumes that were running in
multi-Availability Zone configurations avoided meaningful disruption to their
applications"_

"Meaningful disruption" is a bit of a weasel word; Amazon's own EBS API was
down for almost two hours[1] despite being designed to use multiple AZs

[1] _"the EBS-related EC2 API calls were impaired from 8:57PM PDT until
10:40PM PDT ... The EC2 and EBS APIs are implemented on multi-Availability
Zone replicated datastores"_

Guess the moral of the story is, if you require high availability then you
must test your system in the face of an availability zone outage.

------
damian2000
I love this sentence: _Those customers with affected instances or volumes that
were running in multi-Availability Zone configurations avoided meaningful
disruption to their applications; however, those affected who were only
running in this Availability Zone, had to wait until the power was restored to
be fully functional._

Translation: If you have a redundant (multiple-AZ) installation, then you were
ok, if not then your server died.

------
jtchang
Data Center Operator:

We've lost our main power. No problem though we have a backup generator so we
are good!

... 5 minutes later ...

Uhh boss, our backup generator's fan crapped out. But no worries we have a
secondary generator just for this kind of scenarion!

...10 minutes later and lights go out...

"Well damn...looks like we configured the breaker wrong. This is not a good
day."

~~~
lutorm
Things could be have been worse -- it could have been a nuclear power plant.
oh wait...

------
tysont
On the plus side, the level of transparency that AWS displays and the detail
that they provide seems above and beyond the call of duty. I find it
refreshing and I hope that other companies follow suit so that customers can
understand the details of operational issues, calibrate expectations
appropriately, and make informed decisions.

~~~
rdl
They're less transparent and responsive than most datacenter or network
providers -- it's just that most of those providers hide their outage
information behind an NDA, so only customer contacts get it, vs. making it
public.

~~~
smackfu
Yeah, a good datacenter will have SLAs around the root-cause analysis document
for any failures. Like a preliminary report within a day and a final report
within 7 days.

------
mleonhard
The title is incorrect. It should say something more like "Cascading failures
cause part of AWS to go down."

------
mleonhard
I'm running <https://www.rootredirect.com/> and <http://www.restbackup.com/>
in us-east-1, in multiple availability zones. Both sites remained up with no
problems.

~~~
aparadja
Does your rootredirect service actually attract paying customers? I'm
genuinely interested to know.

~~~
joelcollinsdc
Also, how do you do this without having to handle the customers DNS lookups as
well?

------
moe
Can someone translate that to control rods and manifolds?

------
tzury
Seems like deploying on two _physical_ regions (or more) is the best and only
proven approach.

That could be within the global AWS, or even say, one cluster at AWS and the
other at RackSpace/Linode, etc.

~~~
robryan
Then you just need to worry about your application consistency with the
replication lag. No silver bullet I guess.

------
drags
Did anyone else run into issues with ELB during the outage? We're multi-AZ and
could access unaffected instances directly without a problem, but the load
balancer kept claiming they were unhealthy.

------
gosub
Could it be possible to have power management the same way Erlang manages
processes? Instead of 2 or 3 enormous backup power unit, hundreads of small
ones to come in and out of use "fluently".

------
anaheim
TL; DR:

Shit happens. Don't use AWS as your only platform, you _will_ get burned
sometime. Guaranteed, you will also get burned if you try to host and run your
own stuff. How competent you are determines which way you get burned less.

~~~
starship
Actually, starting right now, AWS is probably your best bet.

Old story about Chuck Yeager from the 1950's: one time shortly after take-off,
Yeager's aircraft suffered an engine failure, and he had to do an emergency
semi-crash landing. When he realized that a mechanic had put the wrong type of
fuel in the plane, he went looking for the guy. The mechanic profusely
apologized, said he would resign and never work in aviation again. Yeager
replied something along the lines of "Nonsense. In fact, I need someone to
refuel my plane right now, and I want you to be the one to fuel it. That's
because of all the guys here, I know you'll be the one guy who'll be sure to
do it right."

Probably apocryphal, but the point has merit.

~~~
Monkeyget
This is mentioned in 'How to win friends and influence people' where the
anecdote is about Bob Hoover and Jet fuel put in a WW2 plane. It is used as an
example that it is easy to criticize and complain but that it takes character
to be understanding.

------
heretohelp
1 in a (million/billion/trillion) I guess.

That'll make for a great horror story to tell though.

~~~
kbutler
Isn't the moral of the story, "Check your backups"? There was a defective fan
in one generator (sounds like it was findable via a test run?) and a
misconfigured circuit breaker (sounds like it was findable by a test run).

Redundancy is only helpful if the redundant systems are actually functional.

~~~
olefoo
Having been affected several times by colocation facilities bouncing the power
during a test of the failover system, I can tell you that such tests are not
without risk. Yes, you should test redundant systems, but how often, at what
cost, and what risks are you willing to run while doing so.

It's a fact of life that when dealing with complex, tightly coupled systems
with multiple interactions between subsystems that you will routinely see
accidents caused by improbable combinations of failures.

~~~
spartango
I wonder if it's better to create an accidental outage during a scheduled
test, or to have an outage completely out of the blue. Obviously mitigation is
tricky even during a scheduled test, but perhaps its plausible?

~~~
dkulchenko
With a scheduled test, you have the benefit of having the main power actually
working if the backup being tested comes crashing down; seems to me that
mitigation would be much quicker in a scheduled test than in a real outage.

~~~
excuse-me
But would you rather risk a once in 10years real power failure testing your
backups, or would pull the plug once per year to test it yourself?

~~~
richardw
I suspect a monthly test - if communicated to customers as well - would drive
better customer behaviour, e.g. multi-AZ usage, automated validation of EBS,
adding extra machines in another AZ automatically. Maybe start with one or two
AZ's with an opt-in from customers.

It's the same as advice to routinely replace your live data from backups. It's
not a real backup until you've tested that you can recover from it.

~~~
excuse-me
It's an interesting and well studied area of statistics for medical tests.

eg. If you have a test that is 99% accurate and a treatment that harms 1% of
the patients and you do this screening of a million people - how common does
the disease have to be before you cure more people than you kill ?

