
Amazon AWS had a power failure, their backup generators failed - mcenedella
https://www.twitter.com/PragmaticAndy/status/1168916144121634818
======
skywhopper
This seems to be getting slightly overblown in that thread. To be clear, this
impacted one datacenter out of ten that make up one availability zone out of
six in AWS’s us-east-1 region. So we are talking 2-3% at most of that region’s
capacity was impacted.

I haven’t seen a report yet on exactly why their generator failed, but from
what I’ve heard, the power failed, and the backup generator kicked in and ran
fine for over an hour, but then it failed. This sort of thing sucks to deal
with but it’s also inevitable. Of their hundreds of datacenters, a mechanical
failure is going to happen occasionally no matter how good their maintenance
plans are.

So the key when using cloud services like AWS is to plan for the possibility
of failure. EBS expects an annual failure rate of 0.1%. So one out of a
thousand EBS volumes will fail in a given year. If you operate at the scale of
thousands of servers in AWS, you see this sort of thing all the time. Luckily,
EBS also makes it trivial to take volume snapshots which are stored in S3,
which has much much higher reliability and durability. So if you have data in
EBS that needs to be kept safe, take regular snapshots. Here’s a doc that
explains how you can set up scheduled, auto-rotated snapshots:
[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/snapshot...](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/snapshot-
lifecycle.html)

~~~
hamiltont
Don't disagree with your point that this is overblown, but here's an important
related point

"Your nines are not my nines" \-
[https://rachelbythebay.com/w/2019/07/15/giant/](https://rachelbythebay.com/w/2019/07/15/giant/)

~~~
mk89
If I recall correctly this is nicely covered in the book "Release it!" which
BTW I recommend.

------
carlsborg
This conclusion "The cloud is just a .. blah blah blah" is weak: Amazon offers
isolated availability zones within each region to mitigate this risk at the
system level, it gives you the ability to take EBS snapshots that you can
backup on S3 (with 11 9s of durability), and scaling features you just will
not find on "just another computer." And you are meant to architect to this
with multi-AZ designs.

It's managed infrastructure not some miraculous alternative universe where
probabilities do not apply to you.

From the docs:

"Amazon EBS volumes are designed for an annual failure rate (AFR) of between
0.1% - 0.2% ..."

~~~
vectorEQ
the conclusion is weak but true. people do tend to forget the cloud is also a
bunch of computers, and they might fail. however, it's not an argument to
avoid cloud entirely, and in that sense its weak as an argument against cloud.

~~~
philwelch
Agreed. If the cloud is somebody else's computer, I'm glad to let it be a
computer that belongs to an extremely specialized and demanding company that
focuses on providing that computer to me as a service.

------
linsomniac
We had a similar problem at our hosting facility last winter during that "Ice
Vortex" storm that was all over the news. The facilities guys had been very
proud of never having a power outage, I've had servers with them for 15 years
now.

The morning of the worst of the storm, we completely lose access to all
services at that facility. Super unusual, everything we have there is
redundant. So I do some minor investigation, and get on the horn to them.

They were being super cagey. "Hey, we lost all access to our systems." "Ok,
I'll open a ticket and we will investigate." "Uhhh. It feels like it's a big
problem with the data center, are you guys having problems or is it just us?"
"I can't say anything more until we've completed an investigation." "I'm
trying to decide if we need to start failing over to our DR site, or if I need
to put chains on the truck to drive 50 miles to the data center in this storm.
Is anyone else having problems? Are fire alarms going off?" "We have received
multiple reports of problems."

Power was back on in less than half an hour, but they still weren't saying
anything for a few hours. Spent that time trying to figure out if we should
shut everything down and ride it out, or if they were back in business. We had
one system that suffered disk corruption, despite having a (according to the
weekly testing) correctly operating BBU on the RAID.

So what happened? It shouldn't have been possible, our cabinet was being fed
by two lines from independent PDUs. Each PDU is fed by 2 independent UPSes
(one shared between the two PDUs), each UPS fed by a dedicated generator.
Should have required 3 failures to bring us down.

They eventually reveal that they had had one of their UPSes down for weeks,
waiting for replacement parts. The other two UPSes had independent failures
(one was a controller board, one was battery related). They said they still
did quarterly full load tests of the power systems, but reading between the
lines I think they weren't testing these two UPSes because the other one was
not there to back them up.

Still, one power event in 15 years isn't too shabby.

~~~
jgalentine007
I had a similar situation in the early 2000's at NTT/Verio in Sterling, VA.
They lost power because someone dug through a utility line and introduced a
ground a fault into their system. They switched to generators and then against
protocol (basically human error) tried to force override a transfer switch
onto their other utility which ended up killing the generators. Eventually the
UPSs were all drained, but we had to shutdown servers long ahead of that
because with no AC, they were overheating. People were taking servers out by
pickup truck to stand them up at their other offices and data centers. 48
hours of pain.

~~~
macintux
Power seems like the big wildcard in data center management. So tough to
properly test your failover preparations, and so many different ways things
can go wrong.

I know of a large company that had the data center emergency cutoff button
next to the automatic doors on the way out. Sure enough, a contractor hit it
one day thinking it was the way to open the doors.

------
lovetocode
So dude is mad because he didn’t have a redundancy plan? You can take
snapshots of EBS volumes which backs everything up to S3. They even tell you
that EBS volumes can fail in the documentation. But blaming someone else is
easier I guess...

~~~
sheeshkebab
AWS need to add a button/option to ebs to have volumes be automatically backed
up by aws itself. Without this few will do it or are even aware that it’s
possible to do.

It doesn’t help that ebs backup takes forever, especially initially.

~~~
ReidZB
They did add that:
[https://aws.amazon.com/backup/](https://aws.amazon.com/backup/)

------
empath75
Not only should you be architecting your app to survive an az going down you
should be planning on an entire region going down and maybe even an entire
cloud provider.

It’s annoying when amazon has outages but they have local outages all the time
and they give you all the tools you need to handle them.

~~~
sgt101
money money money though - all that time and effort (and fees) doesn't help
you to lowball the next contract, cross your fingers and hope not to get
caught out!

------
darkcha0s
This just sounds like a bad architected solution, nothing else. The same
problems that can happen in your own datacenter can happen in the cloud; it's
just not your responsibility to fix it. If you lack that knowledge as an
architect, rethink your title.

------
jrace
I think the real complaint is:

>>Then it took them four days to figure this out and tell us about it.

~~~
stickydink
I must have missed something entirely, did this all happen before the weekend?
Where is the 4 days coming from? Amazon's RSS feed hit our Slack on Saturday
morning with an explanation that the power had gone out and the backup
generators failed.

Is there some post mortem that just came out?

>>We want to give you more information on progress at this point, and what we
know about the event. At 4:33 AM PDT one of 10 datacenters in one of the 6
Availability Zones in the US-EAST-1 Region saw a failure of utility power.
Backup generators came online immediately, but for reasons we are still
investigating, began quickly failing at around 6:00 AM PDT. This resulted in
7.5% of all instances in that Availability Zone failing by 6:10 AM PDT. Over
the last few hours we have recovered most...

------
nerdbaggy
I remember somebody on here writing out how when they worked at AWS they wrote
custom firmware for their generators to get max performance.

~~~
MichaelApproved
I couldn't find the HN post but I found an article that talks about firmware
mods they do.

[https://www.datacenterknowledge.com/archives/2017/04/07/how-...](https://www.datacenterknowledge.com/archives/2017/04/07/how-
amazon-prevents-data-center-outages-like-deltas-150m-meltdown)

> _The piece of technology Amazon designed to avoid this type of outage is the
> firmware that decides what electrical switchgear should do when a data
> center loses utility power. Typical vendor firmware prioritizes preventing
> damage to expensive backup generators over preventing a full data center
> outage, according to Hamilton. Amazon (and probably most other large-scale
> data center operators) prefers risking the loss of a sub-$1 million piece of
> equipment rather than risking widespread application downtime._

> _When everything happens as expected during a utility outage (which is the
> case most of the time), the switchgear waits a few seconds in case utility
> power comes back (also the most common scenario) and if it doesn’t, the
> switchgear fires up generators, while the data center runs on energy stored
> by UPS systems. Once the generators are stabilized, the switchgear makes
> them the primary source of power to the IT systems._

> _Last year’s Delta data center outage was attributed to switchgear “locking
> out” the generators at the airline’s facility in Atlanta. That’s what most
> switchgear is designed to do when it senses a major voltage anomaly either
> in the data center or on the incoming utility feed. Plugging a live
> generator into a shorted circuit will usually fry the generator, and
> switchgear locks generators out to avoid that._

~~~
generatorguy
Diesel generators at Hospitals and diesel motors running pumps for fire
suppression systems are normally set up to keep running closer to the line of
risking damage to the generator and engine.

------
linsomniac
People tend to make a habit of treating AWS like a VPS provider, in my
experience. And you can skate by on this for a while, but it really isn't
designed for that. And that will, eventually, lead to pain and suffering.

Sometimes instances just go out to lunch. Sometimes an AZ goes down. Chaos
Monkey isn't just a good idea, it's required for reliable operation.

But please, please, if you are going to treat AWS like a VPS, at least don't
do it in us-east-1! It seems to have more outages.

Our setup related to instances and EBS includes: At least 2 instances in
different AZs, an ELB in front of them, a backup running at our hosting
facility (though this could just be a different AWS zone, or different
provider), and DNS with full-paper-path health checks that switch DNS over to
the colo servers if any component of the primary fails.

------
serkanh
Although this was not reported on the status dashboard, this also affected
elasticcache as well.It was acknowledged by the rep on the phone and via email
issue got resolved. We weren't able start/reboot any redis instances on us-
east-1a so had to launch on us-east-1c.

------
therealx
Is there a reason it often seems like backup generators fail? Is it that we
don't hear the success stories of all the times they don't fail? Just due to
not being tested often?

~~~
toast0
We generally only hear about the failures, but it's also a tricky thing to
test. Simple setups won't put load on the generator during the periodic tests,
which can result in outages if the generator will start, but can't run the
load for whatever reason (ex: mechanical problems, or load size grew beyond
capacity). More complicated setups may be able to switch the load to the
generator, but not switch back to utility fast enough in case the generator
under test fails during the test. The transfer switches themselves are prone
to failures and hard to make redundant.

It seems that it's pretty hard to get this right from the beginning too, every
large datacenter ends up learning this again after they have a sequence of
power incidents. That said, successful switch to generator for 1 hour and then
generator failure is not a terrible outcome; if there was a notification,
that's enough time to evacuate critical systems (assuming you have a plan).

------
joncrane
Is there a source for this other than an angry Twitter user?

~~~
random_visitor
I came across this news article:
[https://www.theregister.co.uk/2019/09/04/aws_power_outage_da...](https://www.theregister.co.uk/2019/09/04/aws_power_outage_data_loss/)

The headline sounded so clickbaity that I ignored it before seeing this thread

~~~
hexteria
The Register often uses snarky or tongue-in-cheek headlines, I like it tbh

------
ksec
Forgive me, What may be a stupid question.

As far as I can tell, the number of a Mechanical / Hardware Failure is far
higher than Software. And it is always Power, UPS, Generator, BBU, Raid Card
Failure etc.

Why is it that we keep hearing failure in these segment? And it doesn't seems
anything have been done? Are there any Innovation happening in this space?

------
staticvar
This tweet might be in response to the AWS Post event summary from August 23,
2019:

> We’d like to give you some additional information about the service
> disruption that occurred in the Tokyo (AP-NORTHEAST-1) Region on August 23,
> 2019. Beginning at 12:36 PM JST, a small percentage of EC2 servers in a
> single Availability Zone in the Tokyo (AP-NORTHEAST-1) Region shut down due
> to overheating.

[https://aws.amazon.com/message/56489/](https://aws.amazon.com/message/56489/)

~~~
joncrane
But the guy's tweet specifically mentions Reston (Virginia). This is in the
vicinity of us-east-1.

By the way, I'm pretty sure none of the actual AWS datacenters are in Reston
proper. They are in Ashburn and other more sparse suburbs.

Source: I live in the DC area and regularly visit Reston and Herndon. There
are large AWS offices in Herndon but not so many datacenters. Real estate in
Reston is pretty expensive.

~~~
jsjohnst
> By the way, I'm pretty sure none of the actual AWS datacenters are in Reston
> proper.

Pretty sure you are right. The physical AWS data centers I know of in Reston
area are:

4 DCs on Smith Switch Rd in Ashburn, VA

2 DCs (IAD54 and IAD67) elsewhere in Ashburn, VA

3 DCs on West Severn Way in Sterling, VA

3 DCs on Dulles Summit Ct in Sterling, VA

2 DCs on Prologis Dr in Sterling, VA

2 DCs on Relocation Dr in Sterling, VA

2 DCs (IAD69 and IAD76) elsewhere in Sterling, VA

1 DC in South Riding, VA

4 DCs on Westfax Drive in Chantilly, VA

3 DCs on Mason King Ct in Manassas, VA

2 DCs elsewhere in Manassas, VA

------
dgritsko
mLab was affected by this, and based on their own status page it looks like
some volumes are permanently unrecoverable:
[https://status.mlab.com/](https://status.mlab.com/)

~~~
ReidZB
I can confirm that. We had a single-AZ RDS instance whose underlying storage
(a "magnetic" EBS volume) unrecoverably failed, according to AWS. It had to be
restored from a backup. Fortunately, "point-in-time recovery" meant there was
very little data loss, just some downtime.

(Not that data loss or downtime mattered for this instance, which was just
used for internal testing.)

Power-up is very stressful for hard drives, so it's not too surprising that
some failed when the power turned back on. EBS does offer spinning rust
storage options, so maybe mLab was using those for some of those failed
volumes. I don't know if the same is true for SSDs or not.

------
xxxpupugo
EBS is persistent right?

I mean the failure of backup is not a surprise. Entropy is just everywhere.

------
durpleDrank
Is this North Virginia AGAIN???? How ironic that a company named "Amazon"
cannot keep it's servers up when ever there is rain. This has happened
practically every time hurricane season appears. Obviously being a bit hard on
them (for humor) but come on guys, get a giant umbrella or something.

------
the_70x
I wonder if they perform routine tests on their support infra: power, cooling,
et al

~~~
darkcha0s
No, they just let it ride and hope nothing breaks.

~~~
chasd00
I know you're being sarcastic but you're probably more right than you realize.
Many many "redundant" systems turn out to not be so redundant when it counts.

talk to some datacenter admins and you'll learn there's a lot more bailing
wire and hope-for-the-best out there than you would think.

------
Polyisoprene
Scary that AWS can’t restore EBS volumes properly after a power failure.
Snapshot are not a solution to this in a live system.

~~~
Quarrel
If you lose power mid-write to an HDD, of course you can lose data.

This guy sounds like if he'd self hosted he'd be complaining about an HDD
failure. It happens- you need to design around it. Luckily, EBS volumes,
snapshots and AZs make all of this pretty straight forward.

~~~
Polyisoprene
Data, sure. Lose the volume, no.

~~~
cthalupa
In my 15 years of experience as a sysadmin and architect, hard drives are far
and above the most frequent hardware casualty of power failures.

The EBS documentation states that there's an expected AFR of 1 to 2 per
thousand volumes, so you should plan accordingly. Replicate any data that will
cause harm to your business if lost to other sources. Keep backups.

------
notyourday
He is just having a temper tantrum. Everything can fail and everything does
fail. There was an outage a few years ago in a well known colo/ip/managed
services provider where the feeder line from the power company failed, the ATS
flipped to the backup power which had a limited run time. And, due to one of
those 1/1000 events that should never happen (because that ATS should flip
maybe 5-6 times a year) it fused to the new position. And it happened in a
place where the DC operator would cut off the service on the second line to
ensure they can safely work on removing the affected ATS. So the redundant
power lines + backup power did not work. If you happened to be in that
specific area of the building and happened to know building engineers and data
center engineers and power company engineers you would have had heard what
actually happened. Otherwise you just got "Imminent power failure"
notification. Hopefully you knew that it means you want to shutdown all your
workloads remotely and send someone you have on call who can reach the data
center in 10-15 minutes to physically disconnect your PDUs from the incoming
lines just in case someone messed up when they play at fixing the power so you
don't blow 10-30% of your PDUs.

That's the reality of the life in a data center. So yeah, either accept that
stuff like this happens or build for stuff like this happening. Engineering
around physical problems in the cloud environment is far easier than in the
data center environment.

~~~
acdha
I’ve seen at least 3 variations of the problem you mention, where power
failover caused protracted downtime requiring rush delivery of niche
replacement hardware. (That last is big: I’ve seen 8-figure enterprise
hardware spends down for a week because it requires flying someone in to fix
it, whereas AWS/Google has 24x7 staffing along with redundancy).

Anyone thinking this doesn’t happen with private data centers is either very
green or selectively excusing problems.

