
AWS Service Disruption Post Mortem - teoruiz
http://aws.amazon.com/message/65648/
======
Smerity
> The nodes in an EBS cluster are connected to each other via two networks.
> The primary network is a high bandwidth network... The secondary network,
> the replication network, is a lower capacity network used as a back-up
> network... This network is not designed to handle all traffic from the
> primary network but rather provide highly-reliable connectivity between EBS
> nodes inside of an EBS cluster.

During maintenance instead of shifting traffic off of one of the redundant
routers the traffic was routed onto the lower capacity network. There was
human error involved but the network issue only provoked latent bugs in the
system that should have been picked out during disaster recovery testing.

Automatic recovery that isn't properly tested is a dangerous beast; it can
cause problems faster and broader than any team of humans are capable of
handling.

------
charper
Seems there is always this issue. System fails. Systems try to repair
themselves. Systems saturate _something_ which stops them from repairing.
Systems all loop aggressively bringing it all down.

------
thebootstrapper
Reminds me again, Distributed System are hard and the first fallacies "The
network is reliable"

------
mcpherrinm
There's a quote I found interesting that hasn't been noted here yet:

"This required the time-consuming process of physically relocating excess
server capacity from across the US East Region and installing that capacity
into the degraded EBS cluster."

And if I read this description of the re-mirror storm correctly, I think that
implies Amazon had to increase the size of it's EBS cluster in the affected
zone by 13%, which considering the timeline seems fairly impressive.

------
senthilnayagam
AWS was numero uno in terms of customer visibility and the image of a
pathbreaking cloud service, before the incident.

Lack in transparency in reaching out to customers is the biggest mistake what
AWS did. They would learn from their mistakes, their servers and networks
would be more reliable than ever.

This incident has given a reason for people to look at multi-cloud operation
capability, for disaster recovery and backup reasons. AWS monopoly would be
gone, there would be many new standards which would be proposed to bring in
interoperability and for migrations between clouds.

------
rdl
I still don't see a good justification for keeping the ebs control plane
exposed to failure across multiple availability zones in a region. Until that
is fixed, I would not depend on AZs for real fault tolerance.

------
moe
Now _that's_ what I call a post mortem. Kudos to the author.

------
wanderr
I highly recommend that anyone who was surprised by this outage, or the
description of the chain reaction of failures that lead to it, read
Systemantics. It is a dry but amusing exploration of the seemingly universal
fact that every complex system is always operating in a state of failure, but
the complexity, failovers and multiple layers can hide this, until the last
link in the chain finally breaks, usually with catastrophic results.

~~~
gruseom
_read Systemantics_

Oh yes. It's a classic that deserves to be much better known. Anybody engaged
with complex systems - such as software or software projects - will find all
kinds of suggestive things in there. As for "dry"... come now, it's hilarious
and has cartoons.

Basically, just get it. Here, I'll help:

[http://www.amazon.com/Systems-Bible-Beginners-Guide-
Large/dp...](http://www.amazon.com/Systems-Bible-Beginners-Guide-
Large/dp/0961825170)

(They ruined the title but it's the same book.)

------
assiotis
I find it surprising that they did not and do not plan to employ any sort of
interlocks/padded walls. What I mean is, if the system is exhibiting some very
abnormal state (e.g #remirror_event above a fixed threshold or more than x
standard deviations above average) then automated repair actions should
probably stop and the issue should be escalated to a human.

~~~
neuroelectronic
They will probably do that now. They will probably also make sure they have a
powerful SOP for network upgrades as well.

------
johndbritton
"We will look to provide customers with better tools to create multi-AZ
applications that can support the loss of an entire Availability Zone without
impacting application availability. We know we need to help customers design
their application logic using common design patterns. In this event, some
customers were seriously impacted, and yet others had resources that were
impacted but saw nearly no impact on their applications."

------
leoc
Compare to the 2008 post-mortem:
<http://status.aws.amazon.com/s3-20080720.html> Messaging infrastructure as
single point of failure? Check. <http://news.ycombinator.com/item?id=2472227>

------
thebootstrapper
One of the main cause for "re-mirroring storm," is node not backing off from
finding a replica.

Here's Twitter Back off decider implementation (Java)

[https://github.com/twitter/commons/blob/master/src/java/com/...](https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/util/BackoffDecider.java)

When last time i looked i was little clueless on this. Now I find its usage.

~~~
biot
Actual URL with Libya dependency removed:
[https://github.com/twitter/commons/blob/master/src/java/com/...](https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/util/BackoffDecider.java)

HN doesn't have a 140 character limit, so there's no need to post an
obfuscated shortened link.

~~~
thebootstrapper
My fault. Edited the link. Thanks.

------
pwzeus
I for once just want to say that _claps_ to them for figuring this out ,
nailing it down in fixing it in just few days. After reading this if feels
like issue at such massive level can take large amount of time to fix.

------
mauricio
It's strange we haven't heard more from users of the 0.07% of EBS volumes that
were corrupted and unrecoverable during the outage. I just assumed there was
no data loss as a result of the outage.

------
VladRussian
interesting, several weeks ago someone (reddit?) has already hit the problems
with EBS availability. Did Amazon paid attention and analyzed the problem back
then? Or let it just pass?

------
mikiem
The whole thing is just too complicated to be highly-available. There will be
more problems, but I wish them luck.

~~~
ra
That's a bit defeatist. In my first year of uni, one of our lecturers drummed
something into us:

 _Q. How do you eat an elephant?

A. One bite at a time_

------
epi0Bauqu
They should also allow one-time moves of reserved instances between
availability zones.

~~~
amock
What would the purpose of that be?

~~~
epi0Bauqu
They want to help people make better use of multiple availability zones.
People may have reserved a bunch of instances, but would be better off
distributing those more effectively across zones.

~~~
amock
That makes sense. I wonder if something like the free realm transfers in World
of Warcraft would make sense. Maybe the AZ mapping randomization keeps things
balanced but encouraging multi-az deployments seems like a good idea.

------
AdamGibbins
I found this rather entertaining: <http://intraspirit.net/images/aws-
explained.png>

~~~
fawxtin
Im getting a 404, is this the same? (Amazon service disruption "explained" by
an employee)

[http://lgv.s3.amazonaws.com/AmazonFail_explainedByEmployee.j...](http://lgv.s3.amazonaws.com/AmazonFail_explainedByEmployee.jpg)

------
gojomo
I doubt this is the last time we'll hear of a "re-mirroring storm" in an
oversaturated cloud.

~~~
sliverstorm
Oversaturated? How do you figure? 13% were unable to re-mirror, which means
87% were able to. In short, nearly 40% of the 'cloud' was free space.

~~~
gojomo
The 're-mirroring storm' occurred when all free space was exhausted. At that
point, the EBS storage resources were oversaturated.

Then the 'EBS control plane' started to fail because 'slow API calls began to
back up and resulted in thread starvation'. At that point, the EBS processing
resources were oversaturated.

Then other nearby systems got wet.

------
thehodge
An automatic 100% credit for 10 days usage, thats pretty good IMO

~~~
tomjen3
Well yes, except that that is usually peanuts compared to the lost income from
your service being down.

Really the only purpose of a SLA penalty is to incentivize the provider to
keep the network reliable.

~~~
com
I totally agree with your comment about the SLA penalty as an incentive to the
provider to take reasonable measures to ensure service.

But that's just in general.

When negotiating bespoke SLA penalty clauses, it can be very illuminating for
both sides to discuss lost profit + lost confidence + additional costs to the
customer and suggest that these be factored in to the penalty clause.

My experience: both the customer and supplier tend to take a deep breath to
evaluate whether this deal is a good one for either of them and begin to
reassess their level of risk.

In a off-the-shelf service like Amazon, you as a customer are welcome to
suggest a change of penalty to your Amazon account manager, and unless you're
something like the US government, you will probably be directed to other cloud
providers or your own internal IT organisation!

~~~
cosmicray
> In a off-the-shelf service like Amazon, you as a customer are welcome to
> suggest a change of penalty to your Amazon account manager, and unless
> you're something like the US government, you will probably be directed to
> other cloud providers or your own internal IT organisation!

What that suggests to me, is that the time has arrived for an external
organization, one that sells loss-of-business protection against such
failures, needs to become involved. Such an organization, should enough cloud
customers subscribe to it, would become an influence upon services like AWS.
I'm not sure I 'like' this idea, but the premise that a customer is using the
cloud service at the whim of whatever the provider decides is best practice
needs to be revisited.

------
mml
Did I read this correctly in paragraph 2: " For two periods during the first
day of the issue, the degraded EBS cluster affected the EBS APIs and caused
high error rates and latencies for EBS calls to these APIs across the entire
US East Region."

Their "control plane" network for the EBS clusters span availability zones in
a region? If so, this would be the fatal flaw.

~~~
jsdalton
I may have read it incorrectly myself, but I interpreted this as meaning the
control plane was balanced across availability zones in order to provide
durability in the face of a failure of one of the zones. In other words,
Amazon is ensured their control plane is operational at all times.

The API failures were ultimately tied to the network problems that occurred,
not to a failure of the control plane.

EDIT: I should finish reading before I reply. :) It would appear that the
network issue in the one availability zone was so severe that the control
plane ran out of threads to service API requests to _any_ of the availability
zones.

So while it's true the underlying problem was a network issue, the fact that
the the control plane is spread across availability zones was responsible for
part of the outage that occurred across the whole region.

My totally unqualified assessment of this aspect of the outage is that, while
it might make sense to have a control plane spread across availability zones,
they presumably need to have isolated control planes for each zone, instead of
a shared plane as they seemingly have now.

~~~
evangineer
They seem to have settled on a halfway house, pushing more of the control
plane functionality down into the EBS clusters and making the remaining shared
control plane more robust to the sort of failures that arose this time.

------
bretthopper
I've been noticing a trend recently when reading about large scale failures of
any system: it's never just one thing.

AWS EBS outage, Fukushima, Chernobyl, even the great Chicago Fire (forgive me
for comparing AWS to those events).

Sure there's always a "root" cause, but more importantly, it's the related
events that keep adding up to make the failure even worse. I can only imagine
how many minor failures happen world wide on a daily basis where there's only
a root cause and no further chain of events.

Once a system is sufficiently complex, I'm not sure it's possible to make it
completely fault-tolerant. I'm starting to believe that there's always some
chain of events which would lead to a massive failure. And the more complex a
system is, the more "chains of failure" exist. It would also become
increasingly difficult to plan around failures.

edit: The Logic of Failure is recommended to anyone wanted to know more about
this subject: [http://www.amazon.com/Logic-Failure-Recognizing-Avoiding-
Sit...](http://www.amazon.com/Logic-Failure-Recognizing-Avoiding-
Situations/dp/0201479486)

~~~
akuchling
A similar point is made in Gene Weingarten's "Fatal Distraction"
(<http://www.pulitzer.org/works/2010-Feature-Writing>), which was about
parents who forget a child in the car. Excerpt: "[British psychologist James
Reason] likens the layers to slices of Swiss cheese, piled upon each other,
five or six deep. The holes represent small, potentially insignificant
weaknesses. Things will totally collapse only rarely, he says, but when they
do, it is by coincidence -- when all the holes happen to align so that there
is a breach through the entire system."

------
gord
This article reads like nonsense - but this is not a criticism of AWS.

The real problem is there is no good mathematical model of distributed
behaviour, from which statistical guarantees can be made.

I think we're at the limit of what the smartest people can achieve with hand
crafted code.

Most likely new math will give rise to new tools and languages, in which the
next generation of reliable distributed systems will be written.

Without this advance we will have storage networks that aren't reliable, an
internet that can be taken down by one organization, botnets that are
unkillable and patchy network security.

~~~
brown9-2
I'm curious - what about this post-mortem "reads like nonsense"?

~~~
gord
All of it, the general approach is wrong - so the nonsense is the part where
you believe your current set of abstractions about distributed networks are
adequate.

Also the part where you reapply those same abstractions to fix the hole,
without realizing that the problem is you simply don't yet have tools that are
capable of writing a robust system - despite the evidence to the contrary.

If a day long outage of this scale is not enough to make us rethink
distributed systems, what is?

------
hobbes
>...one of the standard steps is to shift traffic off of one of the redundant
routers in the primary EBS network to allow the upgrade to happen. The traffic
shift was executed incorrectly...

This supports the theory that between 50%-80% of outages are caused by human
error, regardless of the resilience of the underlying infrastructure.

~~~
tomjen3
Which leaves a question: why not engineer around humans, such that they are
never needed in the day-to-day running of the systems?

~~~
henrikschroder
"The trigger for this event was a network configuration change. We will audit
our change process and increase the automation to prevent this mistake from
happening in the future."

~~~
tomjen3
Sure now, but I wanted to know why they didn't do that from the beginning.

~~~
michael_dorfman
Why they didn't do what? "Increase the automation"? I suspect that they _have_
been doing that from the beginning. It's an ongoing process.

I hate to quote Rumsfeld, but there are known unknowns, and unknown unknowns.
_Of course_ you want to eliminate the latter-- but there's (necessarily) no
way you can ever know that you've done so.

------
nodata
tl;dr version?

~~~
chrisboesing
An EBS node in a EBS cluster is connected to two networks. One is used for the
traffic to and from the EBS volumes(Primary network), the other is used to
replicate the EBS volume on a EBS node to a different EBS node(Secondary
network). Amazon wanted to upgrade the capacity of the primary network. Their
standard step doing this is to shift the traffic to a redundant router. This
step was executed incorrectly. This resulted in the traffic not being routed
to the primary network but instead to the secondary network which has less
capacity. All this traffic satured the secondary network and resulted in the
EBS volumes becoming "stuck". When the traffic got routed the right way all
the EBS volumes were trying to remirror. Part of the remirroring process is
that the EBS volumes search the cluster for free space to remirror to. The EBS
cluster couldn't handle this load and new capacity was needed for the EBS
cluster.

Amazon offers a 10 day credit equal to 100% of their usage of EBS Volumes, EC2
Instances and RDS database instances. This credit will be automatically
applied to the next bill.

------
nicpottier
tldr: ""The trigger for this event was a network configuration change. We will
audit our change process and increase the automation to prevent this mistake
from happening in the future."

AMZN has gotten a lot of flack over this outage, and rightly so. But I do want
to dissuade anyone from thinking anybody else could do much better. I worked
there 10 years ago, when they were closer to 200 engineers, and the caliber of
people there at that point was insane. By far the smartest bunch I've ever
worked with, and a place where I learned habits that serve me well to this
day.

I know the guys that started the AWS group and they were the best of that
already insanely selective group. It is easy to be an arm chair coach and
scream that the network changes should have been automated in the first place,
or that they should have predicted this storm, but that ignores just how
fantastically hard what they are doing is and how fantastically well it works
99(how many 9's now?)% of the time.

In short, take my word for it, the people working on this are smarter than you
and me, by an order of magnitude. There is no way you could do better, and it
is unlikely that if you are building anything that needs more than a handful
of servers you could build anything more reliable.

~~~
ekidd
Given a choice between hosting servers on AWS, and trying to build my own
reliable infrastructure with a single sysadmin, I'll take AWS in a heartbeat.
But I do want to quibble with one of your points:

 _It is easy to be an arm chair coach and scream that... they should have
predicted this storm_

I'm not as smart as the AWS developers, and I have a lot less experience with
large-scale distributed systems.

But thanks to my own cluelessness, I've blown up smaller distributed systems,
and I've learned one important lesson: Almost _nobody_ is smart enough to
understand automatic error-recovery code. Features like automated volume
remirroring or multi-AZ failover increase the load on an already stressed
system, and they often cause this kind of "storm."

So I've learned to distrust intelligence in these matters. If you want to
understand how your system reacts when things start going wrong, you have to
find a way to simulate (or cause) large-scale failures:

 _This is something that Google does really really well by the way, I've
watched them turn of 25 core routers simultaneously carrying hundreds of
gigabits worth of data, just to verify that what they think will happen, does
happen._ <http://news.ycombinator.com/item?id=2475112>

You also need to pay particular attention to components with substantial,
ongoing problems, and make sure you don't let known issues linger:

 _I work at Amazon EC2 and I can tell you what's going on (thanks to this
handy throwaway account). What's happening is the EBS team gets inundated with
support tickets due to their half-assed product. Here's the hilarious part:
whenever we've asked them why they don't fix the main issue, they keep telling
us that they're too busy with tickets. What they don't seem to realize is that
if they fixed the core issue_ the tickets would go away.
[http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...](http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l7vy1)

Now, I'm not saying I could have done any better than Amazon (evidence
suggests otherwise). But I do know that I'm not smart enough to understand
these systems without testing them to destruction, and aggressively fixing the
root causes of known problems.

~~~
eddieplan9
_But thanks to my own cluelessness, I've blown up smaller distributed systems,
and I've learned one important lesson: Almost nobody is smart enough to
understand automatic error-recovery code. Features like automated volume
remirroring or multi-AZ failover increase the load on an already stressed
system, and they often cause this kind of "storm."_

It's basically Test-Driven Development: if you cannot test it, don't write it.

~~~
adpowers
It is hard to test emergent behavior in large distributed systems, you pretty
much have to actually run the tests live to see what is going to happen and
see if it aligns with your predictions.

