

Amazon EC2 outage: summary and lessons learned - sarahbacon
http://blog.rightscale.com/2011/04/25/amazon-ec2-outage-summary-and-lessons-learned/

======
AndyNemmity
I got the directive this weekend to transfer stuff off of the cloud
immediately. We lost a weekend of work for a lot of people, and nothing I can
say as a tech will account for that. Upper management wants us the hell off.

I'd argue that the overall cost is much less than having all of these services
in house, and in house services go down too.

But I think for the moment, they want someone to yell at, and Amazon gives the
most unhelpful lack of communication, with no even remote eta, and that's
unacceptable.

~~~
ssmoot
I think you'd be surprised at how much better/cheaper you could have done it
in house.

Given proper motivation (say 20c bonus on every dollar saved) I think we'd see
this argument vanish pretty quickly.

If you listen to the wrong people (sales-guys from "Enterprise Grade"
vendors), or pinch too many pennies it can easily be a disaster. It's
dangerous water to tread on your own for sure.

~~~
alecco
Beware of wrong incentives. This is exactly how people are tempted to improve
the average case with a trade-off with the worst case. Dollars saved in the
short term can cause catastrophic problems later.

------
credo
Amazon has promoted RightScale in the past (and presumably the two continue to
have a close relationship). So it seems understandable that RightScale would
want to adopt a diplomatic tone.

However, imo an executive summary that starts with _"The Amazon cloud proved
itself in that sufficient resources were available world-wide such that many
well-prepared users could continue operating with relatively little downtime.
But because Amazon’s reliability has been incredible, many users were not
well-prepared leading to widespread outages. Additionally, some users got
caught by unforseen failure modes rendering their failure plans ineffective."_
seems a little too supportive of Amazon.

~~~
RyanMcGreal
That's like saying, _The commuter rail service proved itself in that customers
who also owned cars were able to drive to work when the train stopped
running._

~~~
aaronblohowiak
No, it isn't. It is like saying the highway system proved itself when the 101
was closed because people could take 280 instead. If for some reason, you only
had planned to ever take 101 and wasn't ready to take an alternate route, yes
you got screwed, but that was kind of your own lack of planning for this
particular failure mode. (stretched metaphor.)

~~~
neuroelectronic
The metaphor works if you pretend 101 is on the east coast and the 280 on on
the west coast. :D

~~~
stingraycharles
The metaphors are still flawed, since it's both the route _and_ the
destination that changes, which makes things a lot more complex than just
taking a different route.

------
pumpmylemma
<http://ee.lbl.gov/papers/sync_94.pdf>

I posted this yesterday, with the conjecture that it may have been a sudden
sync problem. It's a good read.

~~~
tptacek
This is a great paper. If you haven't read it, it suggests a common scenario
where endemic network delays tend to nudge all participants in a periodic
broadcast protocol to send their broadcasts at the same time, so that some
hours after you start all the participants, everyone has synchronized and on a
timer saturates the network with updates.

The solution (I didn't reread so this is from memory) is to add random jitter
to each participant's timer.

However, is there evidence to suggest that's what happened to Amazon? I can
see this being a big issue in '93 with high-latency low-bandwidth links a
commonplace. But we think that Amazon wasn't engineered well enough to deal
with multiple orders of magnitude spikes in C&C traffic?

Thank you, though, for posting a (much needed) technical comment to this
discussion.

~~~
pumpmylemma
I don't think it was a symptom of routing synchronization specifically, but
I'd be curious to know if it was a case of unexpected and undesired
synchronization. (E.G. An independent and random cluster of blocks suddenly
updated; the network was saturated; it pulled in more updates; ...)

And yes, the paper talked about randomization. It also pointed out the
magnitude of randomization required was larger than expected.

~~~
pandakar
Has there been an official explanation?

~~~
pumpmylemma
As far as I'm aware, no. That's why RightAWS said they get an F for
communication.

------
capstone
For those of us waiting to learn what happened, the title is baity and
misleading. A more accurate headline would be: "Rightscale outage: some
speculation and customer service suggestions for Amazon".

 _At the time of writing Amazon has not yet posted a root cause analysis. I
will update this section when they do. Until then, I have to make some
educated guesses._

That pretty much sums it up. Well, that plus some contradictory lesson
learned, such as _The biggest problem was that more than one availability zone
was affected_ , followed by, _must have live replication across multiple
availability zones_.

~~~
power78
I agree. This is just another article of someone speculating what has gone
wrong, possibly in order to get hits on his or her blog.

------
jsprinkles
The author's suggestion that service providers should make predictions is
exactly what status updates _aren't_ supposed to do.

Amazon's communication during this was on point. There's a line between _tell
me what's wrong_ and _fix what's wrong_ and all of the author's suggestions on
how to fix the "communication problem" are on the wrong side of that line.

~~~
tve
Mhh, this is weird. I didn't suggest that they tell me what to do but that
they tell me about the derivative. From the status messages it appeared that
things were getting better, albeit slowly, when in fact they kept getting
worse and affected more machines 12 hours after the initial problem.

