

Applying 5 Whys To Amazon EC2 Outage - somic
http://www.somic.org/2012/07/05/applying-5-whys-to-amazon-ec2-outage/

======
burke
" _To me, this outage is the most worrisome of all AWS service disruptions
that I know about. In a nutshell:

AWS effectively lost its control plane for entire region as a result of a
failure within a single AZ. This was not supposed to be possible._"

I find it funny how we have this assumption that if we don't architect across
multiple AZs or regions we shouldn't be surprised when our service goes down
because of an AWS failure, but that if we do, we're "pretty safe" -- and then
Amazon itself experiences failure spanning AZs from a single-AZ failure.

~~~
Zombieball
Correct me if I am wrong, but I believe if your application was designed to
operate across multiple 'regions' your app would have indeed been safe from
this failure.

~~~
mokeefe
Sure, but then your app needs to communicate across the Internet if you share
data across regions. This can be expensive and/or slow and/or unreliable.
[http://aws.amazon.com/ec2/faqs/#how_will_I_be_charged_for_da...](http://aws.amazon.com/ec2/faqs/#how_will_I_be_charged_for_data_transfer)

~~~
count
Or not as much: <http://aws.amazon.com/directconnect/>

------
crazygringo
The first why is actually the what, and the last why is unanswered, so there
are only 3 whys... kind of disappointing based on the title :(

~~~
pbreit
There was 1 what and 5 whys. They last why being unanswered is the whole point
of the post.

------
ibejoeb
> this outage is the most worrisome of all AWS service disruptions that I know
> about

I don't think anyone was especially happy with it. I think AWS, as an entity,
is probably just as unhappy as its customers.

I was happy with their response, and I was happy with it during the outage
last year, too. They're adapting, and I believe they're constantly getting
better. It's still a pretty new thing, this utility computing service. You
can't reasonably expect them to expect the unexpected. I'm quite sure that
even if they didn't apply the "5 whys," or make them publicly available, they
are doing something to address the control plane issue.

I'm confident that the service will improve. Some things just need to be
battle hardened.

~~~
nowarninglabel
The problem I have with the whole thing, is that if you look at their status
page they still don't report they had any outage. The worst you will ever see
is a yellow triangle with a message of "connectivity issues". Amazon is
pathologically obsessed with denying that any outages occur, which is
understandable given their business model, but since they do actually have
outages, it makes them look scummy.

~~~
ibejoeb
I agree. I think there's a policy problem. If a problem is isolated to a
single AZ, you should see a hazard triangle. If the region is affected, it
needs to be classified as a service disruption.

The whole problem with the AZ thing is that they're geographically congruent.
Major weather events are pretty likely to mess you up. Remember the disk
latency spikes from that little earthquake?

It costs more, but that they operate properties all over the world, and that
they're operable under a common paradigm (for most services), is the truly
compelling feature of AWS. I keep my major operation on the east and fail over
to the west if there's a significant event. It's a little more labor
intensive, but it works.

------
catshirt
_"From 8:04pm PDT to 9:10pm PDT, customers were not able to launch new EC2
instances, create EBS volumes, or attach volumes in any Availability Zone in
the US-East-1 Region."_

 _"The control planes for EC2 and EBS were significantly impacted by the power
failure” in a single AZ."_

neither of these things are reasons for the disruption, but side effects of.
not much "why" happening in the article all together.

