
Amazon’s EC2 Service Suffers Outage - peter123
http://gigaom.com/2009/06/10/amazons-ec2-service-suffers-outage/
======
lrm242
IMO this is an example of why AWS is so valuable. Not only has Amazon designed
their cloud to allow for redundancy (availability zones), but when something
does fail they have smart people working on it immediately. Whenever I have
doubts about using AWS in a project I always remind myself that the cost of
the instance includes much more than the compute, storage, and bandwidth
resources. It's all the other stuff as well: smart people watching over my
bits, policies & procedures to ensure proper handling of failures, etc.

------
nethergoat
"Today’s incident also shows the fragility of the 'cloud' as it can be knocked
out a single lightening strike."

How is this different from any other data center? Note too that on EC2, your
instances will be spread randomly throughout the facility - you'd actually be
better off having this happen on EC2 vs. in a traditional DC. Furthermore,
this was a localized incident inside of a single availability zone - the other
four (now five, actually) AZs were completely unaffected.

The author clearly does not know what he is talking about.

I'm glad one commenter on the article had the sense to call him out. Too bad
he was followed by the usual set of trolls.

~~~
mdasen
From what I've heard, this is EC2's only outage in the past year. Amazon's SLA
guarantees 99.95% availability for a region (meaning that if you have 2
instances running in different availability zones, Amazon guarantees that at
least one of the instances will be up 99.95% of the time). However, with a 3
hour downtime being their only downtime (so far as I'm aware) in the past
year, they're hitting 99.95% uptime _within_ an availability zone.

It should be noted that the 365 Main datacenter that hosts (hosted?)
Craigslist, Six Apart, Technorati, and a number of high profile web companies
had a total failure a while back
([http://radar.oreilly.com/archives/2007/07/365-main-
datace.ht...](http://radar.oreilly.com/archives/2007/07/365-main-
datace.html)). Similarly, Rackspace (whose business is not having to worry
about your servers) had 3 outages in 2 days
([http://gawker.com/tech/followup/rackspace-outage-was-
third-i...](http://gawker.com/tech/followup/rackspace-outage-was-third-in-two-
days-321909.php)).

So, Amazon's 3 hour outage that affected a minority of their customers doesn't
seem that outrageous. You're totally right. The author was probably writing a
bit of link-bait to get read. No one reads "Amazon EC2 has minor outage that
compares favorably against their competitors and affected a minority of their
customers." That's just business-as-usual talk. However, if you question the
entire viability of cloud computing because of a 3 hour outage, well, that
deserves reading.

------
dylanz
Lightning storm FTL.

Once power was back, however, so were our instances, which was a pleasant
surprise.

------
ShabbyDoo
So, let me make sure I understand... If I had deployed my app on EC2 in two or
more availability zones AND had tested it to ensure that it would continue to
work if one zone became unavailable, then my users woudn't have noticed,
right?

If this is the case, I'd interpret this outage as a testament to how good AWS
is!

------
timf
It is nice that they availability zones so you can get around these kinds of
problems if you have the money, their outage report confirms the problem was
isolated to one zone.

The report also states it was a problem isolated to a group of racks, a
failing power distribution unit which I take it was on the "wrong side" of the
UPS. A lightning storm affected something on the "inside" of the UPS, but
nothing else in the datacenter? I'm curious how that would even happen (and
shouldn't a lightning rod be in use?).

Also, it seems like Google's per-motherboard UPS system would have been the
setup to have in order to avoid this problem in the first place.

------
tybris
If you're building a mission-critical service, always make sure you have
servers in multiple availability zones (regardless of whether you use EC2). If
you're service is too small to rent multiple servers, consider a shared
approach like Azure or GAE.

------
cmer
Does anyone know which availability zone was affected?

~~~
mdasen
Short answer: no.

Long answer: there's no way to know. The names that Amazon gives a specific
location aren't something you can rely on. So, your us-east-1a might be a
different availability zone from my us-east-1a. I'm guessing Amazon did this
to avoid everyone "going with the default" and launching instances in us-
east-1a and having way more instances wanted to be launched there than their
other availability zones. But it does mean that there is no way of identifying
by name what the availability zone was that went down since the only names we
have identify different places depending on our account.

