
AWS is down, but here's why the sky is falling - justinsb
http://justinsb.posterous.com/aws-down-why-the-sky-is-falling
======
akashs
Amazon makes it pretty clear that Availability Zones within the same region
can fail simultaneously. In fact, a Region being down is defined as multiple
AZs within that zone being down according to the SLA. And since that 99.95%
promise applies to Regions and not AZs, multiple AZs within the same region
being down will be fairly common.

Edit: One more point. In the SLA, you'll find the following: _“Region
Unavailable” and “Region Unavailability” means that more than one Availability
Zone in which you are running an instance, within the same Region, is
“Unavailable” to you._ What it implies is that if you do not spread across
multiple Availability Zones, you will then have less than 99.95% uptime. So
spreading across AZs should still reduce your downtime, just not beyond that
99.95%

<http://aws.amazon.com/ec2-sla/>

~~~
justinsb
I have to disagree with you. The SLA is just a legal agreement that really
serves to limit AWS's liability. Here's what the main EC2 page says:

"Availability Zones are distinct locations that are engineered to be insulated
from failures in other Availability Zones and provide inexpensive, low latency
network connectivity to other Availability Zones in the same Region. By
launching instances in separate Availability Zones, you can protect your
applications from failure of a single location."

<http://aws.amazon.com/ec2/>

That's the spec that everyone was building to, but that isn't what is
happening. Of course you're right, multiple AZs can fail at the same time, but
I read the above as saying that they should fail independently/coincidentally
(until the entire Region fails).

~~~
bphogan
We always, always use the SLA offered by a vendor as the basis for our
information. We trust it more than any marketing page, sales pitch, tech
support FAQ, or anything else. That's what they'll hide behind, so that's what
I'll have in mind when I design my setup.

~~~
justinsb
I think it's great to check the SLA. However, there's enough wiggle room in
the AWS SLA that I think this outage could continue for the rest of the month,
and Amazon would still not owe a penny. I don't even know that the SLA covers
this outage, because network connectivity isn't affected.

Even if Amazon breach their SLA, I think they only have to refund 10% of one
month's bill per year - i.e. a 1% discount. I suspect they'd make a good
profit even if they paid out a full 10% refund every month.

Unless an SLA is accelerated - i.e. >100% refund - I don't think it's worth
taking particularly seriously.

Of course if an SLA only guarantees 95% uptime, that's probably a big hint to
design for failure!

~~~
bphogan
Yeah but I don't care about getting my money back as much as I care about how
much they claim to be down.

It's like the hard disk maker that gives you a 1 year warranty vs a 5 year
warranty... which one believes in their product more? :)

~~~
justinsb
It's a good analogy and I certainly accept your point. It could just be a
marketing thing though:

Suppose it's the same hard disk with a black sticker instead of a blue
sticker. Drive with 1 yr warranty @ $100, 5 yr warranty @ $150, 20% additional
failure rate over the extra 4 years, 50% redemption rate on failed drives.
Cost per replaced drives = 20% * 50% * ($100 + $30 processing costs) = $13 =
$37 profit.

Totally fictitious numbers to try to prove my point, of course :-) But as the
SLA becomes increasingly low in value, the signalling value decreases in my
book.

(Edit - fixed my math!)

------
mdasen
Amazon has probably correctly designed core infrastructure so that these
things shouldn't happen if you're in multiple Availability Zones. I'm guessing
that means different power sources, backup generators, network hookups, etc.
for the different Availability Zones. However, there's also the issue of
Amazon's management software. In this case, it seems that some network issues
triggered a huge reorganization of their EBS storage which would involve lots
of transfer over the network of all that stored data, a lot more EBS hosts
coming online and a stampede problem.

I've written vigorously (in previous comments) for using cloud servers like
EC2 over dedicated hosting like SoftLayer. I'm less sure about that now. The
issue is that EC2 is still beholden to the traditional points of failure
(power, cooling, network issues). However, EC2 has the additional problem of
Amazon's management software. I don't want to sound too down on Amazon's
ability to make good software. However, Amazon's status site shows that EBS
and EC2 also had issues on March 17th for about 2.5 hours each (at different
times). Reddit has also just been experiencing trouble on EC2/EBS. I don't
want this to sound like "Amazon is unreliable", but it does seem more
hiccup-y.

The question I'm left with is what one is gaining from the management software
Amazon is introducing. Well, one can launch a new box in minutes rather than a
couple hours; one can dynamically expand a storage volume rather than dealing
with the size of physical discs; one can template a server so that you don't
have to set it up from scratch when you want a new one. But if you're a site
with 5 boxes, would that give you much help? SoftLayer's pricing is
competitive against EC2's 1-year reserved instances and SoftLayer throws in
several TB of bandwidth and persistent storage. Even if you have to over-buy
on storage because you can't just dynamically expand volumes, it's still
competitively priced. If you're only running 5 boxes, the server templates
aren't of that much help - and virtually none given that you're maybe running
3 app servers, and a replicated database over two boxes.

I'm still a huge fan of S3. Building a replicated storage system is a pain
until you need to store huge volumes of assets. Likewise, if you need 50 boxes
for 24 hours at a time, EC2 is awesome. I'm less smitten with it for general
purpose web app hosting where the fancy footwork done to make it possible to
launch 100 boxes for a short time doesn't really help you if you're looking to
just have 5 instances keep running all the time.

Maybe it's just bad timing that I suggested we look at Amazon's new live
streaming and a day later EC2 is suffering a half-day outage.

~~~
dasil003
What about hardware failure? On AWS you just commission a new instance and
your downtime is minutes rather than hours, plus you don't have to keep extra
hardware on hand just to avoid downtime of days. There are also smaller more
localized issues like network switch failure and other things that you
probably never even notice on Amazon, but might be more likely to bite you on
a dedicated host.

If an AWS data center goes down it gets a lot of press, but does it actually
outweigh the sum of all dedicated/shared/vps hosting issues on the equivalent
volume?

~~~
gtuhl
There are some nice middle options out there. I'll use Softlayer as an example
as I have provisioned a lot of machinery over there.

I can order machines online and SSH in 3-4 hours later. Even exotic stuff they
turn around just as fast - we saw that speed on a quad octocore box with a
raid 10 of Intel SSDs.

That's real metal too, with real IO (most of my work is IO bound so VMs and
the cloud are not options). You get to pick the exact CPUs, disks, etc and
they slot them in solid Super Micro boards and use good Adaptec disk
controllers. You pay monthly and can spin down the box at any time (though
must pay full months, no per-minute pricing like AWS).

That is on the dedicated hardware side, you can also spin up compute instances
and those can be cloned and fired up in bulk. But, they also have the IO
problems that all other VMs have.

In any case, just wanted to mention they are a decent middle ground. Not as
automated and polished as Amazon on the VM side but you can spin up mixtures
of metal and VMs to get combinations that make sense - pushing compute or RAM-
only stuff to VMs and keeping DBs and persistence layers on real metal. They
have a few different datacenters too so you can spread gear around physical
locations.

------
justinsb
A quick tldr: Availability Zones within a Region are supposed to fail
independently (until the entire Region fails catastrophically). Any sites that
designed to that 'contract' were broken by this morning's incident, because
multiple AZs failed simultaneously.

I've seen a lot of misinformation about this, with people suggesting that the
sites (reddit/foursquare/heroku/quora) are to blame. I believe that the sites
were designed to AWS's contract/specs, and AWS broke that contract.

~~~
js2
The contract to which you refer is entirely inferred, is it not? Amazon claims
the AZ's should be independent[1]:

 _Each availability zone runs on its own physically distinct, independent
infrastructure, and is engineered to be highly reliable. Common points of
failures like generators and cooling equipment are not shared across
Availability Zones. Additionally, they are physically separate, such that even
extremely uncommon disasters such as fires, tornados or flooding would only
affect a single Availability Zone._

Yet what Amazon guarantees, by way of their SLA, is only 99.95% for a
region[2,3]:

 _The Amazon EC2 SLA guarantees 99.95% availability of the service within a
Region over a trailing 365 day period._

[1]
[http://aws.amazon.com/ec2/faqs/#How_isolated_are_Availabilit...](http://aws.amazon.com/ec2/faqs/#How_isolated_are_Availability_Zones_from_one_another)

[2]
[http://aws.amazon.com/ec2/faqs/#What_does_your_Amazon_EC2_Se...](http://aws.amazon.com/ec2/faqs/#What_does_your_Amazon_EC2_Service_Level_Agreement_guarantee)

[3] Of course, they're not even meeting that right now. :-(

~~~
justinsb
Ah - sorry! I don't mean a legal contract, I mean more of a technical
contract. e.g. "I won't pass a null pointer" style contract.

In fact, the first bit you quoted provides an even stricter technical contract
than the one on the main EC2 page - it states some degree of natural disaster
tolerance, heavily suggesting separate datacenters (not just different
floors). Thanks for pointing that out.

Whatever the common point of failure turns out to be, it does seem to have
been shared across AZs, in violation of their FAQ.

------
jpdoctor
Every time someone bitched at me for not having a "cloud-based strategy", I
kept asking how many 9s of reliability they thought the cloud would deliver.

We're down to 3 nines so far. A few more hours to 2 nines.

The cloud is not for all businesses.

~~~
rufo
"The cloud", as I understand it, is the ubiquitous, cheap and near-
instantaneous availability of computing power; as in minutes instead of hours
or days for new servers.

"The cloud" is not (and never has been) a cure-all for reliability issues.
It's just as easy to have single points of failure as any other hosting
strategy, and is just as easy (or difficult) to plan for. Companies that have
planned for high availability with multi-region or multi-provider strategies
will continue to be available, regardless of whether or not they are using
"the cloud".

~~~
jpdoctor
> near-instantaneous availability

That implies something about reliability. The downtime today is real data
about that availability.

~~~
rufo
That's an issue with one service in one region offered by Amazon Web Services,
not "the cloud" as a concept.

Use this as an example of the reliability of EBS (or if you want to broaden
the scope, Amazon Web Services) all you want, but this says nothing about "the
cloud" as a concept.

------
risotto
These outages are very rough. Clearly a lot of the Internet is building out on
AWS, and not using multiple zones correctly in the first place. But AWS can
have multi-zone problems too as we see here. Nobody is perfect.

But what people forget is: AWS has a world class team of engineers first
fixing the problem, and second making sure it will never happen again. Same
with Heroku, EngineYard, etc.

Host stuff on dedicated boxes racked up somewhere and you will not go down
with everyone else. But my dedicated boxes on ServerBeach go down for the same
reasons: hard drive failure, power outages, hurricanes, etc. And I don't have
anyone to help me bring them back up, nor the interest or capacity to build
out redundant services myself.

My Heroku apps are down, but I can rest easy knowing that they will bring them
back up with out an action on my part.

The cloud might not be perfect but the baseline is already very good and
should only get better. All without you changing your business applications.
Economy of scale is what the cloud is about.

~~~
ANH
_The cloud might not be perfect but the baseline is already very good and
should only get better._

Do we have reason to believe that it will only get better? I think it's
possible the complexity of the systems we are building and the traffic they
encounter will outpace our ability to manage them. Not saying I think it's the
most likely outcome, but I don't feel as confident as you.

~~~
risotto
Food for thought for sure. True, nothing can get better forever...

But do we believe in "economy of scale" for computer and Internet systems in
this age? Google, Amazon, Facebook, etc. have already proven to me that they
have enough human and financial capital to architect and run systems that show
economies of scale.

It's a bit scary to think about what it will mean when this runs out, but for
now I personally feel confident that things are getting much better, and will
continue to do so.

------
weswinham
I'd say your choice between Quora's engineers being incompetent or AWS being
dishonest/incompetent is a completely false dichotomy. Anyone who has been
around AWS (or basically any technology) will agree that the things that can
really hurt you are not always the things you considered in your design. I
just can't believe that many of the people who grok the cloud were running
production sites under the assumption that there was no cross-AZ risk. They
use the same API endpoints, auth, etc so it's obvious they're integrated at
some level.

Perhaps for Quora and the like, engineering for the amount of availability
needed to withstand this kind of event was simply not cost effective, but I
seriously doubt the possibility didn't occur to them. It's not even obvious to
me that there are many people who did follow the contract you reference who
had serious downtime. All of the cases I've read about so far have been
architectures that were not robust to a single AZ failure.

As for multi-az RDS, it's synchronous MySQL replication on what smell like
standard EC2 instances, probably backed by EBS. Our multi-az failover actually
worked fine this morning, but I am curious how normal that was.

------
endergen
Read how @learnboost who uses AWS was not affected by the AWS outages because
of their architecture design: [http://blog.learnboost.com/blog/availability-
redundancy-and-...](http://blog.learnboost.com/blog/availability-redundancy-
and-failover-at-learnboost/)

~~~
nulljangles
Very interesting. If I'm reading this correctly, though, if all 4 Availability
Zones that they're replicated across were to have gone down, though, they
would've been in the same boat.

------
EGreg
This is again the problem with centralized vs distributed services. Not just
Amazon's infrasturcture.

<http://myownstream.com/blog#2010-05-21> :)

~~~
billswift
Good essay. Coming up with a workable, decentralized alternative for domain
name registrars is even more than decentralized social apps though.

------
cafebabe
Relations. At the viewpoint of a non-cloud-user, this is a pretty normal
situation. Systems fail. Maybe, we should think about cloud as a service, that
is managed somehow different (to enable easier access to our wallets and
budgets) but do eventually fail the same way as standard services. That's how
I saw it as the first headline about cloud services appeared in front of me
couple a years ago.

------
grandalf
It's pretty wild that this stuff happens. Similar to today's nasty outage,
Google has had some massive problems with its app engine datastore...

I'm curious if anyone has any predictions about what the landscape will be
like in a few years? Will these be solved problems? Will cloud services lose
favor? Will everything just be designed more conservatively? Will engineers
finally learn to read the RTFSLA?

~~~
dendory
The benefits of the cloud is just too great, we won't go back. Except in a few
years, when something goes down, instead of being some random site who's down,
it's going to be the 20,000 sites that are hosted on that hardware.

~~~
TheAmazingIdiot
Now, different clod providers "speak" different languages now. But I can see
in 5 or so years that the cloud will speak a similar set of languages. One
could use a storage cloud from this provider and a CPU cloud from another
provider.

I could eventually see, with help from functional languages like Lisp or
Erlang, a intra-company cloud running on and between networks. CPU could be
bought from 3 providers, and storage could be bought from 4 providers, with
GPU acceleration clusters when big data needs crunched quickly.

Or right now, companies can make their own clouds via Eucalyptus. Don't want
Amazon to hold your keys? Load balance between Your cloud and Amazon's.

------
ww520
One data point. I have one of my clients' servers in the east-1d availability
zone. East coast region, zone d. So far things are holding up, no crash or no
slow down. Fingers crossed.

~~~
shykes
Note: your "zone d" is not my "zone d". AWS shuffles zone IDs across users.
See <http://alestic.com/2009/07/ec2-availability-zones>

------
wslh
I use dreamhost and never had a failure like the Amazon one.

It's an irony.

~~~
mkramlich
I had one client project on DreamHost, and never again. My experience with
them was that even when it was "up" it could be down. Lots of mysterious
glitches and weirdness, stomping, restarts. Not 100% sure it was their fault.
But didn't see any definite evidence it was ours either. In comparison,
WebFaction and Linode have been great. Though I settled on Linode for all new
projects for several reasons that I felt made them better in the general case.

------
KeyBoardG
The ending of this article came of very slanderous rather than just a report
of why the problem occured. Keep it.

------
delvan07
Crazy how that crashed and brought other sites like Reddit, Quora etc down.

~~~
ceejayoz
Given that those sites are hosted on EC2, it's no more crazy than blowing up a
car resulting in killing its passengers.

