

AWS Post-Mortem - zeit_geist
http://aws.amazon.com/message/2329B7/

======
rdl
The thing I wonder about is wtf they didn't manually switch to generator when
their automatic controls failed. They had presumably ~5 minutes of UPS; it
took them 40 minutes to do this. This probably isn't directly Amazon's fault,
but whatever contract datacenter they are using in Europe (probably a PTT, or
possibly an international carrier; really curious what facility)

I'm wary of using >1 generators to back up loads, thus requiring sync on
generators for backup anyway -- much more comfortable with splitting the load
up by room and having one generator per, with some kind of switch to allow for
pulling generators out for maintenance. This pretty much limits you to 2-3MW
per room (the largest economical diesel gensets), but that's not horrible.

Really high reliability sites actually run onsite generation as PRIMARY (since
it's less reliable to start), and then utility as backup. With the right
onsite generation equipment, it can be cheaper/more efficient than the grid,
too (by using combined cycle; use heat output to run cooling directly).

Still, the 365 Main power outages take the cake; they used rotational UPSes
(generators with huge flywheels) which had software bugs such that if input
power got turned off and on several times (a common utility failure mode), the
unit shut itself off entirely. Doh.

~~~
jwatte
They explained that a ground fault prevented generators from delivering power.
Manual start doesn't help in that case.

~~~
rdl
From what I read, they said ground fault confused their PLCs (synchro gear for
paralleling multiple generators). This shouldn't affect the generator (engine,
generator) outputting power.

Electronics are much more sensitive to ground faults, etc. than mechanical and
electrical devices.

A big manual transfer switch (as backup), which is presumably what they ended
up using, is fairly bulletproof.

------
marcamillion
It seems to me that Amazon Web Services will never truly be VERY stable.

Not because I am being cynical, but just based on the nature of what they are
doing.

They are the biggest provider of large scale cloud-based computing services.
They are pushing the boundaries. They are bound to always come upon problems
that no one has ever seen before (including themselves) just based on the very
nature of their business.

So if you are looking for 'rock-solid reliability', maybe it is better to wait
for another big company (Google, Apple, etc.) to come behind and fix all the
mistakes that Amazon made the first time.

That being said, I use AWS and I love it. Granted, I don't use EBS (not
directly, via Heroku) and yes I have encountered downtime recently, it's not
that big of a deal. I know they aren't messing around, and they are in
uncharted territory.

I can't reasonably expect them to have the best uptime for a platform that no
one has ever built or done before, on the first time around the block. That's
very unreasonable.

That being said, I will continue using them from now until I outgrow them or
the economics becomes painful, because the value I get with paying for what I
use far outweighs 24 - 48 hours of downtime per year.

------
o1iver
There seems to be a pretty simple solution to these problems: diversification.
Like most things in life, putting all your eggs in one basket it not the right
choice.

The people who use only AWS or only RackSpace or only 1&1 are equally wrong.

What you have to do it diversify. Run a production site ghost on some other
platform (software/hardware bugs, ...), run by some other provider
(bankruptcy, theft, ...) in another country (power cuts, earthquake, ...). As
soon as the primary goes down you switch on the secondary. The probability of
a total blackout is then squared: 10^-3*10^-3=10^-6.

The great thing with these "cloud" platforms is that your secondary system can
even "go to sleep" saving you money and then spin up instances as soon as the
primary goes down. This is by the way how banks, airport-systems and probably
the NSA do it!

------
larrycatinspace
I'm thinking AWS needs to implement the Availability Zones: AZ-ChaosMonkey and
AZ-ChaosApe. Having a dedicated playground for breaking things where they can
start to observe how this complex system reacts to simple failures and gaps in
assumptions.

~~~
harshaw
Sure. Presumably Amazon has a test lab that replicates multiple zones :)
Perhaps your point is that Amazon should make this test lab public so people
can contribute to the QA effort?

IIRC many of these datacenter failures start with a utility company power
outage followed by a failure of the secondary power systems (I'm thinking of
some past failures at softlayer and other providers). I wonder if it is
prohibitively expensive to do a real life system test on a big data center (or
prohibitively expensive once the data center is on line). For example, how
often do they turn off one of the mains (unexpectedly) to see what happens
with the backup system?

~~~
cperciva
_I wonder if it is prohibitively expensive to do a real life system test on a
big data center_

It's probably prohibitively dangerous. Backup power systems don't have many-
nines of reliability; generators which are reliable enough for the once-a-
decade event when a car crash knocks out your utility power aren't anywhere
near the reliability needed to run your datacentre for an hour every month as
a test.

~~~
emaste
Actually, if you don't run test your generator regularly it's very unlikely to
work when you do need it.

Here's a doc from cummins, a generator mfgr:
[http://www.cumminspower.com/www/literature/technicalpapers/P...](http://www.cumminspower.com/www/literature/technicalpapers/PT-7004-Maintenance-
en.pdf)

It claims that the generator should be run for 30 minutes every month, loaded
to at least one third of the rated capacity. So testing every month is exactly
what you want to do.

~~~
rdl
Right, but the thing you don't test is the transfer switch/sync gear.

Powering up the generator and dumping the output as heat weekly is pretty
standard practice.

~~~
ams6110
Also don't forget to check the fuel tanks. With the rise in fuel prices the
past couple of years, theft of diesel from backup generators has become more
common.

------
mtkd
It's a good communication from Amazon - maybe a little too long - could use a
summary block at top.

The compensation looks generous too.

~~~
grourk
There's a summary block at the bottom -- but it's not a summary.

------
gfodor
For all those complaining about AWS I think it's important to not fall into
the trap of throwing all of Amazon's services into the same bucket. EBS (and
hence, RDS) have shown time and time again to be the most complex offerings
and more prone to failure.

Generally speaking, at least for now, the parts of your system built on top of
EBS should be carefully architected to survive in the face of erratic EBS
latency, data corruption, or even downtime. (All of which are part of the
standard AWS contract, but happen much more often in practice than if you are
used to the mean failure time of a hard disk sitting in a cage.)

This pattern leads me to believe that services such as VoltDB that do not
directly rely upon attached storage will prove to be the paradigm necessary to
get reliable cloud computing, at least in the AWS ecosystem. On-demand
provisioning of disk is an extraordinary hard problem, and a world where local
ephemeral storage provides durability through redundancy across nodes and AZ's
is probably where we are headed.

------
jwatte
I thought best practice backup power was to use large flywheels for re-
generation, and spin up diesel engines to power the wheel in the event of a
loss. That way, there is no phase synchronization issue, just a mechanical
clutch. Seems like this outage could have been prevented with better gear?

------
robryan
This seems to share a lot of parallels with the last big outage in terms of
the API request overload and the EBS replication. Seems like the system need
to be able to tell a bit better between a node going down and require a
remirror and most of an availability zone going down.

------
saturn
As someone who has put a considerable amount of resources moving things into
cloud computing - I wanted to believe. But I have changed my mind.

Cloud computing scales the efficiencies, yes. It also scales the problems. And
because of this, AWS is _by several orders of magnitude_ the worst of my
current hosts.

I have dedicated servers. No downtime in past year. I have a couple of cloud
servers with rackspace. No downtime (although i don't recommend them). I have
some VPSes with local providers. No downtime.

AWS? _More than 24hrs downtime in the last year._ Seriously, for someone
trying to run web sites reliably - screw that. I'm not using AWS any more.

And don't even get me started on the apologists. "EBS slow as treacle? Well
you should have been running a multi zone raid-20 redundant array! Duh!". "EC2
instances dying at random? Well you should architect and implement a multi-
master failover intelligent grid!"

I used to be under some kind of crazy delusional spell that the above was
correct and it was somehow my fault that I wasn't correctly adapting to AWS's
numerous failings. Well, no more. Now I realise that I should just stick with
the super reliable service I know and love from traditional operators. You
need to programmatically grow and shrink your app server flock? Great, use
AWS. For the other 99.999% of us - stick with what you were using before.

~~~
quanticle
Is there any reason in particular why you wouldn't recommend Rackspace?

~~~
saturn
Hm. Well, I don't like them. It's subjective, you might disagree. But off the
top of my head:

1\. Contracts. They want 1 years minimum contracts for any dedicated servers.
For truly gargantuan orders I could understand this but for one puny server?
Never.

2\. Their definition of "cloud" is different from mine. To use their "cloud"
services your servers need to be public facing, ie on public IPs. Want them on
your own VPN? You can stil get the cloud prices but not the API. you create
and cancel servers via tickets. This is different from VPS how?

3\. Sloooow provisioning - even if you are able to use their "public cloud"
API to provision a server - prepare to wait _hours_ for it to be done, leading
me to suspect it does nothing more than email a tech to provision a VPS and
hook it up somehow. Oh, you can't pause them to save money either, again
making me think these "cloud" servers are nothing more than slicehosts with an
extra layer of abstraction

Is that enough? I could go on.

~~~
robszumski
It sounds like you're not a fan of Rackspace (and that's fine) but you can't
honestly believe that an API sends an email to a tech. That's just completely
false statement that no one in there right mind should believe.

~~~
saturn
I came up with that theory because I couldn't think of anything else that
explained my 2 hour waits to get a new instance. Not the other way round.

