

AWS outage summary - tch
https://aws.amazon.com/message/680342/

======
seldo
I dunno about you, but I could use a TL;DR for this:

1\. They fucked up an internal DNS change and didn't notice

2\. Internal systems on EBS hosts piled up with messages trying to get to the
non-existent domain

3\. Eventually the messages used up all the memory on the EBS hosts, and
thousands of EBS hosts began to die simultaneously

4\. Meanwhile, panicked operators trying to slow down this tidal wave hit the
Throttle Everything button

5\. The throttling was so aggressive the even normal levels of operation
became impossible

6\. The incident was a single AZ, but the throttling was across the whole
region, which spread the pain further

[Everybody who got throttled gets a 3-hour refund]

7\. Any single-AZ RDS instance on a dead EBS host was fucked

8\. _Multi_ -AZ RDS instances ran into two separate bugs, and either became
stuck or hit a replication race condition and shut down

[Everybody whose multi-AZ RDS didn't fail over gets 10 days free credit]

9\. Single-AZ ELB instances in the broken AZ failed because they use EBS too

10\. Because everybody was freaking out and trying to fix their ELBs, the ELB
service ran out of IP addresses and locked up

11\. Multi-AZ ELB instances took too long to notice EBS was broken and then
hit a bug and didn't fail over properly anyway

[ELB users get no refund, which seems harsh]

For those keeping score, that's 1 human error, 2 dependency chains, 3 design
flaws, 3 instances of inadequate monitoring, and 5 brand-new internal bugs.
From the length and groveling tone of the report, I can only assume that a big
chunk of customers are very, VERY angry at them.

~~~
cperciva
I'm looking at this a bit differently. My reading of this is "a series of
subtle and bizarre failures combined in a way which nobody could ever have
anticipated". I think I'm a pretty good architect and coder, but I would never
claim that I could design a system which couldn't fail in this sort of way --
in fact, "a background task is unable to complete, resulting in it gradually
increasing its memory usage, ultimately causing a system to fail" is the one-
line description of an outage Tarsnap had in December of last year.

~~~
antirez
That's the problem. As a designer your goal is not to claim that you can
design unflawed system. Instead it is to use all your humility ( _and_ skills)
to design stuff that are simple enough that are unlikely to fail because the
complexity level reaches the limit of prevention and ability to analyze the
failure modes.

I would like to know how much of the design in places like AWS is made more
complex by _the requirements of HA itself_ , but my guess is, a lot.

------
helper
> We are already in the process of making a few changes to reduce the
> interdependency between ELB and EBS to avoid correlated failure in future
> events and allow ELB recovery even when there are EBS issues within an
> Availability Zone.

This is music to my ears. We switched away from ELBs because of this
dependency. Hopefully this statement means Amazon is working on completely
removing any use of EBS from ELBs.

We came to the conclusion a year and a half ago that EBS has had too many
cascading failures to be trustworthy for our production systems. We now run
everything on ephemeral drives and use Cassandra distributed across multiple
AZs and multiple regions for data persistence.

I highly recommend getting as many servers as you can off EBS.

~~~
signifiers
I could not agree more, and have had _zero_ downtime since the fleet-wide
reboot last December from any of 20+ instance-backed VMs in US-East. Joe Stump
of SimpleGeo and Sprint.ly and many others have come to the same conclusion.

While we have a couple of RDS instances, nothing is production critical. And
this: "the root cause of the Multi-AZ [MySql|Oracle|SqlServer] failures we
observed during this event will be addressed" only confirms my observations
from the RSS history in the dashboard, that in nearly every major EBS-related
"service event" (including the ones that happen every few weeks and never get
this level of post-mortem), the managed databases, load balancers and config
management (beanstalk) services go down too.

When you move from AWS' basic EC2 IaaS VMs with instance (ephemeral|local)
storage to EBS-backed (basically vSAN) storage, your multi-month uptime odds
go down considerably. But when you step up to the PaaS of managed DBs, load
balancing, dynamo, etc., yes, they offload a ton of management drudge, but
it's an order of magnitude more fragile.

The unpredictability of performance, network contention and stability with
EBS, for me, just doesn't outweigh the relatively smaller risk of hardware
disk failure I take on from instance-backed nodes. Yes, I know, disks fail -
but EBS disks fail _a lot_ , and when they do, good luck fighting the herd to
spin up more -- or crap, now, even getting web console access to understand
what the hell is happening. That's the irony here - API access (including
issuing more IPs!) is "throttled" at precisely the time when you need it most.

My advice? Instance-backed >= large, and roll your own failover/DR/load
balancing. Go ahead and "plan for failure" - but do it old school: plan for
the more likely case of simple h/w failure, not the EBS control plane and
everything that depends on it.

~~~
moe
If you eschew the AWS EBS-backed services, which is pretty much all of them,
why are you on EC2 to begin with?

When you operate at a scale where the above matters, and unless you need
enormous elasticity, running Eucalyptus/SolusVM on your own gear is
significantly cheaper.

------
Trufa
I really love when the companies take time to explain their customers what
happened specially in such detail.

It's clearly a very complicated setting, and this type of posts make me trust
them more, don't get me wrong, and outage is an outage, but knowing that they
are in control and take time to explain shows respect and the correct attitude
towards a mistake.

Good for them!

------
lukev
I am always astonished by how many layers these bugs actually have. It's easy
to start out blaming AWS, but if anyone can realistically say they could have
anticipated this type of issue at a system level, they're deluding themselves.

~~~
bcantrill
Full disclosure: I work for an AWS competitor.

While none of the specific AWS systemic failures may themselves be
foreseeable, it is not true that issues of this nature cannot be anticipated:
the architecture of their system (and in particular, their insistence on
network storage for local data) allows for cascading failure modes in which
single failures blossom to systemic ones. AWS is not the only entity to have
made this mistake with respect to network storage in the cloud; I, too, was
fooled.[1]

We have learned this lesson the hard way, many times over: local storage
should be local, even in a distributed system. So while we cannot predict the
specifics of the next EBS failure, we can say with absolute certainty that
there will be a next failure -- and that it will be one in which the magnitude
of the system failure is far greater than the initial failing component or
subsystem. With respect to network storage in the cloud, the only way to win
is not to play.

[1] [http://joyent.com/blog/network-storage-in-the-cloud-
deliciou...](http://joyent.com/blog/network-storage-in-the-cloud-delicious-
but-deadly)

~~~
bigiain
FWIW, once Amazon decided that an availability zone is an unreliable unit (and
they tell you this up front, and strongly suggest running multiple AZ
architectures for anything where you require reliability), then any cascading
failure mode in a single AZ is not something you'd expect them to spend too
much time protecting against. Sure, the cascade from EBS faults to RDS and ELB
meant this affected more of their single AZ customers than it would have
otherwise, but anyone using a singe AZ knew upfront that Amazon advised
against that, and never claimed they intended to provide high-availability in
single AZs.

So yeah, you're right - these systemic failures can be anticipated, and
Amazon's advice (to spread your important infrastructure across multiple AZs)
would have protected users from most of this. (I feel quite a lot of sympathy
for the engineers involved in the cross-AZ failures this incident revealed -
the multi-AZ RDS and ELB failures are things customers doing everything they
were told got bitten by anyway, and are probably rightly annoyed...)

~~~
seldo
While Amazon do indeed say that multi-AZ is the way to go, their last 3 major
incidents (including last year's cloudpocalypse) have all been full-region
incidents.

IMHO, their biggest design problem is that they build their systems on top of
each other (e.g. ELB is built on EBS and EIP). So when one system goes down,
it takes down half a dozen others -- this is especially true of EBS, and
especially dangerous because the services it takes down, like ELB, are the
services people are supposed to be using to route around EBS failures.

~~~
acdha
They've had exactly one true region-wide failure - that 17 minute routing
failure earlier this year. Running a truly multi-AZ setup has avoided every
other outage popularly reported as “the cloud is falling”.

Some services - e.g. Heroku - have lots of impacted customers but that's due
to their architecture, not the underlying AWS.

~~~
seldo
I'm sorry, but that's just not true. Prior to last week's incident:

June 28th 2012 - region-wide failure due to power outage and EBS dependency
issues hitting ELB and RDS: <http://aws.amazon.com/message/67457/>

March 15th 2012 - 22-minute, region-wide networking interruption (no status
report available)

April 2011 - cloudpocalypse, also a cascading EBS failure:
<http://aws.amazon.com/message/65648/>

~~~
acdha
The only region-wide outage in your links is the network one I mentioned -
they called it 22 minutes, I measured it as 17 on my systems. June 28th was
indeed not region-wide - I have systems in every AZ and lost exactly one of
them.

The other ones say “one of our Availability Zones” - which was rather my
point: if you follow long-time accepted redundancy practices, you had far less
- if any - downtime than people who put everything in one AZ or rely heavily
on EBS volumes not failing.

------
teraflop
> Multi Availability Zone (Multi-AZ), where two database instances are
> synchronously operated in two different Availability Zones.

> The second group of Multi-AZ instances did not failover automatically
> because the master database instances were disconnected from their standby
> for a brief time interval immediately before these master database
> instances’ volumes became stuck. Normally these events are simultaneous.
> Between the period of time the masters were disconnected from their standbys
> and the point where volumes became stuck, the masters continued to process
> transactions without being able to replicate to their standbys.

Can someone explain this? I thought the entire point of synchronous
replication was that the master doesn't acknowledge that a transaction is
committed until the data reaches the slave. That's how it's described in the
RDS FAQ: <http://aws.amazon.com/rds/faqs/#36>

~~~
mcpherrinm
This just sounds like a race condition gone weird.

I assume from this description that the update protocol looks something like
this:

    
    
       process-request:
         Update-local
         push to sync buddy
         wait for sync buddy success
         reply-with-status
         mark state as synced
    

If the function does the local update and then gets stuck in the state waiting
for the buddy to reply, one could imagine the failover daemon not handling
that case very well. So while the master might not have acknowledged the
transaction, the pair might get jammed trying to complete it.

EBS is a particularily complicated piece of software, and RDS is another layer
of complication built on top of that. Bugs clearly happen, and it's an
unfortunate state of affairs.

------
ndcrandall
Everytime there is a service outage it makes me feel better about using them
in the future. Every outage is actually making the project more reliable since
some issuess will only manifest in production. I believe they have a great
team that's very knowledgable.

~~~
blaines
I don't necessarily agree with your first two sentences, but I definitely
agree with the last one. I know they're smart.

I do have some concerns that they're having too much downtime. If there's one
small flaw in the system it seems that the whole thing begins to fail.

If they fix the problem, and it impacts the larger system in some other
unknown way, a different equally crippling issue could present itself in the
future. I'd like to be sure they're putting a huge effort into making sure
these problems don't happen, and I don't have those assurances at the moment.

First things first, a better status dashboard that actually reflects how
issues impact customers is needed. I'd rather have everything working fine and
the status be 'red' than have servers down, support tickets, calls, emails,
etc and see a 'green' on the dashboard.

~~~
sokoloff
_If there's one small flaw in the system it seems that the whole thing begins
to fail._

I wonder if there is selection bias underlying that judgment? Reading all of
Amazon's post-mortems of big events, it does seem that the whole thing is
fragile. What I suspect is more likely true is that AWS suffers thousands of
small failures every month and most are contained as designed, with no(or
minuscule) customer impact. It's the ones that turn into highly visible
failures that we read about.

That said, I agree with you that EBS in particular seems to have more downtime
than I'd expect. (And that other services like ELB depends on it makes it
cascade in a way that's hard to design highly-available systems.)

~~~
Firehed
> What I suspect is more likely true is that AWS suffers thousands of small
> failures every month and most are contained as designed, with no(or
> minuscule) customer impact.

Isn't that the whole point of moving to The Cloud? There's supposed to be some
magical system in place such that hardware failures are routed around and
don't interrupt service. Of course you can roll this yourself with your own
hardware, but this is done for you.

It should be no small surprise that a system complicated enough to appear
magical has some crazy complexity behind the scenes, and accidental
dependencies can result in catastrophic failure.

------
mrkurt
AWS sure does put out amazing post mortems. If only they'd make their status
page more useful ...

~~~
taligent
How is the status page not useful ? It's so simple to understand.

Green if everything is fine. Green if there are intermittent problems. Green
if nothing works.

~~~
signifiers
Too true. Today when Google's App Engine failed hard, they flat out said: "App
Engine is currently experiencing serving issues – Python, Java, Go … Oct 26
2012, 07:30 AM-11:59 PM … Current Availability: 55.90%" and showed read/write
I/O metrics were spiking through the roof. Their dashboard clearly labeled the
relevant icon as the most severe "Service Interruption".

I admire that level of transparency.

~~~
kordless
Yeah, was thinking the same thing while looking at all the cool graphs.

------
Karunamon
So how reliable is AWS in comparison to some of its competitors? I think there
might be a slight bias whenever AWS has a problem because so many big name
sites rely on them, and they're the proverbial 800lb gorilla of the cloud
computing space.

How many of these massive outages are affecting its competitors that we never
hear about?

~~~
IheartApplesDix
Well there's two problems with your question. One is that there is no other
massive cloud service to compare with AWS. Every other large hosting solution
is more classical with managed hardware and maybe virts being the most high-
level offering for most of competitors. Nobody important uses MS or Google's
cloud, and if they do, they don't like talking about it because they probably
feel it gives them a competitive advantage and nobody notices when they go
down anyway.

------
jwr
I avoid EBS because I think it is very complex, hard to do right, and has
nasty failure modes if you use it within a UNIX environment (your code
basically hangs, with no warning).

Now I learned that ELB uses EBS internally. I consider this very bad news, as
I inadvertently became dependent on EBS. I intend to stop using ELB.

~~~
oijaf888
How do you do databases? Just keep everything in instance storage and backup
to S3 frequently and accept that there's possible dataloss in between backup
times?

~~~
jwr
I don't do databases, for the most part. I am lucky enough to have an
application that can do with very little transactional state, which we keep in
redis (using instance storage), replicated to another instance and backed up
to S3.

------
papercruncher
I know there are lots of smart people working there but just look at the sheer
amount of AWS offerings. Amazon certainly gets credit for quickly putting out
new features and services but it makes me wonder if their pace has resulted in
way too many moving parts with an intractable number of dependencies.

~~~
nolok
Actually, it always seems to come down to EBS failure, which is an "old"
feature.

------
filvdg
Everything is a Freaking DNS problem :)

~~~
jlgreco
Well, naming things _is_ one of the 2 hard problems in computer science. ;)

~~~
erichocean
"There are only two hard problems in computer science: cache invalidation,
naming things, and off-by-one errors."

------
michaelkscott
For anyone interested, here are the comments on the App engine outage:
<http://news.ycombinator.com/item?id=4704973>

------
krosaen
Bugs happen, and the effects of cascading failures are very hard to
anticipate. But it seems like the aws team hadn't fully tested the effects of
an EBS outage, which seems like it could have uncovered the rdms multi
availability zone failover bug and perhaps the elbs failover bug ahead of
time.

------
wglb
This is some degree of complexity.

What does it suggest when they say "learned about new failure modes".
Suggesting that there are new ones not yet learned.

One wonders if somewhere internally they have a dynamic model of how all this
works. If not, might be a good time to build one.

------
Randgalt
FYI - how Netflix handled the outage:
[http://techblog.netflix.com/2012/10/post-mortem-of-
october-2...](http://techblog.netflix.com/2012/10/post-mortem-of-
october-222012-aws.html)

