Hacker News new | comments | show | ask | jobs | submit login
AWS outage summary (amazon.com)
161 points by tch on Oct 27, 2012 | hide | past | web | favorite | 65 comments

I dunno about you, but I could use a TL;DR for this:

1. They fucked up an internal DNS change and didn't notice

2. Internal systems on EBS hosts piled up with messages trying to get to the non-existent domain

3. Eventually the messages used up all the memory on the EBS hosts, and thousands of EBS hosts began to die simultaneously

4. Meanwhile, panicked operators trying to slow down this tidal wave hit the Throttle Everything button

5. The throttling was so aggressive the even normal levels of operation became impossible

6. The incident was a single AZ, but the throttling was across the whole region, which spread the pain further

[Everybody who got throttled gets a 3-hour refund]

7. Any single-AZ RDS instance on a dead EBS host was fucked

8. Multi-AZ RDS instances ran into two separate bugs, and either became stuck or hit a replication race condition and shut down

[Everybody whose multi-AZ RDS didn't fail over gets 10 days free credit]

9. Single-AZ ELB instances in the broken AZ failed because they use EBS too

10. Because everybody was freaking out and trying to fix their ELBs, the ELB service ran out of IP addresses and locked up

11. Multi-AZ ELB instances took too long to notice EBS was broken and then hit a bug and didn't fail over properly anyway

[ELB users get no refund, which seems harsh]

For those keeping score, that's 1 human error, 2 dependency chains, 3 design flaws, 3 instances of inadequate monitoring, and 5 brand-new internal bugs. From the length and groveling tone of the report, I can only assume that a big chunk of customers are very, VERY angry at them.

I'm looking at this a bit differently. My reading of this is "a series of subtle and bizarre failures combined in a way which nobody could ever have anticipated". I think I'm a pretty good architect and coder, but I would never claim that I could design a system which couldn't fail in this sort of way -- in fact, "a background task is unable to complete, resulting in it gradually increasing its memory usage, ultimately causing a system to fail" is the one-line description of an outage Tarsnap had in December of last year.

That's the problem. As a designer your goal is not to claim that you can design unflawed system. Instead it is to use all your humility (and skills) to design stuff that are simple enough that are unlikely to fail because the complexity level reaches the limit of prevention and ability to analyze the failure modes.

I would like to know how much of the design in places like AWS is made more complex by the requirements of HA itself, but my guess is, a lot.

I certainly didn't mean to imply that they should have predicted this -- my reason for scoring the number of simultaneous issues is to indicate what a shitstorm this was.

That said, there are some genuine deep-rooted design flaws at work here, as others have pointed out, primarily Amazon's use of EBS for critical services in their own cloud.

That sounds a lot like people using cron for complicated tasks that repeat every 5 minutes (db queries for example). Before you know it, the jobs pile on top of each other locking the DB and spiraling out of control.

My general take on their writeup was that they have too many services that depend on EBS working. They should try to find ways to decouple critical services like loadbalancers and Amazon RDS from EBS.

Databases (RDS) do need storage and EBS is their tool for that. If something else was better, it should replace EBS across products.

To your point, w/o knowing architecture, seems like ELB run state could likely be on ephemeral storage (if ELBs are EC2 instances) backed by configs on S3 unless run state is crucial across resets. If not instances, maybe use S3 directly, or ElastiCache.

Heterogeneity aids resilience because it reduces the risk that the same mode of failure occurs simultaneously. It's not clear to me that there should be one and only one mechanism for storage, because variety has a value in and of itself, apart from relative merits of different kinds of storage.

They have 3 mechanisms for storage, all can be accessed from an EC2 server: ephemeral disk, EBS, and S3.

That said, I agree with you. I commented similarly during the outage, noting that AWS may have too many interdependencies: http://news.ycombinator.com/item?id=4685571

Would be very cool to have s3 snapshots available for ephemeral instances, but I am guessing that is not very practical.

Sounds like a perfect storm [of imperfection].

> We are already in the process of making a few changes to reduce the interdependency between ELB and EBS to avoid correlated failure in future events and allow ELB recovery even when there are EBS issues within an Availability Zone.

This is music to my ears. We switched away from ELBs because of this dependency. Hopefully this statement means Amazon is working on completely removing any use of EBS from ELBs.

We came to the conclusion a year and a half ago that EBS has had too many cascading failures to be trustworthy for our production systems. We now run everything on ephemeral drives and use Cassandra distributed across multiple AZs and multiple regions for data persistence.

I highly recommend getting as many servers as you can off EBS.

I could not agree more, and have had zero downtime since the fleet-wide reboot last December from any of 20+ instance-backed VMs in US-East. Joe Stump of SimpleGeo and Sprint.ly and many others have come to the same conclusion.

While we have a couple of RDS instances, nothing is production critical. And this: "the root cause of the Multi-AZ [MySql|Oracle|SqlServer] failures we observed during this event will be addressed" only confirms my observations from the RSS history in the dashboard, that in nearly every major EBS-related "service event" (including the ones that happen every few weeks and never get this level of post-mortem), the managed databases, load balancers and config management (beanstalk) services go down too.

When you move from AWS' basic EC2 IaaS VMs with instance (ephemeral|local) storage to EBS-backed (basically vSAN) storage, your multi-month uptime odds go down considerably. But when you step up to the PaaS of managed DBs, load balancing, dynamo, etc., yes, they offload a ton of management drudge, but it's an order of magnitude more fragile.

The unpredictability of performance, network contention and stability with EBS, for me, just doesn't outweigh the relatively smaller risk of hardware disk failure I take on from instance-backed nodes. Yes, I know, disks fail - but EBS disks fail a lot, and when they do, good luck fighting the herd to spin up more -- or crap, now, even getting web console access to understand what the hell is happening. That's the irony here - API access (including issuing more IPs!) is "throttled" at precisely the time when you need it most.

My advice? Instance-backed >= large, and roll your own failover/DR/load balancing. Go ahead and "plan for failure" - but do it old school: plan for the more likely case of simple h/w failure, not the EBS control plane and everything that depends on it.

If you eschew the AWS EBS-backed services, which is pretty much all of them, why are you on EC2 to begin with?

When you operate at a scale where the above matters, and unless you need enormous elasticity, running Eucalyptus/SolusVM on your own gear is significantly cheaper.

Enthusiastically seconded. We stopped using EBS for anything 12 months ago and got through this outage with very little pain (previous AWS incidents had us cleaning up for weeks afterwards).

I really love when the companies take time to explain their customers what happened specially in such detail.

It's clearly a very complicated setting, and this type of posts make me trust them more, don't get me wrong, and outage is an outage, but knowing that they are in control and take time to explain shows respect and the correct attitude towards a mistake.

Good for them!

I am always astonished by how many layers these bugs actually have. It's easy to start out blaming AWS, but if anyone can realistically say they could have anticipated this type of issue at a system level, they're deluding themselves.

Full disclosure: I work for an AWS competitor.

While none of the specific AWS systemic failures may themselves be foreseeable, it is not true that issues of this nature cannot be anticipated: the architecture of their system (and in particular, their insistence on network storage for local data) allows for cascading failure modes in which single failures blossom to systemic ones. AWS is not the only entity to have made this mistake with respect to network storage in the cloud; I, too, was fooled.[1]

We have learned this lesson the hard way, many times over: local storage should be local, even in a distributed system. So while we cannot predict the specifics of the next EBS failure, we can say with absolute certainty that there will be a next failure -- and that it will be one in which the magnitude of the system failure is far greater than the initial failing component or subsystem. With respect to network storage in the cloud, the only way to win is not to play.

[1] http://joyent.com/blog/network-storage-in-the-cloud-deliciou...

FWIW, once Amazon decided that an availability zone is an unreliable unit (and they tell you this up front, and strongly suggest running multiple AZ architectures for anything where you require reliability), then any cascading failure mode in a single AZ is not something you'd expect them to spend too much time protecting against. Sure, the cascade from EBS faults to RDS and ELB meant this affected more of their single AZ customers than it would have otherwise, but anyone using a singe AZ knew upfront that Amazon advised against that, and never claimed they intended to provide high-availability in single AZs.

So yeah, you're right - these systemic failures can be anticipated, and Amazon's advice (to spread your important infrastructure across multiple AZs) would have protected users from most of this. (I feel quite a lot of sympathy for the engineers involved in the cross-AZ failures this incident revealed - the multi-AZ RDS and ELB failures are things customers doing everything they were told got bitten by anyway, and are probably rightly annoyed...)

While Amazon do indeed say that multi-AZ is the way to go, their last 3 major incidents (including last year's cloudpocalypse) have all been full-region incidents.

IMHO, their biggest design problem is that they build their systems on top of each other (e.g. ELB is built on EBS and EIP). So when one system goes down, it takes down half a dozen others -- this is especially true of EBS, and especially dangerous because the services it takes down, like ELB, are the services people are supposed to be using to route around EBS failures.

They've had exactly one true region-wide failure - that 17 minute routing failure earlier this year. Running a truly multi-AZ setup has avoided every other outage popularly reported as “the cloud is falling”.

Some services - e.g. Heroku - have lots of impacted customers but that's due to their architecture, not the underlying AWS.

I'm sorry, but that's just not true. Prior to last week's incident:

June 28th 2012 - region-wide failure due to power outage and EBS dependency issues hitting ELB and RDS: http://aws.amazon.com/message/67457/

March 15th 2012 - 22-minute, region-wide networking interruption (no status report available)

April 2011 - cloudpocalypse, also a cascading EBS failure: http://aws.amazon.com/message/65648/

The only region-wide outage in your links is the network one I mentioned - they called it 22 minutes, I measured it as 17 on my systems. June 28th was indeed not region-wide - I have systems in every AZ and lost exactly one of them.

The other ones say “one of our Availability Zones” - which was rather my point: if you follow long-time accepted redundancy practices, you had far less - if any - downtime than people who put everything in one AZ or rely heavily on EBS volumes not failing.

Mostly, but not entirely. Under the "Impact on Amazon Relational Database Service (RDS)" section they state that due to two other bugs, some Multi-zone RDS instances did NOT fail over, because their systems knew that the masters had stopped replaying to the standbys, but had continued to process some transactions (albeit more slowly, as the failures cascaded and amplified), and thus prevented the automatic promotion of a standby, as the standbys were now out of date.

I agree that's a problem but calling it a region-wide failure is an unhelpful exaggeration, particularly in the case of a SQL database where there's always a tradeoff between availability and consistency.

Very wise comment IMHO. More complex = more complex failure modes. The only way to improve is Less complex. Well also probably local storage has a number of other advantages from the point of view of performance.

I don't think that there is an obvious way to make it less complex. But complexity is not inherently bad. Internet is very complex, yet it is robust.

What feels wrong here, is that Amazon systems are too trusting to other Amazon internal systems.

Moving storage locally seems like a good move. When complexity is still too high and there are no ways to make it simpler (I don't think that's the case, btw) relax your requirements.

To be honest, EBS is the special sauce that makes me stick with AWS. There are so many little things that become so much easier by having network storage. To me the cost of upgrading or migrating machines using local storage is just not worth the perceived reliability gain to me.

It's not just reliability though. The quality of I/O is also better. You don't see such slow I/O, with crazy fluctuations, with local storage.

I believe the damage of cascading failures was also masterfully exposed during your talk on QCon Last year: http://www.infoq.com/presentations/Debugging-Production-Syst...

Start at ~10:20 for the cascading failures.

Very entertaining IMO. But I would suggest the speaker cut the off-hand ostracism of large groups of people.

> Multi Availability Zone (Multi-AZ), where two database instances are synchronously operated in two different Availability Zones.

> The second group of Multi-AZ instances did not failover automatically because the master database instances were disconnected from their standby for a brief time interval immediately before these master database instances’ volumes became stuck. Normally these events are simultaneous. Between the period of time the masters were disconnected from their standbys and the point where volumes became stuck, the masters continued to process transactions without being able to replicate to their standbys.

Can someone explain this? I thought the entire point of synchronous replication was that the master doesn't acknowledge that a transaction is committed until the data reaches the slave. That's how it's described in the RDS FAQ: http://aws.amazon.com/rds/faqs/#36

This just sounds like a race condition gone weird.

I assume from this description that the update protocol looks something like this:

     push to sync buddy
     wait for sync buddy success
     mark state as synced
If the function does the local update and then gets stuck in the state waiting for the buddy to reply, one could imagine the failover daemon not handling that case very well. So while the master might not have acknowledged the transaction, the pair might get jammed trying to complete it.

EBS is a particularily complicated piece of software, and RDS is another layer of complication built on top of that. Bugs clearly happen, and it's an unfortunate state of affairs.

I think aws uses drbd (http://www.drbd.org/) for replicating the writes to the standby. Among the replication modes of drbd suggested here: http://www.drbd.org/users-guide-emb/s-replication-protocols.... I suspect they probably use B and landed in a state where packets got delivered but werent written on the standby.

Everytime there is a service outage it makes me feel better about using them in the future. Every outage is actually making the project more reliable since some issuess will only manifest in production. I believe they have a great team that's very knowledgable.

I don't necessarily agree with your first two sentences, but I definitely agree with the last one. I know they're smart.

I do have some concerns that they're having too much downtime. If there's one small flaw in the system it seems that the whole thing begins to fail.

If they fix the problem, and it impacts the larger system in some other unknown way, a different equally crippling issue could present itself in the future. I'd like to be sure they're putting a huge effort into making sure these problems don't happen, and I don't have those assurances at the moment.

First things first, a better status dashboard that actually reflects how issues impact customers is needed. I'd rather have everything working fine and the status be 'red' than have servers down, support tickets, calls, emails, etc and see a 'green' on the dashboard.

If there's one small flaw in the system it seems that the whole thing begins to fail.

I wonder if there is selection bias underlying that judgment? Reading all of Amazon's post-mortems of big events, it does seem that the whole thing is fragile. What I suspect is more likely true is that AWS suffers thousands of small failures every month and most are contained as designed, with no(or minuscule) customer impact. It's the ones that turn into highly visible failures that we read about.

That said, I agree with you that EBS in particular seems to have more downtime than I'd expect. (And that other services like ELB depends on it makes it cascade in a way that's hard to design highly-available systems.)

> What I suspect is more likely true is that AWS suffers thousands of small failures every month and most are contained as designed, with no(or minuscule) customer impact.

Isn't that the whole point of moving to The Cloud? There's supposed to be some magical system in place such that hardware failures are routed around and don't interrupt service. Of course you can roll this yourself with your own hardware, but this is done for you.

It should be no small surprise that a system complicated enough to appear magical has some crazy complexity behind the scenes, and accidental dependencies can result in catastrophic failure.

>I do have some concerns that they're having too much downtime. If there's one small flaw in the system it seems that the whole thing begins to fail.

I think this is a case of selection bias. Most of the time, when their is a small flaw, the system transparently bypasses the flaw and the service continues uninterrupted while they fix the initial problem. Because of this, the only failures that people see are the ones where the bypassing process fails, in which case the problem affects many people.

From the point of view of a single service running off of AWS, this is a much more stable system, becuase it will provide your service much more than a system without the auto bypass infastructure. From a end-user point of view, this benefit is not so clear, because while any given service is more reliable, they tend to go down at the same time.

Do you think maybe this is just a focusing illusion?[1] And therefore the utility you associate with the service is not correctly attributed.

[1] http://en.wikipedia.org/wiki/Anchoring

I like this idea. But which aspect of his belief do you think might be an illusion? I'm not clear precisely what you're referring to.

AWS sure does put out amazing post mortems. If only they'd make their status page more useful ...

How is the status page not useful ? It's so simple to understand.

Green if everything is fine. Green if there are intermittent problems. Green if nothing works.

Too true. Today when Google's App Engine failed hard, they flat out said: "App Engine is currently experiencing serving issues – Python, Java, Go … Oct 26 2012, 07:30 AM-11:59 PM … Current Availability: 55.90%" and showed read/write I/O metrics were spiking through the roof. Their dashboard clearly labeled the relevant icon as the most severe "Service Interruption".

I admire that level of transparency.

Yeah, was thinking the same thing while looking at all the cool graphs.

And if things are really broken, green with a little "i" icon next to it.

You left out "Green if everything is working - except the Chaos Monkey. All systems are working properly, but none of them failing randomly 'for your convenience'"

Haha. I'd love someone to release a greasemonkey/userscript for that status page to make degraded service more obvious.

Lol, thanks for a good laugh.

So how reliable is AWS in comparison to some of its competitors? I think there might be a slight bias whenever AWS has a problem because so many big name sites rely on them, and they're the proverbial 800lb gorilla of the cloud computing space.

How many of these massive outages are affecting its competitors that we never hear about?

Well there's two problems with your question. One is that there is no other massive cloud service to compare with AWS. Every other large hosting solution is more classical with managed hardware and maybe virts being the most high-level offering for most of competitors. Nobody important uses MS or Google's cloud, and if they do, they don't like talking about it because they probably feel it gives them a competitive advantage and nobody notices when they go down anyway.

I avoid EBS because I think it is very complex, hard to do right, and has nasty failure modes if you use it within a UNIX environment (your code basically hangs, with no warning).

Now I learned that ELB uses EBS internally. I consider this very bad news, as I inadvertently became dependent on EBS. I intend to stop using ELB.

How do you do databases? Just keep everything in instance storage and backup to S3 frequently and accept that there's possible dataloss in between backup times?

I don't do databases, for the most part. I am lucky enough to have an application that can do with very little transactional state, which we keep in redis (using instance storage), replicated to another instance and backed up to S3.

I know there are lots of smart people working there but just look at the sheer amount of AWS offerings. Amazon certainly gets credit for quickly putting out new features and services but it makes me wonder if their pace has resulted in way too many moving parts with an intractable number of dependencies.

Actually, it always seems to come down to EBS failure, which is an "old" feature.

Everything is a Freaking DNS problem :)

Well, naming things is one of the 2 hard problems in computer science. ;)

"There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors."

DNS is the other problem too.

For anyone interested, here are the comments on the App engine outage: http://news.ycombinator.com/item?id=4704973

Bugs happen, and the effects of cascading failures are very hard to anticipate. But it seems like the aws team hadn't fully tested the effects of an EBS outage, which seems like it could have uncovered the rdms multi availability zone failover bug and perhaps the elbs failover bug ahead of time.

This is some degree of complexity.

What does it suggest when they say "learned about new failure modes". Suggesting that there are new ones not yet learned.

One wonders if somewhere internally they have a dynamic model of how all this works. If not, might be a good time to build one.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact