1. They fucked up an internal DNS change and didn't notice
2. Internal systems on EBS hosts piled up with messages trying to get to the non-existent domain
3. Eventually the messages used up all the memory on the EBS hosts, and thousands of EBS hosts began to die simultaneously
4. Meanwhile, panicked operators trying to slow down this tidal wave hit the Throttle Everything button
5. The throttling was so aggressive the even normal levels of operation became impossible
6. The incident was a single AZ, but the throttling was across the whole region, which spread the pain further
[Everybody who got throttled gets a 3-hour refund]
7. Any single-AZ RDS instance on a dead EBS host was fucked
8. Multi-AZ RDS instances ran into two separate bugs, and either became stuck or hit a replication race condition and shut down
[Everybody whose multi-AZ RDS didn't fail over gets 10 days free credit]
9. Single-AZ ELB instances in the broken AZ failed because they use EBS too
10. Because everybody was freaking out and trying to fix their ELBs, the ELB service ran out of IP addresses and locked up
11. Multi-AZ ELB instances took too long to notice EBS was broken and then hit a bug and didn't fail over properly anyway
[ELB users get no refund, which seems harsh]
For those keeping score, that's 1 human error, 2 dependency chains, 3 design flaws, 3 instances of inadequate monitoring, and 5 brand-new internal bugs. From the length and groveling tone of the report, I can only assume that a big chunk of customers are very, VERY angry at them.
I would like to know how much of the design in places like AWS is made more complex by the requirements of HA itself, but my guess is, a lot.
That said, there are some genuine deep-rooted design flaws at work here, as others have pointed out, primarily Amazon's use of EBS for critical services in their own cloud.
To your point, w/o knowing architecture, seems like ELB run state could likely be on ephemeral storage (if ELBs are EC2 instances) backed by configs on S3 unless run state is crucial across resets. If not instances, maybe use S3 directly, or ElastiCache.
That said, I agree with you. I commented similarly during the outage, noting that AWS may have too many interdependencies: http://news.ycombinator.com/item?id=4685571
This is music to my ears. We switched away from ELBs because of this dependency. Hopefully this statement means Amazon is working on completely removing any use of EBS from ELBs.
We came to the conclusion a year and a half ago that EBS has had too many cascading failures to be trustworthy for our production systems. We now run everything on ephemeral drives and use Cassandra distributed across multiple AZs and multiple regions for data persistence.
I highly recommend getting as many servers as you can off EBS.
While we have a couple of RDS instances, nothing is production critical. And this: "the root cause of the Multi-AZ [MySql|Oracle|SqlServer] failures we observed during this event will be addressed" only confirms my observations from the RSS history in the dashboard, that in nearly every major EBS-related "service event" (including the ones that happen every few weeks and never get this level of post-mortem), the managed databases, load balancers and config management (beanstalk) services go down too.
When you move from AWS' basic EC2 IaaS VMs with instance (ephemeral|local) storage to EBS-backed (basically vSAN) storage, your multi-month uptime odds go down considerably. But when you step up to the PaaS of managed DBs, load balancing, dynamo, etc., yes, they offload a ton of management drudge, but it's an order of magnitude more fragile.
The unpredictability of performance, network contention and stability with EBS, for me, just doesn't outweigh the relatively smaller risk of hardware disk failure I take on from instance-backed nodes. Yes, I know, disks fail - but EBS disks fail a lot, and when they do, good luck fighting the herd to spin up more -- or crap, now, even getting web console access to understand what the hell is happening. That's the irony here - API access (including issuing more IPs!) is "throttled" at precisely the time when you need it most.
My advice? Instance-backed >= large, and roll your own failover/DR/load balancing. Go ahead and "plan for failure" - but do it old school: plan for the more likely case of simple h/w failure, not the EBS control plane and everything that depends on it.
When you operate at a scale where the above matters, and unless you need enormous elasticity, running Eucalyptus/SolusVM on your own gear is significantly cheaper.
It's clearly a very complicated setting, and this type of posts make me trust them more, don't get me wrong, and outage is an outage, but knowing that they are in control and take time to explain shows respect and the correct attitude towards a mistake.
Good for them!
While none of the specific AWS systemic failures may themselves be foreseeable, it is not true that issues of this nature cannot be anticipated: the architecture of their system (and in particular, their insistence on network storage for local data) allows for cascading failure modes in which single failures blossom to systemic ones. AWS is not the only entity to have made this mistake with respect to network storage in the cloud; I, too, was fooled.
We have learned this lesson the hard way, many times over: local storage should be local, even in a distributed system. So while we cannot predict the specifics of the next EBS failure, we can say with absolute certainty that there will be a next failure -- and that it will be one in which the magnitude of the system failure is far greater than the initial failing component or subsystem. With respect to network storage in the cloud, the only way to win is not to play.
So yeah, you're right - these systemic failures can be anticipated, and Amazon's advice (to spread your important infrastructure across multiple AZs) would have protected users from most of this. (I feel quite a lot of sympathy for the engineers involved in the cross-AZ failures this incident revealed - the multi-AZ RDS and ELB failures are things customers doing everything they were told got bitten by anyway, and are probably rightly annoyed...)
IMHO, their biggest design problem is that they build their systems on top of each other (e.g. ELB is built on EBS and EIP). So when one system goes down, it takes down half a dozen others -- this is especially true of EBS, and especially dangerous because the services it takes down, like ELB, are the services people are supposed to be using to route around EBS failures.
Some services - e.g. Heroku - have lots of impacted customers but that's due to their architecture, not the underlying AWS.
June 28th 2012 - region-wide failure due to power outage and EBS dependency issues hitting ELB and RDS: http://aws.amazon.com/message/67457/
March 15th 2012 - 22-minute, region-wide networking interruption (no status report available)
April 2011 - cloudpocalypse, also a cascading EBS failure: http://aws.amazon.com/message/65648/
The other ones say “one of our Availability Zones” - which was rather my point: if you follow long-time accepted redundancy practices, you had far less - if any - downtime than people who put everything in one AZ or rely heavily on EBS volumes not failing.
What feels wrong here, is that Amazon systems are too trusting to other Amazon internal systems.
Very entertaining IMO. But I would suggest the speaker cut the off-hand ostracism of large groups of people.
> The second group of Multi-AZ instances did not failover automatically because the master database instances were disconnected from their standby for a brief time interval immediately before these master database instances’ volumes became stuck. Normally these events are simultaneous. Between the period of time the masters were disconnected from their standbys and the point where volumes became stuck, the masters continued to process transactions without being able to replicate to their standbys.
Can someone explain this? I thought the entire point of synchronous replication was that the master doesn't acknowledge that a transaction is committed until the data reaches the slave. That's how it's described in the RDS FAQ: http://aws.amazon.com/rds/faqs/#36
I assume from this description that the update protocol looks something like this:
push to sync buddy
wait for sync buddy success
mark state as synced
EBS is a particularily complicated piece of software, and RDS is another layer of complication built on top of that.
Bugs clearly happen, and it's an unfortunate state of affairs.
I do have some concerns that they're having too much downtime. If there's one small flaw in the system it seems that the whole thing begins to fail.
If they fix the problem, and it impacts the larger system in some other unknown way, a different equally crippling issue could present itself in the future. I'd like to be sure they're putting a huge effort into making sure these problems don't happen, and I don't have those assurances at the moment.
First things first, a better status dashboard that actually reflects how issues impact customers is needed. I'd rather have everything working fine and the status be 'red' than have servers down, support tickets, calls, emails, etc and see a 'green' on the dashboard.
I wonder if there is selection bias underlying that judgment? Reading all of Amazon's post-mortems of big events, it does seem that the whole thing is fragile. What I suspect is more likely true is that AWS suffers thousands of small failures every month and most are contained as designed, with no(or minuscule) customer impact. It's the ones that turn into highly visible failures that we read about.
That said, I agree with you that EBS in particular seems to have more downtime than I'd expect. (And that other services like ELB depends on it makes it cascade in a way that's hard to design highly-available systems.)
Isn't that the whole point of moving to The Cloud? There's supposed to be some magical system in place such that hardware failures are routed around and don't interrupt service. Of course you can roll this yourself with your own hardware, but this is done for you.
It should be no small surprise that a system complicated enough to appear magical has some crazy complexity behind the scenes, and accidental dependencies can result in catastrophic failure.
I think this is a case of selection bias. Most of the time, when their is a small flaw, the system transparently bypasses the flaw and the service continues uninterrupted while they fix the initial problem. Because of this, the only failures that people see are the ones where the bypassing process fails, in which case the problem affects many people.
From the point of view of a single service running off of AWS, this is a much more stable system, becuase it will provide your service much more than a system without the auto bypass infastructure. From a end-user point of view, this benefit is not so clear, because while any given service is more reliable, they tend to go down at the same time.
Green if everything is fine. Green if there are intermittent problems. Green if nothing works.
I admire that level of transparency.
How many of these massive outages are affecting its competitors that we never hear about?
Now I learned that ELB uses EBS internally. I consider this very bad news, as I inadvertently became dependent on EBS. I intend to stop using ELB.
What does it suggest when they say "learned about new failure modes". Suggesting that there are new ones not yet learned.
One wonders if somewhere internally they have a dynamic model of how all this works. If not, might be a good time to build one.