AMZN has gotten a lot of flack over this outage, and rightly so. But I do want to dissuade anyone from thinking anybody else could do much better. I worked there 10 years ago, when they were closer to 200 engineers, and the caliber of people there at that point was insane. By far the smartest bunch I've ever worked with, and a place where I learned habits that serve me well to this day.
I know the guys that started the AWS group and they were the best of that already insanely selective group. It is easy to be an arm chair coach and scream that the network changes should have been automated in the first place, or that they should have predicted this storm, but that ignores just how fantastically hard what they are doing is and how fantastically well it works 99(how many 9's now?)% of the time.
In short, take my word for it, the people working on this are smarter than you and me, by an order of magnitude. There is no way you could do better, and it is unlikely that if you are building anything that needs more than a handful of servers you could build anything more reliable.
It is easy to be an arm chair coach and scream that... they should have predicted this storm
I'm not as smart as the AWS developers, and I have a lot less experience with large-scale distributed systems.
But thanks to my own cluelessness, I've blown up smaller distributed systems, and I've learned one important lesson: Almost nobody is smart enough to understand automatic error-recovery code. Features like automated volume remirroring or multi-AZ failover increase the load on an already stressed system, and they often cause this kind of "storm."
So I've learned to distrust intelligence in these matters. If you want to understand how your system reacts when things start going wrong, you have to find a way to simulate (or cause) large-scale failures:
This is something that Google does really really well by the way, I've watched them turn of 25 core routers simultaneously carrying hundreds of gigabits worth of data, just to verify that what they think will happen, does happen. http://news.ycombinator.com/item?id=2475112
You also need to pay particular attention to components with substantial, ongoing problems, and make sure you don't let known issues linger:
I work at Amazon EC2 and I can tell you what's going on (thanks to this handy throwaway account). What's happening is the EBS team gets inundated with support tickets due to their half-assed product. Here's the hilarious part: whenever we've asked them why they don't fix the main issue, they keep telling us that they're too busy with tickets. What they don't seem to realize is that if they fixed the core issue the tickets would go away. http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...
Now, I'm not saying I could have done any better than Amazon (evidence suggests otherwise). But I do know that I'm not smart enough to understand these systems without testing them to destruction, and aggressively fixing the root causes of known problems.
It's basically Test-Driven Development: if you cannot test it, don't write it.
I can't disagree, but there is one key benefit to not using the cloud for some services.
When your company is working on an important deadline, your sysadmins could choose not to implement that pending network configuration change during that crucial period. You can control your own at-risk times, which you can't generally do with IaaS.
As with everything, it's a trade off.
Just look at the pattern emerging from these kinds of incidents. There's an automatic cluster recovery mechanism that works for individual node failures but makes matters worse once a larger number of nodes fail.
I wonder whether they did extensive testing or simulation of that scenario. The initial root cause is probably unpredictable because there may be many, but what follows is not unpredictable.
I'm not ready to concede that because they are such an insanely smart elite group of people we just have to live with week long outages.
In the grand scheme of things, it's still day one for computing services like AWS.
Your start-up is probably not dealing with setting up & maintaining 200K servers as Amazon, so up to a limit you can actually be better going on your own.
But even that is pretty darn hard. You still have to deal with split brain situations, do all the right things in those cases etc.. If you are a tiny startup that isn't going to hire a dedicated sysadmin (very few of which can build that type of thing without buying some expensive hardware) then EC2 is probably a better choice.
Even just the hardware costs make it pretty braindead. We just shut down the dedicated three system setup I built because it just didn't make sense financially, and that despite me spending a month learning the intricacies of heartbeat it still had odd failure scenarios that we don't experience with EC2. Again, I'm just not that smart. :)
I think that's everything. It just goes to show, most disasters in very well-engineered systems are generally the result of a series of things all going wrong at once, not individual failures...
Heck, users of municipal water systems still experience outages and that technology is arguably mature.
The really encouraging thing is that the Amazon post-mortem writeup indicates they're taking it very seriously.
The trigger was a bad and unexpected network configuration change, but the error was that the attempted recovery by the stuck volumes was that it was uncontrolled.
I don't think that anyone is knocking the intelligence of the AWS engineers, or saying that anyone else could do it better. Just like NASA engineers and scientists are incredibly intelligent and good at what they do, systems can become complicated enough that unexpected errors creep into the system, and not any particular component.
In either case you can always overlook or fail to predict the even the easily foreseeable future. And that happens due to many reasons plain human error, or even over confidence which is some times the case with the best.
The best way out of this problem is what Jeff Atwood had blogged some time days back. Is to keep failing, and keep failing in different ways. And each failure needs to be translated into to lessons of some sort and then the solution of it a best practice. Even the Netflix model of failing purposefully will do.
There is no way the best can be flawless. Nothing is flawless, so as long it is done by a human.
If AWS with some of the smartest engineers can be down for that long do you think that our crappy service will be up 100% of time?
>In short, take my word for it, the people working on this are smarter than you and me, by an order of magnitude. There is no way you could do better, and it is unlikely that if you are building anything that needs more than a handful of servers you could build anything more reliable.
Ever since the AWS outage, I've seen a number of these "the AWS guys are so smart, I've met them." type comments. And then paraphrased: "There sort-of can't be that much to blame on them because of how smart they are, and they are so smart anyway, who could do better?" That's not a valid argument, not everyone is equally impressed by an individuals intelligence, perhaps your assessment is wrong. And even if someone is insanely smart, they still can commit practical errors which indicates they are smart, but still flawed in their understanding of engineering in significant ways. Perhaps AWS simply does need a higher caliber of engineer that wouldn't miss out on these dead-simple safeguards that would have prevented this outage.
By doing what they do they create the expectation that they _are_ doing better than everybody else.
In the last two years my workplace has gotten pretty good at handling SAN outages (due to terrible Oracle equipment).
Put simply, this set of scenarios can't happen at my work-place. We don't have that level of automation, there's only a pair of SAN systems in the mirrors, there's no "hunting for capacity".
I'd suggest most businesses are closer to this than AWS.
There's a lot of sugar coating going around saying "You couldn't build a better space-shuttle", and that's probably true. But if I only need an extremely reliable bicycle that's a false argument. The Simplest Thing That Could Possibly Work doesn't apply just to programming.
AWS EBS outage, Fukushima, Chernobyl, even the great Chicago Fire (forgive me for comparing AWS to those events).
Sure there's always a "root" cause, but more importantly, it's the related events that keep adding up to make the failure even worse. I can only imagine how many minor failures happen world wide on a daily basis where there's only a root cause and no further chain of events.
Once a system is sufficiently complex, I'm not sure it's possible to make it completely fault-tolerant. I'm starting to believe that there's always some chain of events which would lead to a massive failure. And the more complex a system is, the more "chains of failure" exist. It would also become increasingly difficult to plan around failures.
edit: The Logic of Failure is recommended to anyone wanted to know more about this subject: http://www.amazon.com/Logic-Failure-Recognizing-Avoiding-Sit...
„The genius of a construction lies in its simplicity. Everybody can build complicated things."
Also, a couple other complex systems for your trend are; financial markets and commercial jets.
The examples he draws from are nuclear power plant failures (TMI in particular), civil aviation and oil transport. But the basics will be recognizable to anyone who has dealt with large computing installations; interactive complexity, tight coupling and cascading failures.
It is not a reassuring book, you won't be able to look at any complex system without asking yourself what sequence of simple, predictable failures of widely separated parts could tip it into a catastrophic failure mode.
During maintenance instead of shifting traffic off of one of the redundant routers the traffic was routed onto the lower capacity network. There was human error involved but the network issue only provoked latent bugs in the system that should have been picked out during disaster recovery testing.
Automatic recovery that isn't properly tested is a dangerous beast; it can cause problems faster and broader than any team of humans are capable of handling.
Here's Twitter Back off decider implementation (Java)
When last time i looked i was little clueless on this. Now I find its usage.
HN doesn't have a 140 character limit, so there's no need to post an obfuscated shortened link.
This supports the theory that between 50%-80% of outages are caused by human error, regardless of the resilience of the underlying infrastructure.
Not quite - in this case, a single human error then triggered a series of latent and undiscovered bugs in the system itself. It's a confluence of small events that makes for a large-scale problem like this.
If not then with little hard work and smart work here and there any body can beat these 'best' during non crisis times. And during the crisis time all are same any way.
Probably that's why there are a lot of successful companies even with average talented people.
I hate to quote Rumsfeld, but there are known unknowns, and unknown unknowns. Of course you want to eliminate the latter-- but there's (necessarily) no way you can ever know that you've done so.
Oh yes. It's a classic that deserves to be much better known. Anybody engaged with complex systems - such as software or software projects - will find all kinds of suggestive things in there. As for "dry"... come now, it's hilarious and has cartoons.
Basically, just get it. Here, I'll help:
(They ruined the title but it's the same book.)
Lack in transparency in reaching out to customers is the biggest mistake what AWS did. They would learn from their mistakes, their servers and networks would be more reliable than ever.
This incident has given a reason for people to look at multi-cloud operation capability, for disaster recovery and backup reasons. AWS monopoly would be gone, there would be many new standards which would be proposed to bring in interoperability and for migrations between clouds.
"This required the time-consuming process of physically relocating excess server capacity from across the US East Region and installing that capacity into the degraded EBS cluster."
And if I read this description of the re-mirror storm correctly, I think that implies Amazon had to increase the size of it's EBS cluster in the affected zone by 13%, which considering the timeline seems fairly impressive.
Really the only purpose of a SLA penalty is to incentivize the provider to keep the network reliable.
But that's just in general.
When negotiating bespoke SLA penalty clauses, it can be very illuminating for both sides to discuss lost profit + lost confidence + additional costs to the customer and suggest that these be factored in to the penalty clause.
My experience: both the customer and supplier tend to take a deep breath to evaluate whether this deal is a good one for either of them and begin to reassess their level of risk.
In a off-the-shelf service like Amazon, you as a customer are welcome to suggest a change of penalty to your Amazon account manager, and unless you're something like the US government, you will probably be directed to other cloud providers or your own internal IT organisation!
What that suggests to me, is that the time has arrived for an external organization, one that sells loss-of-business protection against such failures, needs to become involved. Such an organization, should enough cloud customers subscribe to it, would become an influence upon services like AWS. I'm not sure I 'like' this idea, but the premise that a customer is using the cloud service at the whim of whatever the provider decides is best practice needs to be revisited.
Their "control plane" network for the EBS clusters span availability zones in a region? If so, this would be the fatal flaw.
The API failures were ultimately tied to the network problems that occurred, not to a failure of the control plane.
EDIT: I should finish reading before I reply. :) It would appear that the network issue in the one availability zone was so severe that the control plane ran out of threads to service API requests to any of the availability zones.
So while it's true the underlying problem was a network issue, the fact that the the control plane is spread across availability zones was responsible for part of the outage that occurred across the whole region.
My totally unqualified assessment of this aspect of the outage is that, while it might make sense to have a control plane spread across availability zones, they presumably need to have isolated control planes for each zone, instead of a shared plane as they seemingly have now.
'shared nothing' is the only way to islandize failures.
The following is from the AWS web site :
> Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location.
No mention of tradeoffs.
The quoted statement doesn't say that isolation is 100% or that multiple AZs can't ever ever fail at the same time. It says that if only one AZ goes down and you have servers in another, then those servers will still be up, which should be obvious. Insulated doesn't even mean the same thing as isolated.
Even the name, 'Availability Zone' implies that it is isolated from other 'Availability Zones' in the same region. And that text I quoted does nothing but substantiate that inference.
I just think that Amazon are misleading here. Maybe they shouldn't call it an Availability Zone.
Which is probably far more difficult to do properly than I can imagine.
They setup a separate instance of it to help with API calls in the affected region, but it still sounds like it functions across AZs and is still vulnerable overall.
"There are three things we will do to prevent a single Availability Zone from impacting the EBS control plane across multiple Availability Zones. The first is that we will immediately improve our timeout logic to prevent thread exhaustion when a single Availability Zone cluster is taking too long to process requests. … To address the cause of the second API impact, we will also add the ability for our EBS control plane to be more Availability Zone aware and shed load intelligently when it is over capacity. … Additionally, we also see an opportunity to push more of our EBS control plane into per-EBS cluster services. By moving more functionality out of the EBS control plane and creating per-EBS cluster deployments of these services (which run in the same Availability Zone as the EBS cluster they are supporting), we can provide even better Availability Zone isolation for the EBS control plane"
Then the 'EBS control plane' started to fail because 'slow API calls began to back up and resulted in thread starvation'. At that point, the EBS processing resources were oversaturated.
Then other nearby systems got wet.
Q. How do you eat an elephant?
A. One bite at a time
Amazon offers a 10 day credit equal to 100% of their usage of EBS Volumes, EC2 Instances and RDS database instances.
This credit will be automatically applied to the next bill.
Only a fool would try to run their business out of "the cloud".
The real problem is there is no good mathematical model of distributed behaviour, from which statistical guarantees can be made.
I think we're at the limit of what the smartest people can achieve with hand crafted code.
Most likely new math will give rise to new tools and languages, in which the next generation of reliable distributed systems will be written.
Without this advance we will have storage networks that aren't reliable, an internet that can be taken down by one organization, botnets that are unkillable and patchy network security.
Also the part where you reapply those same abstractions to fix the hole, without realizing that the problem is you simply don't yet have tools that are capable of writing a robust system - despite the evidence to the contrary.
If a day long outage of this scale is not enough to make us rethink distributed systems, what is?