1) When was the problem detected?
2) How was the problem detected: automated monitoring, operator vigilance, or customer complaints?
3) When were customers first notified of the outage?
This post-mortem starts out with the root cause, but it is likely that was not known until after the outage had ended, so it's important to know how much time passed between the root cause event and when the on-call engineer(s) started treating the outage as such.
I'd also be interested in hearing about technical measures they will use to prevent this in the future. It sounds like their mitigation measures are all social, not technical. For example, do their configuration tools warn them when they are about the push out a huge delta? Do they have a review process that forces two people to look at the change before it's put into action? Would these have prevented the outage?
Finally, I think it's appalling (but not surprising) that it took many hours to revert their configs. It sounds like the tools they are using are way too slow. Any laptop can do an enormous amount of work in an hour; the computing resources available to Amazon can perform truly insane feats in one minute. Slow configuration tools are easy to write but prolong outages. They should be designed to the same standards as production applications.
Finally I don't think that the computational part of the data recovery is the time intensive one. A laptop may churn through gigabytes and gigabytes a minute, but a developer can't. In this case the more time intensive part is probably figuring out which parts of the data got lost and how to get them back, then writing the tools to automate the process, test that the data fits and then put it safely back in production. I understand that it's annoying, but theres only so many flops a human brain can run.
Most companies would have said "the outage was caused by a system being accidentally misconfigured" or even less; Amazon, in contrast, admitted (a) that a specific person was identified who made the mistake, (b) that he had access to the systems in question because a process was being run manually which should have been automated, and (c) that if there hadn't been an error in the access control rules, he wouldn't have been able to make that mistake.
I often wish that Amazon was more open about their internal systems; post-mortems are the one time when I'm never wondering what they're not telling me.
I course, I am small potatoes compared to Netflix being down (while Amazon instant video wasn't.)
Amazon is populated by jackasses apparently and Heroku isn't much better. Why the f*(k doesn't Heroku have some kind of fall-over or something to protect clients from the inevitable AWS-East failures?
Buh, bye Heroku and AWS. Hello Bluebox. I run other apps on Bluebox and have never had any downtime except for some mild nonsense due to my own damned ignorance. Luckily the Blueboxers saved me from myself. AWS needs to get their shit together.
Make smarter decisions. If being up on December 24 was really that important to you, you'd have had backup hosting in place with the ability to quickly fail over to it. You'll start being a better engineer when you quit blaming some poor bastard's bad luck for your failures and learn the real lesson from your downtime: you fucked up by not having high-availability hosting to meet your claimed high-availability needs.
The worst thing Amazon could do is fire this guy. They've just paid a lot of money to have him learn from his mistake. They'd be fools to throw that hard-earned experience away.
People can't own up to the fact that customers don't really care if it's Amazon AWS or Heroku or some other platform that failed -- if your service is down, its your fault!
I don't, necessarily, have a problem with saying, without animosity, "we're down because our hosting provider is down". For many small businesses, being down when your hosting provider is down is an acceptable tradeoff -- your business simply isn't critical enough for it to be worth the costs of maintaining fail-over redundancy in case of primary hosting loss.
That's totally fine. Most people probably don't want to pay the higher cost of having every service they use having the ability to survive their main host going down. A couple 9s of uptime is more than enough for the vast majority of businesses.
What I have a problem is with people who were behaving as though they were fine with that tradeoff screaming about how they needed to be up when it comes time to pay in a little downtime for those cheaper hosting costs. Grow up and accept the tradeoffs you choose to make.
I'm not sure who is to blame at this point -- is it the fault of the individual for not understanding uptime? is it the fault of the service for not articulating clearly what it means? Is it the fault of the industry for inculcating an unjustified sense of entitlement?
I'm inclined to lean towards the third option. It ought to be obvious that there is no such thing as a perfect host, that in the limit your probability of being down at some point goes to 100% as your time hosted with a company increases.
There's just a lot of entitlement and sloppy thinking out there. People believe that their $30 a month hosting ought to be magically invulnerable in ways it by the simple laws of physics cannot be. If you didn't explicitly choose to have multiple geographically separate hosts with a fail-over plan between them, then you were at the very least implicitly accepting that you were going to be down some of the time when (not if) your only host goes down. That's a fine and ok thing to choose, but I don't think you get to be mad at your host because you didn't clue in to the choice you were making.
That said, I understand it's pretty damn frustrating having downtime during peak periods :)
1. Demonstrate understanding of the event.
2. Communicate what steps are being taken to reduce the likelihood a similar event in the future.
Not necessarily in that order. Well done.
There is always one takeaway from these that I can use to better my own infrastructure management activities - even when most of my infrastructure runs on AWS. :)
If Amazon can have an issue of this severity in ELB, one of their core core services, during peak traffic season for their #1 highest profile customer, every one of their services should be viewed with suspicion and you really need to have non-amazon backup systems in place in case they have an incident that affects you. Yes it is painful to leave the comfortable aws womb, but it's time to grow up and start either managing multiple infrastructure providers or build your own.
A maintenance process whose purpose in life is to delete data from ELB backend database (if it were not the case, you'd see "maint process didn't work right") in such a weird way that it would cause such chaos? Why on earth would such a maint process exist in the first place? I can imagine some possibilities here of course but it feels to me there is more to it than what they've chosen to disclose.
Next. So they lose config data but data path for now not impacted. Fine, makes sense. But then backend, with only partial data, attempts to reconfig running LBs, doesn't fail completely (as in it was able to connect and do at least something but not all actions it was supposed to do) thus forcing otherwise good but impaired LBs into a completely bad state. Sunds suspicious to me.
And then the biggest question - why did they choose to attempt to restore entire backend database when only 6.8% of LBs were impacted?
I also have no idea how a CM process can protect against making a mistake - mistakes happen when somebody is at the controls with or without prior coordination.
All in all, their backend systems are so sophisticated and precisely engineered that any unforeseen/unexpected abnormality caused by manual intervention (be it inadvertent run of a maint script or fat fingered traffic re-routing from primary to backup network) inevitably lead to overwhelming reaction of their automation that makes the problem even worse and extremely hard to recover from.
Very tough position to be in - during these outages, they are essentially fighting the skynet that they themselves created and at their scale there is no way around it.
So hats off to those who've been working on this and good luck taming the beast.
You could be standing next to Jeff Bezos, putting books in boxes. The irony for me was that it was one of the few times I could expect to see him, despite my (small slice of an) office being two doors down from his.
Now that there are multiple, largely automated, DCs, I'm not sure if this still happens. Amazon isn't really "the little guy" anymore, at least not in the retail sense.
Sometimes pedantry gets in the way of appreciating something as small as this.
Eventually the numbers with nothing "special" about them begin to appear the minority: http://www2.stetson.edu/~efriedma/numbers.html
Two provisos: first, it's really late so there's probably something wrong with the math up there; second, that wouldn't count other significant patterns (like, 01/02 03:04 or something) because that's kinda difficult to quantify.
And no, those are not the odds.