Hacker News new | comments | show | ask | jobs | submit login
Post-mortem of Christmas Eve Amazon ELB outage (amazon.com)
90 points by cperciva 1812 days ago | hide | past | web | favorite | 38 comments

The post-mortem is missing important pieces.

1) When was the problem detected? 2) How was the problem detected: automated monitoring, operator vigilance, or customer complaints? 3) When were customers first notified of the outage?

This post-mortem starts out with the root cause, but it is likely that was not known until after the outage had ended, so it's important to know how much time passed between the root cause event and when the on-call engineer(s) started treating the outage as such.

I'd also be interested in hearing about technical measures they will use to prevent this in the future. It sounds like their mitigation measures are all social, not technical. For example, do their configuration tools warn them when they are about the push out a huge delta? Do they have a review process that forces two people to look at the change before it's put into action? Would these have prevented the outage?

Finally, I think it's appalling (but not surprising) that it took many hours to revert their configs. It sounds like the tools they are using are way too slow. Any laptop can do an enormous amount of work in an hour; the computing resources available to Amazon can perform truly insane feats in one minute. Slow configuration tools are easy to write but prolong outages. They should be designed to the same standards as production applications.

I think the measures they implement to prevent this failure are adequate. Changing access control patterns so that not a single person but at least two persons must allow the job to run are sufficient as a quick fix. They also mention changing the future architecture to automate such maintenance jobs and to make data recovery faster. So it is indeed a mix of process and technical measures, much as the problem was caused by more of a process problem rather than a technical problem.

Finally I don't think that the computational part of the data recovery is the time intensive one. A laptop may churn through gigabytes and gigabytes a minute, but a developer can't. In this case the more time intensive part is probably figuring out which parts of the data got lost and how to get them back, then writing the tools to automate the process, test that the data fits and then put it safely back in production. I understand that it's annoying, but theres only so many flops a human brain can run.

"The data was deleted by a maintenance process that was inadvertently run against the production ELB state data. This process was run by one of a very small number of developers who have access to this production environment."

This is why I love AWS post-mortems: They don't hide things.

Most companies would have said "the outage was caused by a system being accidentally misconfigured" or even less; Amazon, in contrast, admitted (a) that a specific person was identified who made the mistake, (b) that he had access to the systems in question because a process was being run manually which should have been automated, and (c) that if there hadn't been an error in the access control rules, he wouldn't have been able to make that mistake.

I often wish that Amazon was more open about their internal systems; post-mortems are the one time when I'm never wondering what they're not telling me.

Sadly, it's only the postmortems. Their entire status gameplan involves being very vague.

And the developer ought to be fired. He cost me a considerable amount of money since my heroku site with SSL endpoint was down over 12 hours during one if our biggest days of the year. Of course, heroku still (over)bills me those hours for Dynos and Workers.

I course, I am small potatoes compared to Netflix being down (while Amazon instant video wasn't.)

Amazon is populated by jackasses apparently and Heroku isn't much better. Why the f*(k doesn't Heroku have some kind of fall-over or something to protect clients from the inevitable AWS-East failures?

Buh, bye Heroku and AWS. Hello Bluebox. I run other apps on Bluebox and have never had any downtime except for some mild nonsense due to my own damned ignorance. Luckily the Blueboxers saved me from myself. AWS needs to get their shit together.

You weren't down because an Amazon engineer fat-fingered a maintenance command, you were down because you chose to put all of your eggs in one basket. The basket you chose failed, and you had no redundancy basket to fail over to. Choosing a new basket to put all your eggs in isn't going to fix anything just because the new basket hasn't failed on you yet.

Make smarter decisions. If being up on December 24 was really that important to you, you'd have had backup hosting in place with the ability to quickly fail over to it. You'll start being a better engineer when you quit blaming some poor bastard's bad luck for your failures and learn the real lesson from your downtime: you fucked up by not having high-availability hosting to meet your claimed high-availability needs.

The worst thing Amazon could do is fire this guy. They've just paid a lot of money to have him learn from his mistake. They'd be fools to throw that hard-earned experience away.

Interestingly enough a similar article on the front page scapegoated Windows Azure.

People can't own up to the fact that customers don't really care if it's Amazon AWS or Heroku or some other platform that failed -- if your service is down, its your fault!

> People can't own up to the fact that customers don't really care if it's Amazon AWS or Heroku or some other platform that failed -- if your service is down, its your fault!

I don't, necessarily, have a problem with saying, without animosity, "we're down because our hosting provider is down". For many small businesses, being down when your hosting provider is down is an acceptable tradeoff -- your business simply isn't critical enough for it to be worth the costs of maintaining fail-over redundancy in case of primary hosting loss.

That's totally fine. Most people probably don't want to pay the higher cost of having every service they use having the ability to survive their main host going down. A couple 9s of uptime is more than enough for the vast majority of businesses.

What I have a problem is with people who were behaving as though they were fine with that tradeoff screaming about how they needed to be up when it comes time to pay in a little downtime for those cheaper hosting costs. Grow up and accept the tradeoffs you choose to make.

"people who were behaving as though they were fine with that tradeoff"

I'm not sure who is to blame at this point -- is it the fault of the individual for not understanding uptime? is it the fault of the service for not articulating clearly what it means? Is it the fault of the industry for inculcating an unjustified sense of entitlement?

> I'm not sure who is to blame at this point -- is it the fault of the individual for not understanding uptime? is it the fault of the service for not articulating clearly what it means? Is it the fault of the industry for inculcating an unjustified sense of entitlement?

I'm inclined to lean towards the third option. It ought to be obvious that there is no such thing as a perfect host, that in the limit your probability of being down at some point goes to 100% as your time hosted with a company increases.

There's just a lot of entitlement and sloppy thinking out there. People believe that their $30 a month hosting ought to be magically invulnerable in ways it by the simple laws of physics cannot be. If you didn't explicitly choose to have multiple geographically separate hosts with a fail-over plan between them, then you were at the very least implicitly accepting that you were going to be down some of the time when (not if) your only host goes down. That's a fine and ok thing to choose, but I don't think you get to be mad at your host because you didn't clue in to the choice you were making.

In my experience, most really great developers have made similar mistakes at some point in their pasts. Often these mistakes are what catalyze growth and create the scar tissue that helps them (and organizations) to not repeat the same classes of mistakes. You don't fire people when they make mistakes like these, you ask them what they will do to make sure that nobody in the organization is likely to make a similar one. Seems like that is generally Amazon's reaction too.

That said, I understand it's pretty damn frustrating having downtime during peak periods :)

Judging by your two comments in this thread you should probably study statistics for a short while, even a minimal amount of knowledge in that field would have saved you from two mistakes!

If you can't engineer your way around downtime, the fault is yours.

Good work from Amazon putting together the post-mortem. Whether you agree or not with their remediation plans, they hit the important parts of a public post-mortem very well:

1. Demonstrate understanding of the event.

2. Communicate what steps are being taken to reduce the likelihood a similar event in the future.

3. Apologize.

Not necessarily in that order. Well done.

Want to learn great infrastructure management tips? Read and digest these post-mortems.... Regardless of who the provider is.

There is always one takeaway from these that I can use to better my own infrastructure management activities - even when most of my infrastructure runs on AWS. :)

Wouldn't it be a better idea to learn from a provider that isn't constantly screwing things up? Amazon's systems clearly aren't working reliably.

If Amazon can have an issue of this severity in ELB, one of their core core services, during peak traffic season for their #1 highest profile customer, every one of their services should be viewed with suspicion and you really need to have non-amazon backup systems in place in case they have an incident that affects you. Yes it is painful to leave the comfortable aws womb, but it's time to grow up and start either managing multiple infrastructure providers or build your own.

Or make a calculated cost/risk assessment, and act accordingly.

As usual, their summaries leave me with more questions than answers.

A maintenance process whose purpose in life is to delete data from ELB backend database (if it were not the case, you'd see "maint process didn't work right") in such a weird way that it would cause such chaos? Why on earth would such a maint process exist in the first place? I can imagine some possibilities here of course but it feels to me there is more to it than what they've chosen to disclose.

Next. So they lose config data but data path for now not impacted. Fine, makes sense. But then backend, with only partial data, attempts to reconfig running LBs, doesn't fail completely (as in it was able to connect and do at least something but not all actions it was supposed to do) thus forcing otherwise good but impaired LBs into a completely bad state. Sunds suspicious to me.

And then the biggest question - why did they choose to attempt to restore entire backend database when only 6.8% of LBs were impacted?

I also have no idea how a CM process can protect against making a mistake - mistakes happen when somebody is at the controls with or without prior coordination.

All in all, their backend systems are so sophisticated and precisely engineered that any unforeseen/unexpected abnormality caused by manual intervention (be it inadvertent run of a maint script or fat fingered traffic re-routing from primary to backup network) inevitably lead to overwhelming reaction of their automation that makes the problem even worse and extremely hard to recover from.

Very tough position to be in - during these outages, they are essentially fighting the skynet that they themselves created and at their scale there is no way around it.

So hats off to those who've been working on this and good luck taming the beast.

It's amazing that the team worked through what is likely the least fun night of the year to be working to fix this issue.

I do not know if it is still the case, but when I worked at Amazon in the late 1990s, one of the aspects of the "frugal" culture involved all of us working in the Seattle DC during the run-up to the xmas holidays.

You could be standing next to Jeff Bezos, putting books in boxes. The irony for me was that it was one of the few times I could expect to see him, despite my (small slice of an) office being two doors down from his.

Now that there are multiple, largely automated, DCs, I'm not sure if this still happens. Amazon isn't really "the little guy" anymore, at least not in the retail sense.

I wonder who gets stuck working those shifts -- would it just be following the normal schedule, in the name of fairness? Do more senior people get to claim the day of vacation? Do parents get priority over single people?

Most teams I've seen have at least one person willing to bite the bullet voluntarily. Perhaps they don't celebrate Christmas, or they want to do a favor to their team, or they'd prefer to take vacation at a different time. I have volunteered to be on call during Christmas in the past specifically because it is Christmas: other than rare events like this, it's the quietest time of the year. I get a relaxed week of work while everyone else is out, and can then use my vacation on a normal workweek instead. Some managers also give makeup vacation time for people who go on call during the holidays.

Yeah, I've volunteered for such shifts before. Usually there's some "unofficial" extra compensation, extra time off or something similar.

I don't mind being on-call for my service over holidays, because at least for my service there's significantly less traffic, but mainly because all the developers are on vacation, and most outages are caused by new software or modified configs. That was true in this case with AWS. Whoever was in there on Dec 24th modifying load balancer configs should have just taken the day off instead.

They might've had employees that don't celebrate Christmas.

Any large technology company will have lots of employees from cultures that don't celebrate Christmas. One of the nice little upsides of globalization.

But not necessarily a lot of people on each of the relevant teams.

Thanks to AWS for the full disclosure here... I always jump at the chance to read the post-mortems. However, I don't understand why this took Netflix down. This affected ELB only in us-east-1, but surely Netflix is multi-region, at least for their frontend?

12:24 on 12/24. Interesting coincidence. What are the odds?

Just as likely as 12:25 on 12/24.

Though technically correct, this (seemingly) common response to highlighted dates belies the fact that no, 12:24 12/24 is in fact more rare than an arbitrary other minute because of the pattern in it's configuration.

Sometimes pedantry gets in the way of appreciating something as small as this.

There are any number of ways a number could be considered "special". It could be prime, it could have no repeated digits, it could have sequential digits, it could be a friendly number, it could be a palindrome, it could be a perfect number, ... I could go on...

Eventually the numbers with nothing "special" about them begin to appear the minority: http://www2.stetson.edu/~efriedma/numbers.html

But it's much more likely for it to land on "any other time" than "on an interestingly repetitive time."

100% That's the beauty of probability after the fact.

If you're using a 24 hour clock, then there should be ~365 of those events per year (can only have one per day by definition, every day has a corresponding day/month to hour:minute pair) -- since we're looking at events that can occur on a minutely basis, that means (days in year)/(minutes in year) so wolfram alpha says 1/1440.

Two provisos: first, it's really late so there's probably something wrong with the math up there; second, that wouldn't count other significant patterns (like, 01/02 03:04 or something) because that's kinda difficult to quantify.

My brain is having trouble parsing this as something other than "1:2 or 1/2".

And no, those are not the odds.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact