

Post-mortem of Christmas Eve Amazon ELB outage - cperciva
http://aws.amazon.com/message/680587/

======
thrownaway2424
The post-mortem is missing important pieces.

1) When was the problem detected? 2) How was the problem detected: automated
monitoring, operator vigilance, or customer complaints? 3) When were customers
first notified of the outage?

This post-mortem starts out with the root cause, but it is likely that was not
known until after the outage had ended, so it's important to know how much
time passed between the root cause event and when the on-call engineer(s)
started treating the outage as such.

I'd also be interested in hearing about technical measures they will use to
prevent this in the future. It sounds like their mitigation measures are all
social, not technical. For example, do their configuration tools warn them
when they are about the push out a huge delta? Do they have a review process
that forces two people to look at the change before it's put into action?
Would these have prevented the outage?

Finally, I think it's appalling (but not surprising) that it took many hours
to revert their configs. It sounds like the tools they are using are way too
slow. Any laptop can do an enormous amount of work in an hour; the computing
resources available to Amazon can perform truly insane feats in one minute.
Slow configuration tools are easy to write but prolong outages. They should be
designed to the same standards as production applications.

~~~
Xylakant
I think the measures they implement to prevent this failure are adequate.
Changing access control patterns so that not a single person but at least two
persons must allow the job to run are sufficient as a quick fix. They also
mention changing the future architecture to automate such maintenance jobs and
to make data recovery faster. So it is indeed a mix of process and technical
measures, much as the problem was caused by more of a process problem rather
than a technical problem.

Finally I don't think that the computational part of the data recovery is the
time intensive one. A laptop may churn through gigabytes and gigabytes a
minute, but a developer can't. In this case the more time intensive part is
probably figuring out which parts of the data got lost and how to get them
back, then writing the tools to automate the process, test that the data fits
and then put it safely back in production. I understand that it's annoying,
but theres only so many flops a human brain can run.

------
mmastrac
"The data was deleted by a maintenance process that was inadvertently run
against the production ELB state data. This process was run by one of a very
small number of developers who have access to this production environment."

~~~
briandear
And the developer ought to be fired. He cost me a considerable amount of money
since my heroku site with SSL endpoint was down over 12 hours during one if
our biggest days of the year. Of course, heroku still (over)bills me those
hours for Dynos and Workers.

I course, I am small potatoes compared to Netflix being down (while Amazon
instant video wasn't.)

Amazon is populated by jackasses apparently and Heroku isn't much better. Why
the f*(k doesn't Heroku have some kind of fall-over or something to protect
clients from the inevitable AWS-East failures?

Buh, bye Heroku and AWS. Hello Bluebox. I run other apps on Bluebox and have
never had any downtime except for some mild nonsense due to my own damned
ignorance. Luckily the Blueboxers saved me from myself. AWS needs to get their
shit together.

~~~
msbarnett
You weren't down because an Amazon engineer fat-fingered a maintenance
command, you were down because _you_ chose to put all of your eggs in one
basket. The basket you chose failed, and you had no redundancy basket to fail
over to. Choosing a new basket to put all your eggs in isn't going to fix
anything just because the new basket hasn't failed on you yet.

Make smarter decisions. If being up on December 24 was really that important
to you, you'd have had backup hosting in place with the ability to quickly
fail over to it. You'll start being a better engineer when you quit blaming
some poor bastard's bad luck for your failures and learn the real lesson from
your downtime: you fucked up by not having high-availability hosting to meet
your claimed high-availability needs.

The worst thing Amazon could do is fire this guy. They've just paid a lot of
money to have him learn from his mistake. They'd be fools to throw that hard-
earned experience away.

~~~
niggler
Interestingly enough a similar article on the front page scapegoated Windows
Azure.

People can't own up to the fact that customers don't really care if it's
Amazon AWS or Heroku or some other platform that failed -- if your service is
down, its your fault!

~~~
msbarnett
> People can't own up to the fact that customers don't really care if it's
> Amazon AWS or Heroku or some other platform that failed -- if your service
> is down, its your fault!

I don't, necessarily, have a problem with saying, without animosity, "we're
down because our hosting provider is down". For many small businesses, being
down when your hosting provider is down is an acceptable tradeoff -- your
business simply isn't critical enough for it to be worth the costs of
maintaining fail-over redundancy in case of primary hosting loss.

That's totally fine. Most people probably don't want to pay the higher cost of
having every service they use having the ability to survive their main host
going down. A couple 9s of uptime is more than enough for the vast majority of
businesses.

What I have a problem is with people who were behaving as though they were
fine with that tradeoff screaming about how they needed to be up when it comes
time to pay in a little downtime for those cheaper hosting costs. Grow up and
accept the tradeoffs you choose to make.

~~~
niggler
"people who were behaving as though they were fine with that tradeoff"

I'm not sure who is to blame at this point -- is it the fault of the
individual for not understanding uptime? is it the fault of the service for
not articulating clearly what it means? Is it the fault of the industry for
inculcating an unjustified sense of entitlement?

~~~
msbarnett
> I'm not sure who is to blame at this point -- is it the fault of the
> individual for not understanding uptime? is it the fault of the service for
> not articulating clearly what it means? Is it the fault of the industry for
> inculcating an unjustified sense of entitlement?

I'm inclined to lean towards the third option. It ought to be obvious that
there is no such thing as a perfect host, that in the limit your probability
of being down at some point goes to 100% as your time hosted with a company
increases.

There's just a lot of entitlement and sloppy thinking out there. People
believe that their $30 a month hosting ought to be magically invulnerable in
ways it by the simple laws of physics cannot be. If you didn't explicitly
choose to have multiple geographically separate hosts with a fail-over plan
between them, then you were at the very least implicitly accepting that you
were going to be down some of the time when (not if) your only host goes down.
That's a fine and ok thing to choose, but I don't think you get to be mad at
your host because you didn't clue in to the choice you were making.

------
imbriaco
Good work from Amazon putting together the post-mortem. Whether you agree or
not with their remediation plans, they hit the important parts of a public
post-mortem very well:

1\. Demonstrate understanding of the event.

2\. Communicate what steps are being taken to reduce the likelihood a similar
event in the future.

3\. Apologize.

Not necessarily in that order. Well done.

------
blantonl
Want to learn great infrastructure management tips? Read and digest these
post-mortems.... Regardless of who the provider is.

There is always one takeaway from these that I can use to better my own
infrastructure management activities - even when most of my infrastructure
runs on AWS. :)

~~~
23david
Wouldn't it be a better idea to learn from a provider that isn't constantly
screwing things up? Amazon's systems clearly aren't working reliably.

If Amazon can have an issue of this severity in ELB, one of their core core
services, during peak traffic season for their #1 highest profile customer,
every one of their services should be viewed with suspicion and you really
need to have non-amazon backup systems in place in case they have an incident
that affects you. Yes it is painful to leave the comfortable aws womb, but
it's time to grow up and start either managing multiple infrastructure
providers or build your own.

~~~
krisoft
Or make a calculated cost/risk assessment, and act accordingly.

------
somic
As usual, their summaries leave me with more questions than answers.

A maintenance process whose purpose in life is to delete data from ELB backend
database (if it were not the case, you'd see "maint process didn't work
right") in such a weird way that it would cause such chaos? Why on earth would
such a maint process exist in the first place? I can imagine some
possibilities here of course but it feels to me there is more to it than what
they've chosen to disclose.

Next. So they lose config data but data path for now not impacted. Fine, makes
sense. But then backend, with only partial data, attempts to reconfig running
LBs, doesn't fail completely (as in it was able to connect and do at least
something but not all actions it was supposed to do) thus forcing otherwise
good but impaired LBs into a completely bad state. Sunds suspicious to me.

And then the biggest question - why did they choose to attempt to restore
entire backend database when only 6.8% of LBs were impacted?

I also have no idea how a CM process can protect against making a mistake -
mistakes happen when somebody is at the controls with or without prior
coordination.

All in all, their backend systems are so sophisticated and precisely
engineered that any unforeseen/unexpected abnormality caused by manual
intervention (be it inadvertent run of a maint script or fat fingered traffic
re-routing from primary to backup network) inevitably lead to overwhelming
reaction of their automation that makes the problem even worse and extremely
hard to recover from.

Very tough position to be in - during these outages, they are essentially
fighting the skynet that they themselves created and at their scale there is
no way around it.

So hats off to those who've been working on this and good luck taming the
beast.

------
______
It's amazing that the team worked through what is likely the least fun night
of the year to be working to fix this issue.

~~~
cperciva
I wonder who gets stuck working those shifts -- would it just be following the
normal schedule, in the name of fairness? Do more senior people get to claim
the day of vacation? Do parents get priority over single people?

~~~
peripitea
Most teams I've seen have at least one person willing to bite the bullet
voluntarily. Perhaps they don't celebrate Christmas, or they want to do a
favor to their team, or they'd prefer to take vacation at a different time. I
have volunteered to be on call during Christmas in the past specifically
because it is Christmas: other than rare events like this, it's the quietest
time of the year. I get a relaxed week of work while everyone else is out, and
can then use my vacation on a normal workweek instead. Some managers also give
makeup vacation time for people who go on call during the holidays.

~~~
waterlesscloud
Yeah, I've volunteered for such shifts before. Usually there's some
"unofficial" extra compensation, extra time off or something similar.

------
tedchs
Thanks to AWS for the full disclosure here... I always jump at the chance to
read the post-mortems. However, I don't understand why this took Netflix down.
This affected ELB only in us-east-1, but surely Netflix is multi-region, at
least for their frontend?

------
briandear
12:24 on 12/24. Interesting coincidence. What are the odds?

~~~
ryanpetrich
Just as likely as 12:25 on 12/24.

~~~
bigdubs
Though technically correct, this (seemingly) common response to highlighted
dates belies the fact that no, 12:24 12/24 is in fact more rare than an
arbitrary other minute because of the pattern in it's configuration.

Sometimes pedantry gets in the way of appreciating something as small as this.

~~~
jlgreco
There are any number of ways a number could be considered "special". It could
be prime, it could have no repeated digits, it could have sequential digits,
it could be a friendly number, it could be a palindrome, it could be a perfect
number, ... I could go on...

Eventually the numbers with nothing "special" about them begin to appear the
minority: <http://www2.stetson.edu/~efriedma/numbers.html>

