
Netflix: Post-mortem of 22 Oct AWS degradation - jedberg
http://techblog.netflix.com/2012/10/post-mortem-of-october-222012-aws.html
======
dusing
"On Monday, just after 8:30am, we noticed that a couple of large websites that
are hosted on Amazon were having problems and displaying errors."

Our sys admins couldn't get to reddit.

~~~
jedberg
Actually, that is completely true. I was trying to post a link to my Airbnb
talk, noticed reddit was down, and then noticed Airbnb was down too.

One thing we might start doing is actually having alarms when two or more
major AWS sites go down.

~~~
samstave
Where is the link to said AirBnB talk?

~~~
jedberg
<https://www.airbnb.com/techtalks>

------
snprbob86
> If you like thinking about high availability and how to build more resilient
> systems, we have many openings throughout the company

Money can't buy recruiting opportunities like these. This is exemplary
engineering and marketing.

------
ericcholis
I enjoy reading these Sysops articles from Netflix. They provide a pretty good
blueprint for working inside the cloud.

~~~
3amOpsGuy
They are good. I'd enjoy reading more from their ops guys - likely others
wouldn't though :-( a significant portion of ops concerns would probably
appear unsexy to non ops people, but horses for courses, whatever floats your
boat and all that. I love it.

~~~
waven
3amOpsGuy, is there any chance I could email/contact you somehow regarding
some ops advice? thanks!

~~~
3amOpsGuy
3amopsguy@gmail.com

------
confluence
I saw a comment a few weeks ago where one fellow HNer ran his entire startup
on AWS spot instance pricing - so that he was forced to program in a state of
continuous chaos as his demand/spot instances popped randomly and continuously
into and out of existence while his service was running. It's like programming
on quicksand.

This is probably a step too far - but maybe it is a natural extension of
NFLX's Chaos monkey.

If you want ~100% up time with no QOS degradation - your system must be
constantly under catastrophic attack.

This is probably a major reason why vol based risk models in finance are
completely pointless.

Value at risk of any investable securities (including cash) is 100% all the
time - any other number is bullshit. All volatility based risk models are
useful if you like watching squiggly lines or pricing options, but they are
essentially a random anchor that helps us sleep at night (anchoring bias).

------
signifiers
Key line of the article: “Since Netflix focuses on making sure services can
handle individual instance failure and since _we avoid using EBS for data
persistence_ , we still did not see any impact to our service.”

As an architecture design, the choice to avoid EBS is hotly debated, though
many high-profile systems besides NetFlix (SimpleGeo, Sprint.ly) have moved
almost exclusively to EC2 instance-backed (local disk) VMs and as a result,
avoided the pain of the last 3 major AWS outages.

------
EzGraphs
Netflix has spent a lot of time and energy devising solutions that minimize
disruptions. Are any of you Cloud-Savvy folks using Netflix's tools on Amazon?
If so, are you using them off-the-shelf or did you need to customize them to
suit your site? In particular:

 _Asgard_ is cited in the post as making a "zone evacuation" relatively
straightforward.

 _Astyanax_ (their Cassandra client) is designed with "smarts" that allow it
to choose from available nodes should one or more be unavailable.

These (and several other tools) are available from Netflix at Github:

<https://github.com/Netflix>

------
shuw
If this practice was widely adopted, I wonder if AWS would experience a bank
run/DOS in the affected and neighboring zones.

~~~
jedberg
It would certainly make reservations a lot more important.

------
smoyer
My impression is that the Netflix team understands AWS better than Amazon does
... But certainly better than most other AWS customers. Kudos

~~~
ams6110
OTOH, most AWS customers are not Netflix and couldn't afford the sort of high
availability architecture Netflix has.

~~~
smoyer
Not "OTOH" ... what you've said is the truth. Some of the methods Netflix uses
still apply but others are indeed financially non-viable. I'd love to have a
Simian Army of my own and I think that would translate to any PAAS provider.

------
crb
One of the more interesting findings from the most recent AWS outages is that
Elastic Load Balancing (ELB), the best-practice way to handle multi-AZ
deployment, has a dependency on EBS. I know from various talks that Netflix
try not to use EBS these days, but I wonder if you had any ELB problems, and
how you might have coped if you had evacuated a zone but your LBs were out?

