Hacker Newsnew | comments | ask | jobs | submitlogin
Netflix: Post-mortem of 22 Oct AWS degradation (netflix.com)
97 points by jedberg 533 days ago | comments


dusing 533 days ago | link

"On Monday, just after 8:30am, we noticed that a couple of large websites that are hosted on Amazon were having problems and displaying errors."

Our sys admins couldn't get to reddit.

-----

jedberg 533 days ago | link

Actually, that is completely true. I was trying to post a link to my Airbnb talk, noticed reddit was down, and then noticed Airbnb was down too.

One thing we might start doing is actually having alarms when two or more major AWS sites go down.

-----

azylman 533 days ago | link

This is completely unrelated to the post, but I was at the Airbnb tech talk and it was extremely interesting - thanks for putting that on!

-----

samstave 533 days ago | link

Where is the link to said AirBnB talk?

-----

jedberg 533 days ago | link

https://www.airbnb.com/techtalks

-----

snprbob86 533 days ago | link

> If you like thinking about high availability and how to build more resilient systems, we have many openings throughout the company

Money can't buy recruiting opportunities like these. This is exemplary engineering and marketing.

-----

signifiers 533 days ago | link

Key line of the article: “Since Netflix focuses on making sure services can handle individual instance failure and since we avoid using EBS for data persistence, we still did not see any impact to our service.”

As an architecture design, the choice to avoid EBS is hotly debated, though many high-profile systems besides NetFlix (SimpleGeo, Sprint.ly) have moved almost exclusively to EC2 instance-backed (local disk) VMs and as a result, avoided the pain of the last 3 major AWS outages.

-----

ericcholis 533 days ago | link

I enjoy reading these Sysops articles from Netflix. They provide a pretty good blueprint for working inside the cloud.

-----

3amOpsGuy 533 days ago | link

They are good. I'd enjoy reading more from their ops guys - likely others wouldn't though :-( a significant portion of ops concerns would probably appear unsexy to non ops people, but horses for courses, whatever floats your boat and all that. I love it.

-----

waven 533 days ago | link

3amOpsGuy, is there any chance I could email/contact you somehow regarding some ops advice? thanks!

-----

3amOpsGuy 533 days ago | link

3amopsguy@gmail.com

-----

shuw 533 days ago | link

If this practice was widely adopted, I wonder if AWS would experience a bank run/DOS in the affected and neighboring zones.

-----

jedberg 533 days ago | link

It would certainly make reservations a lot more important.

-----

Terretta 532 days ago | link

You don't have to wonder. Every single AZ AWS postmortem says that's exactly what happens.

-----

EzGraphs 533 days ago | link

Netflix has spent a lot of time and energy devising solutions that minimize disruptions. Are any of you Cloud-Savvy folks using Netflix's tools on Amazon? If so, are you using them off-the-shelf or did you need to customize them to suit your site? In particular:

Asgard is cited in the post as making a "zone evacuation" relatively straightforward.

Astyanax (their Cassandra client) is designed with "smarts" that allow it to choose from available nodes should one or more be unavailable.

These (and several other tools) are available from Netflix at Github:

https://github.com/Netflix

-----

smoyer 533 days ago | link

My impression is that the Netflix team understands AWS better than Amazon does ... But certainly better than most other AWS customers. Kudos

-----

ams6110 533 days ago | link

OTOH, most AWS customers are not Netflix and couldn't afford the sort of high availability architecture Netflix has.

-----

smoyer 532 days ago | link

Not "OTOH" ... what you've said is the truth. Some of the methods Netflix uses still apply but others are indeed financially non-viable. I'd love to have a Simian Army of my own and I think that would translate to any PAAS provider.

-----

crb 533 days ago | link

One of the more interesting findings from the most recent AWS outages is that Elastic Load Balancing (ELB), the best-practice way to handle multi-AZ deployment, has a dependency on EBS. I know from various talks that Netflix try not to use EBS these days, but I wonder if you had any ELB problems, and how you might have coped if you had evacuated a zone but your LBs were out?

-----

confluence 533 days ago | link

I saw a comment a few weeks ago where one fellow HNer ran his entire startup on AWS spot instance pricing - so that he was forced to program in a state of continuous chaos as his demand/spot instances popped randomly and continuously into and out of existence while his service was running. It's like programming on quicksand.

This is probably a step too far - but maybe it is a natural extension of NFLX's Chaos monkey.

If you want ~100% up time with no QOS degradation - your system must be constantly under catastrophic attack.

This is probably a major reason why vol based risk models in finance are completely pointless.

Value at risk of any investable securities (including cash) is 100% all the time - any other number is bullshit. All volatility based risk models are useful if you like watching squiggly lines or pricing options, but they are essentially a random anchor that helps us sleep at night (anchoring bias).

-----




Lists | RSS | Bookmarklet | Guidelines | FAQ | DMCA | News News | Feature Requests | Bugs | Y Combinator | Apply | Library

Search: