Key line of the article: “Since Netflix focuses on making sure services can handle individual instance failure and since we avoid using EBS for data persistence, we still did not see any impact to our service.”
As an architecture design, the choice to avoid EBS is hotly debated, though many high-profile systems besides NetFlix (SimpleGeo, Sprint.ly) have moved almost exclusively to EC2 instance-backed (local disk) VMs and as a result, avoided the pain of the last 3 major AWS outages.
They are good. I'd enjoy reading more from their ops guys - likely others wouldn't though :-( a significant portion of ops concerns would probably appear unsexy to non ops people, but horses for courses, whatever floats your boat and all that. I love it.
Netflix has spent a lot of time and energy devising solutions that minimize disruptions. Are any of you Cloud-Savvy folks using Netflix's tools on Amazon? If so, are you using them off-the-shelf or did you need to customize them to suit your site? In particular:
Asgard is cited in the post as making a "zone evacuation" relatively straightforward.
Astyanax (their Cassandra client) is designed with "smarts" that allow it to choose from available nodes should one or more be unavailable.
These (and several other tools) are available from Netflix at Github:
Not "OTOH" ... what you've said is the truth. Some of the methods Netflix uses still apply but others are indeed financially non-viable. I'd love to have a Simian Army of my own and I think that would translate to any PAAS provider.
One of the more interesting findings from the most recent AWS outages is that Elastic Load Balancing (ELB), the best-practice way to handle multi-AZ deployment, has a dependency on EBS. I know from various talks that Netflix try not to use EBS these days, but I wonder if you had any ELB problems, and how you might have coped if you had evacuated a zone but your LBs were out?
I saw a comment a few weeks ago where one fellow HNer ran his entire startup on AWS spot instance pricing - so that he was forced to program in a state of continuous chaos as his demand/spot instances popped randomly and continuously into and out of existence while his service was running. It's like programming on quicksand.
This is probably a step too far - but maybe it is a natural extension of NFLX's Chaos monkey.
If you want ~100% up time with no QOS degradation - your system must be constantly under catastrophic attack.
This is probably a major reason why vol based risk models in finance are completely pointless.
Value at risk of any investable securities (including cash) is 100% all the time - any other number is bullshit. All volatility based risk models are useful if you like watching squiggly lines or pricing options, but they are essentially a random anchor that helps us sleep at night (anchoring bias).