Our sys admins couldn't get to reddit.
One thing we might start doing is actually having alarms when two or more major AWS sites go down.
Money can't buy recruiting opportunities like these. This is exemplary engineering and marketing.
This is probably a step too far - but maybe it is a natural extension of NFLX's Chaos monkey.
If you want ~100% up time with no QOS degradation - your system must be constantly under catastrophic attack.
This is probably a major reason why vol based risk models in finance are completely pointless.
Value at risk of any investable securities (including cash) is 100% all the time - any other number is bullshit. All volatility based risk models are useful if you like watching squiggly lines or pricing options, but they are essentially a random anchor that helps us sleep at night (anchoring bias).
As an architecture design, the choice to avoid EBS is hotly debated, though many high-profile systems besides NetFlix (SimpleGeo, Sprint.ly) have moved almost exclusively to EC2 instance-backed (local disk) VMs and as a result, avoided the pain of the last 3 major AWS outages.
Asgard is cited in the post as making a "zone evacuation" relatively straightforward.
Astyanax (their Cassandra client) is designed with "smarts" that allow it to choose from available nodes should one or more be unavailable.
These (and several other tools) are available from Netflix at Github: