A) We've seen downtime in multiple zones, not just one. The big problem seems like it might be localized to one zone, but there are spillover effects. See below.
B) We have extensive EBS snapshots, thank you very much, and, yes, they span availability zones. Unfortunately, they can't be reliably reinstantiated as volumes in any zone in US-East.
If one is going to chase ambulances one could at least provide some advice that is actually topical. Like, e.g., an article on the tradeoffs and design choices needed to run a service that spans regions or even companies, not just zones.
Sorry to be so testy, but it's not a good day for platitudes.
I must admit that it's very much an idiot's guide aimed at smaller setups or people less familar with web operations.
I can assure you there's no copy/paste and I'm surprised at the interest it's had.
The author recommends using EBS volumes to provide for backups and snapshots. However, Amazon's EBS system is one of the more failure-prone components of the AWS infrastructure, and lies at the heart of this morning's outage . Any steps you can take to reduce your dependence upon a service that is both critical to operation and failure-prone will limit the surface of your vulnerability to such outages. While the snapshotting ability of EBS is nice, waking up to a buzzing pager to find that half of the EBS volumes in your cluster have dropped out, hosing each of the striped RAID arrays you've set up to achieve reasonable IO throughput, is not. Instead, consider using the ephemeral drives of your EC2 instances, switching to a non-snapshot-based backup strategy, and replicating data to other instances and AZ's to improve resilience.
The author also recommends Elastic Load Balancers to distribute load across services in multiple availability zones. Load balancing across availability zones is excellent advice in principle, but still succumbs to the problem above in the instance of EBS unavailability: ELB instances are also backed by Amazon's EBS infrastructure. ELB's can be excellent day-to-day and provide some great monitoring and introspection. However, having a quick chef script to spin up an Nginx or HAProxy balancer and flipping DNS could save your bacon in the event of an outage that also affected ELBs, like today.
With each service provider incident, you learn more about your availability, dependencies, and assumptions, along with what must improve. Proportional investment following each incident should reduce the impact of subsequent provider issues. Naming and shaming providers in angry Twitter posts will not solve your problem, and it most certainly won't solve your users' problem. Owning your availability by taking concrete steps following each outage to analyze what went down and why, mitigating your exposure to these factors, and measuring your progress during the next incident will. It is exciting to see these investments pay off.
Some of these:
– Painfully thorough monitoring of every subsystem of every component of your infrastructure. When you get paged, it's good to know exactly what's having issues rather than checking each manually in blind suspicion.
– Threshold-based alerting.
– Keeping failover for all systems as automated, quick, and transparent as is reasonably possible.
– Spreading your systems across multiple availability zones and regions, with the ideal goal of being able to lose an entire AZ/region without a complete production outage.
– Team operational reviews and incident analysis that expose the root cause of an issue, but also spider out across your system's dependencies to preemptively identify other components which are vulnerable to the same sort of problem.
 See the response from AWS in the first reply here: https://forums.aws.amazon.com/thread.jspa?messageID=239106...