Using only AWS services, what do you put in place to help prevent disruptions when a single availability zone goes down?
The most simple would be to set up your instances in a multiple AZ's and then configure the ELB to round robin requests until the health of one of the instances is poor.
Any other thoughts?
I'm in one of the us-east zones and I haven't had a failure in at least a year. They retired one machine I was using and dealing with that was as simple as starting and stopping -- at a time I chose.
With five zones in U.S. East, the probability of a zone failure affecting a single zone systems is 1 in 5.
If you're a busybody who spreads your system across five zones, the probability of a failure affecting you becomes 1.
You're spending more money, and dealing with a lot more complexity, all to increase the probability that hardware failures will affect you.
Now, you're hoping that a zone-distributed system will be able to recover from failures, but that's tricky to do and it's quite unlikely that this will work if you haven't tested it. Add the fact that all the other "cool kids" will be trying to recover their systems at this time and make AMZN's control plane go down.
In the meantime, with probability 4/5 I'm sleeping through the disaster and the first time I hear about it is on hacker news.