1. "We use a multi-AZ strategy!" - This outage affected multiple AZ's concurrently. If you did not see downtime, this means you were fortunate to have at least one unaffected AZ. This is pure luck however, many sites with the same level of preparation had significant downtime. (Note: A multi-AZ strategy is sage and would have minimized your downtime, but does not warrant a survival claim in this case.)
2. "We aren't using EBS!" - Not a single article I've seen has claimed that they weren't using EBS because they feared a multi-day/multi-AZ outage. They weren't using it because it lacks predictable I/O performance in comparison to S3. You can't retroactively claim wisdom in the category of availability for this choice.
3. "We don't host component <X> on AWS!" - Taking this argument to it's logical end, any service that doesn't host on AWS could write one of these articles e.g. "We host on Rackspace so we didn't go down!"
In short, if you don't have a completely multi-region strategy (including your relational data-store) implemented purely on AWS, your blog post is decreasing the signal-to-noise ratio on this issue.
It sounds stupid but if you really do have a resilient and redunant infrastructure it shouldn't matter. If you fear someone randomly unplugging things then you have work to do ;-)
Essentially I think we're going to be in an 80/20-ish cloud/colo sweet spot situation for years to come.
Perhaps you should diversify into cardiac monitoring!
Certainly 10% being 20ms or more is a little troubling, but if this is only for writes (ie reads come from a slave in the same AZ) you are probably ok.
Here's a traceroute between an EC2 instance in us-east-1a and rackspace.com (which resolved to one of their VA datacenters): http://pastebin.com/RF5VrTic
Sub 2ms. It also looks like the us-east-1a is peered directly with whichever rackspace datacenter served the request.
1) Long before EBS API was returned, AWS adjusted the "Amazon Elastic Compute Cloud (N. Virginia)" status for 24 April to show operational (green). This has since been corrected in the "Amazon EC2 (N. Virginia)" Status History.
2) Their own RDS service, which is instances backed by EBS, remained unavailable for its users, proving that #1 was false. If they couldn't operate a service (RDS) built on themselves (EC2) normally, the underlying service (EC2) should not have been considered Operational in the status page.
3) At present, the icon for "Amazon Elastic Compute Cloud (N. Virginia)" is Green for "Service is operating normally" instead of Yellow for "Performance issues", though the text description is not "Service is operating normally." but "Instance connectivity, latency and error rates."
4) It seems from anecdotal observation they're using the status page at least as "median status", or perhaps closer to "20th percentile status", meaning >80% of something can be down before it toggles to "Service Disruption".
>we don’t use Elastic Block Storage (EBS), which is the main component that failed last week.
Not using EBS wasn't luck it was a conscious decision.
SmugMug got lucky in their choice. If performance had been consistent with EBS, they would have used it and most likely gone down like so many others.