I've seen more failures which take out multiple AZs than which take out only a single AZ. So, a prudent person would split their application across regions (which are relatively shared nothing, except for admin/account level stuff), but Amazon goes out of its way to not make that easy -- you're using the public Internet, pay higher costs, etc.
The right choice is probably EC2 plus a non-EC2 provider (your own hosted stuff, another cloud (?), etc.), with protection in case either goes down. But that is a relatively lot of work, and if you're on a PaaS like Heroku which is 100% at risk to EC2, you can't do it.
Kind of sucks.
For my benefit, what are some others?
I haven't seen a lot of people using both EC2 and Terremark for the same app -- kind of different markets. Not technically unreasonable, but Terremark seems to be more enterprise IT outsourcing, and EC2 (followed at very far remove by the other clouds, including Rackspace) being Internet-delivered consumer, etc. apps, or at least larger scale public services.
Even the major day-long outage last year because we had (at the time) not really spread ALL our core systems across multiple AZs. We just re-launched those systems on another AZ and everything was up and running again.
Which outages have affected multiple AZs?
It's sad how people knew how to do this stuff ~2002-2006 and then forgot it all (or just stopped caring) once the delicious cake of cloud appeared.
So separate datacenters, admin layer, providers and also important: billing.
Running a high available cluster in this setup isn't trivial though, mostly due to network splits. It works quite well for specific purposes where availability is more important than data integrity. (remote monitoring in this case)
*Downvote if you have legitimate technical reason I'm wrong, not just because you throw a hissy fit that your technology of the week isn't all that and a bag of chips.
(We're back now. Kudos to Pingdom for noticing and alerting)
I second the comment above suggesting a "crowdsourced" status app monitoring twitter. Although it's no consolation for service interruptions, it does at least keep you sane knowing the problem is elsewhere.
Plus, who said we were sleeping.
Worse if you are tied into the AWS ecosystem. You can't even move out or even have backup servers in other places.
Migrating the servers themselves however shouldnt be too hard if you have a fairly sane build process, however if you dont, and i suspect a lot of people dont...its not going to be fun.
Probably the sane way is to special case some subset of your functionality so it works regardless, and then gracefully scale up/down your app (performance, scope of features, etc.) based on system health. This is a lot more complex, and really hard to retrofit.
From the very beginning i've always strayed far away from anything that locks us into AWS. For this reason we've made no use of anything that couldn't be picked up and moved away, so for us auto scaling was never something we decided to utilize for this exact reason.
While this held us back a bit at first, even tools like SES initially only had an API provided by Amazon. Now it supports standard SMTP connections, so we decided there was no harm in using it as we could easily make a switch with no code changes.
Does anyone know if ops teams generally have to clear these sort of announcements with PR/comms? I assume they would, given how they get reported on.
A lot of providers try to NDA their "ops to customer" service outage notifications, but most customers flagrantly violate those NDAs. Automated service dashboards are supposed to be automatic; ops teams often put in short statements (especially time to fix and any interim way to mitigate the outage).
Definitive statements after the outage are run by PR (and generally announced senior to ops), but service notifications of outages (vs. causes, compensation, and long term corrective actions) are not.
2:40 AM PDT We are investigating connectivity issues for EC2 in the US-EAST-1 region.
3:03 AM PDT Between 2:22 AM and 2:43 AM PDT internet connectivity was impaired in the US-EAST-1 region. Full connectivity has been restored. The service is operating normally.
6:09 PM PDT We want to provide some additional information on the Internet connectivity interruption that impacted our US-East Region last night. A networking router bug caused a defective route to the Internet to be advertised within the network. This resulted in a 22 minute Internet connectivity interruption for instances in the region. During this time, connectivity between instances in the region and to other AWS services was not interrupted. Given the extensive experience that we have running this router in this configuration, we know this bug is rare and unlikely to reoccur. That said, we have identified and are in the process of deploying a mitigation that will prevent a reoccurrence of this bug from affecting network connectivity.
We understand that when networking events affect instances in multiple Availability Zones it causes our customers serious operational issues that are difficult to architect around. We have been using and refining our Availability Zone architecture for over 10 years at Amazon to provide highly reliable services. Availability Zones provide a high degree of isolation including physical separation, independent power distribution, independent cooling and mechanical systems, and multiple physical links to the Internet through multiple transit providers and peering connections. All of our regions have exceeded 99.99% availability over the last several years. We are also continually investing in improving our architecture as we learn more. In addition to the remediation discussed above which addresses the specific bug we saw last night, we are currently in the later stages of refining the way that we do route advertisement within a region. These changes will isolate any bad route information to inside a single Availability Zone while maintaining the performance characteristics of our current inter-Availability Zone network design. We have been deploying these changes carefully to avoid impact to customers, but we expect these changes to be complete within the next several weeks. We are confident these changes will protect us from multi-Availability Zone impact for the sort of bug we saw last night.
Nothing critical, but certainly a warning shot across the bows.
Yes, multi-Cloud strategy... i know a good company for this ;-)
There's also the obvious risk with even using a single PaaS running on multiple IaaS clouds. If your account with the PaaS gets hacked, or they get acquihired, or whatever, you can be screwed too.
Figuring out exactly where to have redundancy in your business is hard. Especially because building something to be redundant imposes costs (more expensive, slower development) and sometimes itself is the cause of outages (lots of hilarious failover-related failures have taken down sites).
AWS status dashboard: -1, HN: +1