Amazon Availability Zones are such a fucking lie. ("Shared nothing, and an outage affecting one will not affect other Zones in the Region.")
I've seen more failures which take out multiple AZs than which take out only a single AZ. So, a prudent person would split their application across regions (which are relatively shared nothing, except for admin/account level stuff), but Amazon goes out of its way to not make that easy -- you're using the public Internet, pay higher costs, etc.
The right choice is probably EC2 plus a non-EC2 provider (your own hosted stuff, another cloud (?), etc.), with protection in case either goes down. But that is a relatively lot of work, and if you're on a PaaS like Heroku which is 100% at risk to EC2, you can't do it.
You should at least know there is an outage to have something to tell your downstream customers. It is really embarrassing to have a customer (or your boss) call to report an outage you don't yet know about, even if there is fuck all you can do to resolve it. Basic principle of ops.
If my case can help you, my company uses services of one company for load-balancing trafic across multiple CDN/Cloud. We are no longer impacted by the failure of some providers. You can read this http://tinyurl.com/7pwfza7 (i'm user, not vendor)
Do you work for a DNS provider or CDN or something (so as to see this in near realtime)? Envy.
I haven't seen a lot of people using both EC2 and Terremark for the same app -- kind of different markets. Not technically unreasonable, but Terremark seems to be more enterprise IT outsourcing, and EC2 (followed at very far remove by the other clouds, including Rackspace) being Internet-delivered consumer, etc. apps, or at least larger scale public services.
Here's an idea I've thought about but don't have time to do anything with: a peer-to-peer monitoring network, so each new server on each new network makes it more robust. No idea how the details would work out.
That gets done for network/application performance monitoring (alternatives to keynote, gomez, etc., and is how some of their own products work). It's kind of overkill for basic application level monitoring -- there's a tradeoff between number of endpoints checking and frequency of checks. I guess you could round-robin checks across a larger number of end nodes, too, to get both.
We're set up across multiple AZs in the affected region, and all we had was a few minutes of failed requests to one AZ until our systems automatically shifted all the traffic to another.
Even the major day-long outage last year because we had (at the time) not really spread ALL our core systems across multiple AZs. We just re-launched those systems on another AZ and everything was up and running again.
AZs are supposed to be distinct datacenters within a single region. If all of your customers are in (e.g.) APAC, it's not unreasonable to put all your online processing within APAC, with high bandwidth connectivity between them and from each to customers. You might not be able to do master-master over extremely long distances for performance reasons under normal conditions, but you'd keep warm or cold backups totally out of the area. There are a lot of factors which go into the decision, but there are definitely times when 2 datacenters (often run by separate providers) with independent connectivity, but both within a specific distance, makes more sense than extreme separation.
It's sad how people knew how to do this stuff ~2002-2006 and then forgot it all (or just stopped caring) once the delicious cake of cloud appeared.
You missed my point: this is not a cloud problem except to vendors looking to sell non-cloud hosting. Any region is vulnerable - some clown with a backhoe, congestion / DDoS, routing screwups, etc. have taken out data centers in entire areas (Los Angeles, SF, NY, etc.) even when providers thought they had more redundancy. If you really need it, you spend the money on wide geographic separation.
For this reason I'm using a set of different VPS servers running on both Linode (UK datacenter) and Slicehost (US datacenter).
So separate datacenters, admin layer, providers and also important: billing.
Running a high available cluster in this setup isn't trivial though, mostly due to network splits. It works quite well for specific purposes where availability is more important than data integrity. (remote monitoring in this case)
That my plan too. By using dual clouds (again in UK and US), we're getting the highest failover protection we can afford. I can't afford our e-commerce platform to be down and the evidence shows that a single cloud is robust enough. We call it "Cloud Docking" :)
What's funny is that this is what I find myself doing instinctually when I encounter an outage or high latencies on ANY service or site. Heroku (recent process startup woes), Google Apps (slowdowns, specifically Gmail), Amazon (when its hammered by traffic to big deals), etc etc.
I second the comment above suggesting a "crowdsourced" status app monitoring twitter. Although it's no consolation for service interruptions, it does at least keep you sane knowing the problem is elsewhere.
What does it matter if Twitter goes down? The odds of it happening for an extended time right at the beginning of an EC2 outage are rather small, and even in that worst-case scenario, it doesn't really put us in any worse of a position than we're in now.
Apparently it was caused by a network problem above AWS:
2:40 AM PDT We are investigating connectivity issues for EC2 in the US-EAST-1 region.
3:03 AM PDT Between 2:22 AM and 2:43 AM PDT internet connectivity was impaired in the US-EAST-1 region. Full connectivity has been restored. The service is operating normally.
6:09 PM PDT We want to provide some additional information on the Internet connectivity interruption that impacted our US-East Region last night. A networking router bug caused a defective route to the Internet to be advertised within the network. This resulted in a 22 minute Internet connectivity interruption for instances in the region. During this time, connectivity between instances in the region and to other AWS services was not interrupted. Given the extensive experience that we have running this router in this configuration, we know this bug is rare and unlikely to reoccur. That said, we have identified and are in the process of deploying a mitigation that will prevent a reoccurrence of this bug from affecting network connectivity.
We understand that when networking events affect instances in multiple Availability Zones it causes our customers serious operational issues that are difficult to architect around. We have been using and refining our Availability Zone architecture for over 10 years at Amazon to provide highly reliable services. Availability Zones provide a high degree of isolation including physical separation, independent power distribution, independent cooling and mechanical systems, and multiple physical links to the Internet through multiple transit providers and peering connections. All of our regions have exceeded 99.99% availability over the last several years. We are also continually investing in improving our architecture as we learn more. In addition to the remediation discussed above which addresses the specific bug we saw last night, we are currently in the later stages of refining the way that we do route advertisement within a region. These changes will isolate any bad route information to inside a single Availability Zone while maintaining the performance characteristics of our current inter-Availability Zone network design. We have been deploying these changes carefully to avoid impact to customers, but we expect these changes to be complete within the next several weeks. We are confident these changes will protect us from multi-Availability Zone impact for the sort of bug we saw last night.
It's more that if you make use of the Amazon APIs for autoscaling other services, you can't just directly translate that to a more static managed hosting environment.
Probably the sane way is to special case some subset of your functionality so it works regardless, and then gracefully scale up/down your app (performance, scope of features, etc.) based on system health. This is a lot more complex, and really hard to retrofit.
From the very beginning i've always strayed far away from anything that locks us into AWS. For this reason we've made no use of anything that couldn't be picked up and moved away, so for us auto scaling was never something we decided to utilize for this exact reason.
While this held us back a bit at first, even tools like SES initially only had an API provided by Amazon. Now it supports standard SMTP connections, so we decided there was no harm in using it as we could easily make a switch with no code changes.
At a competent company, no. There are policies in place before the outage, but having your PR people in the loop slows things down to the point where you're worthless to your customers. The exception is you loop in legal, PR, etc. if someone is actually injured/dies, or if crimes are involved.
A lot of providers try to NDA their "ops to customer" service outage notifications, but most customers flagrantly violate those NDAs. Automated service dashboards are supposed to be automatic; ops teams often put in short statements (especially time to fix and any interim way to mitigate the outage).
Definitive statements after the outage are run by PR (and generally announced senior to ops), but service notifications of outages (vs. causes, compensation, and long term corrective actions) are not.
I suspect there is some human admin level disconnect between their network ops/routing and the AWS team itself. A connectivity outage wouldn't necessarily get detected within AWS, and they probably don't have good monitoring within the AWS product to detect problems like that. The network team presumably doesn't have a good way to push status updates to the AWS dashboard automatically (and it's kind of a grey area what is a "network outage" -- if you lost routes to just Pakistan, that's not really a big deal for most AWS customers. If you lose routes to everyone, yes, that's a big deal.)
And yet you have no contact info in your profile...
There's also the obvious risk with even using a single PaaS running on multiple IaaS clouds. If your account with the PaaS gets hacked, or they get acquihired, or whatever, you can be screwed too.
Figuring out exactly where to have redundancy in your business is hard. Especially because building something to be redundant imposes costs (more expensive, slower development) and sometimes itself is the cause of outages (lots of hilarious failover-related failures have taken down sites).