Newest update: 10:29 PM PDT We can confirm a portion of a single Availability Zone in the US-EAST-1 Region lost power. We are actively restoring power to the effected EC2 instances and EBS volumes. We are continuing to see increased API errors. Customers might see increased errors trying to launch new instances in the Region.
AWS hides the real availability zone names from you. From the docs:
Can I assume that my Availability Zone us-east-1a is the same location as someone else's Availability Zone us-east-1a?
No. Currently, we do not support cross-account proximity. Each account's Availability Zones are independent. For example, the us-east-1a Availability Zone for one account might be in a different location than for another account.
Everyone's A zone is different. So I could say A is down while someone else is saying B, and we could be talking about the exact same zone. It makes it difficult to say if it is more widespread or not.
My us-east-1a is the affected zone, which is 3a98bf7d-126d-411a-a612-3a57a62dc688 using the incantation on the site.
(Oh, and to note: my us-east-1a was also the affected zone during the massive outage last year, and I believe I remember another outage sometime between then and now. I almost feel like every Amazon outage affects my zone. I kind of wonder if that availability zone just sucks ;P.)
Maybe I'm assuming wrong but I guess in your example zone A and B are the same and that the different zone names users see don't represent different ways of spreading resources. If so, why aren't they named consistently? If not, are there any details on how and why they've set up their zones (or am I overlooking another assumption I've made?)
Due to a quirk of both human nature and copy/paste example code, if the names of the availability zones were mapped the same for all users then 90% of requests would be to the zone named "A". To make certain that the zones get even usage, as they bring on new customers, they change the A-D order of the availability zones as seen by these new accounts.
Adding to the "why it's scrambled" comments: because data transfer pricing differs for intra- vs. inter- AZ traffic, if you need to coordinate with other entities to be in the same AZ, it's possible to do so.
But, yes, generally: Amazon randomly allocates AZ labels to given customers to avoid preferential clustering on any given first choice.
My co-location provider's data center has had about four hours of downtime in ten years. It turns out a backup diesel generator, redundant connectivity, and a good network/sysadmin are all anyone needs.
People are now working out how to failover from AWS to Rackspace and that is infuriating to me. You... you need redundant clouds now? That can't be right!
You're comparing different things: most of the complaints about AWS have been due to servers (EC2) or network storage (EBS) failing. The outage in March was networking related so that would count, however.
Why do you assume amazon doesn't have " a backup diesel generator, redundant connectivity, and a good network/sysadmin"? And if they do have that why is your situation different?
You tell me man. I guess the difference is I can install many more times more server horsepower for the $500/month rental of a half rack at my colo, and the result will be better uptime than EC2. I just can't install them immediately.
really? I've noticed an increase in maintenance with Linode recently but they've always been really great with notice, so much so I am consistently surprised. If there is ever going to be planned downtime I've got an email at least a week in advance and it gives me the option to migrate my linode to another server at my own convenience if their schedule isn't compatible with my needs. Do you not get these emails and options?
For the scheduled ones, I think I did receive some prior notice, I might have gotten particularly unlucky with a bunch of emergency/network/etc. maintenance affecting many of my servers simultaneously.
My problem with them is they don't provide RFOs (Request for outage) because they can't give away any of the "propitiatory secretes" about their setup. I'd like to know what happened when it effects me. Also when it comes to replacing hardware or preforming maintenance, their priorities wont always line up with mine.
Interesting. We moved away from VPS.net for a similar reason. There were multiple occasions where hosts got shut down and support couldn't tell us the reason for it. Since we're using hand-rolled virtualization on top of a few rented servers, we're basically problem-free.
If you use DNS round robin, then you pretty much guarantee that everyone will be affected if just one hosting company is down. DNS round robin is not the tool for this job.
N. Virginia in my experience is by far the least reliable region on EC2/EBS... Fortunately our app servers are across 2 zone in the region... but our db server is just a lone master... Our slave is down... Very nervous.
I run a master with a "hot" master each in a different AZ and slaves of each in their respective AZs for days like today. Expensive, but makes it easy to sleep at night.
The slaves have their EBS disks snapshotted every 30 minutes, the master every 24 hours.
A "full-spec" machine, another X-Large with 4 EBS volumes that I can fail over to. Its in circular replication with the other "active" master (only one is receiving writes at a time). These instances are only snapshotted once a day to keep them as fast as possible.
N. Virginia is by far the most used region. There's more stuff to fail and any failure will affect more people. I don't think there's anything inherently less reliable about the geographical location of the building.
I've been wondering this myself lately. It seems that every major EC2 outage hits US-East. By comparison, my US-West instances have way better uptime (granted, over a shorter test period). I've never tried the Europe or Asia zones, but I'm tempted to now.
On the other hand they don't have any prices listed, and their blog is down "Error establishing a database connection". This doesn't exactly inspire to go to them for hosting.
Our instances still down.
The AWS service health says:
9:27 PM PDT We continue to investigate this issue. We can confirm that there is both impact to volumes and instances in a single AZ in US-EAST-1 Region. We are also experiencing increased error rates and latencies on the EC2 APIs in the US-EAST-1 Region.
The update I heard was (essentially) 'Another update from Amazon: Looks like it was a power issue for one facility that services a particular AZ in us-east-1, flipped to generator, now back on power and in recovery mode.'
So, Amazon has said since the introduction of EC2 that, to ensure really high uptimes, customers should use multiple availability zones and architect their applications to survive an outage in a single availability zone. While I would question Amazon's competence if outages of any sort were overly frequent, Amazon has not had many at all and no recent cross-AZ ones. [This is correct, right?] I recognize that architecting applications to be performant across datacenters (tolerant of relatively high-latency replication), but Amazon seems to be a poster child for keeping its promises w.r.t. availability. Is my take on this incorrect?
Still down here, over 12 hours at this point. This is probably the second time we've been hit with something on AWS in the last three months -- and you have to pay them to talk to someone about it. We're definitely moving to Linode ASAP..
If your application needs to be up constantly, then it should probably be at least multi-AZ scaled, if not multi-Region. Multi-AZ applications are not affected by this outage, and multi-AZ events are very rare. Living out of a single AZ is very risky.
Does anyone else find it strange that two Heroku posts made the frontpage considerably (in relative terms, obviously) earlier than "EC2 down"? I would think EC2 is a more common denominator for people, but maybe other hosts have better redundancy and thus there wasn't an immediate awareness?
Or am I just overly curious and it's really just that some Heroku clients happened to notice before an at-large EC2 customer?
edit: I don't mean to imply a conspiracy of some sort, upon a reread. I merely am curious if there are just that many Heroku users in particular on HN or somesuch?
It is probably because of the large Heroku outage and post here just the other day, and people are trying to point out that they are down again as that is more dramatic than a normal AWS disruption.
a PaaS like Heroku ends up being "front line support" for AWS. if you use Heroku and your apps fail, you don't care if it is Amazon's fault - you blame Heroku
Source: http://status.aws.amazon.com/?rf Or: http://status.aws.amazon.com/rss/ec2-us-east-1.rss