Amazon EC2 down?

jswanson · on June 15, 2012

Newest update: 10:29 PM PDT We can confirm a portion of a single Availability Zone in the US-EAST-1 Region lost power. We are actively restoring power to the effected EC2 instances and EBS volumes. We are continuing to see increased API errors. Customers might see increased errors trying to launch new instances in the Region.

Source: http://status.aws.amazon.com/?rf Or: http://status.aws.amazon.com/rss/ec2-us-east-1.rss

reustle · on June 15, 2012

They keep saying it happened in a single availability zone when I saw frantic tweets from people in 1A, 1C and 1D.

ejdyksen · on June 15, 2012

AWS hides the real availability zone names from you. From the docs:

Can I assume that my Availability Zone us-east-1a is the same location as someone else's Availability Zone us-east-1a?

No. Currently, we do not support cross-account proximity. Each account's Availability Zones are independent. For example, the us-east-1a Availability Zone for one account might be in a different location than for another account.

http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/FA...

adamlindsay · on June 15, 2012

Everyone's A zone is different. So I could say A is down while someone else is saying B, and we could be talking about the exact same zone. It makes it difficult to say if it is more widespread or not.

saurik · on June 15, 2012

You can match them up through a quirk in one of the APIs.

http://alestic.com/2009/07/ec2-availability-zones

My us-east-1a is the affected zone, which is 3a98bf7d-126d-411a-a612-3a57a62dc688 using the incantation on the site.

(Oh, and to note: my us-east-1a was also the affected zone during the massive outage last year, and I believe I remember another outage sometime between then and now. I almost feel like every Amazon outage affects my zone. I kind of wonder if that availability zone just sucks ;P.)

datr · on June 15, 2012

Maybe I'm assuming wrong but I guess in your example zone A and B are the same and that the different zone names users see don't represent different ways of spreading resources. If so, why aren't they named consistently? If not, are there any details on how and why they've set up their zones (or am I overlooking another assumption I've made?)

saurik · on June 15, 2012

Due to a quirk of both human nature and copy/paste example code, if the names of the availability zones were mapped the same for all users then 90% of requests would be to the zone named "A". To make certain that the zones get even usage, as they bring on new customers, they change the A-D order of the availability zones as seen by these new accounts.

asdfaoeu · on June 15, 2012

If they weren't different it would likely everyone would just pick the same first zone, this way they can randomly assign resources keeping it even.

dredmorbius · on June 15, 2012

Adding to the "why it's scrambled" comments: because data transfer pricing differs for intra- vs. inter- AZ traffic, if you need to coordinate with other entities to be in the same AZ, it's possible to do so.

But, yes, generally: Amazon randomly allocates AZ labels to given customers to avoid preferential clustering on any given first choice.

chintan · on June 15, 2012

We learnt our lesson the hard way after the great AWScalypse of Apr 2011.

The lesson: Use n>1 hosting companies (even if one of them promises a-z-multiregion-distributed-fault-tolerant-back-up)

barkingtoad · on June 15, 2012

My co-location provider's data center has had about four hours of downtime in ten years. It turns out a backup diesel generator, redundant connectivity, and a good network/sysadmin are all anyone needs.

People are now working out how to failover from AWS to Rackspace and that is infuriating to me. You... you need redundant clouds now? That can't be right!

acdha · on June 15, 2012

You're comparing different things: most of the complaints about AWS have been due to servers (EC2) or network storage (EBS) failing. The outage in March was networking related so that would count, however.

donavanm · on June 16, 2012

Why do you assume amazon doesn't have " a backup diesel generator, redundant connectivity, and a good network/sysadmin"? And if they do have that why is your situation different?

barkingtoad · on June 16, 2012

You tell me man. I guess the difference is I can install many more times more server horsepower for the $500/month rental of a half rack at my colo, and the result will be better uptime than EC2. I just can't install them immediately.

kunalmodi · on June 15, 2012

how are you quickly failing over from one to another?

monsenhor · on June 15, 2012

We work with 3 providers. Today 2 of then go down. Just Linode is ok now.

kunalmodi · on June 15, 2012

ironically, we moved away from linode last month because they kept doing maintenance and turning off our machines without telling us

citricsquid · on June 15, 2012

really? I've noticed an increase in maintenance with Linode recently but they've always been really great with notice, so much so I am consistently surprised. If there is ever going to be planned downtime I've got an email at least a week in advance and it gives me the option to migrate my linode to another server at my own convenience if their schedule isn't compatible with my needs. Do you not get these emails and options?

kunalmodi · on June 15, 2012

For the scheduled ones, I think I did receive some prior notice, I might have gotten particularly unlucky with a bunch of emergency/network/etc. maintenance affecting many of my servers simultaneously.

wizard_2 · on June 15, 2012

My problem with them is they don't provide RFOs (Request for outage) because they can't give away any of the "propitiatory secretes" about their setup. I'd like to know what happened when it effects me. Also when it comes to replacing hardware or preforming maintenance, their priorities wont always line up with mine.

tasaro · on June 15, 2012

Linode sends out notices for every single maintenance event, which includes both emergency and scheduled events. The maintenance you experienced last month is explained here: http://blog.linode.com/2012/06/13/xen-security-advisories-an...

ch0wn · on June 15, 2012

Interesting. We moved away from VPS.net for a similar reason. There were multiple occasions where hosts got shut down and support couldn't tell us the reason for it. Since we're using hand-rolled virtualization on top of a few rented servers, we're basically problem-free.

raverbashing · on June 15, 2012

Really?

I had 2 (short, scheduled) downtimes with Linode. Since 2009

ww520 · on June 15, 2012

If your site has little dependency among itself, then you can use DNS round robin to include IP's from all hosting companies.

rlpb · on June 15, 2012

If you use DNS round robin, then you pretty much guarantee that everyone will be affected if just one hosting company is down. DNS round robin is not the tool for this job.

derekclapham · on June 15, 2012

N. Virginia in my experience is by far the least reliable region on EC2/EBS... Fortunately our app servers are across 2 zone in the region... but our db server is just a lone master... Our slave is down... Very nervous.

WALoeIII · on June 15, 2012

I run a master with a "hot" master each in a different AZ and slaves of each in their respective AZs for days like today. Expensive, but makes it easy to sleep at night.

The slaves have their EBS disks snapshotted every 30 minutes, the master every 24 hours.

derekclapham · on June 15, 2012

Yeah... I snapshot our slaves every 30 minutes.. When you say "hot" master what exactly do you mean?

WALoeIII · on June 15, 2012

A "full-spec" machine, another X-Large with 4 EBS volumes that I can fail over to. Its in circular replication with the other "active" master (only one is receiving writes at a time). These instances are only snapshotted once a day to keep them as fast as possible.

dangrossman · on June 15, 2012

N. Virginia is by far the most used region. There's more stuff to fail and any failure will affect more people. I don't think there's anything inherently less reliable about the geographical location of the building.

dsirijus · on June 15, 2012

Am I at fault in believing this happens drastically less in Europe datacenters, for any cloud service?

yuvadam · on June 15, 2012

I believe that, at least for AWS, the us-east data center is drastically larger than eu-west.

malachismith · on June 15, 2012

Larger, older, slower, more fragile. And cheaper.

disbelief · on June 15, 2012

I've been wondering this myself lately. It seems that every major EC2 outage hits US-East. By comparison, my US-West instances have way better uptime (granted, over a shorter test period). I've never tried the Europe or Asia zones, but I'm tempted to now.

ftwinnovations · on June 15, 2012

Yup, very down. My site is down, and I've been attempting to reboot and recreate instances like a madman... Why didn't I just check HN first??

mhartl · on June 15, 2012

For future reference: http://status.heroku.com/

reustle · on June 15, 2012

Says FTW Innovations

cardmagic · on June 15, 2012

This is why http://AppFog.com/ is investing in multiple IaaS and is not being hit nearly as hard. You can still sign up and even create apps.

Emouri · on June 15, 2012

On the other hand they don't have any prices listed, and their blog is down "Error establishing a database connection". This doesn't exactly inspire to go to them for hosting.

Pythondj · on June 18, 2012

and on the other hand, check out the http://status.appfog.com/ page (hint: it doesn't exist)

monsenhor · on June 15, 2012

Our instances still down. The AWS service health says: 9:27 PM PDT We continue to investigate this issue. We can confirm that there is both impact to volumes and instances in a single AZ in US-EAST-1 Region. We are also experiencing increased error rates and latencies on the EC2 APIs in the US-EAST-1 Region.

elq · on June 15, 2012

power outage in one of their AZ's. Only after lots of machines died did the generator kick on. Lovely.

stevefink · on June 15, 2012

Are you sure? I have a graph of one of my Resque boxes taking 30Gbit/s of inbound traffic. Looks like a DDoS attack to me.

elq · on June 15, 2012

The update I heard was (essentially) 'Another update from Amazon: Looks like it was a power issue for one facility that services a particular AZ in us-east-1, flipped to generator, now back on power and in recovery mode.'

jacobian · on June 15, 2012

Do you have a source for this?

elq · on June 15, 2012

'tis what I heard on an outage call...

robbiet480 · on June 15, 2012

"outage call" AWS provided, heroku provided or other? Need a good solution for this

ShabbyDoo · on June 17, 2012

So, Amazon has said since the introduction of EC2 that, to ensure really high uptimes, customers should use multiple availability zones and architect their applications to survive an outage in a single availability zone. While I would question Amazon's competence if outages of any sort were overly frequent, Amazon has not had many at all and no recent cross-AZ ones. [This is correct, right?] I recognize that architecting applications to be performant across datacenters (tolerant of relatively high-latency replication), but Amazon seems to be a poster child for keeping its promises w.r.t. availability. Is my take on this incorrect?

redditmigrant · on June 15, 2012

I wonder if the power outage here has anything to do with this - http://www.dom.com/storm-center/dominion-electric-outage-map...

harryh · on June 15, 2012

FWIW as of approx 0450 UTC we're starting to see various instances that had become unresponsive return back to service.

philip1209 · on June 15, 2012

Hootsuite just reported that it is offline - I'm unsure if it is related.

In other news, for once Reddit is working.

duwease · on June 15, 2012

Still down here, over 12 hours at this point. This is probably the second time we've been hit with something on AWS in the last three months -- and you have to pay them to talk to someone about it. We're definitely moving to Linode ASAP..

spartango · on June 15, 2012

If your application needs to be up constantly, then it should probably be at least multi-AZ scaled, if not multi-Region. Multi-AZ applications are not affected by this outage, and multi-AZ events are very rare. Living out of a single AZ is very risky.

ww520 · on June 15, 2012

Hmm, my sites are still up. They are at us-east-1d. Keeping fingers crossed.

yuvalo · on June 15, 2012

Still having problems staring a few instances.

We just started a campaign so i thought there were performance issues with our application so it took me a while to look for ec2 issues. sigh

csmcdermott · on June 15, 2012

Coming back up now...

snorkel · on June 15, 2012

It does the same thing as OK.

drivebyacct2 · on June 15, 2012

Does anyone else find it strange that two Heroku posts made the frontpage considerably (in relative terms, obviously) earlier than "EC2 down"? I would think EC2 is a more common denominator for people, but maybe other hosts have better redundancy and thus there wasn't an immediate awareness?

Or am I just overly curious and it's really just that some Heroku clients happened to notice before an at-large EC2 customer?

edit: I don't mean to imply a conspiracy of some sort, upon a reread. I merely am curious if there are just that many Heroku users in particular on HN or somesuch?

res0nat0r · on June 15, 2012

It is probably because of the large Heroku outage and post here just the other day, and people are trying to point out that they are down again as that is more dramatic than a normal AWS disruption.

malachismith · on June 15, 2012

a PaaS like Heroku ends up being "front line support" for AWS. if you use Heroku and your apps fail, you don't care if it is Amazon's fault - you blame Heroku