Hacker Newsnew | comments | show | ask | jobs | submit login
Another EC2 outage (yet AWS dashboard says no error)
120 points by rdl 1145 days ago | 76 comments
EC2 appears to be down (not Heroku, specifically, but EC2). AWS dashboard doesn't yet show this.



Amazon Availability Zones are such a fucking lie. ("Shared nothing, and an outage affecting one will not affect other Zones in the Region.")

I've seen more failures which take out multiple AZs than which take out only a single AZ. So, a prudent person would split their application across regions (which are relatively shared nothing, except for admin/account level stuff), but Amazon goes out of its way to not make that easy -- you're using the public Internet, pay higher costs, etc.

The right choice is probably EC2 plus a non-EC2 provider (your own hosted stuff, another cloud (?), etc.), with protection in case either goes down. But that is a relatively lot of work, and if you're on a PaaS like Heroku which is 100% at risk to EC2, you can't do it.

Kind of sucks.

-----


My Nagios awoken 3 AM brain finds no fault in your logic.

-----


Even funnier are people doing server monitoring of (things in EC2) from within EC2. When the EC2 outage happens, there's obviously no problem because no alerts get sent...

Doh!

-----


For some people, that might be fine. If you don't have plans for how to rapidly move out of EC2, you might as well just sleep through an all-of-EC2-goes-down outage for all you can do about it.

-----


You should at least know there is an outage to have something to tell your downstream customers. It is really embarrassing to have a customer (or your boss) call to report an outage you don't yet know about, even if there is fuck all you can do to resolve it. Basic principle of ops.

-----


I wasn't being entirely serious. :-)

-----


> Basic principle of ops

For my benefit, what are some others?

-----


This would actually be an interesting blog post.

-----


This is why my sleepy 3AM brain was awoken by Pingdom. Hooray for having just enough redundancy to tell you that it's not quite enough.

Good night.

-----


Me, i use specific load balancing for my trafic when Cloud outage is detected. And i sleep perfectly ;-)

-----


Could you give a little more detail on your setup? I'm curious how others are designing around these issues.

-----


If my case can help you, my company uses services of one company for load-balancing trafic across multiple CDN/Cloud. We are no longer impacted by the failure of some providers. You can read this http://tinyurl.com/7pwfza7 (i'm user, not vendor)

-----


I can't figure out why you people are using URL shorteners on HN, but I believe it is not looked upon well. So, for others, these links are as follows:

http://www.theregister.co.uk/2012/02/17/cedexis_and_the_open...

http://translate.google.fr/translate?hl=fr&sl=fr&tl=...

-----


Very interesting flojibi. Another about multi cloud: http://bit.ly/zg37FQ

-----


Even funnier than that is watching the latency hit at Rackspace Cloud and Terremark as some non-trivial number of customers fail over.

-----


Do you work for a DNS provider or CDN or something (so as to see this in near realtime)? Envy.

I haven't seen a lot of people using both EC2 and Terremark for the same app -- kind of different markets. Not technically unreasonable, but Terremark seems to be more enterprise IT outsourcing, and EC2 (followed at very far remove by the other clouds, including Rackspace) being Internet-delivered consumer, etc. apps, or at least larger scale public services.

-----


Here's an idea I've thought about but don't have time to do anything with: a peer-to-peer monitoring network, so each new server on each new network makes it more robust. No idea how the details would work out.

-----


That gets done for network/application performance monitoring (alternatives to keynote, gomez, etc., and is how some of their own products work). It's kind of overkill for basic application level monitoring -- there's a tradeoff between number of endpoints checking and frequency of checks. I guess you could round-robin checks across a larger number of end nodes, too, to get both.

-----


We're set up across multiple AZs in the affected region, and all we had was a few minutes of failed requests to one AZ until our systems automatically shifted all the traffic to another.

Even the major day-long outage last year because we had (at the time) not really spread ALL our core systems across multiple AZs. We just re-launched those systems on another AZ and everything was up and running again.

Which outages have affected multiple AZs?

-----


The most recent one (I think it may have been ELB specific; I don't have a huge sample set), and the big EBS outage (which only affected multiple AZs somewhat)

-----


You need different regions for DR in any case. It's hardly unprecedented for network issues like this to take down multiple data centers in an area even when they're not part of the same provider.

-----


AZs are supposed to be distinct datacenters within a single region. If all of your customers are in (e.g.) APAC, it's not unreasonable to put all your online processing within APAC, with high bandwidth connectivity between them and from each to customers. You might not be able to do master-master over extremely long distances for performance reasons under normal conditions, but you'd keep warm or cold backups totally out of the area. There are a lot of factors which go into the decision, but there are definitely times when 2 datacenters (often run by separate providers) with independent connectivity, but both within a specific distance, makes more sense than extreme separation.

It's sad how people knew how to do this stuff ~2002-2006 and then forgot it all (or just stopped caring) once the delicious cake of cloud appeared.

-----


You missed my point: this is not a cloud problem except to vendors looking to sell non-cloud hosting. Any region is vulnerable - some clown with a backhoe, congestion / DDoS, routing screwups, etc. have taken out data centers in entire areas (Los Angeles, SF, NY, etc.) even when providers thought they had more redundancy. If you really need it, you spend the money on wide geographic separation.

-----


For this reason I'm using a set of different VPS servers running on both Linode (UK datacenter) and Slicehost (US datacenter).

So separate datacenters, admin layer, providers and also important: billing.

Running a high available cluster in this setup isn't trivial though, mostly due to network splits. It works quite well for specific purposes where availability is more important than data integrity. (remote monitoring in this case)

-----


That my plan too. By using dual clouds (again in UK and US), we're getting the highest failover protection we can afford. I can't afford our e-commerce platform to be down and the evidence shows that a single cloud is robust enough. We call it "Cloud Docking" :)

-----


The Cloud isn't infalible; doesn't solve all the problems like everyone says; news at 11.

*Downvote if you have legitimate technical reason I'm wrong, not just because you throw a hissy fit that your technology of the week isn't all that and a bag of chips.

-----


This does make me tempted to just put everything in one zone (less latency) and have a backup on another region entirely. Clearly, backups on different AZs isn't the best plan

-----


http://www.quickmeme.com/meme/3obpuo/

-----


Waking up at 2.30am to a total service outage sucks, but at least I know all you bastards are awake and screaming too. That makes me feel slightly better.

(We're back now. Kudos to Pingdom for noticing and alerting)

-----


Ah, pingdom failed to alert us via push got only the email alert (luckily it was exactly when the team is warmed up and ready to hit the keyboard :) 11AM at Israel)

-----


I have an IFTTT recipe to get Push notification if you have a server outage http://ifttt.com/recipes/25122

-----


Cool! thanks, adding it now (IFTT rocks)

-----


For a good 20 minutes I didn't know it was an outage and panicked thinking my server mysteriously stopped working, and couldn't even SSH in.

-----


I've learned that when my SSH session dies, the first thing to do is go to twitter and search for "EC2". People were complaining there within about 60 seconds of the outage starting.

-----


Alternative monitoring opportunity? Grep a twitter stream for "EC2" and "down" and raise an alert if you get enough hits in 5 min?

-----


Looks like the entire US-East-1 region had connectivity issues. We're a monitoring company, we're watching this very carefully...there is a write up on our blog:

http://www.verelo.com/blog/2012/03/15/aws-entire-an-region-d...

-----


What's funny is that this is what I find myself doing instinctually when I encounter an outage or high latencies on ANY service or site. Heroku (recent process startup woes), Google Apps (slowdowns, specifically Gmail), Amazon (when its hammered by traffic to big deals), etc etc.

I second the comment above suggesting a "crowdsourced" status app monitoring twitter. Although it's no consolation for service interruptions, it does at least keep you sane knowing the problem is elsewhere.

-----


Yes, because Twitter never goes down.

-----


What does it matter if Twitter goes down? The odds of it happening for an extended time right at the beginning of an EC2 outage are rather small, and even in that worst-case scenario, it doesn't really put us in any worse of a position than we're in now.

-----


Touché.

-----


One advantage to being on the east coast vs. the rest of you silicon valley types - I got awakened at 5:30am instead of 2:30am! :)

-----


true, but that still doesn't make up for the weather difference :)

Plus, who said we were sleeping.

-----


fair enough. And I'm sure the next outage will happen around 11:30p PT. :)

-----


Leaving 30 minutes before the bars in NYC close. Just saying.

-----


Last call in NYC is usually 4AM.

-----


He's talking about the next outage :)

-----


No, actually I incorrectly added 3h to 2330 and got 0330, mostly because I'm sleep deprived after this outage. Or I could pretend it was "an hour to resolve the outage, then back to the bars".

-----


Apparently it was caused by a network problem above AWS:

2:40 AM PDT We are investigating connectivity issues for EC2 in the US-EAST-1 region. 3:03 AM PDT Between 2:22 AM and 2:43 AM PDT internet connectivity was impaired in the US-EAST-1 region. Full connectivity has been restored. The service is operating normally. 6:09 PM PDT We want to provide some additional information on the Internet connectivity interruption that impacted our US-East Region last night. A networking router bug caused a defective route to the Internet to be advertised within the network. This resulted in a 22 minute Internet connectivity interruption for instances in the region. During this time, connectivity between instances in the region and to other AWS services was not interrupted. Given the extensive experience that we have running this router in this configuration, we know this bug is rare and unlikely to reoccur. That said, we have identified and are in the process of deploying a mitigation that will prevent a reoccurrence of this bug from affecting network connectivity.

We understand that when networking events affect instances in multiple Availability Zones it causes our customers serious operational issues that are difficult to architect around. We have been using and refining our Availability Zone architecture for over 10 years at Amazon to provide highly reliable services. Availability Zones provide a high degree of isolation including physical separation, independent power distribution, independent cooling and mechanical systems, and multiple physical links to the Internet through multiple transit providers and peering connections. All of our regions have exceeded 99.99% availability over the last several years. We are also continually investing in improving our architecture as we learn more. In addition to the remediation discussed above which addresses the specific bug we saw last night, we are currently in the later stages of refining the way that we do route advertisement within a region. These changes will isolate any bad route information to inside a single Availability Zone while maintaining the performance characteristics of our current inter-Availability Zone network design. We have been deploying these changes carefully to avoid impact to customers, but we expect these changes to be complete within the next several weeks. We are confident these changes will protect us from multi-Availability Zone impact for the sort of bug we saw last night.

-----


Their dashboard not showing this and Twitter being more reliable as a way of showing downtime is a sad tale.

-----


That dashboard sucks, but Twitter being a faster, more reliable source of news is true for most things now.

-----


Back now. My logs show it was down from 2:23 to 2:41.

-----


Painless sever management is slowly becoming painful again :(

Worse if you are tied into the AWS ecosystem. You can't even move out or even have backup servers in other places.

-----


The largest issue i see for this is RDS. If you use RDS...i dont know, but migrating away is not going to be easy.

Migrating the servers themselves however shouldnt be too hard if you have a fairly sane build process, however if you dont, and i suspect a lot of people dont...its not going to be fun.

-----


It's more that if you make use of the Amazon APIs for autoscaling other services, you can't just directly translate that to a more static managed hosting environment.

Probably the sane way is to special case some subset of your functionality so it works regardless, and then gracefully scale up/down your app (performance, scope of features, etc.) based on system health. This is a lot more complex, and really hard to retrofit.

-----


Yeah thats a fair point.

From the very beginning i've always strayed far away from anything that locks us into AWS. For this reason we've made no use of anything that couldn't be picked up and moved away, so for us auto scaling was never something we decided to utilize for this exact reason.

While this held us back a bit at first, even tools like SES initially only had an API provided by Amazon. Now it supports standard SMTP connections, so we decided there was no harm in using it as we could easily make a switch with no code changes.

-----


RackSpace has a fairly sane API for building new servers. If you're not adverse to using Ruby, the fog gem is pretty awesome when building servers for Amazon and Rackspace's Cloud.

-----


Related: http://deltacloud.apache.org/

-----


Someone please apply to YCombinator with this idea.

-----


Did a quick write-up on this here: http://webdev360.com/users-report-major-amazon-ec2-outage-bu...

Does anyone know if ops teams generally have to clear these sort of announcements with PR/comms? I assume they would, given how they get reported on.

-----


At a competent company, no. There are policies in place before the outage, but having your PR people in the loop slows things down to the point where you're worthless to your customers. The exception is you loop in legal, PR, etc. if someone is actually injured/dies, or if crimes are involved.

A lot of providers try to NDA their "ops to customer" service outage notifications, but most customers flagrantly violate those NDAs. Automated service dashboards are supposed to be automatic; ops teams often put in short statements (especially time to fix and any interim way to mitigate the outage).

Definitive statements after the outage are run by PR (and generally announced senior to ops), but service notifications of outages (vs. causes, compensation, and long term corrective actions) are not.

-----


Thanks -- that's what I assumed initially, but had second thoughts after seeing how delayed the announcements were.

-----


I suspect there is some human admin level disconnect between their network ops/routing and the AWS team itself. A connectivity outage wouldn't necessarily get detected within AWS, and they probably don't have good monitoring within the AWS product to detect problems like that. The network team presumably doesn't have a good way to push status updates to the AWS dashboard automatically (and it's kind of a grey area what is a "network outage" -- if you lost routes to just Pakistan, that's not really a big deal for most AWS customers. If you lose routes to everyone, yes, that's a big deal.)

-----


Somewhat ironically, I can't get to that site.

-----


I think our servers were down for a few minutes earlier -- should work now.

-----


Just updated - looks like a network connectivity issue. It looks like they have had issues everyday for the last 3 days - first time I have seen it down though.

-----


For those of you relieved it happened in the middle of the night. Hi from mid morning London where externally hosted services including their websites mysteriously disappeared for half an hour.

Nothing critical, but certainly a warning shot across the bows.

-----


"The right choice is probably EC2 plus a non-EC2 provider (your own hosted stuff, another cloud (?), etc.), with protection in case either goes down"

Yes, multi-Cloud strategy... i know a good company for this ;-)

-----


And yet you have no contact info in your profile...

There's also the obvious risk with even using a single PaaS running on multiple IaaS clouds. If your account with the PaaS gets hacked, or they get acquihired, or whatever, you can be screwed too.

Figuring out exactly where to have redundancy in your business is hard. Especially because building something to be redundant imposes costs (more expensive, slower development) and sometimes itself is the cause of outages (lots of hilarious failover-related failures have taken down sites).

-----


Keeping an eye on HN all the time pays off :) I too was wondering what the hell is going on with my EC2 instances when I saw this thread.

AWS status dashboard: -1, HN: +1

-----


It's incredible how much of the internet depends on Amazon.

-----


Happened with many of our sites a few moments ago. Went to their status history and no sign of failure. Shame on you, AWS.

-----


I have a dotcloud (EC2 based afaik) down too.

-----


So this is why wildfireapp.com was down.

-----


Seeing this too. The status dashboard shows connectivity problems for US-EAST-1 now.

-----


At least I wasn't imagining it.

-----


back up

-----




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: