Hacker News new | comments | ask | show | jobs | submit login
Heroku is down again (heroku.com)
203 points by ardakara on June 30, 2012 | hide | past | web | favorite | 126 comments

MASSIVE Storms in VA area where us-east-1 is. 326,000 customers without power already, worst lightning I have seen in my 20 years of life. Sky is intense blue/green/purple. This is most likely what the issue is

No matter how powerful we become as a species with our technology, we are still at the mercy of the clouds. Pretty cool if you think about it.

The cloud is no match for the cloud!


I'm completely ignorant here. But aren't these outages usually solved by having backup servers in different locations? As many datacenters do, and as I imagine something as huge as heroku would?

Or if we just built our power grid underground like rational people.

Underground cables have more expensive set up costs, lower lifetime, and higher maintenance costs. The price you pay for electricity doesn't even come close to justifying burying power lines. There's also the ecological stuff if you find that a reasonable argument. Bottom line, burying power cables just so you don't have to light a candle for a night isn't worth it.

Until earthquakes.

Depending on where you are in the world, earthquakes are much rarer then insane storms. I'm speaking as a Floridian. I'm fairly ignorant on this issue, but would it be that difficult to use one or the other depending on which natural occurrence is more likely? Or is this also a cost issue?

Massive cost issue, plus some technical issues.

According to this document [1]:

"The North Carolina Utilities Commission studied the cost of placing Duke Power’s distribution facilities underground and found it would cost more than $41 billion, resulting in a 125 percent increase in customer rates."

[1] http://www.sceg.com/NR/rdonlyres/465E6534-2FFB-4069-BF84-814...,

Underground cables have many problems like rats (and other vermin) or people harvesting copper/metals.

Do they? Here in Germany the entire cabling within cities is underground, only the high voltage long distance lines are above ground. I've never heard a story about people stealing underground cables (they do steal e.g. train track above ground cabling). That also wouldn't make sense, digging up those cables is much more effort than taking them down from a post.

I've also never heard stories about issues with rats.

Power outages still happen, but they are quite rare - in 30 years I can only remember twoish.

Rats are a common menace with all sorts of cabling. Large parts of scotland recently lots broadband due to rats eating cables (http://www.theregister.co.uk/2011/10/12/dirty_rat_downs_virg...).

Apparently it's the insulation on the wires that they like.

Really? Rats and copper harvesters?

Like it's gonna be some unprotected plastic cables 1 foot under the ground?

I don't know but I've heard stories. Stealing cables underground is not common but it happened. And rats and underground water is quite a problem for underground (copper) cables.

Sometimes they don't even steal the cables... This guy just liked the sparks. http://www.westvalleyview.com/main.asp?Search=1&ArticleI...

Not cool at all.

All it means is that humans are not yet powerful enough to make the environment work as it should (ie serve humans).

Well, until we figure out plausible ways to control weather reliably on a large enough scale, at least. Without killing the atmosphere or our species or anything like that.

Or each other...

I see what you did there.


this is pretty bad, really bad ...

Hopefully one day it will be feasible to host our computing architecture in space and avoid all these terrestrial obstructions.

Until a solar storm hits, and our servers are no longer under the protection of the Earth's magnetic field.

How do satellites deal with solar storms currently? Surely they could put a ton of iron around the servers to protect them.

I have a feeling the electromagnetic conditions are much more stable on earth than in space. The magnetosphere and the atmosphere deflect a great deal of energy.

What about when a booster goes off by mistake and sends your data flying into the sun?

Sun is far away, you'd still get data back before it hits, but if buster made satelite hit another satelite or descend to earth it would be a worse situation.

Or as the religious people call it, "The Act of God". No seriously, people actually write that shit in their terms. Hilarious.

It's a legal term with a specific meaning.


Which is why I brought it up, it's hilarious. I thought people were just trolling at first but man, the first time I saw it, it made my day. Relating something like "God" with natural disasters. I love how people come up with that kind of stuff.

Religious people be mad!

Presumably in this case - Thor ?

That, or Ramuh? Or Quezacotl?

Well it's hitting N America and his movie didn't do that well - so he might be a bit annoyed


Seems pretty severe actually. Washington Post has a live blog going on about it:


In other words, Amazon's power backups failed again. On the bright side at least they are not running a nuclear reactor.

Was watching a movie in a big 20-screen theater in Richmond, and they told everybody to just leave (incidentally, not through the emergency exits, instead they funneled 100s of people into the lobby all at once :/)

I live in DC. It was an amazing storm. A transformer in my area went down fairly quickly. Fortunately, I live near a large hospital.

Saw this post here on HN, pulled up www.chart.state.md.us to watch the live traffic cams in the area. Clicked through a couple, some of which showed heavy rain, wind & lightning. Then the stream froze and now the site is completely unresponsive.

EC2 status:

8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.

8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.

8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.

8:49 PM PDT Power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.

9:20 PM PDT We are continuing to work to bring the instances and volumes back online. In addition, EC2 and EBS APIs are currently experiencing elevated error rates.

How many times does this have to happen before heroku spreads across multiple regions?

It's a hard thing to engineer, especially after the fact, and especially when you are trying to hide it behind an abstraction layer. (Which is to say: You can't expect your customers to engineer their apps with multiregion in mind, or to take it kindly when you raise rates to support additional redundant hardware and bandwidth.)

e.g. because region-to-region data transfer is not free, and trans-region latency is ugly, you can't just relaunch half your instance farm in another region and expect happiness. There are also routing issues: Internal IPs don't work across regions, elastic IPs don't transfer across regions...

Even if they can't do it right away, they should communicate a plan for how they are going to tackle this recurring issue. That's the whole point behind status.heroku.com / trust.salesforce.com. They are part of a publicly traded corporation with a lot of resources.

Extremely nerve wracking for new startups like ours.

I guess, it is always a good plan to have an instance deployable on Linode.

If you have that set up, does Heroku give you any extra value?

^^ This.

You say this like they can just snap their fingers and provide regional services.

Well, nothing of scale happens overnight.

But anyone deploying a critical application to AWS makes a point of cross-region data replication. Heroku have long known that they lose potential customers to, say, Engine Yard as a result of only hosting at US-East.

One can only conclude that this is a clear business decision on their part. I can hardly believe that Heroku's engineers are incapable of it. Indeed I would be very surprised to learn that they haven't brought up an instance of their platform at, say, US-West, for testing or proof-of-concept purposes.

Of course, productising that is a different matter. Extending the control plane, front end, and pricing/billing systems might have considerable associated project cost. Perhaps they have concluded that the costs outweigh the additional revenue. Or, just haven't got around to it yet.

If not, why pay them rather than go to amazon directly?

Like http://appfog.com/ has?

Can we update the title to something like "AWS US-east-1 is down" instead of just Heroku?

> informative titles get changed

> uninformative titles left intact

hn 2012

Appropriate title might be that "Heroku is down due to AWS outage which is down due to power failure which happened due to storms caused by moist winds colliding with hot air that was heated over the continent by sun that....". It really doesn't matter. Heroku is down. Customers don't care.

Then you'd have to change the link too?

Why does the AWS dashboard show all green when that is most definitely not the case?


I strongly doubt that's the case, at PagerDuty we're seeing ~100x the regular traffic. I think it's been having issues for at least 15 minutes.

The red check marks are in VA and can't be displayed.

If only they did multi-AZ hosting like they keep telling us to do when there's outages :)

Even now that it is updated it has yellow triangles for "performance issues" instead of red circle for service disruption. Seems like they are in denial.

This was the disappointing thing for me as well. Our connectivity died around 8PM EST-ish, and I immediately went to status.aws and it said everything was normal. I then proceeded to waste half my night looking at our internal infrastructure trusting that page was accurate.

I've learned my lesson.

Power went out = service disruption.

It's underhanded to call it a "performance issue," if not an outright lie, albeit a small one.

Yup, a lot of services served by AWS are having issues. We're seeing a huge spike in incidents being triggered in PagerDuty.

(fyi: our customers are still being alerted)

What infrastructure do you use for alerting your customers?

My app with 2 dynos is down.

Their status site is running fine altho it's not reporting errors: https://status.heroku.com/

Their Helpdesk is down: https://api.heroku.com/helpdesk/login?timestamp=1341025835&#...

Devcenter is down: http://devcenter.heroku.com

AWS isn't reporting any errors: http://status.aws.amazon.com

That's standard. You'd expect them to run everything they have on their own service except for the status site.

At 11:25 Eastern, https://status.heroku.com/incidents/386 was posted: "We're currently experiencing a widespread application outage. We've disabled API access while engineers work on resolving the issues."

Amazon posted an update:

8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.

Please link to https://status.heroku.com/, pointing to a broken URL is pointless.

Think that netflix is down too. so much for chaos monkey?

Yup, I can confirm from here in Montana that both Netflix and Herkou are down.

However, I have 20 instances on us-east. And haven't seen any problems, even during yesterday's outage on AWS.

Edit: that doesn't mean this isn't an AWS outage.... It almost certainly is.

Yeah, my ~20 instances are okay. I think we had a hiccup with our (multi-az, whooo!) RDS though.

No wonder my wife was just asking me why Netflix is down on Xbox live here in Seattle.

i think netflix ceo just signed up for a rackspace account

Rackspace got hit by a truck last year and went down for a while too. Not cloud != perfectly reliable.

Slightly different scenario however: the power was shut off by the fire marshall if I recollect correctly.

Rackspace (and many, many other co's) tend to have functional UPS units & generators. Amazon tends to choose the cheapest datacenter facility imaginable and then these sort of failures occur.

Given their size they'll inevitably fix the power issues though -- they've got the finances & they're capable to add a few levels of redundancy.

I found the reports about the outage - it was 2007 (so obviously much more than a year ago) but very similar to one of Amazon's recent outages - the truck took out a transformer, Rackspace fired up backup power, but cooling failed to start so Rackspace had to shut it all down to avoid melting everything.

Looks like Amazon wasn't the only one with inadequate testing of their continuity plan. And I don't think Rackspace offered alternate Availability Zones at that point.

You're joking, but I bet Netflix's Asgard system gets support for OpenStack pretty quickly now...

Netflix has confirmed that they would like the option to migrate to from Amazon. https://www.computerworld.com/s/article/9222294/Netflix_open...

Interesting link - thanks.

I think Netflix are expecting another cloud to offer the same model and API as Amazon though, which isn't likely to happen - everyone else is learning from AWS's mistakes!

Even if it did, many of the features they're waiting for (like auto-scaling groups) probably wouldn't be as useful in a multi-cloud environment, and would therefore have to be built into Asgard.

Confirmed in California as well.

Confirmed in Missouri.

There are US-East problems, looks like:

  - One AZ is down
  - API commands are spotty and may return incorrect results
  - ELB looks screwed
  - IP reassignments don't seem to be working
  - Who knows what the fuck else is broken

Don't think it was just heroku, lots of other sites were down as well, netflix.com, etc. most likely another AWS issue.

I can't get to netflix.com. That is no good. Luckily all of my 75+ east servers seem to be ok.

netflix.com is down here as well

Reddit also seems to be experiencing some difficulties. Are they still on AWS?

We're in AWS East and definitely fighting some issues here, though we're trying to understand what is happening.

We have dozens of servers that are unavailable at the moment (US-East). Obviously AWS is having major issues.

It's been fifteen minutes since our site in US-East went down and AWS Status hasn't said anything yet.

Ugh, the EC2 administration console is done. Being in other availability zones won't save you..

As of 10:16 CDT, I can't reach Netflix or Heroku, although AWS status (http://status.aws.amazon.com/) is not yet reporting any current outages.

Aws East connectivity issues, 8:21pm:


This is the motivation I needed to spread my EC2 instances across multiple availability zones. When the power comes back.

(fandalism is down)

Don't assume that'll save you.

Just started up try using filepicker.io and it seems to be down too.

Beginning to feel pretty lucky though -- this is at least the 4th AWS-East outage that has made enough of a splash to notice but missed my instances. Upgrading to multiple availability zones was scheduled for Monday anyway.

Simple solution to this is to have a backup or failover to a non-AWS Datacenter too, basically don't be just dependent on one Datacenter. E.g. MS Azure/Google/Rackspace This not only spreads your risks but keeps your customers happy.

There's nothing simple about that. :)

Yea, but it should be done by providers like Heroku who can talk to multiple datacenters :)

Google Appengine's just fine. Dont't know how many AZ i'm on and don't want to know :)) The more i see about amazon failures the more i think VM are just not high enough for me in the abstraction layer...

I have 3 customers -- what the are they going to do, those poor souls!!

Comcast's login server looks like it's down too: http://login.comcast.net. Prevents me from logging into HBO Go. No Netflix either. :(

Everything still down for me. Would have expected some redundancy...

Apparently it's AWS East.

Ahh I was just using crunchbase and its now down, must be related.

I just lost a potential hire because of this, was demoing my app to someone and it wasn't working, she thought it was because of the product. Damn you heroku!

Come on not another power issue, what happened to the generators... and the back up generators that they fixed few weeks back.

The little red ribbon that you pull to get the AA batteries out is stcuk underneath - they are looking for a pen to flick the battery but since everyone switched over to Fire tablets there aren't any pens.

idea for heroku : allow customers to host a "my app is down page for blah blah reason" where they host their status page (rackspace I guess?). Who think this would be useful? My users see a blank page right now when they go to ZeTrip, I'd rather show them a static page saying : "our site is down due to amazon lack of redundancy."

Cloudflare lets you do this afaik. I'm not sure I'd trust a service to show a proper 'this site is temporarily down' page when something very bad has happened.

It's not Heroku that is down, AWS is down.

We use Heroku for our event tools, thankfully we have nothing live this weekend or this would be a disaster.

Loadbalancers are down for me: Getting 'Response contains invalid JSON' upon attempted termination.

"heroku status" command returns : All Systems Go: No known issues at this time.

Cloudflare Always-Up isn't showing on my page - is Cloudflare affected too?

Heroku's uptime for June is going to be.. not so hot.

For shissle.

Had to move off Heroku for my latest app. That amount of downtime would put me out of business.

To bad. I really like the Heroku platform

did the datacenter get flooded or what? this is just "major" downage.

It's likely related to the power outages across VA

AWS should call them Unavailability Zones.

I had the same thought, but it’s not HN material.

what a winner! And they charge!! FUX.

I can't believe that guys at heroku are not ready for such situations!

They rely ONLY on virginia's instances because its the cheapest, without caring about customers.. or thinking of replicating their services in multiple locations for such issues!

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact