Hacker News new | comments | ask | show | jobs | submit login
Tell HN: Heroku is Down (update: recovering as of 10PM PST)
100 points by timr on June 15, 2012 | hide | past | web | favorite | 110 comments

The AWS status page[1] is showing problems for EC2 East as of a few minutes ago. This <strike>might be</strike> is a more widespread issue.

EDIT: Various non-Heroku EC2-East-based sites (e.g. Quora) seem to be down as well, lending more evidence to this being an EC2/EBS outage.

1: http://status.aws.amazon.com/

Half the startup world is offline right now, yet the AWS status page is all greens with one "info" notice about EBS (which I doubt is the issue here). I'm glad to have moved off a fully AWS stack back to my own servers over a year ago. My uptime, bills, and stress level are all much improved.

Sounds like half of the startup world hasn't spent the time or the money to design their service to be redundant on top of AWS.

Fortunately for most, they dont have customers yet

That made me laugh. Am I a bad person?

The sales pitch is kind of that it's already redundant. You know, RAID and load balancers and all.

That's hardly surprising. The groupthink on cloud services is that they magically make everything scalable, cheap and easy. Making redundant software is hard work.

Definitely an EBS issue. Our (EBS backed, Heroku hosted) DB was returning queries more and more slowly leading up to the outage.

They posted an update on amazon: We continue to investigate this issue. We can confirm that there is both impact to volumes and instances in a single AZ in US-EAST-1 Region. We are also experiencing increased error rates and latencies on the EC2 APIs in the US-EAST-1 Region.

Also Amazon Relational Database Service (N. Virginia) is unavailible.

Seems like its snowballing.

RDS actually uses EBS so it is most likely the same underlying issue.

Out of curiosity do we know if http://status.aws.amazon.com/ is hosted on AWS?

It's a key design constraint of status.aws.amazon.com that it not depend on any AWS services. (Or so I've heard.)

status.aws.amazon.com resolves to for me.

That IP isn't in the EC2 IP ranges (https://forums.aws.amazon.com/ann.jspa?annID=1528), so at the very least it's not hosted on EC2.

I thought I read somewhere that it was on Rackspace

Dear Heroku -- I know it's my job to make sure my site is available (/thread). However, I think I speak for most enterprise customers when I say I will throw money at your company the second you come up with a multi-zone/highly-available.

I spoke to one of the guys from Heroku recently and he said they are working on it but he couldn't give me a date.

It couldn't come soon enough for us, but also for Heroku as AppFog seem to have its foundations built on a multi-zone/region/provider architecture.

Everyone should take a page from the book of Netflix right now. It's pretty embarrassing to be anyone that's entirely down and can't do a thing about it due to an EC2 outage.

How do you explain to your customers/users/etc that you were down and have absolutely no control of when you will be back online? How can you explain it to yourself?

"We're sorry about the current downtime. We know some of you are frustrated, so we thought we'd take a moment to explain why this happened.

"Running a web server is very expensive. After we've built the site, if we want to keep it running, someone needs to be on-call 24 hours a day. That means at least one full-time staff member who does nothing else-- more if we want them to stay sane.

"To save us and you some money, we've contracted maintenance of our servers out to a third-party service. This is great for us, since they run it more reliably than we could, and it's great for you, because it costs a lot less. But the downside is that things still break sometimes, and when they do, it's completely out of our hands. We're left waiting for things to get fixed just like you are.

"So we understand your frustration; we're frustrated too. But unfortunately, downtimes do happen. Guaranteeing our service 100% of the time would cost hundreds of thousands of extra dollars per year, and for most of our users, that's simply not worth the cost. Our provider guarantees 99.[nines]% reliability for much less money, and this is the 0.01%.

"If you have something that absolutely must get done, shoot us an email right now and we'll take care of it for you as soon as the site is active again.

"Although this is technically out of our hands, we aren't trying to shift the blame; we made this decision with open eyes, and we stand by that decision. Again, we sincerely apologize for the temporary inconvenience. We hope we can make it up to you with some new features we'll be rolling out this month :)"

It doesn't have to be that hard.

Yes, it's actually easier to apologize to your customers when you have AWS to blame for the outage. After all, you're in good company ("even Heroku is down - what do you want from us?")

You're lucky if your average customer knows what AWS is, let alone Heroku.

I'm sure your average customer has heard of Amazon :)

What are you telling me? Our servers are hosted by a book store? You must be joking!

Because this shit is really hard especially when you're trying to build a product at the same time.

It's not like you can just wake up one day and say "I'm gonna go build a fully fault tolerant distributed system that works across multiple data centers!" and then you're done by the time you go to sleep.

Go actually talk to some Netflix engineers. They'll tell you the same thing.

Yes, you're absolutely right. However, Netflix is distributed across multiple AZ, while Heroku has spent the last two years after their $212MM acquisition in the same AZ.

That makes it sound like Netflix has a more reliable platform than the PaaS company.

That's my point exactly. Everyone relies entirely (almost) on EC2 for mission critical business, and then they're left there with no outs as soon as a big outage occurs.

this. stop talking about how great a MVP is and then complain when people haven't build multi region distributed services that are fault tolerant to major platform outages.

Netflix's "Lessons Learned" from the April 2011 AWS outage: http://techblog.netflix.com/2011/04/lessons-netflix-learned-...

Though they're not 100% degradation proof today either: http://i.imgur.com/MJfqj.png

Maybe you tell them that they won't be able to trade cat pictures for some time. And that the outage happened because you didn't overinvest in infrastructure so that you could keep prices down and roll out some new features faster.

Even if you aren't a cat picture site, many startups can deal with some downtime occasionally and it's the right tradeoff to make.

Their uptime percentage is 99.97%, but I'm having a hard time fighting the recency effect that is telling me to get off the platform ASAP.

That was their uptime for May, but looking at June it's going to be a worse picture. They're already down to 99.63% if you only include "red" incidents and down to 99.25% if you include "amber" incidents as well (as of 39m of downtime for this latest incident and assuming they don't have any more downtime for the month).

That's 99.97 for the last month.

Which is ridiculous, 9s for most services use years as standard. Of course if heroku did that they wouldn't look so good.

if you have a year of 99.97 months, don't you have a 99.97 year? how does having a shorter time make the numbers look better?

[edit: i realise they could be hiding worse months, but i don't think that's what the post i am replying to meant. perhaps i am reading it wrong.]

They could have done it to the extreme -- show the numbers for the past hour. Then they could almost always report 100% uptime. And if they ever went down, wait an hour, then go back to reporting 100% uptime again.

They could have terrible uptime 2 months ago and you wouldn't see it in their status page, because it only showed last month's.

Parent post was saying that, without taking into account the entire year, you can have one month that is terrible and the rest quite good.

The measurement in question is supposed to be about consistency.

Depends on the month...

31 versus 28 days.

It doesn't matter whether gauged by the month or year. It's a percentage. If they have 99.97% uptime every month for a year, they'll still have 99.97% uptime for the year.

Or if they have 100% for six months and then 99.97% for one month the long-term average would be better.

I just got up (it's morning in Israel), and a client of mine in the US with a major, mission-critical application was screaming (rightly so) that things are down.

We're already looking into alternatives -- perhaps not leaving Heroku altogether, but certainly not depending on them 100 percent. There's no way that we can entrust the business to something that can just catastrophically fail at any moment. I've been running my own servers for years, and they've never had such unpredictable issues.

I increasingly have to think that a few servers, on different providers, with the application deployed via Capistrano, will be more fault-tolerant than Heroku. At least, it seems that way right now.

> There's no way that we can entrust the business to something that can just catastrophically fail at any moment.

Anything, including service providers, can catastrophically fail at any moment. Fault-tolerant architectures are based on redundancy (including infrastructure provider redundancy, as you mention), not on "guaranteed" SLAs.

Provider redundancy goes against the concept of PaaS IMO (ignoring the sci-fi future where there are multiple 100% compatible providers). Heroku needs to become internally redundant to really live up to its promise.

Dont overreact... heroku is using AWS which is down for many other sites right now other than heroku. This is an EBS issue that is occurring.

And my guess is his client doesn't give two excrements what the reason is. It's boolean to them; it's up or it's down.

I know AWS EC2 does support multiple global regions and within each region multiple availability zones. You're supposed to host your application across more than one AZ if you want good fault tolerance. Not sure if Heroku uses this though.

I've had so much more downtime with heroku/AWS than I ever had back when I was running my sites on slicehost.

I also feel like I've let my admin skills deteriorate because I've been dependent on heroku. Back when I was running everything myself, worst case scenario I could set up a new VPS from a backup in another datacenter. Now if heroku goes out I just have to twiddle my thumbs while I wait for updates.

...or you can keep on working knowing that your services will be magically back online without you lifting a finger.

except for any potential data loss that just happened

I understand what you are saying. I host most of my projects on Rackspace. I've been intending to convert current ideas to Heroku because learning SysAdmin stuff doesn't seem to have much upside potential. However, these multiple outages while Rackspace (seemingly) keeps chugging along are discouraging. That being said, every Heroku outage makes front page of HN, and there could have been Rackspace downtimes much more often that I didn't notice.

Eh, I'll probably continue converting to Heroku and not look back.

I haven't see any RS-wide outages in my datacenter (IAD) for many years (I vaguely remember some networking issue, but that was something like 5 years back).

So they have their infrastructure (network, power) working very well.

The downtime we've had were our own unique hardware/software issues that come with a complex bare metal installation.

Here's the Heroko status site: https://status.heroku.com/

Seems like someone posted a link to a project that made self-hosting a previously Heroku hosted site simple, but I can't find it now...

...would be cool if there was a Linux package (or distro) that you could boot-up and then just change your git remote to and have your app up-and-running on your own hardware.

This was in the recent Heroku down thread. You might be interested in Stackato (http://www.activestate.com/stackato). It is based on Cloud Foundry, with numerous enhancements, including support for Heroku buildpacks (http://docs.stackato.com/languages/buildpack.html). Heroku-in-a-box - give it a try.

BTW, while the point is to enable private paas, you won't get around the issues that hit sites like this without heeding all the warnings and recommendations about building in redundancy for high availability.

This was noted well in this post: http://www.newvem.com/blog/main/2012/06/aws-cloud-best-pract...

"""It is a lot cheaper to add 1% uptime to a 95% SLA than it is to add 0.09% to a 99.9% SLA. Cloud application vendors (SaaS) need to pay very close attention to the additional resources that are invested in order to support a 99.9XX…% uptime SLA, and perhaps build it into their pricing plans."""

Yeah I'm just evaluating options, if nothing else it would be excellent to have the equivalent of a "donut spare" on private hardware that we could throw on when events like this occur (although having done plenty of "self-hosted" sites during the first boom, I'm always running the numbers for either option).

I made Dokuen a few weeks ago, maybe that's what you're thinking of? If you're willing to live with a few warts, it's working pretty well for my personal use. My blog is still on Heroku so you can't really read about it now, but you can check out the code.


I saw this a while ago and will actually be using it very shortly to deploy 3-4 internal apps on our own mini cloud. I love Heroku and can't stand all of the open source "alternatives" like Cloud Foundry. Yours is exactly what I wanted and I can't wait to really start using it and contributing via github! Thanks again!

This looks very interesting; thanks for the link!

This is what I was thinking of (I think):


If Heroku moved off of AWS they'd have better uptime and lower prices.

I'm sure that some Heroku customers already depend on the fact they're hosted on EC2 for accessing other instances or AWS services.

That'd depend where they moved to, wouldn't it?

At their size they need to be running and managing their own hardware. I'd use them more if it wasn't hundreds of dollars a month to host a few apps that might not be up when my customers are.

Can we get this title renamed to AWS outage as the problem is not Heroku.

It may be an AWS outage as the cause, but ultimately it's Heroku's problem. They're the ones touting:

  "Erosion-resistant architecture.

   Heroku takes full responsibility for your app's health,
   keeping it up and running through thick and thin..."
The thick happened today, something eroded, and people are holding them to their word: full responsibility for their app's health. Heroku could do load balancing between multiple independent providers rather than be solely dependent on [one region of?] AWS.

Another update from Amazon: "9:55 PM PDT We have identified the issue and are currently working to bring effected instances and volumes in the impacted Availability Zone back online. We continue to see increased API error rates and latencies in the US-East-1 Region." Been thinking that maybe most startups are seeing that cloud computing is the most reliable way to go, but today I'm reconsidering having another type of backup server. Just hope there is no data loss in the apps.

Pocket also tweeted that they are having issues due to Amazon's issues.


Was at a Heroku "crash course" last week where they claimed they learned from their major outage in March/April Here is the link to the videos from the conference http://zurichtechtalks.tumblr.com/post/24670375315/heroku-in...

One of mine went down, and I'm seeing folks on Twitter saying the same. Definitely something going on.

*edit: Came up 12:42 AM ET.

A few of my us-east-1d machines are down.

One of mine is as well. The load balancers also seem to be haywire.

Related: Right now when I try to cat /proc/mdstat or use mdstat to look at my RAID status, it just hangs. Seems it's trying to contact the EBS volumes and it's just failing. Any way to actually view my raid status?

It's probably best not to muck around with the RAID right now when the drives it thinks are there aren't actually there. If it were me, I wouldn't touch anything until Amazon fixes itself.

It's more a theoretical question than anything else - Say this outage had gone on for days, I'd have needed to be able to see which volumes have failed and drop them from the array. How can I do that when I can't view the raid status? I have these problems even If I purposely detach a volume to test.

And yet we continue to throw everything on AWS...

You know there are OTHER data centers, right?

Other datacenters can go down, too. In many cases, the complexity of running an application across different platforms (say, AWS + Rackspace Cloud) might not be worth it.

cloud.engineyard.com isn’t loading either.

edit: I meant literally the EngineYard website at that address. Some EngineYard websites were up and some were down, no doubt based on region.

I'm hosted on EngineYard and my app is up and running just fine.

are they on us-east-1?

Anyone know if Amazon Fresh is hosted on EC2? I had the worst connectivity and performance issues with their site earlier today...and now all of my sites are down.

So is Parse.com, they just announced it is AWS related.

At this point, the internet may as well be dead to me.

My AWS instances on us-east are unreachable :(

It seems to be the elastic load balancers on AWS, can't blame Heroku this time.

My love hate relationship with Heroku continues...

It's not just ELB.

It looks like Amazon are worse at reporting their outages than Heroku...

During the previous outage (which wasn't AWS related), Heroku's status page was down entirely (among other things, it relied on static assets from heroku.com), so I can't say I agree with that.

Yep, but they saw that flaw and addressed it and it's ok for the time being. Amazons has been and still is crappy.

knocked out some other stuff? gothamist/chicagoist/laist/all those other blogs are out.

We have an unreachable instance in us-east-1b but others in that region are reachable

FYI: Your 1b is not the same as other people's 1b: http://aws.amazon.com/ec2/faqs/#How_can_I_make_sure_that_I_a...

Various software used to hardcode 1a, so 1a received disproportionate load. Now, everyone's a-e is randomized among the "true" a-e, meaning that even if everyone hardcodes 1a, the load will still be evenly distributed.

Keep in mind that not everybody sees the same names for the same availability zones - your us-east-1b might be my us-east-1d

I want a credit on my Heroku account. Paying $71/month for shit like this is stupid.

Have you tried asking for one?

Maybe for the database service, but dynos and workers are paid for by the hour, aren't they?

All services are billed based on time usage (its not hourly, its much more granular), including the database.

That said, an outage is still an outage.

Not sure if it's related, but www.pythonanywhere.com is also down.

Doesn't look like it, their IP is owned by a German company.

google searches for "migrating off heroku tutorial" just spiked.

I'm getting a "Request limit exceeded" error in my EC2 panel.

parse.com also down

I confirm this. Was working on an app prototype and when I reloaded the page it couldn't find it.

quora.com is down

This is why http://AppFog.com/ is investing in multiple IaaS and is not being hit nearly as hard.

42floors.com is down.


...the sequel.


You can serve your own error pages instead of Heroku's and show your boss whatever you want. They're specified as URLs, so they can be hosted on some other platform. I don't know if that feature works when AWS is failing Heroku, but if Heroku's up enough to serve the error page, maybe it's up enough to serve the error page you configured instead of its own.


Solid, thanks for the heads up!

Isn't it your fault you didn't build a fault tolerant application? The first rule of building services is assume everything is broken.

Honestly, I haven't the resources to guarantee site uptime, and have accepted this will happen as a result; my complaint was more targeted at the default error messages.

It seems this can be changed, though, fortunately for Heroku.

I guess my real point is your customer, 99% of the time, doesn't give a shit WHY your site is down. It just is.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact