Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[dead]
on Feb 25, 2012 | hide | past | favorite


I lose money when Heroku is down, and let me say this: Heroku is a f$(king miracle, and you'll pry our apps cold, dead fingers off of their stack.

They screwed up. They'll fix it. Any ops team and hosting solution I could hire for a comparable amount of money wouldn't even come close.


Just out of interest sake what are people doing as a backup to heroku, and is failover automatic or do you have to make DNS changes?


DNS changes does not provides high availability at all. Even with a ttl of 60 seconds, if a cache decides to overwrite your TTL, you're screwed.

Real high availability comes with a price: redundant datacenter with BGP failover.


While BGP is certainly better, the vast majority of DNS servers honor TTL nicely these days. In the old days, there used to be many problems, but I have been shocked at how well DNS failover has worked in the past few years.


How does that work? If you have a link explaining it a bit that would be great! (a Google search shows me all sorts of solutions for ISPs, but not for application hosting).


Got to love their last two updates. Spot the difference :).

UPDATE: Process management error rates remain high. Engineers are continuing to investigate and work towards a resolution. FEB 25, 2012 – 20:27 UTC – 10 MINUTES AGO

UPDATE: Process management error rates remain high. Engineers are continuing to investigate and resolve the issue. FEB 25, 2012 – 20:11 UTC – 26 MINUTES AGO


Sounds normal. They have nothing to add, but they have to say something.


I believe what he meant was that the first update implied that they had identified the problem and were fixing it ("continuing to investigate and resolve the issue.") - the second implied they were still looking for the problem ("continuing to investigate and work towards a resolution.").

Subtle but important difference.


Having once been one of the engineers in charge of fixing problems very similar to these... don't read too much into the subtleties of the phrasing of these customer notices. ;)

They all translate to: "The engineers' hair is still on fire and they continue to have no time to tell me exactly why."


Yet another reminder that well services like heroku are great, its up to you to make sure you have the appropriate redundancy in place to make sure your critical app doesn't go down. If you host anything critical on heroku, you _need_ a backup server elsewhere (not ec2!)


But doesn't that defeat the whole point of Heroku? If your app is running redundantly elsewhere, that means that you've already done all the work to set up the stuff that Heroku normally provides. And if you've already done that work, why use Heroku at all?


This is true. Heroku needs to be redundant. This shouldn't happen on Heroku.


I agree but we'll need to know more about caused this, hopefully they'll be a little more transparent than usual given the severity of this failure. In the meantime I have a new startup idea: Heroku failover cloud hosting.


I am probably biased, since I wrote many of them, but I like to think Heroku has a pretty good reputation for writing meaningful public post-mortems when there are large scale outages. It's safe to give them the benefit of the doubt that they'll continue.


isnt heroku based on EC2? they should be able offer some type of redundant package for customers willing to pay for it. i mean, lets be serious, people WILL pay for it, assuming it works.


Having this problem too. The thing that I'm worried about is that my site just keeps trying to load without any error message or 404 page appearing. Is there a way to get something to show up to inform users of the downtime?

(Or ideally, a way to point towards another instance of the site quickly. I'm worried the DNS wouldn't propagate fast enough)


Mine says:

Application Error An error occurred in the application and your page could not be served. Please try again in a few moments.

If you are the application owner, check your logs for details.


Mine does too...eventually. It looks like it is taking ~15 seconds to bring that page up.


When heroku goes down. 1M+ apps go down.


Now's the time to launch!


Heroku should know better and not host their status page on their servers since it becomes unaccessible when most needed. This problem started last night and they detected it this morning, monitoring system FAIL.


It appears to be hosted on rackspace (right now at least). But it doesn't look automatic so perhaps the people were not available to update it til this morning.


The Heroku status site is indeed hosted outside of both the Heroku platform and EC2. It has been that way for at least a couple of years that I have direct knowledge of, and probably longer.


The page wasn't available earlier so something from their stack is probably shared. Maybe the problem was is in how they route the traffic to this status page. Bottom line, this page should stay up and the only way to almost guaranty that is to make sure nothing is shared between heroku's infrastructure and where/how the status page is hosted/served.


It's entirely possible that the status site had an unrelated issue. It happens. That said, I ran their stack until 2 weeks ago, so I can speak authoritatively on this: Nothing from their stack is shared with the status site. Period.


If that's the case (I trust your authority in this), then maybe the status page went down because of the spike in traffic.



That comment is made from sour grapes.

If some bits of Heroku seem to be up, but not others, it could be because those bits are running older versions of the infrastructure. Or are being used as test beds for newer versions of the infrastructure. Or are deliberately running on a separate platform so that, when the main Heroku infrastructure starts having problems, other bits of Heroku's domain are still around to dispense advice on how to work around those problems.

(The most extreme example is a company's "status" subdomain, which ideally should be hosted on a completely different server, in a completely different datacenter, on a different continent located on a distant planet with different DNS regulations.)


The devcenter site is absolutely hosted on the Heroku platform.

I don't have first hand knowledge of this specific incident, but I do have a very deep understanding how the Heroku platform as a whole works. It is incredibly likely for many service disruptions to affect a only subset of applications, and not the platform as a whole. The system is, in fact, specifically designed to isolate failures of individual components to as small a failure domain as possible. That doesn't always work, but implying that there's something nefarious going on or that Heroku is not eating their own dogfood with the devcenter site is dead wrong.


It is. Heroku.com is down.


  % host devcenter.heroku.com                                                                                                                                                                          
  devcenter.heroku.com is an alias for iwate-88.herokussl.com.
  iwate-88.herokussl.com is an alias for elb002001-124710749.us-east-1.elb.amazonaws.com.
  elb002001-124710749.us-east-1.elb.amazonaws.com has address 107.20.144.56
  elb002001-124710749.us-east-1.elb.amazonaws.com has address 23.21.215.77
  elb002001-124710749.us-east-1.elb.amazonaws.com has address 107.21.227.222
It's 100% on Heroku, it's just also using the ssl:hostname add-on (https://addons.heroku.com/ssl) which requires different DNS settings (http://devcenter.heroku.com/articles/ssl#hostname_based_ssl).


Our app is back up.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: