I am working for a fairly large heroku app running Node on ~50-100 web dyno's with another 20-50 backends. Here are the problems as I see it:
We get H12's all the time. Randomly. The only suggestion we get from Heroku is to make the requests process as fast as possible. Thus, we've spent considerable amount of time going through everything we can possibly do to make all requests respond as fast as possible. I've given up. I see this as a fundamental issue with the routing system. If you are going to use Heroku for a large production deployment, H12's (and your users getting dropped connections) will become a fact of life.
There is no auto scaling. We have no idea how many dyno's we actually need. So, we over do it in order to handle peek traffic times. This must be a great money maker for Heroku. There is no incentive for them to build auto scaling into their system because that would mean they wouldn't make as much money. Yes, auto scaling is a hard problem to solve, but there should at least be a plan to start on it and there is none that I have found.
Up until someone bitched loudly, nothing was happening to fix any of this. We have an expensive paid support contract with Heroku and before this whole routing issue blew up in public, their only recommendation was to tune the app more and buy into NewRelic for ~$8k / month. We did both and found NewRelic to not give any relevant information to help us. We did a NodeTime trial for ~$49/mo and that actually helped a lot in identifying slow spots in our app. We fixed all the slow spots in our app and still see an endless stream of H12's. Regardless, it shouldn't take a public bitch slapping for a company to listen to their customers.
You log into a dyno and see a load average of 30+. Who knows if that number is accurate or how big the underlying box really is, but regardless, I can't imagine that number being good. Am I getting H12's because I'm on an overloaded box or is it because the routing system is fundamentally broken? I don't know and nobody can tell me. This is not a good position to be in.
I have heard from several sources that Heroku isn't happy being on AWS and has been wanting to migrate off AWS for a while now. So, if your hosting provider isn't happy on their hosting provider, there must be a reason for that and in the bigger picture, you the customer, is getting screwed.
Given these things, I will never recommend that a company use Heroku. It is great if you know you are going to never have more than one dyno, but if you think you are going to go into a large production system with it, it is far better to find something else. Which brings me to another rant... how come none of these other PaaS solutions are as easy as Heroku? The git deploy is seriously the one thing they got mostly right. I'd love to see someone build a layer on top of all the PaaS solutions so that I can just deploy my code to any one of them (or event multiple).
We're aware of the random H12s problems. Some apps are affected pretty badly, others not at all, and we're not sure why yet. Sorry that you've had such a bad experience with this. We're continuing to investigate. If we're not able to find a solution in a timely fashion, I'll completely understand if you no longer want to use our product for this app.
Knowing how many dynos you need is definitely a problem. We have implemented autoscaling in the past... but it always sucked. It's hard to find a one-size-fits-all-solution. Rather than ship something sub-par we chose not to ship anything at all.
I understand a lot of people do well with autoscaling libraries and 3rd party add-ons. Would be curious to hear your experience with any of these.
I completely agree that it shouldn't take complaining in public to get a company to listen to its customers. That's was our biggest mistake in all of this, IMO — not listening.
I gladly accept your compliment that our git deploy remains the best on the market. :)
I'm sorry we haven't been able to serve you better. Let me know if you'd be willing to talk via skype sometime — even if you end up leaving the platform (or already have), I'd like to understand in more depth where we went wrong so we can do better in the future.
Your response only re-enforces my hard won opinion that Heroku should never be used for a production environment for any business that is trying to be successful and popular. Admitting that you have no idea why critical areas of your infrastructure is causing issues, while at the same time charging people an arm and a leg for services (we pay ~$4k/mo) feels like theft to me. I've built solutions for a large porn company that runs on significantly less infrastructure than what we are running on Heroku and handles 100x more traffic. Something is wrong here with the dyno/router model and maybe it is that you guys are just oversubscribed and not admitting to it in public.
Yes, autoscaling is hard. I have apps on Google AppEngine and see their issues as well. That said, at least they are trying. Maybe even take one of those 3rd party libraries and try to harden and adapt that and make it a real solution? I think the real problem though lies in the fact that there isn't any good metrics for what dynos are doing so there is no metric for when something is too busy or not. Yes, log-runtime-metrics puts out some numbers, but those numbers are meaningless when all I have is a slider to change the amount of money we are paying you.
I should qualify that git deploy compliment because there are issues with that as well. For example, why do you have to rebuild the npm modules from scratch each time? Why not have a directory full of pre-built modules for your dyno's that are just copied into my slug? This relatively simple change would increase the speed of deployments greatly. Never mind that deployments aren't reliable and fail randomly. At least it is easy to just try again.
With that kind of money invested in hosting and scaling, why not get a dedicated professional to handle devops on your team, and go with more traditional hosting solutions? I'm interested in hearing why people still use Heroku at the scale you're describing
That's a really good question. Heroku is offering to handle the hosting, devops and scaling issues for us so that we can focus our energy on building a killer product. When considering the costs of hiring a devops team and someone to wear a pager 24/7 when servers go down, paying the premium for using Heroku becomes a lot more attractive.
'why not get a dedicated professional to handle devops on your team'
It sure is easy to write that, but the reality isn't as rosy. I've gone through the process at two companies to try to do that, interviewing ~50-80 people and it has been a nightmare. It is really really difficult to find quality devops people. Again, this makes PaaS like AWS, Heroku and AppEngine a lot more attractive. They are betting their entire business on being able to hire good devops people, so they tend to attract better talent.
EBS just recently added support for Node. It's definitely on the list as the next PaaS to try out. AppFog is another one that we've done a JVM deployment to and like a lot except for the fact that it just feels very alpha quality. Their website is painfully slow and under documented and the 'af' command isn't nearly as cool as just doing git deploy remote master.
I should also add that one thing that Heroku did get 100% correct is the heroku logs -t command (aka: tailing the logs). Nobody else does that one quite as well.