We get H12's all the time. Randomly. The only suggestion we get from Heroku is to make the requests process as fast as possible. Thus, we've spent considerable amount of time going through everything we can possibly do to make all requests respond as fast as possible. I've given up. I see this as a fundamental issue with the routing system. If you are going to use Heroku for a large production deployment, H12's (and your users getting dropped connections) will become a fact of life.
There is no auto scaling. We have no idea how many dyno's we actually need. So, we over do it in order to handle peek traffic times. This must be a great money maker for Heroku. There is no incentive for them to build auto scaling into their system because that would mean they wouldn't make as much money. Yes, auto scaling is a hard problem to solve, but there should at least be a plan to start on it and there is none that I have found.
Up until someone bitched loudly, nothing was happening to fix any of this. We have an expensive paid support contract with Heroku and before this whole routing issue blew up in public, their only recommendation was to tune the app more and buy into NewRelic for ~$8k / month. We did both and found NewRelic to not give any relevant information to help us. We did a NodeTime trial for ~$49/mo and that actually helped a lot in identifying slow spots in our app. We fixed all the slow spots in our app and still see an endless stream of H12's. Regardless, it shouldn't take a public bitch slapping for a company to listen to their customers.
You log into a dyno and see a load average of 30+. Who knows if that number is accurate or how big the underlying box really is, but regardless, I can't imagine that number being good. Am I getting H12's because I'm on an overloaded box or is it because the routing system is fundamentally broken? I don't know and nobody can tell me. This is not a good position to be in.
I have heard from several sources that Heroku isn't happy being on AWS and has been wanting to migrate off AWS for a while now. So, if your hosting provider isn't happy on their hosting provider, there must be a reason for that and in the bigger picture, you the customer, is getting screwed.
Given these things, I will never recommend that a company use Heroku. It is great if you know you are going to never have more than one dyno, but if you think you are going to go into a large production system with it, it is far better to find something else. Which brings me to another rant... how come none of these other PaaS solutions are as easy as Heroku? The git deploy is seriously the one thing they got mostly right. I'd love to see someone build a layer on top of all the PaaS solutions so that I can just deploy my code to any one of them (or event multiple).
Knowing how many dynos you need is definitely a problem. We have implemented autoscaling in the past... but it always sucked. It's hard to find a one-size-fits-all-solution. Rather than ship something sub-par we chose not to ship anything at all.
I understand a lot of people do well with autoscaling libraries and 3rd party add-ons. Would be curious to hear your experience with any of these.
I completely agree that it shouldn't take complaining in public to get a company to listen to its customers. That's was our biggest mistake in all of this, IMO — not listening.
For dyno load, have you tried log-runtime-metrics? https://devcenter.heroku.com/articles/log-runtime-metrics It provides per-dyno load averages.
I gladly accept your compliment that our git deploy remains the best on the market. :)
I'm sorry we haven't been able to serve you better. Let me know if you'd be willing to talk via skype sometime — even if you end up leaving the platform (or already have), I'd like to understand in more depth where we went wrong so we can do better in the future.
Yes, autoscaling is hard. I have apps on Google AppEngine and see their issues as well. That said, at least they are trying. Maybe even take one of those 3rd party libraries and try to harden and adapt that and make it a real solution? I think the real problem though lies in the fact that there isn't any good metrics for what dynos are doing so there is no metric for when something is too busy or not. Yes, log-runtime-metrics puts out some numbers, but those numbers are meaningless when all I have is a slider to change the amount of money we are paying you.
I should qualify that git deploy compliment because there are issues with that as well. For example, why do you have to rebuild the npm modules from scratch each time? Why not have a directory full of pre-built modules for your dyno's that are just copied into my slug? This relatively simple change would increase the speed of deployments greatly. Never mind that deployments aren't reliable and fail randomly. At least it is easy to just try again.
Would love the chance to win back your trust and hang onto your business. Let me know your app name (in email if you prefer) and I can see if there's anything we can do for you in the near term.
'why not get a dedicated professional to handle devops on your team'
It sure is easy to write that, but the reality isn't as rosy. I've gone through the process at two companies to try to do that, interviewing ~50-80 people and it has been a nightmare. It is really really difficult to find quality devops people. Again, this makes PaaS like AWS, Heroku and AppEngine a lot more attractive. They are betting their entire business on being able to hire good devops people, so they tend to attract better talent.
Any thoughts on that. It offers a Heroku'esque deploy.
I should also add that one thing that Heroku did get 100% correct is the heroku logs -t command (aka: tailing the logs). Nobody else does that one quite as well.