Your response only re-enforces my hard won opinion that Heroku should never be used for a production environment for any business that is trying to be successful and popular. Admitting that you have no idea why critical areas of your infrastructure is causing issues, while at the same time charging people an arm and a leg for services (we pay ~$4k/mo) feels like theft to me. I've built solutions for a large porn company that runs on significantly less infrastructure than what we are running on Heroku and handles 100x more traffic. Something is wrong here with the dyno/router model and maybe it is that you guys are just oversubscribed and not admitting to it in public.
Yes, autoscaling is hard. I have apps on Google AppEngine and see their issues as well. That said, at least they are trying. Maybe even take one of those 3rd party libraries and try to harden and adapt that and make it a real solution? I think the real problem though lies in the fact that there isn't any good metrics for what dynos are doing so there is no metric for when something is too busy or not. Yes, log-runtime-metrics puts out some numbers, but those numbers are meaningless when all I have is a slider to change the amount of money we are paying you.
I should qualify that git deploy compliment because there are issues with that as well. For example, why do you have to rebuild the npm modules from scratch each time? Why not have a directory full of pre-built modules for your dyno's that are just copied into my slug? This relatively simple change would increase the speed of deployments greatly. Never mind that deployments aren't reliable and fail randomly. At least it is easy to just try again.