I hear you. Heroku's value proposition is that we abstract away infrastructure tasks and let you focus on your app. Keeping you from needing to deal with load balancers is one part of that. If you're worried about how routing works then you're in some way dealing with load balancing.
However, if someone chose us purely because of a routing algorithm, we probably weren't a great fit for them to begin with. We're not selling algorithms, we're selling abstraction, convenience, elasticity, and productivity.
I do see that part of the reason this topic has been very emotional (for us inside Heroku as well as our customers and the community) is that the Heroku router has historically been seen as a sort of magic black box. This matter required peeking inside that black box, which I think has created a sense of some of that magic being dispelled.
You sound like a politician talking to someone of the opposite party, in that you say "I hear you", but then completely fail to address anyone's concerns. Selling a "magic black box" that guarantees certain properties, changes them, then lies about having changed them presents a liability for users who want to do serious work.
A major selling point of Heroku is that scaling dynos wouldn't be a risk. This guarantee is now gone and it's not coming back soon even if routing behavior is reverted, because users prefer good communication and trust with their providers. The responses of Heroku are blithe non-acknowledgement acknowledgements of this problem.
>A lot of Heroku's apparent value came from the intelligent routing feature. Everybody knew that it was harder to implement than random routing, that's why they were willing to pay Heroku for it.
is being addressed by Adam in this comment:
>Heroku's value proposition is that we abstract away infrastructure tasks and let you focus on your app. Keeping you from needing to deal with load balancers is one part of that. If you're worried about how routing works then you're in some way dealing with load balancing.
I think Adam is getting to a really fair point here, which is that nobody really minds whether the particular algorithm is used. If A-Company is using "intelligent routing" and B-Company uses "random routing," but B-Company has better performance and slower queue times, who are you going to choose? You're going to choose B-Company.
At the end of the day, "intelligent routing" is really nothing more than a feather in your cap. People care about performance. That's what started this whole thing - lousy performance. Better performance is what makes it go away, not "intelligent routing."
Intelligent routing and random routing have different Big O properties. For someone familiar with routing, or someone who's looked into the algorithmic properties, "intelligent routing" gives one high-level picture of what the performance will be like (good with sufficient capacity), whereas random routing gives a different one (deep queues at load factors where you wouldn't expect deep queues).
This is why it was good marketing for Heroku to advertise intelligent routing, instead of just saying 'oh it's a black box, trust us'. You need to know, at the very least, the asymptotic performance beavhior of the black box.
And that's why the change had consequences. In particular, RapGenius designed their software to fit intelligent routing. For their design the dynos needed to guarantee good near-worst-case performance increased with the square of the load, and my back-of-the-envelope math suggests the average case increases by O(n^1.5).
The alleged fix, "switch to a concurrent back-end", is hardly trivial and doesn't solve the underlying problem of maldistribution and underutilization. Maybe intelligent routing doesn't scale but 1) there are efficient non-deterministic algorithms that have the desired properties and 2) it appears the old "doesn't scale" algorithm actually worked better at scale, at least for RapGenius.
Very small customer here. I don't know much about the abstraction you provide us and I don't want to know as long as things go well.
To my point of view, the routing is "random" thus kind of unpredictable. If scaling becomes more of an issue with my business, the last thing I want is to have random scaling issues that I can not do anything about because the load balancer is queuing requests randomly to my dynos.
I want my business to be predictable and if I'm not able to have it I'm going to pack my stuff and move somewhere else.
For now, I'm happy with you except for your damn customer service. They take way too long to answer our questions!
Absolutely right, I totally agree. Random scaling issues that you can't either see or control is exactly the opposite of what we want to be providing.
Can you email me (adam at heroku dot com) some links to your past support tickets? I'd like to investigate.
Thanks for running your app with us. Naturally, I expect you to move elsewhere if we can't provide you the service you need. That's the beauty of contract-free SaaS and writing your app with open standards: we have to do a good job serving you, or you'll leave. I wouldn't want it any other way.
Fire-fighting during the scaling phase is a problem that every fast-growing software-as-a-service business will probably have to face. I think Heroku makes it easier, perhaps way easier; but I hope our marketing materials etc have not implied a scaling panacea. Such a thing doesn't exist and is most likely impossible, in my view.
I doubt anyone chose Heroku solely because of the routing algorithm, but the sales point of intelligent routing was certainly an appealing one.
Developers don't want to pay for abstraction just for abstraction's sake, they want to pay for abstraction of difficult things. Intelligent routing is one of those difficult things. Random routing is easy, which is of course why it's also more reliable, but also why you're seeing people feeling like they didn't get what they paid for.
I should be clear; this doesn't affect me personally but I am totally sympathetic with the customers who are bent out of shape about this and I still see a divide between your response and why people are upset, and I'm trying to help you bridge that divide.
It's interesting — very few customers are actually bent out of shape about this. (A few are, for sure.) It's more non-customers who are watching from the sidelines that seem to be upset. I do want to try to explain ourselves to the community in general, and that's what this post was for. But my first loyalty is to serving our current customers well.
What about potential customers? I've been evaluating your platform and just be completely frustrated with 3 days wasted trying to solve these performance issues.
I'm using gunicorn with python, and if I use the sync worker the request queue easily hits 10 seconds and nothing works; if I switch to gevent or eventlet, new-relic tells me that redis is taking the same time stuck getting a value. This is using the same code in my current provider that works just fine with eventlet and scales well.
To add insult to injury, adding dynos actually degrades performance.
I'd be really interested to see some public information resulting from debugging Python apps. We're holding pretty steady, but see a fairly constant stream of timeouts due, apparently, to variance in response times. To be sure, we're working on that. But, in the meantime, our experiments with gevent and Heroku have been less than inspiring.
"It's interesting — very few customers are actually bent out of shape about this"
Seems to me you don't get it. Sure there are some very vocal non-customers but you also have a lot of potential customers and users (spinning up free instances) evaluating your product and hoping for a better product. I agree that your true value is the abstraction you provide. Some of these potential customers want to ensure Heroku is as good an abstraction as promised to justify the cost and commitment.
Fair enough. I think the best thing we can do for those potential future customers is be really clear about what the product does and give them good tools and guidance to evaluate whether it fits their needs.
I'd argue that we dropped the ball on that before (on web performance, at least), and are rectifying it now.