A lot of Heroku's apparent value came from the intelligent routing feature. Everybody knew that it was harder to implement than random routing, that's why they were willing to pay Heroku for it.
Nobody's arguing random routing isn't easier and more stable; of course it is.
The problem is that by switching over to it, Heroku gave up a major selling point of their platform. Are they really blind enough not to know this? I have a hard time believing that.
It seems to me the real way to make people happy is to discount the "base" products which come with random routing and make intelligent routing available as a premium feature. Of course, people who thought they were getting intelligent routing should be credited.
However, if someone chose us purely because of a routing algorithm, we probably weren't a great fit for them to begin with. We're not selling algorithms, we're selling abstraction, convenience, elasticity, and productivity.
I do see that part of the reason this topic has been very emotional (for us inside Heroku as well as our customers and the community) is that the Heroku router has historically been seen as a sort of magic black box. This matter required peeking inside that black box, which I think has created a sense of some of that magic being dispelled.
Developers don't want to pay for abstraction just for abstraction's sake, they want to pay for abstraction of difficult things. Intelligent routing is one of those difficult things. Random routing is easy, which is of course why it's also more reliable, but also why you're seeing people feeling like they didn't get what they paid for.
I should be clear; this doesn't affect me personally but I am totally sympathetic with the customers who are bent out of shape about this and I still see a divide between your response and why people are upset, and I'm trying to help you bridge that divide.
It's interesting — very few customers are actually bent out of shape about this. (A few are, for sure.) It's more non-customers who are watching from the sidelines that seem to be upset. I do want to try to explain ourselves to the community in general, and that's what this post was for. But my first loyalty is to serving our current customers well.
I'm using gunicorn with python, and if I use the sync worker the request queue easily hits 10 seconds and nothing works; if I switch to gevent or eventlet, new-relic tells me that redis is taking the same time stuck getting a value. This is using the same code in my current provider that works just fine with eventlet and scales well.
To add insult to injury, adding dynos actually degrades performance.
Can you email me at adam at heroku dot com and I'll connect you with one of our Python experts? I can't promise we'll solve it, but I'd like to take a look.
Would you be willing to pair up with one of our developers on your app's performance? If so email me (adam at heroku dot com).
Fwiw, I've found the addon / db connection limits to be the primary blocker when load testing so far.
Seems to me you don't get it. Sure there are some very vocal non-customers but you also have a lot of potential customers and users (spinning up free instances) evaluating your product and hoping for a better product. I agree that your true value is the abstraction you provide. Some of these potential customers want to ensure Heroku is as good an abstraction as promised to justify the cost and commitment.
I'd argue that we dropped the ball on that before (on web performance, at least), and are rectifying it now.
A major selling point of Heroku is that scaling dynos wouldn't be a risk. This guarantee is now gone and it's not coming back soon even if routing behavior is reverted, because users prefer good communication and trust with their providers. The responses of Heroku are blithe non-acknowledgement acknowledgements of this problem.
>A lot of Heroku's apparent value came from the intelligent routing feature. Everybody knew that it was harder to implement than random routing, that's why they were willing to pay Heroku for it.
is being addressed by Adam in this comment:
>Heroku's value proposition is that we abstract away infrastructure tasks and let you focus on your app. Keeping you from needing to deal with load balancers is one part of that. If you're worried about how routing works then you're in some way dealing with load balancing.
I think Adam is getting to a really fair point here, which is that nobody really minds whether the particular algorithm is used. If A-Company is using "intelligent routing" and B-Company uses "random routing," but B-Company has better performance and slower queue times, who are you going to choose? You're going to choose B-Company.
At the end of the day, "intelligent routing" is really nothing more than a feather in your cap. People care about performance. That's what started this whole thing - lousy performance. Better performance is what makes it go away, not "intelligent routing."
This is why it was good marketing for Heroku to advertise intelligent routing, instead of just saying 'oh it's a black box, trust us'. You need to know, at the very least, the asymptotic performance beavhior of the black box.
And that's why the change had consequences. In particular, RapGenius designed their software to fit intelligent routing. For their design the dynos needed to guarantee good near-worst-case performance increased with the square of the load, and my back-of-the-envelope math suggests the average case increases by O(n^1.5).
The original RapGenius post documents them here: http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics
The alleged fix, "switch to a concurrent back-end", is hardly trivial and doesn't solve the underlying problem of maldistribution and underutilization. Maybe intelligent routing doesn't scale but 1) there are efficient non-deterministic algorithms that have the desired properties and 2) it appears the old "doesn't scale" algorithm actually worked better at scale, at least for RapGenius.
That's not the point.
As I think of it, "performance" is an observation of a specific case while intelligent/random can be used to predict performance across all cases.
"A major selling point of Heroku is that scaling wouldn't be a risk"
This is interesting, especially the word "risk." Can you expand on this?
To my point of view, the routing is "random" thus kind of unpredictable. If scaling becomes more of an issue with my business, the last thing I want is to have random scaling issues that I can not do anything about because the load balancer is queuing requests randomly to my dynos.
I want my business to be predictable and if I'm not able to have it I'm going to pack my stuff and move somewhere else.
For now, I'm happy with you except for your damn customer service. They take way too long to answer our questions!
Can you email me (adam at heroku dot com) some links to your past support tickets? I'd like to investigate.
Thanks for running your app with us. Naturally, I expect you to move elsewhere if we can't provide you the service you need. That's the beauty of contract-free SaaS and writing your app with open standards: we have to do a good job serving you, or you'll leave. I wouldn't want it any other way.
Fire-fighting during the scaling phase is a problem that every fast-growing software-as-a-service business will probably have to face. I think Heroku makes it easier, perhaps way easier; but I hope our marketing materials etc have not implied a scaling panacea. Such a thing doesn't exist and is most likely impossible, in my view.
As I understand it, Heroku (on the Bamboo stack) didn't up and decide "Hey, we're gonna switch from intelligent to random routing." The routers are still (individually) intelligent. It's just that there are more of them now, and they were never designed to distribute their internal queue state across the cluster. The system as a whole behaves more and more like a random router as the number of intelligent bamboo routing nodes increases.
For Cedar, we did make an explicit decision. People on the leading edge of web development were running concurrent backends and long-running requests. Our experimental support for Node.js in 2010 was a big driver here, but also people who wanted to use concurrency in Ruby, like Ilya Grigorick's Goliath and Rails becoming threadsafe. These folks complained that the single-request-per-backend algorithm was actually in their way.
This plus horizontal scaling / reliability demands caused us to make an explicit product decision for Cedar.
that makes sense, i was wondering how intelligent routing was implemented in the first place.
In the early days, Heroku only had a single routing node that sat out front. So it wasn't a distributed systems problem at that point. You could argue that Heroku circa 2009 was more of a prototype or a toy than a scalable piece of infrastructure. You couldn't run background workers, or large databases. We weren't even charging money yet.
Implementing a single global queue in a single node is trivial. In fact, this is what Unicorn (and other concurrent backends) do: put a queue within a single node, in this case a dyno. That's how we implemented it in the Heroku router (written in Erlang).
Later on, we scaled out to a few nodes, which meant a few queues. This was close enough to a single queue that it didn't matter much in terms of customer impact.
In late 2010 and early 2011 our growth started to really take off, and that's when we scaled out our router nodes to far more than a handful. And that's when the global queue effectively ceased to exist, even though we hadn't changed a line of code in the router.
The problem with this, of course, is that we didn't give it much attention because we had just launched a new product which made the explicit choice to leave out global queueing. It's this failure to continue full stewardship of our existing product that's the mistake that really hurt customers.
So to answer your question, there was never some crazy-awesome implementation of a distributed global queue that we got rid of. It was a single node's queue, a page of code in Erlang which is not too different from the code that you'll find in Unicorn, Puma, GUnicorn, Jetty, etc.
This sentence is really good, and I would humbly suggest you consider hammering it home even more than you have.
I gathered early on that there were inherent scaling issues with the initial router (which makes sense intuitively if you think about Heroku's architecture for more than 10 seconds), but I feel like most of the articles I've seen the past few weeks have this "Heroku took away our shiny toys because they could!" vibe. (Alternative ending: "Heroku took away our shiny toys to expand their market to nodeJS!")
Anyway, that's my take.
So it was an oversimplified system that worked great but wasn't scalable and was at some point going to completely fall over under increasing load.
IMExp, this is not a wrong thing to build initially and it's not wrong to replace it either. But the replacement is going to have a hard time being as simple or predictable. :-)
Sorry, correct me if I missed something, but I believe that as the overall volume of system transactions increases (thus necessitating more "intelligent" nodes) the volume of random dispersal from the core routers increases as well, which can create situations like what we saw with RapGenius where some requests are 100ms and others are 6500ms (the reason being that random routing is not intelligent and can assign jobs to a node that's completely saturated). Adding more and more intelligent nodes doesn't solve the crux of the issue, which is the random assignment of jobs in the core routers to the "intelligent" routers/nodes.
This whole situation boils down to this: "Intelligent routing is hard, so fuck it" and that's why everyone is pissed off. Heroku could've said "hey Intelligent routing is hard so we're not doing that anymore" but instead they just silently deprecated the service. It's a textbook example of how to be a bad service provider.
Ok, let's really dig in on this. Is this truly a case of us being lazy? We just can't be bothered to implement something that would make our customers' lives better?
The answer to these questions is no.
Single global request queues have trade-offs. One of those tradeoffs is more latency per request. Another is availability on the app. Despite the sentiment here on Hacker News, most of our customers tell us that they're not willing to trade lower availability and higher latency per request for the efficiencies of a global request queue.
Are there other routing implementations that would be a happy medium between pure random (routers have no knowledge of what dynos are up to) and perfect, single global queue (routers have complete, 100% up-to-date knowledge of what dynos are up to)? Yes. We're experimenting with those; so far none have proven to be overwhelmingly good.
In the meantime, concurrent backends give the ability to run apps at scale on Heroku today; and offer other benefits, like cost efficiencies. That's why we're leaning on this area in the near term.
What's the constraint that prevents you from having your dynos register with the loadbalancer cluster and then having the latter perform leastconn balancing per application?
Also why would that mean "lower availability" or "higher latency"? Did you look into zookeeper?
This is how it works. Dynos register their presence into a dyno manager which publishes the results into a feed, and then all the routing nodes subscribe to that feed.
But dyno presence is not the rapidly-changing data which is subject to CAP constraints; it's dyno activity, which changes every few milliseconds (e.g. whenever a request begins or ends). Any implementation that tracks that data will be subject to CAP, and this is where you make your choice on tradeoffs.
> why would that mean "lower availability" or "higher latency"?
I'll direct you back to the same resources we've referenced before:
> Did you look into zookeeper?
This is the best question ever. Not only did we look into it, we actually invested several man-years of engineering into building our own Zookeeper-like datastore:
Zookeeper and Doozerd make almost the opposite trade-off as what's needed in the router: they are both really slow, in exchange for high availability and perfect consistency. Useful for many things but not tracking fast-changing data like web requests.
If that's what you actually mean then I'd ask: Can't the dynos reject requests when they're busy ("back pressure")?
AFAIK that's the traditional solution to distributing the "leastconn" constraint.
In practice we've implemented this either with the iptables maxconn rule (reject if count >= worker_threads), or by having the server immediately close the connection.
What happens is that when a loadbalancer hits an overloaded dyno the connection is rejected and it immediately retries the request on a different backend.
Consequently the affected request incurs an additional roundtrip per overloaded dyno, but that is normally much less of an issue than queueing up requests on a busy backend (~20ms retry vs potentially a multi-second wait).
PS: Do you seriously consider Zookeeper "really slow"?! http://zookeeper.apache.org/doc/r3.1.2/zookeeperOver.html#Pe...
> What's the constraint that prevents you from having your dynos register with the loadbalancer cluster and then having the latter perform leastconn balancing per application
I suspect this is a consequence of the CAP theorem. You'll end up with every loadbalancer needing a near-instantaneous perception of every server's queue state and then updating that state atomically when routing a request. Now consider the failure modes that such a system can enter and how they affect latency. Best not to go there.
My understanding is that Apache Zookeeper is designed for slowly-changing data.
But that's not true. Only the loadbalancers concerned with a given application need to share that state amongst one another. And the number of loadbalancers per application is usually very small. I.e. the number is <1 for >99% of sites and you need quite a popular site to push it into the double digits (a single haproxy instance can sustain >5k connect/sec).
Assigning pooled loadbalancers to apps while ensuring HA is not trivial, but it's also not rocket science. I'm a little surprised by the heroku response here, hence my question which constraint I might have missed.
My understanding is that Apache Zookeeper is designed for slowly-changing data.
Dyno-presence per application is very slowly-changing data by zookeeper standards.
> the number of loadbalancers per application is usually very small. I.e. the number is <1 for >99% of sites and you need quite a popular site to push it into the double digits (a single haproxy instance can sustain >5k connect/sec).
So most Heroku sites have only a single frontend loadbalancer doing their routing, and even these cases are getting random routed with suboptimal results?
Or is the latency issue mainly with respect to exactly those popular sites that end up using a distributed array of loadbalancers?
> Assigning pooled loadbalancers to apps while ensuring HA is not trivial, but it's also not rocket science.
To me the short history of "cloud-scale" (sorry) app proxy load balancing shows that very well-resourced and well-engineered systems often work great and scale great, that is until some weird failure mode unbalances the whole system and response time goes all hockey stick.
> Dyno-presence per application is very slowly-changing data by zookeeper standards.
OK, but instantaneous queue depth for each and every server? (within a given app)
The system as a whole behaves more and more like a random router as the number of intelligent bamboo routing nodes increases
The point is that this process was gradual and implicit, so there's no point at which the intelligence was explicitly "deprecated". That doesn't excuse how things ended up, but it does explain it to some degree.
But that's the service people thought they were getting and what they wanted.
If Heroku prices out the intelligent routing and says; "Ok you can have intelligent routing with your current backend stack, but it's going to cost you $25/mo for evert 10 dynos, or you can switch your stack and use randomized routing for free." Then they are empowering their customers to make the choice rather than dictating to them what they should do.
Aside from that, I am extremely sympathetic to Heroku's engineering point here --- it's obviously hard for HN to extract the engineering from the drama in this case! Randomized dispatch seems like an eminently sound engineering solution to the request routing problem, and the problems actually implementing it in production seem traceable almost entirely to††† the ways Rails managed to set back scalable web request dispatch by roughly a decade††††.
††† IT IS ALL LOVE WITH ME AND THIS POINT COMING UP HERE...
†††† ...it was probably worth it!
The solution is to combine the two approaches. You split the 100 nodes into 10 groups of 10, you route randomly to one of the groups, and then within a group you route intelligently. This works really well. The probability of one of the request queues filling up is astronomically small, because for a request queue to fill up, all 10 request queues in a group have to fill up simultaneously (and as we know from math, the chance that an event with probability p occurs at n places simultaneously is exponentially small in n). Even if you route randomly to 50 groups of 2, that works a lot better than routing randomly to 100 groups of 1 (though obviously not as well as 10 groups of 10). There is a paper about this: http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...
This is essentially what they are suggesting: run multiple concurrent processes on one dyno. Then the requests are routed randomly to a dyno, but within a dyno the requests are routed intelligently to the concurrent processes running on that dyno. There are two problems with this: (1) dynos have ridiculously low memory so you may not be able to run many (if any) concurrent processes on a single dyno (2) if you have contention for a shared resource on a dyno (e.g. the hard disk) you're back to the old situation. They are partially addressing point (1) by providing dynos with 2x the memory of a normal dyno, which given a Rails app's memory requirements is still very low (you probably have to look hard to find a dedicated server that doesn't have at least 20x as much memory).
They could be providing intelligent routing within groups of dynos (say groups of 10) and random routing to each group, but apparently they have decided that this is not worth the effort. Another thing is that apparently their routing is centralized for all their customers. Rapgenius did have what, 150 requests per second? Surely that can even be handled by a single intelligent router if they had a dedicated router per customer that's above a certain size (of course you still have to go to the groups of dynos model once a single customer grows beyond the size that a single intelligent router can handle).
There's a tradeoff between:
* a well-engineered request handler (a solved problem more than a decade ago) and
* an efficient development environment (arguably a nearly-unsolved problem before the Rails era)
And I feel like mostly the Heroku drama is a result of Rails developers not grokking which end of that tradeoff they've selected.
Of course Heroku is under no obligation to do anything, but its customers have to justify its cost and low performance relative to a dedicated server. And most applications run just fine on a single or at most a couple dedicated servers, which means you don't have routing problems at all, whereas to get reasonable throughput on Heroku you have to get many Dynos, plus a database server. A database server with 64GB ram costs $6400 per month. You can get a dedicated server with that much ram for $100 per month. Heroku is supposed to be worth that premium because it is convenient to deploy on and scale. Because of these routing problems which may require a lot of engineering effort in your application it's not even clear that Heroku is more convenient (e.g. making it use less memory so that you can run many concurrent request handlers on a single Dyno).
I'm not sure there are such providers, and if there aren't, I think it's safe to point the finger towards Rails.
As a system for efficiently handling database-backed web requests, Rails is archaic. Not just because of its memory use requirements! It is simultaneously difficult to thread and difficult to run as asynchronous deferrable state machines.
These are problems that Schmidt and the ACE team wrote textbooks about more than 10 years ago.
(Again, Rails has a lot of compensating virtues; I like Rails.)
> I'm not sure there are such providers, and if there aren't, I think it's safe to point the finger towards Rails.
This is not sound logic. I described above two methods for solving the problem: (1) increase the memory per Dyno (see below: they're doing this, going from 512MB to 1GB per Dyno IIRC, which although still low will be a great improvement if that means that your app can now run 2 concurrent processes per Dyno instead of 1), or (2) do intelligent routing for small groups of Dynos. Do you understand the problem with random routing, and why either of these two would solve it? If not you might find the paper I linked to previously very interesting:
"To motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately log n / log log n with high probability. Now suppose instead that the balls are placed sequentially, and each ball is placed in the least loaded of d >= 2 bins chosen independently and uniformly at random. Azar, Broder, Karlin, and Upfal showed that in this case, the maximum load is log log n / log d + Θ(1) with high probability [ABKU99].
The important implication of this result is that even a small amount of choice can lead to drastically different results in load balancing. Indeed, having just two random choices (i.e., d = 2) yields a large reduction in the maximum load over having one choice, while each additional choice beyond two decreases the maximum load by just a constant factor."
Most things are inferior to other substitutable things! :)
And here we've re-invented the airport passport checking queue - everybody hops onto the end of a big long single queue, then near the front you get to choose the shortest of the dozen or two individual counter queues
I wonder what the hybrid intelligent/random queue analogues of the in-queue intelligence gathering and decision making you caan do at the airport might be? "Hmmm, a family with small children, I'll avoid their counter queue even if it's shortest", "a group of experienced-looking business travellers, they'll probably blow through the paperwork quickly, I'll queue behind them". I wonder if it's possible/profitable to characterize requests in the queue in those kinds of ways?
The difference, of course, is that ELBs are single-tenant. So a big app might only end up with half a dozen nodes, instead of the much larger number in Heroku's router fleet.
Offering some kind of single-tenant router is one possibility we've considered. Partitioning the router fleet, homing... all are ideas we've experimented with and continue to explore. If one of these produces conclusive evidence that it provides a better product for our customers and in keeping with the Heroku approach to product design, obviously we'll do it.
My hypothesis is that tenant-specific intelligent load balancers would be plausible; i would guess that you would never need more than a handful of HAProxy or nginx-type balancers to front even a large application. Your main challenge would then be routing requests to the right load balancer cluster. If you had your own hardware, LVS could handle that (i believe that Wikipedia in 2010 ran all page text requests through a single LVS in each datacentre), but i'm not sure what you do on EC2.
However, "hypothesis" is just a fancy way of saying "guess", which is why your findings from actual experiments would be so interesting.
There are screenshots of the website where it said that in multiple places.