"the switch from intelligent to randomized routing"
As I understand it, Heroku (on the Bamboo stack) didn't up and decide "Hey, we're gonna switch from intelligent to random routing." The routers are still (individually) intelligent. It's just that there are more of them now, and they were never designed to distribute their internal queue state across the cluster. The system as a whole behaves more and more like a random router as the number of intelligent bamboo routing nodes increases.
True, although it's more confusing than this. We didn't make an explicit decision for Bamboo, and thus the big problem — docs fell out of date, at a minimum.
For Cedar, we did make an explicit decision. People on the leading edge of web development were running concurrent backends and long-running requests. Our experimental support for Node.js in 2010 was a big driver here, but also people who wanted to use concurrency in Ruby, like Ilya Grigorick's Goliath and Rails becoming threadsafe. These folks complained that the single-request-per-backend algorithm was actually in their way.
This plus horizontal scaling / reliability demands caused us to make an explicit product decision for Cedar.
In the early days, Heroku only had a single routing node that sat out front. So it wasn't a distributed systems problem at that point. You could argue that Heroku circa 2009 was more of a prototype or a toy than a scalable piece of infrastructure. You couldn't run background workers, or large databases. We weren't even charging money yet.
Implementing a single global queue in a single node is trivial. In fact, this is what Unicorn (and other concurrent backends) do: put a queue within a single node, in this case a dyno. That's how we implemented it in the Heroku router (written in Erlang).
Later on, we scaled out to a few nodes, which meant a few queues. This was close enough to a single queue that it didn't matter much in terms of customer impact.
In late 2010 and early 2011 our growth started to really take off, and that's when we scaled out our router nodes to far more than a handful. And that's when the global queue effectively ceased to exist, even though we hadn't changed a line of code in the router.
The problem with this, of course, is that we didn't give it much attention because we had just launched a new product which made the explicit choice to leave out global queueing. It's this failure to continue full stewardship of our existing product that's the mistake that really hurt customers.
So to answer your question, there was never some crazy-awesome implementation of a distributed global queue that we got rid of. It was a single node's queue, a page of code in Erlang which is not too different from the code that you'll find in Unicorn, Puma, GUnicorn, Jetty, etc.
> So to answer your question, there was never some crazy-awesome implementation of a distributed global queue that we got rid of.
This sentence is really good, and I would humbly suggest you consider hammering it home even more than you have.
I gathered early on that there were inherent scaling issues with the initial router (which makes sense intuitively if you think about Heroku's architecture for more than 10 seconds), but I feel like most of the articles I've seen the past few weeks have this "Heroku took away our shiny toys because they could!" vibe. (Alternative ending: "Heroku took away our shiny toys to expand their market to nodeJS!")
I don't understand how the system behaves more intelligently when the edge routers increase. The core routers random behavior gets worse with the larger load, and increasing the number of intelligent routers doesn't help to solve that problem in any meaningful way.
Sorry, correct me if I missed something, but I believe that as the overall volume of system transactions increases (thus necessitating more "intelligent" nodes) the volume of random dispersal from the core routers increases as well, which can create situations like what we saw with RapGenius where some requests are 100ms and others are 6500ms (the reason being that random routing is not intelligent and can assign jobs to a node that's completely saturated). Adding more and more intelligent nodes doesn't solve the crux of the issue, which is the random assignment of jobs in the core routers to the "intelligent" routers/nodes.
This whole situation boils down to this: "Intelligent routing is hard, so fuck it" and that's why everyone is pissed off. Heroku could've said "hey Intelligent routing is hard so we're not doing that anymore" but instead they just silently deprecated the service. It's a textbook example of how to be a bad service provider.
Ok, let's really dig in on this. Is this truly a case of us being lazy? We just can't be bothered to implement something that would make our customers' lives better?
The answer to these questions is no.
Single global request queues have trade-offs. One of those tradeoffs is more latency per request. Another is availability on the app. Despite the sentiment here on Hacker News, most of our customers tell us that they're not willing to trade lower availability and higher latency per request for the efficiencies of a global request queue.
Are there other routing implementations that would be a happy medium between pure random (routers have no knowledge of what dynos are up to) and perfect, single global queue (routers have complete, 100% up-to-date knowledge of what dynos are up to)? Yes. We're experimenting with those; so far none have proven to be overwhelmingly good.
In the meantime, concurrent backends give the ability to run apps at scale on Heroku today; and offer other benefits, like cost efficiencies. That's why we're leaning on this area in the near term.
> What's the constraint that prevents you from having your dynos register with the loadbalancer cluster and then having the latter perform leastconn balancing per application?
This is how it works. Dynos register their presence into a dyno manager which publishes the results into a feed, and then all the routing nodes subscribe to that feed.
But dyno presence is not the rapidly-changing data which is subject to CAP constraints; it's dyno activity, which changes every few milliseconds (e.g. whenever a request begins or ends). Any implementation that tracks that data will be subject to CAP, and this is where you make your choice on tradeoffs.
> why would that mean "lower availability" or "higher latency"?
I'll direct you back to the same resources we've referenced before:
Zookeeper and Doozerd make almost the opposite trade-off as what's needed in the router: they are both really slow, in exchange for high availability and perfect consistency. Useful for many things but not tracking fast-changing data like web requests.
Hm. Until now I thought dyno-presence is your issue, but now I realize you're talking about the actual "leastconn" part, i.e. the requests queueing up on the dynos itself?
If that's what you actually mean then I'd ask: Can't the dynos reject requests when they're busy ("back pressure")?
AFAIK that's the traditional solution to distributing the "leastconn" constraint.
In practice we've implemented this either with the iptables maxconn rule (reject if count >= worker_threads), or by having the server immediately close the connection.
What happens is that when a loadbalancer hits an overloaded dyno the connection is rejected and it immediately retries the request on a different backend.
Consequently the affected request incurs an additional roundtrip per overloaded dyno, but that is normally much less of an issue than queueing up requests on a busy backend (~20ms retry vs potentially a multi-second wait).
> What's the constraint that prevents you from having your dynos register with the loadbalancer cluster and then having the latter perform leastconn balancing per application
I suspect this is a consequence of the CAP theorem. You'll end up with every loadbalancer needing a near-instantaneous perception of every server's queue state and then updating that state atomically when routing a request. Now consider the failure modes that such a system can enter and how they affect latency. Best not to go there.
My understanding is that Apache Zookeeper is designed for slowly-changing data.
You'll end up with every loadbalancer needing a near-instantaneous perception of every server's queue
But that's not true. Only the loadbalancers concerned with a given application need to share that state amongst one another. And the number of loadbalancers per application is usually very small. I.e. the number is <1 for >99% of sites and you need quite a popular site to push it into the double digits (a single haproxy instance can sustain >5k connect/sec).
Assigning pooled loadbalancers to apps while ensuring HA is not trivial, but it's also not rocket science. I'm a little surprised by the heroku response here, hence my question which constraint I might have missed.
My understanding is that Apache Zookeeper is designed for slowly-changing data.
Dyno-presence per application is very slowly-changing data by zookeeper standards.
Again, I'm no expert on Heroku's architectre. Just thinking out loud here,
and feel free to tell me to RTFA. :-)
> the number of loadbalancers per application is usually very small. I.e. the number is <1 for >99% of sites and you need quite a popular site to push it into the double digits (a single haproxy instance can sustain >5k connect/sec).
So most Heroku sites have only a single frontend loadbalancer doing their routing, and even these cases are getting random routed with suboptimal results?
Or is the latency issue mainly with respect to exactly those popular sites that end up using a distributed array of loadbalancers?
> Assigning pooled loadbalancers to apps while ensuring HA is not trivial, but it's also not rocket science.
To me the short history of "cloud-scale" (sorry) app proxy load balancing shows that very well-resourced and well-engineered systems often work great and scale great, that is until some weird failure mode unbalances the whole system and response time goes all hockey stick.
> Dyno-presence per application is very slowly-changing data by zookeeper standards.
OK, but instantaneous queue depth for each and every server? (within a given app)
You seem to have misread the above comment. Here's what it actually said:
The system as a whole behaves more and more like a random router as the number of intelligent bamboo routing nodes increases
The point is that this process was gradual and implicit, so there's no point at which the intelligence was explicitly "deprecated". That doesn't excuse how things ended up, but it does explain it to some degree.
Er, I wasn't trying to claim that the system behaves more intelligently. The performance degradation is just an emergent consequence of gradually adding more nodes to the routing tier. This post might be useful for some context: http://aphyr.com/posts/277-timelike-a-network-simulator