Hacker News new | comments | show | ask | jobs | submit login

Is anybody operating at Heroku's scale offering centralized request routing queues? At what price?

Not that I know of but that's why I'm saying it would be a premium product. Likely pricing would have to scale with the number of dynos running behind the router.

But that's the service people thought they were getting and what they wanted.

If Heroku prices out the intelligent routing and says; "Ok you can have intelligent routing with your current backend stack, but it's going to cost you $25/mo for evert 10 dynos, or you can switch your stack and use randomized routing for free." Then they are empowering their customers to make the choice rather than dictating to them what they should do.

If it's truly impossible to get centralized request routing queues at Heroku's scale in any other product offering, that is evidence that a demand that Heroku provide it might be unreasonable.

Aside from that, I am extremely sympathetic to Heroku's engineering point here --- it's obviously hard for HN to extract the engineering from the drama in this case! Randomized dispatch seems like an eminently sound engineering solution to the request routing problem, and the problems actually implementing it in production seem traceable almost entirely to††† the ways Rails managed to set back scalable web request dispatch by roughly a decade††††.


†††† ...it was probably worth it!

Random routing vs fully centralized request routing is a false dichtonomy. Suppose you have 100 nodes, and you have a router that routes randomly to one of those 100 nodes. This works very poorly. Now suppose you have 100 nodes, and you have a router that routes intelligently too one of those 100 nodes, e.g. to the one with the smallest request queue. From a theoretical perspective this works really well but it may be impossible to implement efficiently.

The solution is to combine the two approaches. You split the 100 nodes into 10 groups of 10, you route randomly to one of the groups, and then within a group you route intelligently. This works really well. The probability of one of the request queues filling up is astronomically small, because for a request queue to fill up, all 10 request queues in a group have to fill up simultaneously (and as we know from math, the chance that an event with probability p occurs at n places simultaneously is exponentially small in n). Even if you route randomly to 50 groups of 2, that works a lot better than routing randomly to 100 groups of 1 (though obviously not as well as 10 groups of 10). There is a paper about this: http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...

This is essentially what they are suggesting: run multiple concurrent processes on one dyno. Then the requests are routed randomly to a dyno, but within a dyno the requests are routed intelligently to the concurrent processes running on that dyno. There are two problems with this: (1) dynos have ridiculously low memory so you may not be able to run many (if any) concurrent processes on a single dyno (2) if you have contention for a shared resource on a dyno (e.g. the hard disk) you're back to the old situation. They are partially addressing point (1) by providing dynos with 2x the memory of a normal dyno, which given a Rails app's memory requirements is still very low (you probably have to look hard to find a dedicated server that doesn't have at least 20x as much memory).

They could be providing intelligent routing within groups of dynos (say groups of 10) and random routing to each group, but apparently they have decided that this is not worth the effort. Another thing is that apparently their routing is centralized for all their customers. Rapgenius did have what, 150 requests per second? Surely that can even be handled by a single intelligent router if they had a dedicated router per customer that's above a certain size (of course you still have to go to the groups of dynos model once a single customer grows beyond the size that a single intelligent router can handle).

I understand and don't disagree with everything you are saying, but the focus of my attention is on what you're talking about in your 3rd graf. When you talk about your example problems (1) and (2) with routing to concurrent systems on large number of dynos, what you're really discussing is an engineering flaw in the typical Rails stack.

There's a tradeoff between:

* a well-engineered request handler (a solved problem more than a decade ago) and

* an efficient development environment (arguably a nearly-unsolved problem before the Rails era)

And I feel like mostly the Heroku drama is a result of Rails developers not grokking which end of that tradeoff they've selected.

I'm not sure I agree. Yes, it's a Rails problem that it is using large amounts of memory (on the other hand (2) isn't Rails specific at all, it applies equally to e.g. Node). But it's a Heroku problem that it gives Dynos just 512MB of memory. It's a Heroku problem that it doesn't have a good load balancer. Heroku is in the business of providing painless app hosting, and part of that is painless request routing. These problems may not be completely trivial to solve, but they're not rocket science either. Servers these days can hold hundreds of gigabytes of memory, the 512MB limitation is completely artificial on Heroku's part. Intelligent routing in groups is also very much achievable. Sure, it requires engineering effort, but that's the business Heroku is in.

Of course Heroku is under no obligation to do anything, but its customers have to justify its cost and low performance relative to a dedicated server. And most applications run just fine on a single or at most a couple dedicated servers, which means you don't have routing problems at all, whereas to get reasonable throughput on Heroku you have to get many Dynos, plus a database server. A database server with 64GB ram costs $6400 per month. You can get a dedicated server with that much ram for $100 per month. Heroku is supposed to be worth that premium because it is convenient to deploy on and scale. Because of these routing problems which may require a lot of engineering effort in your application it's not even clear that Heroku is more convenient (e.g. making it use less memory so that you can run many concurrent request handlers on a single Dyno).

If there is another provider that seamlessly operates at Heroku's scale (ie, that can handle arbitrarily busy Rails apps) at a reasonable price that has better request dispatching, I think it's very easy to show that you're right.

I'm not sure there are such providers, and if there aren't, I think it's safe to point the finger towards Rails.

As a system for efficiently handling database-backed web requests, Rails is archaic. Not just because of its memory use requirements! It is simultaneously difficult to thread and difficult to run as asynchronous deferrable state machines.

These are problems that Schmidt and the ACE team wrote textbooks about more than 10 years ago.

(Again, Rails has a lot of compensating virtues; I like Rails.)

I certainly already agreed that Rails' architecture is bad (though the reason that it has this problem is its memory usage, and not any of the other reasons you mention). Herokus architecture is bad as well. It's the combination of these that causes the problem. But that does not mean that it's impossible, or even hard, to solve the problem at Herokus end.

> I'm not sure there are such providers, and if there aren't, I think it's safe to point the finger towards Rails.

This is not sound logic. I described above two methods for solving the problem: (1) increase the memory per Dyno (see below: they're doing this, going from 512MB to 1GB per Dyno IIRC, which although still low will be a great improvement if that means that your app can now run 2 concurrent processes per Dyno instead of 1), or (2) do intelligent routing for small groups of Dynos. Do you understand the problem with random routing, and why either of these two would solve it? If not you might find the paper I linked to previously very interesting:

"To motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately log n / log log n with high probability. Now suppose instead that the balls are placed sequentially, and each ball is placed in the least loaded of d >= 2 bins chosen independently and uniformly at random. Azar, Broder, Karlin, and Upfal showed that in this case, the maximum load is log log n / log d + Θ(1) with high probability [ABKU99].

The important implication of this result is that even a small amount of choice can lead to drastically different results in load balancing. Indeed, having just two random choices (i.e., d = 2) yields a large reduction in the maximum load over having one choice, while each additional choice beyond two decreases the maximum load by just a constant factor."

-- http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...

I understand that one approach to dispatching requests at the load balancer is superior to the other, just as I understand that one way of absorbing requests at the app server is better than the other.

Most things are inferior to other substitutable things! :)

That's a mild way of putting it. With the current way of dispatching requests you need exponentially many servers to handle the same load at the same queuing time, if your application uses too much memory to run multiple instances concurrently on a single server.

I work at Heroku. To address you concerns about memory limitations, know that we're fast-tracking 2X dynos (this is also mentioned in the FAQ blog post). Extra memory will make it easier to get more concurrency out of each dyno.

Yes, that will be a huge improvement!

"You split the 100 nodes into 10 groups of 10, you route randomly to one of the groups, and then within a group you route intelligently."

And here we've re-invented the airport passport checking queue - everybody hops onto the end of a big long single queue, then near the front you get to choose the shortest of the dozen or two individual counter queues

I wonder what the hybrid intelligent/random queue analogues of the in-queue intelligence gathering and decision making you caan do at the airport might be? "Hmmm, a family with small children, I'll avoid their counter queue even if it's shortest", "a group of experienced-looking business travellers, they'll probably blow through the paperwork quickly, I'll queue behind them". I wonder if it's possible/profitable to characterize requests in the queue in those kinds of ways?

$25 a month? Did you forget a few zeroes?

It was just a placeholder price. :)

Amazon ELB? It does cost significantly more than Heroku AFAIK.

My understanding is that ELBs are HAProxy, and they may be set to use the leastconn algorithm (a global request queue that is friendly to concurrent backends). However, once you get any amount of traffic they start to scale out the nodes in the ELB, which produces essentially the same results as the degradation of the Bamboo router that we've documented.

The difference, of course, is that ELBs are single-tenant. So a big app might only end up with half a dozen nodes, instead of the much larger number in Heroku's router fleet.

Offering some kind of single-tenant router is one possibility we've considered. Partitioning the router fleet, homing... all are ideas we've experimented with and continue to explore. If one of these produces conclusive evidence that it provides a better product for our customers and in keeping with the Heroku approach to product design, obviously we'll do it.

I hope you'll be able to share your findings with us, even if they're negative. As someone who has no stake in Heroku, i have the luxury of finding this problem simply interesting!

My hypothesis is that tenant-specific intelligent load balancers would be plausible; i would guess that you would never need more than a handful of HAProxy or nginx-type balancers to front even a large application. Your main challenge would then be routing requests to the right load balancer cluster. If you had your own hardware, LVS could handle that (i believe that Wikipedia in 2010 ran all page text requests through a single LVS in each datacentre), but i'm not sure what you do on EC2.

However, "hypothesis" is just a fancy way of saying "guess", which is why your findings from actual experiments would be so interesting.

ELBs have a least-conn per node routing behavior. If your ELB is present in more than 1 AZ, then you have more than one node. If you have any non-trivial amount of scale, then you probably have well more than 1 node.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact