Ruby on Rails is using a default configuration where each process can serve one request at a time. There is no cooperative switch (as in Node.js) or (near) preemptive switch (as in Erlang, Haskell, Go, ...).
The routing infrastructure at Heroku is distributed. There are several routers and one router will queue at most one message per back-end dyno in the Bamboo stack and route randomly in the Cedar stack. If two front-end routers route messages to the same Dyno, then you get a queue, which happens more often on a large router mesh.
Forgetting who is right and wrong, there are a couple of points to make in my opinion.
The RoR model is very weak. You need to handle more than one connection concurrently, because under high load queueing will eventually happen. If one expensive request goes into the queue, then everyone further down the queue waits. In a more modern system like Node.js you can manually break up the expensive request and thus give service to other requests in the queue while the back-end works on the expensive req. In stronger models, Haskell, Go and Erlang, this break-up is usually automatic and preemption makes sure it is not a problem. If you have a 5000ms job A and 10 50ms jobs, then after 1ms, the A job will be preempted and then the 50ms jobs will get service. Thus an expensive job doesn't clog the queue. Random queueing in these models are often a very sensible choice.
Note that Heroku is doing distributed routing. Thus the statistical model Rapgenius has made is wrong. One, requests does not arrive in a Poisson process. Usually one page load gives rise to several other calls to the back-end and this makes the requests dependent on each other. Two, there is not a single queue and router but multiple such. This means:
* You need to take care of state between the queues - if they are to share information. This has overhead. Often considerable overhead.
* You need to take care of failures of queues dynamically. A singular queue is easy to handle, but it also imposes a single point of failure and is a performance bottleneck.
* You have very little knowledge of what kind of system is handling requests.
Three, nobody is discussing how to handle the overload situation. What if your dynos can take 2000 req/s but the current arrival rate is 3000, if you forget about routing for a moment. How do you choose to drop requests, because you will have to do so.
If you want to solve this going forward, you probably need Dyno queue feedback. Rapgenius uses the length of the queue in their test, but this is also wrong. They should use the sojourn time spent in the queue which is an indicator for how long you wait in the queue before being given service. According to rapgenius, they have a distribution where requests usually take 46ms (median) but the maximum is above 2000ms. I can roughly have a queue length of 43 and 1 have the same sojourn time then. Given this, you can feed back to the routers about how long a process will usually stay in queue.
But again, this is without assuming distribution of the routers. The problem is way way harder to solve in that case.
(edit for clarity in bullet list)
The right vs wrong issue here is not the proper way to architect a router. The issue is that Heroku glossed over an extremely important aspect of their engineering documentation, because it painted their platform in a bad light. This is particularly damning, because as an engineer working on their platform, I could design around their shortcomings as long as they don't hide them from me.
Furthermore, I believe we could make an argument Heroku intentionally misled (both in documentation and in their support responses) clients as to how their router worked.
Personally, I think it is incredibly naive to build an application around a framework where you have no built-in concurrency. The main reason is that the queue you will build up in front of it is outside your reach so you have to sustain it.
It is also naive to think that your cooked up statistical model resembles reality in any way. Routing is a hard problem so it is entirely plausible that your model does not hold up in reality. Besides, the time it takes to construct those R models is a missed opportunity for improving the backend you have. And the R models doesn't say a lot, sorry. At best they just stir up the storm --- and boy did they succeed.
I agree it is unfortunate that Heroku's documentation isn't better and that New Relic doesn't provide the accurate latency statistics. But to claim that this is entirely Heroku's fault is, frankly, naive as well.
I'm not discussing any sort of cooked up statistical models. I'm discussing the real-world experience I had scaling an application on Heroku.
"mud-slinging from one side upon the other"
You imply that I was uninvolved before Rap Genius's expose. I assure you that is not the case. I've chosen a side in this argument well before Rap Genius went public.
The hard part being a customer or Heroku is that the problem might be on the other side of the fence. And how do you communicate that in a diplomatic way?
Personally my opinion is something along the lines of "If you use Ruby +rails in that configuration, then you deserve the problem".
What I feel is also missing from most comments about the technical aspects is the effect of running more than one process on each node (which is possible with rails using e.g. unicorn). Being able to process more than one request at a time on each node should alleviate wait times and bottle-necks, even with a simple e.g. round-robin routing layer. It might not be the absolute optimum, but it could still provide a pretty good / good enough balance.
As a side note about the overload scenario you mentioned - It's very interesting to consider, but we can easily get dragged into Denial-of-Service territory, and designing against DoS is an even harder problem in my opinion (until you considered handling 'normal' load anyway).
You need queueing to a certain extent since a queue will absorb spikes in the load and smooth out the sudden arrival of a large number of requests in a short time. But excessive queueing leads to latency.
One of the "intelligent" routing problems is we have to consider something more than queue length. We need to know, prior to running, how expensive a query is going to be. Otherwise we may end up queueing someone after the expensive query while a neighboring dyno could serve the request quickly. But you generally can't do this. Under load, such a system would "amplify" expensive queries: all queries after the expensive one in queue will be expensive as well.
This is why I would advise people to move to a model where concurrency happens "in the process" as well. It is actually easier to dequeue the work off the routing layer as fast as possible and then interleave expensive and cheap work in the dyno.
(disclaimer: I do Erlang for a living and often operate in highly concurrent settings.)
Context switches are expensive. Very expensive. So your system runs overhead in the operating system. And you need to spawn a process per request.
Pooling processes is a problem as well. If we only have 4 workers in the pool, then we have to queue requests on one of the 4 workers. But we don't know how expensive those requests are to serve, a priori. Even knowing the queue length or the queue sojourn time won't be able to divulge this information to us, only help a little. More workers just push the problem out further.
If you want to be fast, you need:
The ability to switch between work quickly, in the same process.
The ability to interleave expensive work with cheap work.
The two main solutions are evented servers: Node.js, Twisted, Tornado (both python); and preemptive runtimes: Go, Haskell, Erlang (of which Erlang is the only truly preemptive). I much prefer the preempted solution because it is automatic and you don't have to code anything yourself.
There is a strong similarity to cooperative and preemptive multitasking in operating systems by the way. Events are cooperative. Do note there are no more cooperative operating systems around which you use on a daily basis :)
Spawn up worker pool if you want > 1 req at a time.
It's not "the ROR model". Rails is just an actor in the Rack space; how the framework handles feeds requests to Rails is entirely up the Rack adapter.
Rails itself is thread-safe and can handle requests concurrently with multiple threads; it can also run in a multi-process configuration.
The setup described in the article focuses on Thin, a web server that is primarily single-threaded. These days, many people use Unicorn, which uses a forking model to serve requests from concurrent processes.