The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).
1) Once per minute (or less often if you have a zillion dynos), each dyno tells the router the maximum number of requests it had queued at any time over the past minute.
2) Using that information, the router recalculates a threshold once a minute that defines how many queued requests is "too many" (e.g. maybe if you have n dynos, you take the log(n)th-busiest-dyno's load as the threshold -- you want the threshold to only catch the tail).
3) When each request is sent to a dyno, a header fields is added that tells the dyno the current 'too many' threshold.
4) If the receiving dyno has too many, it passes the request back to the router, telling the router that it's busy ( http://news.ycombinator.com/item?id=5217157 ). The 'busy' dyno remembers that the router thinks it is 'busy'. The next time its queue is empty, it tells the router "i'm not busy anymore" (and repeats this message once per minute until it receives another request, at which point it assumes the router 'heard').
5) When a receiving dyno tells the router that it is busy, the router remembers this and stops giving requests to that dyno until the dyno tells it that it is not busy anymore.
I haven't worked on stuff like this myself, do you think that would work?
I imagine the long tail disappears in a similar way that a traffic jam is prevented by lowering the speed limit.
There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.
Even ignoring the POST requests problem (yup, it tried to replay those) properly cancelling a request on all levels of a multi-level rails stack is very hard/not possible in practice. So you end up DOSing the hard to scale lower levels of the stack (e.g. database) at the expense of the easy to scale LB.
ha-proxy is a lot better than nginx + more flexible if you want to introduce non-http to your stack.
Shouldn't the request be canceled on all levels if you cut the HTTP connection to the frontend?
Alternately, heroku can introduce a third layer between the mesh routers and the inbound random load balancer. This layer consistently hashes (http://en.wikipedia.org/wiki/Consistent_hashing) the api-key/primary key of your app, and sends you to a single mesh router for all of your requests. Mesh routers are/should be blazing fast relative to rails dynos, so that this isn't really a bottleneck for your app. Since the one mesh router can maintain connection state for your app, heroku can implement a least-conn strategy. If the mesh router dies, another router can be automatically chosen.
The 'tied request' idea from the Dean paper is neat, too, and Heroku could possibly implement that, and give dyno request-handlers the ability to check, "did I win the race to handle this, or can this request be dropped?"
Your solution doesn't work if requests aren't idempotent.
For mutating requests, there's a solution as well, but it involves checksumming the request and passing the checksum along so that the database layer knows to discard duplicate requests that it's already handled. You need this anyway if there's any sort of retry logic in your application, though.
Yeah, it's a lot more practical than implementing QoS, isn't it?
As for the intelligent routing, could you explain the problem? The goal isn't to predict which request will take a long time, the goal is to not give more work to dynos that already have work. Remember that in the "intelligent" model it's okay to have requests spend a little time in the global queue, a few ms mean across all requests, even when there are free dynos.
Isn't it as simple as just having the dynos pull jobs from the queue? The dynos waste a little time idle-spinning until the central queue hands them their next job, but that tax would be pretty small, right? Factor of two, tops? (Supposing that the time for the dyno-initiated give-me-work request is equal to the mean handling time of a request.) And if your central queue can only handle distributing to say 100 dynos, I can think of relatively simple workarounds that add another 10ms of lag every factor-of-100 growth, which would be a hell of a lot better than this naive routing.
What am I missing?
Your solution would likely work if you had some higher level (application level? not real up on Heroku) at which you could specify a push vs. pull mechanism for request routing.
Given that, according to TFA (and it's consistent with some other things I've read) Heroku's bread and butter is Rails apps, and given that, according to TFA, Rails is single-threaded, that (valid) point about concurrency in a single dyno is perhaps not that relevant? You'd think that Heroku would continue to support the routing model that almost all of their marketing and documentation advertises, right? Even if it's a configurable option, and it only works usefully with single-threaded servers?
And if you did do it pull-based, it wouldn't be Heroku's problem to decide how many concurrent requests to send. Leave it to the application (or whatever you call the thing you run on a dyno).
And it doesn't need to be pull-based, if the router can detect HTTP connections closing in dynos, or whatever.
But the idea of pull-based work distribution is pretty straightforward. It's called a message queue.
Animations and results are in the explanation at http://rapgenius.com/1502046
It's not an insurmountable problem by any measure, and it's definitely worth it.
I'm not sure this applies to the OP. His in-app measurements were showing all requests being handled very fast by the app itself; the variability in total response time was entirely due to the random routing.
Even if you work on narrowing the fat tails, shouldn't you still need to be upfront and clear about how adding a new dyno only gives you an increased chance of better request handling times as you scale?
The Golang runtime uses non-blocking I/O to get around this problem.
You could write a pthreads-compliant threading library without using threads at all, just epoll.