Hacker News new | comments | show | ask | jobs | submit login

So the issue here is two-fold: - It's very hard to do 'intelligent routing' at scale. - Random routing plays poorly with request times with a really bad tail (median is 50ms, 99th is 3 seconds)

The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).




I do perf work at Facebook, and over time I've become more and more convinced that the most crucial metric is the width of the latency histogram. Narrowing your latency band --even if it makes the average case worse-- makes so many systems problems better (top of the list: load balancing) it's not even funny.


I can chime in here that I have had similar experiences at another large scale place :). Some requests would take a second or more to complete with the vast majority finishing in under 100MS. A solution was put in place that added about 5 MS to the average request, but also crushed the long tail(it just doesn't even exist anymore) and everything is hugely more stable and responsive.


Let's assume that it is unacceptable to have each dyno tell the router each time it finishes a request ( http://news.ycombinator.com/item?id=5217157 ). And also that the goal is to reduce worst-case latency. And also that we don't know a priori how many requests each dyno should queue before it is 'full' and rejects further requests ( http://news.ycombinator.com/item?id=5216771 ).

Proposal:

1) Once per minute (or less often if you have a zillion dynos), each dyno tells the router the maximum number of requests it had queued at any time over the past minute.

2) Using that information, the router recalculates a threshold once a minute that defines how many queued requests is "too many" (e.g. maybe if you have n dynos, you take the log(n)th-busiest-dyno's load as the threshold -- you want the threshold to only catch the tail).

3) When each request is sent to a dyno, a header fields is added that tells the dyno the current 'too many' threshold.

4) If the receiving dyno has too many, it passes the request back to the router, telling the router that it's busy ( http://news.ycombinator.com/item?id=5217157 ). The 'busy' dyno remembers that the router thinks it is 'busy'. The next time its queue is empty, it tells the router "i'm not busy anymore" (and repeats this message once per minute until it receives another request, at which point it assumes the router 'heard').

5) When a receiving dyno tells the router that it is busy, the router remembers this and stops giving requests to that dyno until the dyno tells it that it is not busy anymore.

I haven't worked on stuff like this myself, do you think that would work?


How was 5 ms added? Multiple sleep states per request?

I imagine the long tail disappears in a similar way that a traffic jam is prevented by lowering the speed limit.


I think you misunderstood: they optimized the long running requests and the optimization incurred 5ms performance loss for short requests. It is not that the additional 5ms solved the problem.


I seem to recall Google mentioning on some blog several years ago that high variance in response latency degrades user experience much more than slightly higher average request times. I can't find the link though; if anyone has it, I'd be grateful.


Jeff Dean wrote a paper on it for CACM:

http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scal...

There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.


From experience, this is an incredibly effective way to DoS yourself. It was the default behaviour of nginx LB ages ago. Maybe only on EngineYard. Doesn't really matter as nobody uses nginx LB anymore.

Even ignoring the POST requests problem (yup, it tried to replay those) properly cancelling a request on all levels of a multi-level rails stack is very hard/not possible in practice. So you end up DOSing the hard to scale lower levels of the stack (e.g. database) at the expense of the easy to scale LB.


Nginx introduced least_conn lb method in 1.3.1 which makes it a bit better. http://nginx.org/en/docs/http/ngx_http_upstream_module.html#...

ha-proxy is a lot better than nginx + more flexible if you want to introduce non-http to your stack.

Shouldn't the request be canceled on all levels if you cut the HTTP connection to the frontend?


It's a capacity/latency tradeoff. Perhaps I'm biased by working at Google, where capacity is cheap and latency will kill you, but IIUC Heroku runs off of AWS and their database tier needs to scale horizontally anyway, so reserving sufficient overflow capacity should simply be a matter of throwing money at the problem.


Right now, heroku has one inbound load-balancer that's out of their control (probably ELB(s)). This load balancer hits another layer of mesh routers that heroku does control, and that perform all of herokus magic. In order for "intelligent routing," which is more commonly known as "least-conn" routing to work amongst the mesh layer, all of the mesh routers would have to share state with each other in real-time, which makes this a hard problem.

Alternately, heroku can introduce a third layer between the mesh routers and the inbound random load balancer. This layer consistently hashes (http://en.wikipedia.org/wiki/Consistent_hashing) the api-key/primary key of your app, and sends you to a single mesh router for all of your requests. Mesh routers are/should be blazing fast relative to rails dynos, so that this isn't really a bottleneck for your app. Since the one mesh router can maintain connection state for your app, heroku can implement a least-conn strategy. If the mesh router dies, another router can be automatically chosen.


A relatively easy fix, for read-only or idempotent requests. Also, if long-tail latency requests wind up being run twice, this technique might accelerate tip-over saturation. Still, this 'hedged request' idea is good to keep in mind, thanks for the pointer.

The 'tied request' idea from the Dean paper is neat, too, and Heroku could possibly implement that, and give dyno request-handlers the ability to check, "did I win the race to handle this, or can this request be dropped?"


> There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.

Your solution doesn't work if requests aren't idempotent.


Yeah, I know. I figure that for incoming HTTP traffic it's relatively easy to balance the GET requests, and if they're doing anything remotely sane with HTTP those ought to be idempotent (if they're not, Googlebot will come along and delete their site ;-)).

For mutating requests, there's a solution as well, but it involves checksumming the request and passing the checksum along so that the database layer knows to discard duplicate requests that it's already handled. You need this anyway if there's any sort of retry logic in your application, though.


Or use a parameter that's effectively a nonce.


This is something I've read in networked game literature: players react far better to consistent and high latency than to inconsistent and low latency, even if the averages are lower in the latter case. (It might even have been a John Carmack article).



That's probably true, but the value that Heroku is selling (and they charge a lot for it!) is that you _don't_ need to deal with this - that they will balance that out for you.


Correct, this matches my observations as well. I'd trade an increase in mean latency for a decrease in worst-case latency anytime. It makes it so much easier to reason about how many resources are needed for a given workload when your latency is bounded.


> Narrowing your latency band --even if it makes the average case worse-- makes so many systems problems better (top of the list: load balancing) it's not even funny.

Yeah, it's a lot more practical than implementing QoS, isn't it?


Re the distribution, absolutely. That "FIFTY TIMES" is totally due to the width of the distribution. Although, you know, even if their app was written such that every single request took exactly 100ms of dyno time, this random routing would create the problem all over again, to some degree.

As for the intelligent routing, could you explain the problem? The goal isn't to predict which request will take a long time, the goal is to not give more work to dynos that already have work. Remember that in the "intelligent" model it's okay to have requests spend a little time in the global queue, a few ms mean across all requests, even when there are free dynos.

Isn't it as simple as just having the dynos pull jobs from the queue? The dynos waste a little time idle-spinning until the central queue hands them their next job, but that tax would be pretty small, right? Factor of two, tops? (Supposing that the time for the dyno-initiated give-me-work request is equal to the mean handling time of a request.) And if your central queue can only handle distributing to say 100 dynos, I can think of relatively simple workarounds that add another 10ms of lag every factor-of-100 growth, which would be a hell of a lot better than this naive routing.

What am I missing?


I think the problem is that any servers which can handle concurrent requests now need to decide how many requests they can handle. Since most application servers seem to have concurrency values of "1, ever" or "I dunno, lots" this is a hard problem.

Your solution would likely work if you had some higher level (application level? not real up on Heroku) at which you could specify a push vs. pull mechanism for request routing.


Yeah, I dunno squit about Heroku either.

Given that, according to TFA (and it's consistent with some other things I've read) Heroku's bread and butter is Rails apps, and given that, according to TFA, Rails is single-threaded, that (valid) point about concurrency in a single dyno is perhaps not that relevant? You'd think that Heroku would continue to support the routing model that almost all of their marketing and documentation advertises, right? Even if it's a configurable option, and it only works usefully with single-threaded servers?

And if you did do it pull-based, it wouldn't be Heroku's problem to decide how many concurrent requests to send. Leave it to the application (or whatever you call the thing you run on a dyno).

And it doesn't need to be pull-based, if the router can detect HTTP connections closing in dynos, or whatever.

But the idea of pull-based work distribution is pretty straightforward. It's called a message queue.


Simulation author here with some additional analysis using a faster distribution of request times. If you use a distribution with median 50 ms, 90th percentile 225 ms, and 99.9th percentile 898 ms, then you need 30 intelligent dynos to handle 9000 requests/minute without queueing. In the same scenario with 30 naive dynos, 44% of requests get queued.

Animations and results are in the explanation at http://rapgenius.com/1502046


Yes, it is very hard to do it at scale, but so what? I mean, isn't the whole premise of their company to do intelligent things at scale so you don't have to?

It's not an insurmountable problem by any measure, and it's definitely worth it.


The solution here is to figure out why your 99th is 3 seconds.

I'm not sure this applies to the OP. His in-app measurements were showing all requests being handled very fast by the app itself; the variability in total response time was entirely due to the random routing.


Wouldn't the bad tails of random routing be an unpredictably long length of time since long running requests times are unpredictable?

Even if you work on narrowing the fat tails, shouldn't you still need to be upfront and clear about how adding a new dyno only gives you an increased chance of better request handling times as you scale?


That isn't the only problem with random routing - the problems aren't as pronounced with uniform response speeds, but you still get a significant difference in net effective queue time, especially if you're operating close to your throughput limit.


> The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).

The Golang runtime uses non-blocking I/O to get around this problem.


Wow, it's annoying to get modded down for saying something correct. Don't believe me? Run a golang program with strace -f sometime.

You could write a pthreads-compliant threading library without using threads at all, just epoll.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: