OP is a friend of mine, and when I first heard of his problem I wondered if there might be an analytical solution to quantify the difference between intelligent vs naive routing. I took this problem as an opportunity to teach myself a bit of Queueing Theory, which is a fascinating topic! I'm still very much a beginner, so bear with me and I'd love to get any feedback or suggestions for further study.
For this example, let's assume our queueing environment is a grocery store checkout line: our customers enter, line up in order, and are checked out by one or more registers. The basic way to think about these problems is to classify them across three parameters:
- arrival time: do customers enter the line in a way that is Deterministic (events happen over fixed intervals), randoM (events are distributed exponentially and described by Poisson process), or General (events fall from an arbitrary probability distribution)?
- checkout time: same question for customers getting checked out, is that process D or M or G?
- N = # of registers
So the simplest example would be D/D/1, where - for example - every 3 seconds a customer enters the line and every 1.5 seconds a customer is checked out by a single register. Not very exciting. At a higher level of complexity, M/M/1, we have a single register where customers arrive at rate _L and are checked out at rate _U (in units of # per time interval), where both _L and _U obey Poisson distributions. (You can also model this as an infinite Markov chain where your current node is the # of people in the queue, you transition to a higher node with rate _L and to a lower node with rate _U.) For this system, a customer's average total time spent in the queue is 1/(_U - _L) - 1/_U.
The intelligent routing system routes each customer to the next available checkout counter; equivalently, each checkout counter grabs the first person in line as soon as it frees up. So we have a system of type M/G/R, where our checkout time is Generally distributed and we have R>1 servers. Unfortunately, this type of problem is analytically intractable, as of now. There are approximations for waiting times, but they depend on all sorts of thorny higher moments of the general distribution of checkout times. But if instead we assume the checkout times are randomly distributed, we have a M/M/R system. In this system, the total time spent in queue per customer is C(R, _L/_U)/(R_U - _L), where C(a,b) is an involved function called the Erlang C formula .
How can we use our framework to analyze the naive routing system? I think the naive system is equivalent to an M/M/1 case with arrival rate _L_dumb = _L/R. The insight here is that in a system where customers are instantaneously and randomly assigned to one of R registers, each register should have the same queue characteristics and wait times as the system as a whole. And each register has an arrival rate of 1/R times the global arrival rate. So our average queue time per customer in the dumb routing system is 1/(_U - _L/R) - 1/_U.
In OP's example, we have on average 9000 customers arriving per minute, or _L = 150 customers/second. Our mean checkout time is 306ms, or _U ~= 3. Evaluating for different R values gives the following queue times (in ms):
which are reasonably close to the simulated values. In fact, we would expect the dumb router to be comparatively even worse for the longer-tailed Weibull distribution they use to model request times, because you make bad outcomes (e.g. where two consecutive requests at 99% request times are routed to the same register) even more costly. This observation seems to agree with some of the comments as well .
As someone building a Heroku Cedar-esque PaaS, here's the problem with your analogy: back in Aspen/Bamboo days (and what a lot of people still think of as "PaaS"), Heroku was like this (i.e your app was the one-a-time cashier, and Heroku's "routing mesh" setup checkout lanes and routed customers to your cashiers intelligently).
Now however, Heroku lets you build your own checkout lane, so you can run apps with single-response thread Rails, multi-response thread(e.g. unicorn) Rails, and async long-polling/SSE etc apps w/ ruby/node.js/scala/go/erlang/etc that can handle huge numbers of simultaneous connections. Throw websockets into the mix here too (we do). And you can even mix & match within an app, distributing requests to different stacks of code based on URL or the time of day, which may have different internal response/queuing characteristics (e.g. we have a Rails app w/ a Grape API w/ a handful of URLs mapped in via Rack::Stream middleware rather than going through Rails).
So to get back to your analogy, Heroku is automating the setup of the "lanes", but each supermarket is allowed to use its own blueprint and cashier and checkout process, and basically just do whatever they want within a lane. Maybe some "lanes" are more like a restaurant where 25 customers spend an average of 45 minutes at a time juggled between 6 waiters while others are still bottlenecked supermarket checkouts, with everything in between. Maybe one type of customer ties up the cashier/waiter so much that he can only handle 10 others instead of 100 normally. And it could all change every time the store opens (a deployment with new code occurs), or based on what the specific customers are buying.
The point is simply that there's not a "next available checkout counter" in this situation, because all apps are not single-threaded Rails apps anymore. Which doesn't mean there aren't better solutions than dumb routing, but it does get a bit more complicated than the supermarket checkout.
We're discussing Rails on Heroku specifically which, non-unicorn, should be a "next available checkout counter" situation. Ideally it should be possible to make this an optional behavior that you can choose to turn on for Rails apps.
I agree there should be a better way - it's just important to understand than Rails doesn't get any special treatment on a PaaS done correctly, so it's important to come up with a generic solution.
I think part of the solution would be customizable option(i.e.. how many requests can each dyno handle simultaneously), probably combined with intelligently monitoring/balancing proxy load so new requests always go to the least-loaded dyno.
Buildpacks could probably be used to parse of Gemfile/etc, see if you're using what mix of webrick/unicorn/rails/sinatra/rack-stream/goliath etc, and set an semi-intelligent default. But apps are increasingly unlike a checkout line. Apps are more like the supermarket, which is harder.
Rails doesn't need to be treated specially. All that is needed is a "maximum number of simultaneous connections to pass to this backend" setting coupled with load balancing by available slots rather than purely randomly.
The issue here isn't that Rails needs to be treated specially - this problem applies to various extent in any type of backend where some types of requests might turn out to be computationally heavy or require lots of IO. You can't magic away that: A request that takes 8 CPU seconds will take 8 CPU seconds. If you start piling more requests onto that server, response times will increase, even if some will keep responding, and if another 8 CPU second request hits too soon, chances increase that a third one will, and a fourth, and before you know it you might have a pileup where available resources for new requests on a specific instance are rapidly diminishing and response times shoot through the roof.
Pure random distribution is horrible for that reason pretty much regardless.
Now, doing "intelligent" routing is a lot easier for servers with some concurrency, as you can "just" have check requests and measure latency for the response and pick servers based on current low latency and get 90% there and that will be enough for most applications. Sure, the lower the concurrency, the more you risk having multiple heavy queries hit the same server and slow things down, and this request grows dramatically with the number of load balancers randomly receiving inbound requests to pass on, but at least you escape the total pileup more often.
But that's also a clue to one possible approach for non-concurrent servers: group them into buckets handled by a single active load balancer at a time and have front ends that identifies the right second layer load balancers. Shared state is now reduced to having the front end load balancers know which second layer load balancers are the currently active ones for each type of backend. It costs you an extra load balancer layer with according overhead. But don't you think OP would prefer an extra 10ms per request over the behaviour he's seen?
I'm sure OP could prefer the extra 10ms, but then everyone else who can deal with random dispatching right now has to pay a 10ms penalty because OP built his stuff on a technology that can deal with only one request at a time on a server, which boggles the mind to begin with.
Why? The system could easily be built so that it by default only aggregates those services where the configuration indicates they can handle a concurrency below a certain level, and does random balancing of everything else.
The "everyone else who can deal with random dispatching right now" is a much smaller group than you think. Anyone who has long running requests that grind the CPU or disk when running, will be at high risk of seeing horribly nasty effects from random dispatching, no matter whether their stack in ideal conditions have no problem handling concurrent requests.
It's just less immediately apparent, as any dynos that start aggregating multiple long running requests will "just" get slower and slower instead of blocking normally low-latency requests totally.
"The system could easily be built so that it by default only aggregates those services where the configuration indicates they can handle a concurrency below a certain level, and does random balancing of everything else."
I've built fairly large haproxy based infrastructures, thank you very much. Doing this is not particularly challenging.
Actually what I'd probably do for a setup like this would be to balance by the Host: header, and simply have the second layer be a suitable set of haproxy instances balancing each by least connections.
Haproxy doesn't support dynamic configurations as far as I know, which is a serious problem if you're letting lots of people add/change domains and scale backends up/down dynamically. A Heroku haproxy would probably need to be restarted multiple times a second due to config changes. Nginx can do dynamic backends with lua & redis, but it can't use the built-in upstream backend balancing/failover logic if you do.
While it doesn't support dynamic configurations, it does support hot reconfiguration (the new daemon signals the old processes to gracefully finish up and shut down), and reconfigures very rapidly. You still don't want to restart it multiple times a second, but you don't need to:
A two layer approach largely prevents this from being a problem. You can afford total overkill in terms of the number of haproxies as they're so lightweight - running a few hundred individual haproxy instances with separate configs even on a single box is no big deal.
The primaries would rarely need to change configs. You can route sets customers to specific sets of second layer backends with ACL's on short substrings of the hostname (e.g. two letter combinations), so that you know which set of backends each hostname you handle maps to, and then further balance on the full host header within that set to enable the second layer to balance on least-connections to get the desired effect.
That lets you "just" rewrite the configs and hot-reconfigure the subset of second layer proxies handling customers that falls in the same set on modifications. If your customer set is large enough, you "just" break out the frontend into a larger number of backends.
Frankly, part of the beauty of haproxy is that it is so light that you could probably afford a third layer - a static primary layer grouping customers into buckets, a dynamic second layer routing individual hostnames (requiring reconfiguration when adding/removing customers in that bucket) to a third layer of individual customer-specific haproxies.
So while you would restart some haproxy multiple times a second, the restarts could trivially be spread out over a large pool of individual instances.
Alternatively, "throwing together" a second or third layer using iptables either directly or via keepalived - which does let you do dynamic reconfiguration trivially, and also supportes least-connections load balancing - is also fairly easy.
But my point was not to advocate this as the best solution for somewhere like Heroku - it doesn't take a very large setup before a custom solution starts to pay off.
My point was merely that even with an off the shelf solution like haproxy, throwing together a workable solution that beats random balancing is not all that hard - there's a large number of viable solutions -, so there really is no excuse not to for someone building a PaaS.
> it's just important to understand than Rails doesn't get any special treatment on a PaaS done correctly, so it's important to come up with a generic solution.
Its kind of weird to describe not optimizing the entire platform provided to apps as "PaaS done correctly". Making a PaaS more generic has a certain kind of value in terms of broadening the audience and enabling heterogenous systems to be implemented on it, but if you are doing that by sacrificing the optimization of the individual application platforms available, you are losing some of what makes a PaaS valuable as opposed to roll-your-own platform support on top of a generic IaaS.
Its especially problematic to say that worsening support for the main existing app framework in use on an establish PaaS and giving existing customers orders of magnitude less value for their money is doing something right.
> I think part of the solution would be customizable option
That's probably a good idea, though the default for existing apps should not have changed, especially without clear up-front notice.
> But apps are increasingly unlike a checkout line.
Existing apps are, for the most part, exactly as much like a checkout line as they were before the unannounced change.
Because PaaS is a generic technology for running and scaling applications with a multitude of different language/framework/stacks, and many/most of those apps do not share the specific properties of single-threaded Rails (including many Ruby/Rails apps!)
And Rails 4 is going to bake-in "live streaming", making single-threaded app servers even more of an edge case.
It's like saying EC2 should tailor its virtualization to Fedora 16, or Mac OS X should tailor its windowing system to Photoshop CS4, or Apache should tailor mod_proxy to Joomla. There may be specific attributes of popular applications that need to be adapted to, but those adaptations need to be built in a generic way and exposed through a standard API.
Since even many Rails apps now do not follow a single threaded request-response model, that model of running a web application needs to be considered as one case of many, and building a platform that supports many/all use-cases as well as possible is more complicated than building a platform that fits one use case like a leather glove.
> Mac OS X should tailor its windowing system to Photoshop CS4
I think statements like these obscure away the very tight coupling Heroku has historically had with Rails. While certainly Heroku now perhaps envisions itself as a do-it-all PaaS, there's no denying Rails at one point (and, numerically, perhaps still) was their bread and butter.
While I don't have numbers to support or refute the assertion that "most Rails apps are primarily single threaded", my suspicion is that this is in fact still the case.
> or Mac OS X should tailor its windowing system to Photoshop CS4
I'm taking this example out specifically, because it illustrates my point quite nicely.
If there were a sizeable community of people who only wanted to use a computer for Photoshop, and tailoring the windowing system to them made it a significant usability improvement for those people, then it would be a completely imaginable situation that upon first opening your brand new Mac, it'd ask you whether you're one of those Photoshop people and want the special windowing system setup.
Well, ok, haha Apple and customizing anything for anyone, ever. But many other vendors might make such a choice.
The apparent stubborn refusal of many PaaS services, including Heroku and yours, to particularly tailor to a very common configuration of Rails sounds like a hole in the market, to me. As a customer, I don't care whether this is "incorrect" because Rails does not conform to some yet-to-be-defined standard or simply because the PaaS doesn't have their shit together. The customer experience is the same: I'm running a blog tutorial app, and the performance sucks.
The 'P' stands for platform. Providing the platform as a service absolutely means catering to the specific needs of the platform. There is no generic platform. If you want to support multiple platforms, then you support multiple platforms. You don't stop supporting any platform at all.
This whole we can't optimize for Rails anymore seems like a red herring. Dumb (random) routing is dumb routing. It doesn't matter if you have single threaded Rails or Django stack or highly concurrent Node.js or Erlang serving requests, if you distribute the requests randomly you're not going to efficiently use your resources and the Heroku "promise" of spinning up a 2nd dyno and getting 100% more concurrency is just not true.
All it changes is the details of the analysis, not the core finding. It makes the problem not as worse, but it's still pretty bad, and it's worse the more uneven your traffic is (in terms of how long each request takes to service).
All apps, even Unicorn, JRuby, Node.js, Erlang, etc. would benefit from something better than random routing.
It doesn't yet, but that doesn't really change the situation if you've already got long-polling/SSE. I mentioned it because we support it and it seems like a big part of the model the the web is moving towards (which is significantly less request-response based).
Just a little thing: why can't we just say "servers" and "requests" instead of "registers" and "customers", because stores don't get 9000 customers per minute and don't have 500 registers. Everybody here would understand servers and requests.
Distributed load balancing is a tough problem with two pieces to it. One is the queueing theory part.
The other is the systems side to it. If you have multiple customers and multiple checkout lines, and if your customers act independently without seeing the lines (no feedback from servers, network failures and delays, implementation complexity), what do you do?
It isn't a trivial problem. The easy route is paying Cisco's load balancers millions of dollars, but those only scale so far.
The bigger internet companies spend years of development time trying to make distributed load balancing work, but the issues there are a bit more complicated than a few customers walking to checkout lines.
You are certainly right that it becomes an M/G/1 model and thus that M/M/1 and M/D/1 will give reasonable approximations. The reason that it's an M/?/1 model is partly what you're saying, but also partly because the random assignment of the incoming requests acts as a random filter on a Poisson process. Poisson processes are just defined by not having history, and the random filter is not history-dependent either, so the output from the filter -- the input to any node -- is still a Poisson process.
What's interesting to me here is: Suppose rather than doing this with a random variable, you do it with a summing filter, a simple counter:
on request r:
n = (n + 1) % node.length
That's a really simple solution but it should tend to average out the Poisson input, which gives you something closer to a D/G/1 problem.
Queuing theory is cool, but I'm not 100% sure it actually applies here in a meaningful sense (although, disclaimer: I'm no more experienced here than you). A lot of queuing theory assumes that you must route a request to a handler immediately as you receive it, and that reassigning a request is a very expensive process.
This intuitively explains why queueing theory is very big in router design - imagine that you send a packet through 10 hops and at the last hop experiences a significant delay, does the packet then turn around and go back through another router? Which hop does it pick to look for a different path through? What happens if the packet gets delayed going backwards looking for another route? Does it reverse _again_ looking for a quicker route to another route? Answer: it doesn't, routers deliver messages with the "best effort" (in protocol terminology) they can, and high level latency trends are adapted for through the routing algorithms themselves (read: not adapted for each individual packet). This keeps transport much simpler and therefore faster.
In the case of load balancing, if the "(re)assignment cost" (my terminology) of a request is sufficiently small, then it doesn't make sense to pre-distribute requests until you can be 100% sure a worker is ready. If a request takes 40ms to process and 0.5ms to distribute/assign to a worker, then waiting for feedback (which would also take 0.5ms) from a worker would incur a slowdown of (40 + 0.5 + 0.5)/40 versus if you pre-assigned before a worker was finished. This seems like a no-brainer if it would keep the width of the distribution of your latencies down.
Edit: thinking about this more, if you have an asynchronous worker model, Queueing theory comes back into play. If a worker stops processing Request A to wait for a network response and takes up Request B, and then the network responds while Request B is still working, moving Request A mid-handling to another free machine may be very hard/expensive, if not entirely impossible. As themgt brings up, it sounds like Heroku enabled an asynchronous model in a recent stack and may have dropped the feedback loop that allows for intelligent routing because there's no obvious way to implement it in an asychronous model.
That being said, you could still have a feedback loop when a worker is completely idle. It's certainly very hard to reason about a worker currently executing any task, but it is very easy to reason about a worker that isn't executing anything. Therefore, it should be straightforward (theoretically) to do intelligent routing to empty workers and then keep the random routing for busy workers in such a way that if there is an idle worker no request should ever go to a busy worker. A more advanced version would keep track of the number of open connections to each worker and simply distribute to the worker with the fewest number of active open connections.
Yeah, you want leastconn + app/router affinity. Affinity is the statement that all of your requests for an app go through one router (to avoid distributed connection state tracking).
In the past, I've accomplished this by having the next layer up consistent hash the api keys onto the router list. If you don't control the top layer (ELB), you need to add a dumb layer just for the hashing.
Normally I would just downvote you and move on, but in this case your comment is frustrating enough that I have to say something. I found the comment you responded to (by nsrivast) quite fascinating. A well-written but brief analysis of the problem, with sources attached for further reading -- what's not to like? In-depth and thoughtful comments like that are what keep me coming back to this site, and are what make the community great.
I for one am very thankful that nsrivast took the time to write something so technical and detailed. However, I found your response to be in extremely poor taste. It added nothing to the conversation, and IMHO was rude and unnecessary.
Yes. Really. Some people spend their time writing lengthy and technical posts about specific technical issues. The rest of the world benefits from this. And the person writing the post benefits too, because trying to write something like that makes you smarter. Perhaps you should try it sometime.
I'm inclined to wait until Heroku weighs in to render judgement. Specifically, because their argument depends on this premise:
> But elsewhere in their current docs, they make the same old statement loud and clear:
> The heroku.com stack only supports single threaded requests. Even if your applicaExplaintion were to fork and support handling multiple requests at once, the routing mesh will never serve more than a single request to a dyno at a time.
They pull this from Heroku's documentation on the Bamboo stack , but then extrapolate and say it also applies to Heroku's Cedar stack.
However, I don't believe this to be true. Recently, I wrote a brief tutorial on implementing Google Apps' openID into your Rails app.
The underlying problem with doing so on a free (single-dyno) Heroku app is that while your app makes an authentication request to Google, Google turns around and makes a "oh hey" request to your app. With a single-concurrency system, Google your app times out waiting for Google to get back to you and Google won't get back to you until your app gets back to you so hey deadlock.
However, there is a work-around on the Cedar stack: configure the unicorn server to supply 4 or so worker processes for your web server, and the Heroku routing mesh appropriately routes multiple concurrent requests to Unicorn/my app. This immediately fixed my deadlock problem. I have code and more details in a blog post I wrote recently. 
This seems to be confirmed by Heroku's documentation on dynos :
> Multi-threaded or event-driven environments like Java, Unicorn, and Node.js can handle many concurrent requests. Load testing these applications is the only realistic way to determine request throughput.
I might be missing something really obvious here, but to summarize: their premise is that Heroku only supports single-threaded requests, which is true on the legacy Bamboo stack but I don't believe to be true on Cedar, which they consider their "canonical" stack and where I have been hosting Rails apps for quite a while.
Yeah, the only real question is whether or not it's true that they no longer do intelligent routing. If that is the case, then regardless of anything else the problem exists once you pass a certain scale/request cost. It won't matter if that one dyno can handle hundreds of requests at once, it will still queue stupidly.
One of the hidden costs of Unicorn is spin-up time. Unicorn takes a long time to start, then fork. We would get a ton of request timeouts during this period. Switching back to Thin, we never got timeouts during deploys - even under very heavy load.
Maybe this is a stupid question, but with unicorn it forks the request and can process multiple requests at the same time. Previously it seems that only one request could be handled by the dyno so requests had to queue on the dynamic routing layer but with multiple request support with unicorn or whatever, wouldn't it be more efficient to dump all the requests to dynos? Followup question, also how would intelligent routing work if it just previously checked to see if which dyno had no requests? That seems like an easy thing to do, now you would have to check CPU/IO whatever and route based on load. Not specifically targeted at you but to everyone reading the thread.
> Previously it seems that only one request could be handled by the dyno so requests had to queue on the dynamic routing layer but with multiple request support with unicorn or whatever, wouldn't it be more efficient to dump all the requests to dynos?
It would be if all requests were equal. If all your requests always take 100ms, spreading them equally would work fine.
But consider if one of them takes longer. Doesn't have to be much, but the effect will be much more severe if you e.g. have a request that grinds the disk for a few seconds.
Even if each dyno can handle more than one requests, since those requests share resources, if some of them slows down due to some long running request, response times for the other requests are likely to increase, and as response times increase, it's queue is likely to increase further, and it gets more likely to pile up more long running requests.
> Followup question, also how would intelligent routing work if it just previously checked to see if which dyno had no requests? That seems like an easy thing to do, now you would have to check CPU/IO whatever and route based on load. Not specifically targeted at you but to everyone reading the thread.
There is no perfect answer. Just routing by least connections is one option. it will hurt some queries that will end up being piled up on servers processing a heavy request in high load situations, but pretty soon any heavily loaded servers will have enough connections all the time that most new requests will go to lighter loaded servers.
Adding "buckets" of servers for different types of requests is one option to improve it, if you can easily tell by url which requests will be slow.
That gets pretty unlikely, especially if you have many dynos and a low frequency of slow requests. The main reason unicorn can drastically reduce queue times here is that it does not use random routing internally.
I don't have a stake in the Ruby webserver wars, but the unicorn site has a very good discussion about how it works internally, why it's build as it is, pros & cons etc..
This seems to be missing from most of these project sites, which are often just marketing (look! It's better!!), and therefore not very trustworthy.
From the outside it looks like the biggest differentiator in each generation of ruby servers (and, I guess, db managment systems :) is not that the new is better or worse, but simply that has different trade-offs.
You seem have written one of the very few articles I've seen benchmarking this. I'd love to know more about how Puma compares to Unicorn (especially unicorn configurations mentioned by some in this conversation) and Thin for serving rails on Heroku. Many of the Unicorners pushing their solution don't appear to be aware of Puma and its potential benefits. I'm curious if Puma with MRI has benefits, too. Thanks!
To amplify, I do a ton of queue adjustment work with Unicorn at the backlog level. It is so frequent that we set up Unicorn startup to read from an ENV variable on Heroku that we set as needed.
With two Unicorn workers we found that 25 was the best backlog threshold to accept (it refuses additional requests). When we were able to go to 5 Unicorn workers on Heroku we had to start to adjust that.
Here's a gist of our unicorn.rb config (https://gist.github.com/apangeajwrubel/4953849)
Using env variables lets us adjust parameters without a code push (still requires a restart). We saw dramatic reduction in 'long tail' requests when we dropped the backlog to 25. We're experimenting now (thanks to @scottshea) with even lower values. At some point the routing mesh will give up retrying and throw an H21 (https://devcenter.heroku.com/articles/error-codes#h21-backen...). One data point that would be useful from heroku is how many retries we get.
You have to remove the port declaration from the line for Unicorn in your Procfile, and then add a line like this to your unicorn.rb file to define the listener port along with adjusting the backlog size:
We did this in unicorn.rb `:backlog => Integer(ENV['UNICORN_BACKLOG'] || 200)` and then set the UNICORN_BACKLOG variable by the Heroku command line `heroku config:set UNICORN_BACKLOG=25 -a <app_name>`. We have been as high as 1024 and as low as 10. We settled in at 25 for us.
I've read through the OP, and all of the comments here. Our job at Heroku is to make you successful and we want every single customer to feel that Heroku is transparent and responsive. Getting to the bottom of this situation and giving you a clear understanding of what we’re going to do to make it right is our top priority. I am committing to the community to provide more information as soon as possible, including a blog post on http://blog.heroku.com.
Thanks for the response, but I have to admit that the lack of a clear-cut answer here is a little worrisome.
Anyone who wants to like Heroku would hope that the OP is flat out, 100%, wrong. The fact that Heroku's official answer requires a bit of managing implies otherwise.
On a related tangent, I would also encourage future public statements to be a little less opaque than some Heroku has put out previously.
For instance, the cause of the outage last year was attributed to "...the streaming API which connects the dyno manifold to the routing mesh" . While that statement is technically decipherable, it's far from clear.
Maybe it doesn't need "managing", Oren might just want to talk with whoever was responsible for the change and see what the best way forward is. I don't think panicked, knee-jerk reactions like "OMG we were wrong and will revert that commit pronto!" are beneficial in situations as complex as this.
Yeah, absolutely. I'm just saying we can't expect a to manager to immediately respond to a highly-technical issue questioning subtle changes in internal behaviour which might have been introduced years ago.
What's the point of posting a link to the front page of your blog, where the most recent article is 15 days old (4 hours after the comment above)?
What we want to know:
- is the OP right or wrong? That is, did you switch from smart to naive routing, for all platforms, and without telling your existing or future customers?
- if you did switch from smart to naive routing, what was the rationale behind it? (The OP is light on this point; there must be a good reason to do this, but he doesn't really say what it is or might be)
- if the OP is wrong, where do his problems might come from?
What's the point of posting a link to the front page of your blog
Well, he promised a detailed blog post, at which point that link will be extremely helpful.
I do not think it is fair to expect an immediate detailed response to those questions. If I were CEO of Heroku, I wouldn't say anything definite until after talking to the engineers and product managers involved--even if I was already pretty sure what happened. The worst thing you could do at this point is say something that's just wrong.
I don't expect an immediate response; I would have been happy with just: "This is Heroku's CEO. I'm on it."
But a link, that doesn't point anywhere useful, introduced by a PR phrase that sounds a little like "Your call is important to us", was a little annoying, esp. after reading the OP where they say they have contacted Heroku multiple times on this issue.
> if you did switch from smart to naive routing, what was the rationale behind it?
Most probable cause: smart routing is hard to scale. Multiple routers, with each one doing random distribution independently of others will still produce a globally random distribution. No need for inter-router synchronization.
If multiple routers try smart routing, they must do quite a bit of state sharing to avoid situations where N routers try to schedule their tasks on a single dyno. And even if you split dynos between routers then you need to move requests between routers in order to balance them.
While I think it is appropriate for Heroku to respond to this thread (and other important social media outlets covering this), linking to a blog without any messaging concerning your efforts might not be the greatest move... This may not be a sink or swim moment for Heroku, but tight management of your PR is key to mitigating damage. Best of luck, Heroku is a helpful product and I want to see you guys bounce back from the ropes on this one.
MQ might be a solution but certainly not in the "just use" class. Unless you want to introduce a bottleneck and a single point of failure, this queue has to be distributed.
Managing a distributed queue is hard, for reasons similar to ones making the original problem hard - DQs require global state in a distributed environment. There are tradeoffs involved - the synchronization cost might become a bottleneck in itself.
Pushing the problem on the distributed brokers is making a big bet on the queuing solution. Nope, definitely not in the "just use" category.
We've always been of the opinion that queues were happening on the router, not on the dyno.
We consistently see performance problems that, whilst we could tie down to a particular user request (file uploads for example, now moved to S3 direct), we could never figure out why this would result in queuing requests given Heroku's advertised "intelligent routing". We mistakenly thought the occasion slow request couldn't create a queue....although evidence pointed to the contrary.
Now that it's apparent that requests are queuing on the dyno (although we have no way to tell from what I can gather) it makes the occasional "slow requests" we have all the more fatal. e.g. data exports, reporting and any other non-paged data request.
Several Rails apps I develop have been suffering from similar issues. Perhaps 2-3% of requests take 0.4-2s in just processing. If the allocation is a little intelligent, it'll not perform too badly and is less work than much harder optimization. Yet if it's random, it'll queue up horribly.
I'm pissed. Spent way too much time unable to explain it to coworkers, thinking I just didn't understand Heroku's platform and that it was my fault.
Turns out, I didn't understand it, because Heroku never thought to clearly mention something that's pretty important.
Easiest fix: moving to EC2 next week. I've wanted to ever since our issues became evident but it's hard to make a good argument from handwaving about 'problems'.
I've done dev-ops in the ads industry before, it's really not that hard if you're a competent programmer. You just have to take a more studious approach than most non-dev-ops programmers and read up on things before deploying them.
But if you want to let the various PAAS providers put the fear into you, that's your cowardice.
Let the others learn as they may.
To clarify, Heroku is making the problem harder on themselves than it would be for an individual to serve their own needs because of the complexity managing so many customers and apps.
You don't have to be Heroku to do for yourself what they offer.
> Of course, then you need to solve all these problems yourself. That sounds pretty easy, you'll have it done next week no problem!
I agree with this, actually. I know it's not simple to do your own servers when you're growing. Yet I'd rather improve my existing ops skills a bit than have to setup everything as async APIs (on EC2 anyway). That's the only way I can see that I can solve this.
"Yet I'd rather improve my existing ops skills a bit than have to setup everything as async APIs (on EC2 anyway). That's the only way I can see that I can solve this."
You're going to discover that a lot of "ops skills" boils down to "do things asynchronously whenever possible". And while nearly any smart engineer can think of the "right" way of doing something, finding the time to do it all is a huge opportunity cost.
That's what the parent is trying to say. It's not that you can't do it; it's that it's a really bad idea to do it, at first.
It's a lot easier than you think when you aren't limited by artificial restrictions on the number of concurrent requests you can serve. Taking out what could be considered here to be a hostile intermediary will free up tech resources to fix problems that actually exist.
I can only imagine how these guys must have been beating their heads against the wall. Heroku charges a premium price and should be providing a premium service.
Solving this problem is easy: Run haproxy, and he'll have detailed control over the balancing algorithm, including balancing by least connections and a number of other measures (if he, for example, wants to segregate the long running requests on a specific set of backends, it's trivial).
Best thing we can do is follow through on the article's call-to-action for emailing email@example.com:
"After reading the following RapGenius article (http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics), we are reevaluating the decision to use Heroku. I understand that using a webserver like unicorn_rails will alleviate the symptoms of the dyno queuing problem, but as a cash-strapped startup, cost-efficiency is of high importance.
I look forward to hearing you address the concerns raised by the article, and hope that the issue can be resolved in a cost-effective manner for your customers."
This solves the issue of the application seeing out-of-whack queue times if there's clock skew between the front-end routing framework and the actual dyno box, but misses all the queued time spent in the dyno-queue per rap genius's post.
I work on New Relic's ruby agent, and you're right (hey Justin). In fact we support both methods (i.e. passing a timestamp or a duration). We rely on the front end server (e.g. nginx, apache) to set a timestamp in an HTTP header and forward that to the ruby application. In the case of heroku there is a special header that they pass which describes the queuing duration. Because we're in the ruby application we don't have control over whether this timestamp is accurate but I'm very interested in ideas on how we could do a better job in this situation.
My thoughts before I left the project were to add increased granularity of queue times via having headers added at each passing server and show a rainbow chart for the 'depth' of queue at each layer, not sure if that ever got added.
There's facility for that in the Agent, to allow multiple copies of the header and use whichever came first (for the beginning) and whichever came last (for the end ), it'd be relatively easy to hook metrics into each of those.
I can confirm this. We experimented with Unicorn as a way to get some of the benefits of availability-based routing despite Heroku's random routing. Our medium-sized app (occupying ~230 MB on boot) would quickly exceed Heroku's 512 MB memory limit when forking just 2 unicorn workers, so we had to revert to thin and a greater number of dynos.
That makes more efficient use of the nodes, because otherwise idle cpu time and memory allocation within a node has a chance of getting used, but it doesn't stop the queueing problem (it mearly gives you more virtual nodes to hand tasks to unintelligently).
Also as the nodes are virtual machines anyway and may be contending with each other for IO, and for most apps these days you spend more time waiting for IO than you do spinning the CPU (unless you have a lot of static content so don't need to hit the db for many requests - but such requests are better handled by a caching layer above that which handles the fancier stuff), so the benefit of running multiple processes per node is going to be a lot less noticable than if you are talking about the nodes being physical machines with dedicated storage channels.
Thanks for the blog post, by the way. When we were struggling with our own Heroku scaling issues last year (we eventually moved to AWS), I came across it and it was good vindication that somebody else was facing the same issue.
Because the premise is a good one, abstracting away some of the lower level details so people can just worry about building their apps. It has always been a tough abstraction to pull off though. It is possible the answer is simply that rap genius has outgrown heroku, in the same way as twitter outgrowing rails it doesn't make it a bad tool for everyone.
Precisely. I'm definitely not a startup worshipper; in fact, in terms of the HN spectrum, I'm probably at the more cynical end of the spectrum. But I like the ease of integration, and the fact that, for a smaller application, it certainly helps with my administration and infrastructure overhead.
As others here have pointed out in different discussions: Ideas aren't worth much by themselves. It's the subset that's implementable of which it's the subset that has gotten implemented that's worth anything. If we just love ideas we should better become philosophers.
Sorry to be a bit harsh, but I find it a bit shocking how even in this field where we can basically play god and do whatever we want and what we think is best on increasingly powerful bit-cruncher-and-storer-machines, so many here seem to behave like a herd of sheep and just do what 'everyone else' does. Just sit down for a moment and think! What are my requirements right now? What could be a requirement in the near future? What technologies are there which can help me? Am I sure about these feature? Better read up on it first! How difficult is it to get it to behave in ways that are or can be important for me?
Now list that stuff down. If it's puzzling sleep over it, forget it for a few days. Then suddenly, for example under a hot shower you get an idea - that requirement I had isn't really one, I can solve it differently! Come back, take the now fitting piece of the puzzle and do your job in 20% of the time that would have been needed if you would just have blindly followed some path. That's how it usually works for me. Be picky, be exact, but be lazy.
Now about that routing dispatcher problem: Couldn't we solve that in one to two weeks on a generic plattform, but specifically for a certain use case? Let's say you want to have a worker queue of rails request handlers that work in parallel. Just write that damn router! Maybe I'd be lazy, learn Erlang for a week and think about it afterwards.
I guess the problem I have is with the word 'love' itself. It's such a strong feeling for something that's just a tool that can in some situations make you turn around bits more efficiently. Can't we just look at that stuff rationally and reserve that kind of feelings for family and friends?
> But really, throwing in the towel at intelligent routing and replacing it with "random routing" is horrific, if true.
The thing is, their old "intelligent routing" was really just "we will only route one request at a time to a dyno." In other words, what changed is that they now allow dynos to serve multiple requests at a time. When you put it that way, it doesn't sound as horrific, does it?
Yes it does sound as horrific. It should never route more than N requests to a dyno, where N is the number of requests it can handle simultaneously. It doesn't matter if N is 1, 2, or 10. And it should always send new requests to the emptiest dyno.
While I agree that routing to the emptiest dyno would be a good thing, I think it's a little melodramatic to describe giving every dyno at least three times and up to several thousand times as much power as "horrific."
The problem is when you send a dyno that has all its threads stuck on long-running computations a new request, because it won't be able to even start processing it. The power is orthogonal to the problem.
The only mitigation is that if a dyno can handle a large number of threads, it probably won't get clogged. But if it can only handle 3 and gets new requests at random, you're in a bad place.
Thing is though, it says that a Dyno is a Ubuntu virtual machine. In what sort of horrendous configuration can an ENTIRE VM serve only a SINGLE REQUEST AT A TIME?!
That is utter madness, and the validity of the argument depends on whether it's the Heroku or this dude's fault that the VM is serving only a single request at a time (and it taking >1sec to handle a request).
You are exactly right, but I would hope that someone wouldn't be using Node at all if CPU bound requests are even remotely commoneplace on their app. But of course, if they built their app to spawn processes when this happens, you end up with a case of a need for intelligent routing.
At some point you hit memory limits, disk IO limits, or simply a connection limit. It doesn't matter what limit:
If you have some requests that are longer running than others, random load balancing will make them start piling up once you reach some traffic threshold.
You can increase the threshold by adding more backends or increasing the capacity of each backend (by optimizing, or picking beefier hardware if you're on a platform that will let you), and maybe you can increase it enough that it won't affect you.
But no matter what you do, you end up having to allocate more spare resources at it than what you would need with more intelligent routing.
If you're lucky, the effect might be small enough to not cost you much money, and you might be able to ignore it, but it's still there.
> But no matter what you do, you end up having to allocate more spare resources at it than what you would need with more intelligent routing.
I think we have to remember that the "intelligent routing" in question here is actually marketing-speak for "one request per server." Are you saying that when your servers can only receive one request at a time, you will necessarily need fewer than if your servers can handle three requests at a time but are assigned requests randomly?
No it's not. "Intelligent routing" means giving each server exactly as many requests as it can handle, and no more. If your servers can handle 3 requests each, using intelligent routing to make sure they never get more than 3 will blow the latency of the random method out of the water.
> That's one reason customers are throwing money at you to solve it, Heroku.
People are throwing money at Heroku because it's really easy to use, not because it's the best long-term technology choice. Seriously - what percentage of Heroku paying users do you think actually read up on the finest technical details like routing algorithms before they put in their credit card? Heroku knows. They know you can't even build a highly-available service on top of it, since it's singly-homed, and they're still making tons of money.
> So heroku doesn't want these $20,000/mo accounts, just technically understaffed startups paying $1000/mo?
> I think heroku does want to be a long-term technology choice.
Oh, I'm sure Heroku wants to be a long-term technology choice. That doesn't mean they're trying to be one with their current product offerings.
Consider their product choices since launch: they've added dozens of small value-added technology integrations. Features for a few bucks a month like New Relic to upsell their smallest customers. The price drop was also a big move to reduce barriers to using their platform - which also targets smaller customers. They launched Clojure as their third supported language! Meanwhile, they're singly-homed and have had several protracted outages, and have no announced plans to build a multi-homed product. Scalability has gotten worse with this random routing development.
I think Heroku has known for a long time that they don't have a long-term platform product and that they can't keep big accounts until they build one.
We ran into this exact same problem at Impact Dialing. When we hit scale, we optimized the crap out of our app; our New Relic stats looked insanely fast, but Twilio logs told us that we were taking over 15 seconds to respond to many of their callbacks. After spending a few weeks working with Heroku support (and paying for a dedicated support engineer), we moved to raw AWS and our performance problems disappeared. I want to love Heroku, but it doesn't scale for Rails apps.
* 89/100 requests failed (according to
* Heroku times out requests after 30 seconds, so the 30000ms
numbers may be timeouts (I've forgotten if *ab* includes
those in the summary).
* That said, the *ab* stats could be biased by using overly
large concurrency settings (not probably if you're running 50 dynos...),
But still WTF. 89/100 requests failed? That's not happy-making.
Uncertainty is DiaI (death-in-an-infrastructure). I just created a couple of projects on Heroku and love the service, but this needs to be addressed ASAP (even if addressing it is just a blog post).
Also, if you have fewest-connections available, I've never understood using round-robin or random algorithms for load-balancers...
> I've never understood using round-robin or random algorithms for load-balancers...
LeastConns/FastestConn selection is very dangerous when a backend host fails. Imagine a host has a partial failure, allowing health checks to pass. This host now fast fails and returns a 500 faster than other hosts in the pool generate 200s. This poison host will have less active connections and your LB will route more requests to it. A single host failure just became a major outage.
I like WRR selection for back ends, then use a queue or fast fail when your max active conns is exceeded. Humans prefer latency to errors, so let the lb queue (to a limit) on human centric VIPs. Automated clients deal better with errors so have your lb throw a 500 directly, or RST, or direct to a pool that serves static content.
No, lbs don't inspect established streams. The lb will periodically send requests to a known uri as a health heck instead. The problem is when the health check uri isn't indicative of availabilty. (Hint: it never is)
Nope, don't do this either. Unless you like getting pages because things are working?
You can configure a LB to inspect layer 7 - in HAProxy this is done with an observe directive. Error 500s would then take the member out of a pool. You are right that the health check of a static file may put it right back into the pool, but you can give it a slow rise value so that it waits a few minutes to do that. I'm not saying this is easy to get right but it is definitely possible to at least reduce the frequency of selection of a server raising 500 errors.
> But still WTF. 89/100 requests failed? That's not happy-making.
Unless something has changed recently, ab doesn't handle dynamic pages very well. It takes the first pageload as a baseline, and any subsequent request with any portion of the page that is randomized, or is a CSRF token, or reflects most recent changes, etc., is marked as "failed" because it doesn't match the baseline's length.
The page in question does have a block in the footer reflecting "hot songs", which I'm guessing changed a bit during the run.
AB is not a particularly smart tool (though it's very handy), but I've never seen it reject as not-accurate any non-baseline sized requests... In fact, I can't imagine the rationale that would lead a benchmarking tool to do so. [Obviously, a testing tool might do so...]
I imagine the rationale was something along the lines of many servers/apps are written to incorrectly return 200 with a descriptive error page rather than 500 or whatever the appropriate status code would be. And at the time ab was first written, pages were a lot more static than they are now, so a different page would be more likely to indicate an incorrect response.
I've been scanning through the comments, and I have yet to see anything written by a Heroku engineer to defend the company. I'm hoping its in here and I missed it. I have a feeling that this all might be absolutely true, and they have lawyers/PR trying to think of a damage control plan.
I suspect that the reason they'be been pushed to do this is financial, and it makes me think that Nodejitsu's model of simply not providing ANY free plans other than one month trials is a good one. I realize it's apples and oranges, since NJ is focused on async and this wouldn't even be a problem for a Node app, but from a business perspective I feel like this would alleviate pressure. How many dynos does Heroku have running for non-paying customers? Do these free dynos actually necessitate this random routing mesh bullshit? If not, what?
So the issue here is two-fold:
- It's very hard to do 'intelligent routing' at scale.
- Random routing plays poorly with request times with a really bad tail (median is 50ms, 99th is 3 seconds)
The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).
I do perf work at Facebook, and over time I've become more and more convinced that the most crucial metric is the width of the latency histogram. Narrowing your latency band --even if it makes the average case worse-- makes so many systems problems better (top of the list: load balancing) it's not even funny.
I can chime in here that I have had similar experiences at another large scale place :). Some requests would take a second or more to complete with the vast majority finishing in under 100MS. A solution was put in place that added about 5 MS to the average request, but also crushed the long tail(it just doesn't even exist anymore) and everything is hugely more stable and responsive.
1) Once per minute (or less often if you have a zillion dynos), each dyno tells the router the maximum number of requests it had queued at any time over the past minute.
2) Using that information, the router recalculates a threshold once a minute that defines how many queued requests is "too many" (e.g. maybe if you have n dynos, you take the log(n)th-busiest-dyno's load as the threshold -- you want the threshold to only catch the tail).
3) When each request is sent to a dyno, a header fields is added that tells the dyno the current 'too many' threshold.
4) If the receiving dyno has too many, it passes the request back to the router, telling the router that it's busy ( http://news.ycombinator.com/item?id=5217157 ). The 'busy' dyno remembers that the router thinks it is 'busy'. The next time its queue is empty, it tells the router "i'm not busy anymore" (and repeats this message once per minute until it receives another request, at which point it assumes the router 'heard').
5) When a receiving dyno tells the router that it is busy, the router remembers this and stops giving requests to that dyno until the dyno tells it that it is not busy anymore.
I haven't worked on stuff like this myself, do you think that would work?
I seem to recall Google mentioning on some blog several years ago that high variance in response latency degrades user experience much more than slightly higher average request times. I can't find the link though; if anyone has it, I'd be grateful.
There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.
From experience, this is an incredibly effective way to DoS yourself. It was the default behaviour of nginx LB ages ago. Maybe only on EngineYard. Doesn't really matter as nobody uses nginx LB anymore.
Even ignoring the POST requests problem (yup, it tried to replay those) properly cancelling a request on all levels of a multi-level rails stack is very hard/not possible in practice. So you end up DOSing the hard to scale lower levels of the stack (e.g. database) at the expense of the easy to scale LB.
It's a capacity/latency tradeoff. Perhaps I'm biased by working at Google, where capacity is cheap and latency will kill you, but IIUC Heroku runs off of AWS and their database tier needs to scale horizontally anyway, so reserving sufficient overflow capacity should simply be a matter of throwing money at the problem.
Right now, heroku has one inbound load-balancer that's out of their control (probably ELB(s)). This load balancer hits another layer of mesh routers that heroku does control, and that perform all of herokus magic. In order for "intelligent routing," which is more commonly known as "least-conn" routing to work amongst the mesh layer, all of the mesh routers would have to share state with each other in real-time, which makes this a hard problem.
Alternately, heroku can introduce a third layer between the mesh routers and the inbound random load balancer. This layer consistently hashes (http://en.wikipedia.org/wiki/Consistent_hashing) the api-key/primary key of your app, and sends you to a single mesh router for all of your requests. Mesh routers are/should be blazing fast relative to rails dynos, so that this isn't really a bottleneck for your app. Since the one mesh router can maintain connection state for your app, heroku can implement a least-conn strategy. If the mesh router dies, another router can be automatically chosen.
A relatively easy fix, for read-only or idempotent requests. Also, if long-tail latency requests wind up being run twice, this technique might accelerate tip-over saturation. Still, this 'hedged request' idea is good to keep in mind, thanks for the pointer.
The 'tied request' idea from the Dean paper is neat, too, and Heroku could possibly implement that, and give dyno request-handlers the ability to check, "did I win the race to handle this, or can this request be dropped?"
> There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.
Your solution doesn't work if requests aren't idempotent.
Yeah, I know. I figure that for incoming HTTP traffic it's relatively easy to balance the GET requests, and if they're doing anything remotely sane with HTTP those ought to be idempotent (if they're not, Googlebot will come along and delete their site ;-)).
For mutating requests, there's a solution as well, but it involves checksumming the request and passing the checksum along so that the database layer knows to discard duplicate requests that it's already handled. You need this anyway if there's any sort of retry logic in your application, though.
This is something I've read in networked game literature: players react far better to consistent and high latency than to inconsistent and low latency, even if the averages are lower in the latter case. (It might even have been a John Carmack article).
Correct, this matches my observations as well. I'd trade an increase in mean latency for a decrease in worst-case latency anytime. It makes it so much easier to reason about how many resources are needed for a given workload when your latency is bounded.
Re the distribution, absolutely. That "FIFTY TIMES" is totally due to the width of the distribution. Although, you know, even if their app was written such that every single request took exactly 100ms of dyno time, this random routing would create the problem all over again, to some degree.
As for the intelligent routing, could you explain the problem? The goal isn't to predict which request will take a long time, the goal is to not give more work to dynos that already have work. Remember that in the "intelligent" model it's okay to have requests spend a little time in the global queue, a few ms mean across all requests, even when there are free dynos.
Isn't it as simple as just having the dynos pull jobs from the queue? The dynos waste a little time idle-spinning until the central queue hands them their next job, but that tax would be pretty small, right? Factor of two, tops? (Supposing that the time for the dyno-initiated give-me-work request is equal to the mean handling time of a request.) And if your central queue can only handle distributing to say 100 dynos, I can think of relatively simple workarounds that add another 10ms of lag every factor-of-100 growth, which would be a hell of a lot better than this naive routing.
I think the problem is that any servers which can handle concurrent requests now need to decide how many requests they can handle. Since most application servers seem to have concurrency values of "1, ever" or "I dunno, lots" this is a hard problem.
Your solution would likely work if you had some higher level (application level? not real up on Heroku) at which you could specify a push vs. pull mechanism for request routing.
Given that, according to TFA (and it's consistent with some other things I've read) Heroku's bread and butter is Rails apps, and given that, according to TFA, Rails is single-threaded, that (valid) point about concurrency in a single dyno is perhaps not that relevant? You'd think that Heroku would continue to support the routing model that almost all of their marketing and documentation advertises, right? Even if it's a configurable option, and it only works usefully with single-threaded servers?
And if you did do it pull-based, it wouldn't be Heroku's problem to decide how many concurrent requests to send. Leave it to the application (or whatever you call the thing you run on a dyno).
And it doesn't need to be pull-based, if the router can detect HTTP connections closing in dynos, or whatever.
But the idea of pull-based work distribution is pretty straightforward. It's called a message queue.
Simulation author here with some additional analysis using a faster distribution of request times. If you use a distribution with median 50 ms, 90th percentile 225 ms, and 99.9th percentile 898 ms, then you need 30 intelligent dynos to handle 9000 requests/minute without queueing. In the same scenario with 30 naive dynos, 44% of requests get queued.
The solution here is to figure out why your 99th is 3 seconds.
I'm not sure this applies to the OP. His in-app measurements were showing all requests being handled very fast by the app itself; the variability in total response time was entirely due to the random routing.
Wouldn't the bad tails of random routing be an unpredictably long length of time since long running requests times are unpredictable?
Even if you work on narrowing the fat tails, shouldn't you still need to be upfront and clear about how adding a new dyno only gives you an increased chance of better request handling times as you scale?
That isn't the only problem with random routing - the problems aren't as pronounced with uniform response speeds, but you still get a significant difference in net effective queue time, especially if you're operating close to your throughput limit.
> The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).
The Golang runtime uses non-blocking I/O to get around this problem.
Normally when I read "X is screwing Y!!!" posts on Hacker News I generally consider them to be an overreaction or I can't relate. In this case, I think this was a reasonable reaction and I am immediately convinced never to rely on Heroku again.
Does anyone have a reasonably easy to follow guide on moving from Heroku to AWS? Let's keep it simple and say I'm just looking to move an app with 2 web Dynos and 1 worker. I realize this is not the type of app that will be hurt by Heroku's new routing scheme but I might as well learn to get out before it's too late.
My company switched off of Heroku for our high-load app because of these same problems, but I still really like Heroku for apps with smaller loads, or ones in which I'm will to let a very small percentage of requests take a long time.
We are long time users of Heroku and are big fans of the service. Heroku allows us to focus on application development. We recently read an article on HN entitled 'Heroku's Ugly Secret' http://s831.us/11IIoMF
We have noticed similar behavior, namely increasing dynos does not provide performance increases we would expect. We continue to see wildly different performance responses across different requests that New Relic metrics and internal instrumentation can not explain.
We would like the following:
1. A response from Heroku regarding the analysis done in the article, and
2. Heroku-supplied persistant logs that include information how long requests are queued for processing by the dynos
Thanks in advance for any insight you can provide into this situation and keep up the good work.
I've been reading through all the concerns from customers, and I want every single customer to feel that Heroku is transparent and responsive. Our job at Heroku is to make you successful. Getting to the bottom of this situation and giving you a clear and transparent understanding of what we’re going to do to make it right is our top priority. I am committing to the community to provide more information as soon as possible, including a blog post on http://blog.heroku.com.
I don't know, $20k/mo strikes me as an awful lot of money to avoid engaging in work that a scrappy internet startup really ought to be competent at. If you don't know how to run the pipes and they get clogged and your plumber's not picking up the phone, you're screwed.
That amount buys a whole lotta dedicated servers and the talent to run them. (Sidenote: Every time I price AWS or one of its competitors for a reasonably busy site, my eyes seem to pop out at the cost when compared to dedicated hardware and the corresponding sysadmin salary.)
The larger issue is: Invest in your own sysadmin skills, it'll pay off in spades, especially when your back's up against the wall and you figure out that the vendor-which-solves-all-your-problems won't.
1. Employees are expensive. A good ops guy who believes in your cause and wants to work at an early stage startup can be had for $100k. (Maybe NYC is much cheaper than the bay area, but I'll use bay area numbers for now because it's what I know). That's base. Now add benefits, payroll taxes, 401k match, and the cost of his options. So what... $133k?. That's one guy who can then never go on vacation or get hit by the proverbial bus. Now buy/lease your web cluster, database cluster, worker(s), load balancers, dev and staging environments, etc. Spend engineering time building out Cap and Chef/Puppet scripts and myriad other sysops tools. (You'd need some of that on AWS for sure, but less on Heroku which is certainly much much more expensive than AWS)
2. When you price-out these AWS systems are you using the retail rates or are you factoring in the generous discount larger customers are getting from Amazon? You realize large savings first by going for reserved instances and spot pricing and stack on top of that a hefty discount you negotiate with your Amazon account rep.
3. I've worked at 2 successful, talent Bay Area startups in the last few years: One that was built entirely on AWS, and now, currently, one that owns all of their own hardware. Here's what I think: It's a wash. There isn't a huge difference in cost. You should go with whatever your natural talents lead you towards. You have a founding team with solid devops experience? Great, start on the cloud and then transition early to your own hardware. If not, focus on where your value-add is and outsource the ops.
Your last sentence hides another option which is the route we took: outsource devops. It cost us about $2000 in consultant fees for a fully setup system, easy to expand and add hardware to, and much more cost effective long term than AWS or Heroku. Our guy runs a small ops company who have 24/7 on-call. It's really the perfect solution.
We were very surprised to discover Heroku no longer has a global request queue, and spent a good bit of time debugging performance issues to find this was the culprit.
Heroku is a great company, and I imagine there was some technical reason they did it (not an evil plot to make more money). But not having a global request queue (or "intelligent routing") definitely makes their platform less useful. Moving to Unicorn helped a bit in the short term, but is not a complete solution.
I'd been using Heroku since forever, but bailed on them for a high traffic app last year (Olympics related) due to poor performance once we hit a certain load (adding dynos made very little difference). We were paying for their (new at the time) critical app support, and I brought up that it appears to be failing at a routing level continuously. And this was with a Sinatra app served by Unicorn (which at the time at least was considered unsupported).
We went with a metal cluster setup and everything ran super smooth. I never did figure out what the problem was with Heroku though and this article has been a very illuminating read.
If Heroku had the data needed to do minimum-of-two random routing, they'd have the data needed to do intelligent routing. The problem is not the algorithm itself: "decrement and reheap" isn't going to be a performance bottleneck. The problem is tracking the number of requests queued on the dyno.
If Heroku had the data needed to do minimum-of-two random routing, they'd have the data needed to do intelligent routing.
Not strictly true; imagine that they can query the load state of a dyno, but at some non-zero cost. (For example, that it requires contacting the dyno, because the load-balancer itself is distributed and doesn't have a global view.)
Then, contacting 2, and picking the better of the 2, remains a possible win compared to contacting more/all.
See for example the 'hedged request' strategy, referenced in a sibling thread by nostradaemons from a Jeff Dean Google paper, where 2 redundant requests are issued and the slower-to-respond is discarded (or even actively cancelled, in the 'tiered request' variant).
I'm not familiar with minimum-of-two-random routing, but it does seem like assigning request to dynos in sequence would perform much better than assigning randomly (ie in a scenario with n dyno capacity, request 1 => dyno 1, request 2 => dyno 2, ... request n => dyno n, request n+1 => dyno 1, ..., repeat)
That'd be probably significantly better than the case of (request i => dyno picked out of hat) for all i
If this is such a problem for you, why are you still on Heroku? It's not a be-all end-all solution.
I got started on Heroku for a project, and I also ran into limitations of the platform. I think it can work for some types of projects, but it's really not that expensive to host 15m uniques/month on your own hardware. You can do just about anything on Heroku, but as your organization and company grow it makes sense to do what's right for the product, and not necessarily whats easy anymore.
FYI I wrote up several posts about it, though my reasons were different (and my use-case is quite a bit different from a traditional app):
Funny - I thought this was a really interesting article - but I couldn't stand all these annotations.
And when I was selecting text (what i do mindlessly when I am reading an article) all the hell broke loose and it tried to load something.
Very annoying when I want to concentrate on the technical details. So we'll see once again - everyone's different.
Wow. I suspect Rap Genius has the dollars now where it's totally feasible for them to go beyond Heroku, but it still might not be the best use of their time. But if they have to do it, they have to do it.
OTOH, having a customer have a serious problem like this AND still say "we love your product! We want to remain on your platform", just asking you to fix something, is a pretty ringing endorsement. If you had a marginal product with a problem this severe, people would just silently leave.
There's also the outage hell. It's been ok for a month or two, but getting killed whenever AWS has a blip in US-East (there's no cross-region redundancy, and minimal resilience with an AZ or Region-wide service has serious problems) isn't great.
It probably doesn't hurt RG as much as lower overall performance during normal operations does, though.
Data Dep't here. Postgres scales great on Heroku, AWS, and in general. We've got users doing many thousands of query per second, and terabytes of data. Not a problem.
The issue with the number of connections is that each connection creates a process on the server. We cap the connections at 500, because at that point you start to see problems with O(n^2) data structures in the Postgres internals that start to make all kinds of mischief. This has been improved over the last few releases, but in general, it's still a good idea to try and keep the number of concurrent connections down.
Postgres has a limitation on number of open connections. This is because 1 connection = 1 process. MySQL uses threads, which scales better but has other downsides. The thread-based approach is also possible with Postgres using 3rd party connection pooling apps, e.g. pgbouncer.
It has to deal with the number of open connections to postgresql, not disk IO. You don't want to have 4,000 connections open at once (unless you were using a connection balancer, but that would only work in transaction mode).
Why is everyone like against rapgenius.com for "forcing the issue with a public spat". They are the customer not getting a service they are paying for. I would be fucking pissed too. Heroku isn't being the darling the service they advertised. They tried to work on it with Heroku. This is useful information to most of you. Are most of you against Yelp?
The only person using the words 'force the issue with a public spat' is me, and I judged that as 'fair enough'. I'm not against RapGenius and I'm glad the issue is being discussed.
But we haven't seen Heroku's comments, and while some parts of RapGenius's complaint are compelling, I'm not sure their apparent conclusions - that 'intelligent routing' is needed, and its lack is screwing Heroku customers — are right. I strongly suspect some small tweaks, ideally with Heroku's help, but perhaps even without it, can fix most of RapGenius's concerns.
Perhaps there was a communication or support failure, which led to the public condemnation, or maybe that's just RG's style, to heighten the drama. (That's an observation without judgement; quarrels can benefit both parties in some attention- and personality-driven contexts.)
It's an integration with an OpenRTB bidding marketplace that's sending our way that traffic.
And 25K is not the whole story. In a lot of ways it's similar to high frequency trading. Not only do you need to decide in real time if you want to respond with a bid or not, but the total response time should be under 200ms, preferably under 100ms, otherwise they start bitching about latency and they could drop you off from the exchange.
And the funny thing is 25K is actually nothing, compared to the bigger marketplace that we are pursuing and that will probably send us 800K per second at peak.
But then again, those two links don't address the core of the problem:
Heroku is used by tons of people around the world. Some of them are paying good money for the service. Given the amount of scrutiny under which they operate, what is the incentive for them to turn an algorithm into a less effective one and still charge the same amount of money in a growing "cloud economy" where companies providing the same kind of service are a dime a dozen (AWS, Linode, Engine Yard, etc)?
How does that benefit their business if "calling their BS" is as easy as firing Apache Benchmark, collecting results, drawing a few charts and flat out "prove" that they're lying about the service they provide??
I mean, I doubt Heroku is that stupid, they know how their audience doesn't give them much room for mistakes. So as nice as the story sounds on paper, I'd really like another take on all this, either from other users of Heroku, independent dev ops, researchers, routing algorithms specialists or even Heroku themselves before we all too hastily jump to sensationalist conclusions.
My bad if that wasn't clear in my comment: the sensationalist bit wasn't directed at you but at the general tone of the thread. In fact, I actually upvoted your comment for providing two links highly relevant to the article.
Given how bold the claims are, I'd say it'd be best if Heroku reacted quickly, indeed. Cause right now, it seems that the author is either not understanding how Heroku works or ignoring important specifics/nuances inherent to their "secret-sauce" algorithms. Whether the analysis done in the article is sound or the author is actually pushing an agenda remains to be discussed and the sooner the better.
I'm also quite wary of the incentive a 20K monthly bill would give you to try and shake Heroku down for a rebate. By the way, the figure in itself seems very high, but out of context it's impossible for me, the reader, to evaluate if that's actually good money or not. Maybe other solutions (handling everything yourself) would actually be WAY more costly, maybe Heroku actually provides a service that is well-worth the money or maybe the author is right and it's actually swindling on Heroku's part, no way to know.
This is a knee-jerk reply. I know, because my knee jerked as well. Think about the problem a little more: if you have the data necessary to pick the min-of-two, then you have the data you need to do intelligent routing.
Not necessarily. Heroku claims that a global request queue is hard to scale, and therefore they switched to random load balancing. The comment above shows that a global request queue is not necessary. Lets say that minimum-of-n scales up to 10 dynos. If your application requires 40 dynos, you can have one front load balancer which dispatches the requests to 4 back load balancers, each of which has 10 dynos assigned to them on which they perform min-of-10 routing. This gives you routing that's almost as good as min-of-40 but it scales up nonetheless.
Assuming by intelligent, you mean minimally loaded, choice of 2 requires less bookkeeping than this. (choice of 2 requires only local (per-node level) info, where as most other intelligent load-balancing requires global info ie. the min, etc.)
Taking the case of minimally loaded, you need to keep track of how many active requests each node/replica is serving, as well as globally keeping track of the min. (which past a certain load, will suffer a lot of contention to update)
To do choice of 2, all you need is to keep track of active requests per node/replica.
Under spiky workloads, there is also a problem with choosing minimally loaded. The counter for numRequests of a node might not update fast enough, so that a bunch of requests will go to that node, quickly saturating its capacity.
Choice of 2 doesn't suffer this problem bec of its inherent randomization.
My hunch is that Heroku isn't doing this to bleed customers dry. I know more than a few really, really great people who work there, and I don't think they'd stand for that type of corporate bullshittery. If this were the case, I think we'd have heard about it by now.
My best guess is that they hit a scaling problem with doing smart load balancing. Smart load balancing, conceptually, requires persistent TCP connections to backend servers. There's some upper limit per LB instance or machine at which maintaining those connections causes serious performance degredations. Maybe that overhead became too great at a certain point, and the solution was to move to a simpler random round-robin load balancing scheme.
I'd love to hear a Heroku employee weigh in on this.
The thing that baffles me is that you could do high-level random balancing onto smaller clusters that do smart balancing. This would solve most of the problem of overloaded servers. An entire cluster would have to clog up with slow requests before there was any performance impact. So why don't they do this?
OK, maybe I'm missing something here, but it seems to me that the OP's real problem is that he's artificially limiting himself to one request per dyno. They now allow a dyno to serve more than one request at a time, and he's presenting that as a bad thing! It seems to me that the answer to Rap Genius' problems is not "rage at Heroku," but rather "gem 'unicorn'".
We did this, but all it did was buy us a bit of extra time before we ran into the same problem again - a very small percentage (<0.1%) of requests creating a queue that destroyed performance for the rest of them. Also, FWIW, Heroku does not officially support Unicorn, and you have to make sure that you don't run out of memory on your dynos (we tanked our app the first time we tried Unicorn with 4 processes).
The cost of New Relic on Heroku looks really high because each dyno is billed like a full server, which makes it many times more expensive than if you were to manage your own large EC2 servers and just have multiple rails workers.
New Relic could be much more appealing if they had a pricing model that was based on usage instead of number of machines.
I spent a lot of time trying to come up with a stable installation of EB for a Rails 3.2 app using RDS and just couldn't get it to a state where I'd ever deploy it as production. Here's where I decided to pull the plug:
Setting up SSL on Elastic Beanstalk was very easy for us. The documentation explained the entire process. It is easier if you get a wildcard SSL cert, so then you can use the same SSL cert for your various deployments under the same domain.
i don't think some details of the argument hold. it alleges that you need more dynos to get the same throughput. but that's not true once you have sufficient demand to keep a queue of about sqrt(n) (i think - someone who knows more theory than me can correct me) in size on the dyno (where you have n dynos). because at that point all dynos will be running continuously, and the throughput will be the same with either routing.
the average latency will be higher, though (and the spread in latency larger).
But you never want to have a queue on any of your dynos! A queued request means that a user is waiting with no response. If your goal is to have 0 (or less than epsilon) requests queued, it takes far fewer dynos if the requests are routed intelligently
If you have 10 dynos and 1000 simultaneous requests, the difference between naive and intelligent might well be reduced, but that's also a scenario in which your end user response times would be horrendously slow and so you'd need more dynos either way
I think this analysis and simulation does not account for one important thing: random routing is stateless and thus easy to be distributed. Routing to the least loaded Dyno needs to be stateful. It is quite easy to implement when you have one centralized router, but for 75 dynos this router would likely become a bottleneck. With many routers, intelligent routing has its own performance cost, the routers need to somehow synchronize state, and the simulation ignores this cost.
> With many routers, intelligent routing has its own performance cost, the routers need to somehow synchronize state, and the simulation ignores this cost.
Which is why we pay companies like Heroku to engineer clouds in which to run our applications. Because they're supposed to be better at this than us and spend the time and money building this difficult infrastructure well. That includes a scalable, stateful intelligent routing service.
I believe routing is not random, but round robin. I'd like Heroku to confirm. It's still a problem. If you are looking to run Unicorn on Heroku, use the heroku-forward gem (https://github.com/dblock/heroku-forward). Works well, but application RAM is quickly its own issue, we failed to run that in production as our app takes ~300MB.
The problem is the request arrival rate vs the distribution of service times in your app.
New Relic may be giving you an average number you feel happy about, but the 99th percentile numbers are extremely important. If you have a small fraction of requests that take much longer to process, you'll end up with queuing, even with a predictive least loaded balancing policy.
This is a very common performance problem in rails apps, because developers often use active record's associations without any sort of limit on row count, not considering that in the future individual users might have 10000 posts/friends/whatever associated object.
Fix this and you'll see your end user latency come back in line.
Why not get yourself ONE beefy server? (or two) That should be able to handle your 150 requests per second, simplify your architecture a lot, and buying it would be cheaper than 1 month on Heroku (at $20,000/month).
Why? The second is idling until the first fails, then the second takes over. Unless both fail simultaneously of course, for example due to power outage, but then your 40 servers will also fail simultaneously. Not to mention that with just one running server there are a lot fewer failure scenarios.
Personally - I prefer Linode to Heroku, sure there is more of my time consumed with sys admin, but I like having full control over my platform & setup, rather than having it virtually dictated to me. I'm always open to change but this strategy has served me very well for almost 3 years now.
I was in contact with Heroku support a couple of weeks ago since we experienced some timeout on our production app. Got a detailed explanation how the routing on heroku works by a Heroku engineer, and thought I could share:
"I am a bit confused by what you mean by an "available" dyno. Requests get queued at the application level, rather than at the router level. Basically, as soon as a request comes in, it gets fired off randomly to any one of your web dynos.
Say your request that takes 2 seconds to be handled by the dyno was dispatched to a dyno that was running a long running request. Eventually, after 29 seconds, it completed serving the response, and started working on the new, faster 2 second request. Now, at this point it had already been waiting in the queue for 29 seconds, so after 1 second, it'll get dropped, and after another 1 second, the dyno will be done processing it, but the router is no longer waiting for the response as it has already returned an H12.
That's how a fast request can be dropped. Now, the one long 29 second request could also be a series of not-that-long-but-still-long requests. Say you had 8 requests dispatched to that dyno at the same time, and they all took 4 seconds to process. The last one would have been waiting for 28 second, and so would be dropped before completion and result in an H12."
Watch out, this affects small rails applications with few dynos as well.
If you hit the wall with one dyno and add another one, you won't get twice the throughput even though you pay twice the price.
I've always had suspicions about this on some smaller apps but never really looked into it. You can configure New Relic to measure round-trip response times on the client side. At peak loads those would be unreasonably high. Much higher than can be explained by huge latencies even.
This is something that I have been struggling with the past long while. Very troublesome when a dyno cycles itself (like they always will at least every 24 hours), as the routing layer continues to send it requests, resulting in router level "Request Timeouts" if it takes too long to restart.
Especially difficult to diagnose when the queue and wait time in your logs are 0. What is the point of these in the logs if it never waits or queues?
For those of you looking to migrate to other, barer hosting solutions like AWS or another VPS provider, I've put together a Capistrano add-on that let's you use Heroku-style buildpacks to deploy with Nginx doing front-end proxy. I use it for half a dozen apps on my VPSs and it works swimmingly well.
"For a Rails app, each dyno is capable of serving one request at a time."
Is this a deliberate design choice on Heroku's part, or is this just how Ruby and Rails work? It sounds bizarre that you would need multiple virtual OS instances just to serve multiple requests at the same time. What are the advantages of this over standard server fork()/threaded accept designs?
Rails is commonly run as one or more application servers behind an http server that proxies requests to them. Rails itself doesn't manage threads or forked processes for accepting requests, so the only way it fits into Heroku's dyno model is as an app server per dyno.
> What are the advantages of this over standard server fork()/threaded accept designs?
It's simple to build and manage in that you don't have to worry about thread safety and can use the already built and tested proxy capabilities of existing web servers to distribute traffic.
Those charts of "simulated" load balancing strategies don't look at all reasonable at first glance. You certainly don't see such spiky patterns with normal web loads. I think you'd have to have some crazy amount of std. dev in completion time cranked way, way up in your simulation before you saw a bunch of servers stacked at 30 with others at 1.
It's not that there is no benefit to better balancing, it's just that I've never seen it have anything close to that impact. It seems like it's only being perceived as a problem here because somebody drank too much of the (old) kool-aid.
Some of the other numbers are hard to take at face value as well. 6000ms avg on a specific page? If requests are getting distributed randomly shouldn't all your pages show a similar average time in queue? Sounds more like they're using a hash balancing alg and the static page was hashing on to a hot spot.
> If requests are getting distributed randomly shouldn't all your pages show a similar average time in queue?
A common misconception, called "the law of small numbers".
Probability theory tells us this is only true over a large amount of requests, i.e. in the long term (the law of large numebrs). In the short term, results can vary wildly and thus form these kind of queues.
You don't think that page was chosen as an example because it should be a very inexpensive page to render? I read that as saying that the performance problem does affect all pages "even on THIS mostly static page."
As for your skepticism of the graphs, take a look at the annotated R source they provided. I didn't do a deep dive on it or anything but it looked reasonable to me.
I searched for variance and apparently nobody mentioned this before:
They talk of
Mean request time: 306ms
Median request time: 46ms
Which indicates a very high variance, so don't take for granted that an x50 increase of performance would result from intelligent routing. The problem is that the fast tasks suffer from being queued after the slow tasks, so each such fast task takes an extra latency. If the variance is lower, the random routing will be favorable at some point as the delay of getting the task from the router queue to the dyno is not zero neither. In the case of no variance, "intelligent routing" would always add that delay as soon as all dynos are at their limit. Before that, the router would simply keep a list of idle dynos and send work there without delay.
Sure if you never hit 100% load, intelligent routing is cheap and comes at no delay. Imagine 40ms jobs getting all dynos to 100% load. Now the dynos would be idle for the duration of the ping that it takes to report being idle. let that be 4ms. That is 10% less throughput than with items queuing up on the dyno.
The router being the bottleneck would therefore justify to make it stateless and give the dynos a chance to use these last 10% of processing power as well, ultimately increasing the throughput by 10%. Sure, a serious project would not run its servers at 120% load hoping to eventually get back to 100% within time, so all this being said I would always favor intelligent routing to get responsive servers, add dynos in rush hours and only opt for dyno-queuing for stuff that may come with a delay (scientific number crunching, …)
Just adding a top-level post to point out something buried in one of the threads here that is an important point on what is happening here:
The "queue at the dyno level" is coming from the Rails stack -- it's not something that Heroku is doing to/for the dynos.
Thin and Unicorn (and others, I imagine) will queue requests as socket connections on their listener. Both default to 1024 backlog requests. If you lower that number, Heroku will (according to the implications in the documentation on H21 errors) try multiple other dynos first before giving up.
For a single-threaded process to be willing to backlog a thousand requests is problematic when combined with random load balancing. Dropping this number down significantly will lead to more sane load-balancing behavior by the overall stack, as long as there are other dynos available to take up the slack.
Also, the time the request spends on the dyno, including the time in the dyno's own backlog, is available in the heroku router log. It's the "service" time that you'll see as something like "... wait=0ms connect=1ms service=383ms ...". Definitely wish New Relic was graphing that somewhere...
Intelligent routing would presumably need to know and act on a lot of state across their cluster, and if that's got to flow through a single node, you can see how it would present a bottle-neck as the cluster size and requests per second increased.
On the other hand, you can do randomised routing without knowing any state at all. You can do it with more than one routing node as well, which makes scaling almost trivial.
I presume there are Hard Problems associated with partitioning a Heroku-style cluster for intelligent routing, or that's what they would have done.
It seems that the delay is in the variance in the length of the different jobs. Having slow jobs is generally not a good idea, and I can imagine that they are happening for uncommon tasks.
When you are running a 100+ servers it seems like a simple answer would be to think about these uncommon tasks differently. Options would be for prioritizing them differently, showing different UI indicators, and also wanting them happening on a separate set of machines.
Doing these would mean that an intelligent routing mechanism would not have as much use. Am I wrong here?
I do believe that Heroku should document such problems of theirs more clearly, so that we know what challenges that we are facing as we develop applications, but in this particular case, it seems that they do have the right plumbing, and that they just need to be used differently.
I know nothing about Heroku's architecture than what I just read in this post, but couldn't you alleviate this problem greatly by having the dyno's implement work stealing? Obviously the they would have to know about each other then, but perhaps that is easier to do that global intelligent routing.
Thanks for calling this out. As you said, random routing is about as naive as it gets. They need to make upgrades to the routing mesh - expose some internal stats about dyno performance and route accordingly. Even if the stats were rudimentary, anything would be an improvement over random.
Maybe this is a dumb question, but wouldn't straightforward Round Robin routing by Heroku restore their "one dyno = one more concurrent request" promise without incurring the scaling liabilities of tracking load across an arbitrarily large number of dynos?
Sure, but the real issue the article identifies is that, under random routing, they need to keep doubling the number of dynos to halve the odds of bad queueing, which leads to absurd factor of 50 requirements to get back to what they had before. With round robin, the increase should be much more linear.
Over many requests, both should average to N requests served to each instance assuming a uniform random distribution.
The real issue is being able to figure out instances to avoid when some requests end up being slow. To put it another way, ideal balancing in this case isn't about evenly splitting all requests, but evenly splitting all processing (or waiting) time.
If you can guarantee that requests tend to take pretty stable and uniform time, then random or round-robin distribution should give good results. If you can't, some requests will be stuck waiting behind others and their waiting time will accumulate. You'll see worse behaviour when two or more of the bad slow requests get queued one after the other.
With round robin it's going to chose the dynos most likely to have the shortest queue given the simple information available. (longest time since it got something) so it's biased to putting stuff on empty queues.
Where as random picks randomly, so there's no bias to empty queues, so random should be less efficient.
I'd think that most of the requests being served by rapgenius.com would be highly cacheable (99% are likely just people viewing content that rarely changes).
Seems weird that the site would have such a massive load of non-cacheable traffic. Heroku used to offer free and automatic varnish caching, but the cedar stack removed it. Some architectures make it easy to use cloudfront to cache most of the data being served. My guess that refactoring the app to lean on cloudfront would be easier and more cost-effective (and faster) than manually managing custom scaling infrastructure on EC2.
Perhaps easy deployments are not worth the performance and blackbox trade-off. An alternative approach is a cloud infrastructure provider with baremetal and virtual servers on L2 broadcast domain, and one that provides a good API and orchestration framework so that you can easily automate your deployments. Here are some things we at NephoScale suggest you consider when choosing an infrastructure provider: http://www.slideshare.net/nephoscale/choosing-the-right-infr...
This kind of validates an idea I've been flirting with: a Heroku-like service which routes requests via AMQP or similar message broker and actually exposes the routing dynamics to the client apps.
From a naive, inexperienced view the idea of having web nodes "pull" requests from a central queue rather than the queue taking uneducated guesses seems to be a no-brainer. I can see this making long-running requests (keep-alive, streaming, etc) a bit more difficult, but not impossible.
What am I missing? This seems so glaringly obvious that it must have been done before...
That is exactly how an app I develop at $WORK works.
Requests come in on a front-end, it gets passed off to a router that has multiple different workers connected. The workers send a request to the router letting the router know that they are ready to start responding to requests. The router hands the worker a request, marks the request as being worked on, and moves on to the next request. It is a basic Least Recently Used queue at that point, and if all workers are busy but worker number 3 which received work last finished first, he gets handed new work instantly.
The worker then sends the request back to the router, which sends it back to the appropriate front-end that originally responded to the user.
We are using ZeroMQ for our communication.
For our use case we can handle around 200 requests a second from a TCP/IP connected client to our router, to a worker and back to a client. That is with 3 backend workers, which are hitting the disk/database.
The pull model is very hard to implement because the router behaves like a proxy for a much larger set of dynos (think tens of thousands). When you have 10K clients yielding "i'm available" 10 times a second, you have a nightmare, it's not sustainable.
A possible solution for the proxy and the dynos to agree on a protocol where the proxy passes a request to the dyno and the latter can give up with a status code that says "retry with another dyno". This could go on to up to the 30s timeout limit that Heroku has now.
I believe Mongrel2 (http://mongrel2.org) is close to what you're talking about. It uses zeromq (http://www.zeromq.org) to talk to a backend application and the backend application talks back to the web server using the same protocol.
Not exactly nodes pulling messages off a queue, but closer to something like that?
The problem with this model is you'd need to re-engineer all existing web applications to work with the new service. If I just want to boot up an open-source Rails/Django/node.js/etc app on your service, how would I do it?
Interesting article but every time I read something on RapGenius and move my eyes from the screen, I keep seeing white lines, very annoying. White font on black background is bad. Off topic I know, but still.
This might be the case "out of the box", but it's very simple to go multithreaded on the Cedar stack and avoid this issue (provided that your app is threadsafe, of course).
You can do this pretty easily with a Procfile and thin: bundle exec thin -p $PORT -e $RACK_ENV --threaded start
And then config.threadsafe! in the app
Regarding Rails app threadsafety, there are some gotchas around class-level configuration and certain gems, but by and large these issues are easily manageable if you watch out for them during app development.
Engine Yard is more like opinionated configuration management. It allocates and configures EC2 instances that you can log into like normal. The software stack is HAProxy, nginx, unicorn, etc, and customizable through the web interface and/or chef.
For python / Django you can use a Procfile specifying gunicorn (rather than stock manage.py) with multiple worker threads, eg.
web: gunicorn myapp.wsgi -b 0.0.0.0:$PORT -w 5
Then you will have 5 parallel "single-threaded" instances on each dyno rather than just 1. This will partially ameliorate the problem, but probably not 100%. (NOTE: This is speculation since Heroku hasn't weighed in yet)
Fundamentally we're talking about a load balancer. Even the most basic load-balancers can use a least-connections algorithm. Even a round-robin algorithm would be better since that would give each dyno (number-of-dynos * msec per request) to finish a long-running request. Random routing is a viable option where the number of concurrent requests a node can handle is large or unknown, but when the limit is known and in the single-digits, random routing is a recipe for disaster.
Seems like message here is that if you use an off the shelf solution you need to work around its limitations. In this case random load balancing may sound dumb but it's actually quite a reasonable way to spread load. The customers real problem is the single threaded server bottleneck compounded by the sporadic slow requests. Seems like they have outgrown Heroku and a more custom solution is required. Either that or rebuild the server in whole or in part with a more concurrent one.
I would like to see why actually Heroku fell back to random routing. It doesn't really make sense. Of course this all routing stuff is really tricky, but on the other hand there is a lot of work done (look at TCP agorithms).
When I was studying ZeroMQ routing based stuff for one project, I came across "credit-based flow control" pattern, that could make perfect sense in this kind of situation (Publisher-Subscriber scenario). Why not implementing such thing?
You're not committing to a usage level, you're committing to a pricing level. If you reserve a small instance, you get a small instance, no matter what utilization level you choose. It's the same exact resources no matter what you pick.
Medium utilization = medium upfront cost, medium hourly rate.
Heavy utilization = highest upfront cost, lowest hourly rate.
The names are meant to signify the trade-off you're making. If you run your instance only an hour a day, you will pay the least by choosing "light utilization": the hourly cost is high but you're only going to multiply that by a small number, so the savings in the up-front cost will dominate the total cost. If you run your instance 24 hours a day, then the hourly rate will dominate your total costs, so you'll save money by choosing "heavy utilization" with a higher up-front cost but lower hourly cost.
Segmenting the costs makes the pricing table more difficult to read, but it optimizes for everything else: you pay the lowest possible price for guaranteed resources, and Amazon has better knowledge of how much spare capacity it actually needs to handle the reservations.
~$20,000 sounds like a lot of money, they would need at least $100M a year in revenue to justify this number. This will be a major challenge if they want to grow profitable after the $15M VC money runs out. I'd assume they'd get the same for $5000 in rented servers which would free up enough money - outside of the valley - to have an DevOps and another developer.
Take a look at deliver if you like heroku push, but want it on a machine you control at a bit lower level: https://github.com/gerhard/deliver
I got it working on EC2 / Ubuntu real easy, and even added some basic support for SmartOS/illumos for joyent cloud
They want to force the issue with a public spat. Fair enough.
But, they also might also be able to self-help quite a bit. RG makes no mention of using more than 1 unicorn worker per dyno. That could help, making a smaller number of dynos behave more like a larger number. I think it was around when Heroku switched to random routing that they also became more officially supportive of dynos handling multiple requests at once.
There's still the risk of random pileups behind long-running requests, and as others have noted, it's that long-tail of long-running requests that messes things up. Besides diving into the worst offender requests, perhaps simply segregating those requests to a different Heroku-app would lead to a giant speedup for most users, who rarely do long-running requests.
Then, the 90% of requests that never take more than a second would stay in one bank of dynos, never having pathological pile-ups, while the 10% that take 1-6 seconds would go to another bank (by different entry URL hostname). There'd still be awful pile-ups there, but for less-frequent requests, perhaps only used by a subset of users/crawler-bots, who don't mind waiting.
On further thought, Heroku users could probably even
approximate the benefits from the Mitzenmacher power-of-two-choices insight (mentioned elsewhere in thread), without Heroku's systemic help, by having dynos shed their own excess load.
Assume each unicorn can tell how many of its workers are engaged. The 1st thing any worker does – before any other IO/DB/net-intensive work – would be to check if the dyno is 'loaded', defined as all other workers (perhaps just one, for workers=2) on the same dyno already being engaged. If so, the request is redirected to a secondary hostname, getting random assignment to a (usually) different dyno.
The result: fewer big pileups unless completely saturated, and performance approaching smart routing but without central state/queueing. There is an overhead cost of the redirects... but that seems to fit the folk wisdom (others have also shared elsewhere in thread) that a hit to average latency is worth it to get rid of the long tails.
(Also, perhaps Heroku's routing mesh could intercept a dyno load-shedding response, ameliorating pile-ups without taking the full step back to stateful smart balancing.)
Added: On even further thought: perhaps the Heroku routing mesh automatically tries another dyno when one refuses the connection. In such a case, you could set your listening server (g/unicorn or similar) to have a minimal listen-backlog queue, say just 1 (or the number of workers). Then once it's busy, a connect-attempt will fail quickly (rather than queue up), and the balancer will try another random dyno. That's as good as the 1-request-per-dyno-but-intelligent-routing that RapGenius wants... and might be completely within RapGenius's power to implement without any fixes from Heroku.
Can anyone explain why this random routing is supposedly good for Node.JS and Java? I mean the net effect is busy dynos might serve more requests while idle ones remain idle and that is certainly not good for Node.JS or anything. What am I missing?
Great article. Have you thought about the alternative of building your own auto-scaling architecture with 99.9% uptime? I'd be interested to hear if you plan to move off heroku and, if so, what your plans are.