It's a good response in that they are taking responsibility, but it is pretty obvious that they are reluctant to say anything about a fix. In my mind, "it's hard" isn't a valid excuse in this case, especially when there are relatively straightforward solutions that will solve this at a practical level. For example, you could imagine a naive form of intelligent routing that would work simply by keeping a counter per dyno:
- request comes in and gets routed to the dyno with the lowest count. Inc the count.
- response goes out. Dec the counter.
Since they control the flow both in and out, this requires at most a sorted collection of counters and would solve the problem at a "practical" level. Is it possible to still end up with one request that backs up another one or two? Sure. Is it likely? No. While this isn't as ideal as true intelligent routing, I think it's likely the best solution in a scenario where they have incomplete information about what a random process on a dyno can reliably handle (which is the case on the cedar stack).
Alternatively, they could just add some configuration that allows you to set the request density and then you could bring intelligent routing back. The couple of milliseconds that lookup/comparison would take is far better than the scenario they're in now.
EDIT: I realized my comment could be read as though I'm suggesting this naive solution is "easy". At scale it certainly isn't, but I do believe it's possible and as this is their business, that's not a valid reason to do what they are.
What if their inbound routing is hundreds of machines, each of which may get a request for any of their thousands of apps, spread across tens of thousands of web dynos?
Do you have a distributed sufficiently-consistent counter strategy that won't itself become a source of latency or bottlenecks or miscounts under traffic surges?
I doubt they want every inbound request to require:
• query remote redis for lowest-connection-count dyno(s) (from among potentially hundreds): 1 network roundtrip
• increment count at remote redis for chosen dyno: 1 network roundtrip (maybe can be coalesced with above?)
• when connection ends, decrement count at remote redis for chosen dyno: 1 network roundtrip
That's 2-3 extra roundtrips each inbound request, and new potential failure modes and bottlenecks around the redis instance(s). And the redis instance(s) might need retuning as operations scale and more state is needed.
Random routing lets a single loosely-consistent (perhaps distributed) table of 'up' dynos, with no other counter state, drive an arbitrarily large plant of simple, low-state routers.
This has all been solved previously. In Google Appengine the scheduler is aware of, for each instance:
* the type of instance it is
* the amount of memory currently being used
* the amount of CPU currently being used
* the last request time handled by that instance
It also tracks the profile of your application, and applies a scheduling algorithm based on what it has learned. For eg. the url /import may take 170MB and 800ms to run, on average, so it would schedule it with an instance that has more resources available.
> Each instance has its own queue for incoming requests. App Engine monitors the number of requests waiting in each instance's queue. If App Engine detects that queues for an application are getting too long due to increased load, it automatically creates a new instance of the application to handle that load
This is what it looks like from a user point of view:
Heroku essentially need to build all of that. The way it is solved is that the network roundtrips to poll the instances run in parallel to the scheduler. You don't do:
* accept request
* poll scheduler
* poll instance/dyno
* serve request
* update scheduler
* update instance/dyno
This all happens asynchronously. At most your data is 10ms out of date. It would also use a very lightweight UDP based protocol and would broadcast (and not round-trip, since you send the data frequently enough with a checksum that a single failure doesn't really matter, at worst it delays a request or two).
A big problem is that the newer stack is not homogenous - the applications deployed on Dyno have much, much bigger variability than the old "Rails/Rack only" stack of Heroku. Meanwhile GAE stack is fully controlled by Google and reuses, afaik, their impressive "google scale" toolchest that goes from replacement naming systems, through monitoring, IPC, custom load balancing etc.
While F5 and similar offer nice hw for that, I'm not sure if their hw (or HAProxy's software) supports the architecture type used by Heroku (many heterogenous workers running wildly different applications with dynamic association of worker to machine etc.)
> It also tracks the profile of your application, and applies a scheduling algorithm based on what it has learned. For eg. the url /import may take 170MB and 800ms to run, on average, so it would schedule it with an instance that has more resources available.
That is very awesome technology, but it something like that available for non-google people?
Sounds nice, but I'm not sure it's the only way -- that Heroku 'essentially needs to build all of that'. It'd be interesting to see whose routing-to-instance is faster in the non-contended case, between Heroku and GAE. Do you know of any benchmarks?
Don't know of any benchmarks, but I have/had a number of projects on AppEngine and it is very good (but expensive). I would be looking to include Elastic Beanstalk in a comparison as well, as it is gaining popularity since it launched (it doesn't have the lockin and supports any environment).
My argument was built on the premise that random routing isn't acceptable given the potential slow downs it can cause (as pointed out in the Rap Genius post). If you believe otherwise, then there's no real argument for me to make :)
With that said, in your example, you could do one and two together and the response doesn't need to wait on the completion of #3. So it's one network roundtrip, which I would imagine is a tiny fraction of what they're having to do already. It is certainly another moving piece, but again my argument is that they have to have a solution and this doesn't seem infeasible.
As I've written elsewhere, I think just having a way for dynos to refuse load (by not accepting a connection, or returning an error or redirect), such that the load tries another dyno, will probably achieve most of the benefits of 'intelligent' routing. And, preserve the stateless scalability of the 'routing mesh'.
Really? Is that so hard? All you need is a table on the router that tells it the best information it can currently have about the number of requests processed on each dyno - without doing a roundtrip. This requires exactly one additional (one-way) package: A message from the dyno to the router, telling it that it has finished the current request.
Now, to avoid dead dynos (because the finished message might have been lost somewhere) the dyno can repeat the finished message ever 30 seconds or so (and the router ignores messages with counts <= 0).
A problem with this proposal is the assumption that there is one 'the router' with this info, updated for tens of thousands of dynos and millions of requests per second.
Well, if this is the case that would be a pretty deep architectural problem, I'd say, for a PaaS like Heroku.
I think it's pretty obvious that you need at least two layers of hierarchy for the routing here: One (or more) router forwarding requests to virtualized routers (per Heroku customer 'instance' or whatever that's called), which in turn provide the functionality I described in software. I'd probably use VMs running a specialized minimal linux distro for the per-instance-routers.
There are dozens of possible ways to implement this in a distributed, atomic or near-atomic, low-impact way.
* One way is to have a list in Redis, just pop a dyno off it (atomic so each dyno is popped off exactly once), send the request to that dyno, and as soon as it's done, push the dyno back on the queue. 1RTT incoming, and let the dyno push itself back on after it's finished.
* Another way is to use sorted lists in Redis, increment/decrement the score based on the connections you're sending it/returning from it. Get the first dyno in line which will have the lowest score. This is harder but maybe more flexible.
* Presumably they already have a system in place in the router, that caches the dyno's a request to a particular app can be sent too, which includes detecting when a dyno has gone dead. Just use that same system but instead of detecting when it has gone dead, detect if it has more than 1 request waiting for it.
etc...
But in the end, 2-3 extra roundtrips for each inbound request is peanuts, that's the least of the problems with these ideas. That would add maybe 10ms? to each request. It's not like the servers are on the other side of the world. They're in the same datacenter connected by high-throughput cabling.
It would seem at first glance that the extra round trip(s) would be less costly than the potencial bottleneck by a large margin. As mentioned several times in the other thread, increasing average latency to narrow your latency histogram is almost always the correct choice.
This is NoSQL at startups in a nutshell. A master-slave vertical scaling database won't work when we're the size of twitter, so we're going to settle for a shoddy user experience while we customize the hell out of it. Heroku gets away from this with Postgres and hopefully they can get away from this with their load balancing too.
You don't need a distributed sufficiently-consistent counter strategy. You can do just fine with a two layer routing mechanism where the first layer "sort" the incoming requests (e.g. by host header) so that the second layer is grouped by customer and can apply any number of simple alternatives like least-connections.
The naive approach would fail badly if they didn't have some way of supporting a healthcheck. A customizable url path (/healthcheck is often used) where the app would return 200 if things look good would work. Otherwise you may wind up with dyno that's quickly sending back 500's where not appropriate, and since it's handling lots of requests the router would keep giving it more.
I'm not sure that should be a concern at the routing layer or even necessarily a concern of heroku. It's not their job to ensure that your code isn't blowing up.
That being said, health checks are nice for other reasons and could be used outside of the routing layer (which you need to sail along as quickly as possible).
The problem with that is when your code is failing to run properly due to their hardware problem - now whose fault is it?
I don't know how big they are. 50k machines? Could be off by an order of magnitude either way but I'll go with that. Suppose that your servers have, let's be generous, a 5 year mean time between failure. That's 10k machines dying every year. About 27 per day. A bit over 1 per hour.
Machines don't necessarily die cleanly. They get flaky. Bits flip. Memory gets corrupted. Network interfaces claim to have sent data they didn't. Every kind of thing that can go wrong, will go wrong, regularly. And every one of them opens you up to following "impossible" code paths where the machine still looks pretty good, but your software did something that should not be possible, and is now in a state that makes no sense. Eventually you figure it out and pull the machine.
Yeah, it doesn't happen to an individual customer too often. But it is always happening to someone, somewhere. And if you use least connections routing, many of those failures will be much, much bigger deals than they would be otherwise. And every time it happens, it was Heroku's fault.
I'm having trouble finding information about this. It's disturbing how little attention seems to be given to load balancing relative to how important it is.
Yeah, it's interesting what sort of emergent properties come out of massively-scalable, massively-distributed systems. For example, when you write software in school or for single-machine deployment, you're taught to assume that when there's a bug it's your fault, a defect in your software. That's no longer the case when you get into massive (10K+ machine) clusters, where when your program fails, it might be your software, or it might be the hardware, or it might be a random event like a cosmic ray (seriously...in early Google history, there were several failed crawls that happened because cosmic rays caused random single-bit errors in the software).
And so all the defect-prevention approaches you learn for writing single-machine software - testing, assertions, invariants, static-typing, code reviews, linters, coding standards - need to be supplemented with architectural approaches. Retries, canaries, phased rollouts, supervisors, restarts, checksums, timeouts, distributed transactions, replicas, Paxos, quorums, recovery logic, etc. A typical C++ or Java programmer thinks of reliability in terms of "How many bugs are in my program?" The Erlang guys figured out a while ago that this is insufficient for reliability, because the hardware might (and will, in a sufficiently large system) fail, and so to build reliable systems you need at least two computers, and it's better to let errors kill the process and trigger a fallback than to try to never have errors.
Seems obvious that a naive solution wouldn't be as easy at large scale. You have to imagine that Heroku would have considered a whole lot of options before deciding on random distribution. Give them some credit at least.
What's worse is that they are not dealing with the behaviour of just one app. Random routing probably works best for a certain subset of their customers, maybe even the majority. If they switched to it it's unlikely to have happened without measuring a few different approaches and deciding this gave the best results overall. Unfortunately they have to deal with all of their customers apps, some of whom might have a huge variation in response time, which seems to trigger these issues - it's a complicated topic and one solution might not work for everyone.
To be clear, I'm not saying the naive solution I proposed would be "easy". Just suggesting that better solutions do exist and given that solving this very problem is one of their biggest value propositions, taking the easier way out isn't acceptable.
We used to do long polling/comet in our app on Heroku (we axed that feature for non-technical reasons), which meant that many requests were taking several minutes while the framework could still process more requests. In your system, how would you account for asynchronous web frameworks?
- request comes in and gets routed to the dyno with the lowest count. Inc the count.
- response goes out. Dec the counter.
Since they control the flow both in and out, this requires at most a sorted collection of counters and would solve the problem at a "practical" level. Is it possible to still end up with one request that backs up another one or two? Sure. Is it likely? No. While this isn't as ideal as true intelligent routing, I think it's likely the best solution in a scenario where they have incomplete information about what a random process on a dyno can reliably handle (which is the case on the cedar stack).
Alternatively, they could just add some configuration that allows you to set the request density and then you could bring intelligent routing back. The couple of milliseconds that lookup/comparison would take is far better than the scenario they're in now.
EDIT: I realized my comment could be read as though I'm suggesting this naive solution is "easy". At scale it certainly isn't, but I do believe it's possible and as this is their business, that's not a valid reason to do what they are.