Lawyers -- well, judges really -- are good at coming up with answers for this exact sort of question.
I am not being facetious. There are legal rules for assessing losses in even very complex, very entangled situations. If you feel Heroku has dudded you, find a torts lawyer.
Heck, Salesforce.com have deep pockets. Round up a few other $20k/month customers and start a class action.
Web companies need to realise that boring old-fashioned rules like "your claims should not be misleading" apply to them too.
If AWS was able to be sued by every customer who didn't understand how their infrastructure config worked with the underlying EC2 network, or did something incorrectly for a time due to a documentation mistake there would be more lawyers working at Amazon than engineers.
Also loss in productivity because their documentation was misleading.
I don't recommend that you use this patch - work needs to be done on Heroku's end, this is not a satisfactory workaround. The ideal would be for them to add a timestamp to the headers at the front end of the dyno machine (i.e. in apache/nginx/whatever) to allow the calculation to include a local-machine-relative timestamp rather than one reliant on two servers being in sync.
The major issue is that servers on AWS do not have synchronized clocks, in general. I'm not sure how Heroku manage their servers, but I do know that in the samples I saw several years ago, we had a very large variance in the queue time reported based solely on that clock skew.
The New Relic reported value is an average, which is a poor choice for something like this, but it's very difficult to graphically illustrate queue time across a network of machines without resorting to it.
I'd be happy to discuss it further, and I know that sgrock  is also around the neighborhood - he's one of the current Ruby Agent maintainers.
That's just how they roll.
> The issue they are facing is certainly a real issue, but they are handling it completely inappropriately.
They've managed to start a huge discussion around Heroku's infrastructure while showing off the broader applications of their own annotation technology. I think they're doing just fine.
I don't think Adam@Heroku's response on the 11th is that bad. He accepts the feedback and also wants Heroku to help RapGenius 'modernize their stack'. That's not a full and proper solution, nor a remedy for the lost cost/effort so far, but it would have offered a lot of performance and cost relief.
In fact, I think that's why this problem festered: many customers managed to soften the pain by going to Cedar, multiple-workers, app-optimizations, and more dynos... so deeper investigations kept getting backburnered, both inside and outside Heroku, until now.
RapGenius has done us a mitzvah by finally digging deeper, but I'm still eager to see what Heroku thinks the right remedies are, beyond RapGenius's 'must do' ultimatums.
The assumptions built into the queue time and queue depth monitoring were essentially the same - that routing would hold requests until a dyno was free, and all queueing happened at the routing fabric level not the dyno level.
Unfortunately, so far as I am aware the only way to get a 'true' round trip time for a given web page is to look at it from the user's perspective as they request it - as a percentage of time, the network roundtrip is the only number you really care about.
I think this whole thing is composed of two issues: Rap Genius realized that requests were queueing at the dyno level (bad) and decided that they needed numbers to back that up. Unfortunately they picked a number (queue time) that doesn't have much functional basis on Heroku's stack at the moment, which weakens their argument.
What I would like to see change, is to see an additional header placed on the front end of the dyno machine by Nginx or Apache or Yaws or whatever web server runs local to the dyno, immediately as the request hits the machine. That would enable the current New Relic Agent to pick up the queue time spent on the local machine correctly, and basically entirely eliminate the problem of inaccurate queue time statistics.
There's actually code in there already to handle this already - add an HTTP_X_REQUEST_START header to your requests as they enter the machine and it'll be recorded. Not sure how it's displayed these days, I haven't been privy for a couple years now, but the code still exists and records statistics in the Agent.
Note that they were using New Relic as part of an expensive add-on package from Heroku. It gave them a queue time value in it's reports, but it was extremely misleading, since the only value it showed was for the router queue time, which should have always been extremely low (and was displayed as such). It didn't say "router queue time" or "dyno queue time shown elsewhere".
Since New Relic is supposed to be showing them everything that happens with their request on Heroku's servers, it seems logical that it would include dyno queue times.
> Unfortunately they picked a number (queue time) that doesn't have much functional basis on Heroku's stack at the moment, which weakens their argument.
I don't think it weakens their argument as it clearly shows that the biggest problem they have is not only out of their control (even with very short run-times, the higher the number of requests you have per minute, the more this problem is going to affect you) but that even buying very expensive tools integrated into Heroku's "stack" will not help you to see where the problem is. The tools were basically hiding the one problem that was solely Heroku's responsibility.
Do not forget that even while there were a lot of statistics about how long running requests can cause other, much shorter requests to take just as long and even timeout, the heart of the matter is that even with a high number of extremely short requests, the router can end up sending many requests to a single dyno while other dynos remain idle. There were plenty of graphs, even animated to show you the effect over time for this random dyno routing.
I can tell you that personally when I was writing the code that calculates that queue time value, several years ago, we didn't think such time existed. It was either router time or nothing.
> The tools were basically hiding the one problem that was solely Heroku's responsibility.
1. Operating at scale with parallel routing.
2. Handle faults while operating at scale with parallel routing
3. Providing correct statistical models for the situation. The one we have right now is a crude approximation.
4. Measuring on the real system for problems.
The optimum routing is to have each dyno with 0 or 1 job at a time and a global queue of all incoming requests. But this is a latency problem then since it takes time for a dyno to tell that it is "ready". The net result is very bad performance and the global queue is a single point of failure. The solution is to queue because this removes the latency --- but with the price you see RG paying if a Dyno can only serve one request at a time.
If a dyno does not report "ready" to the routing mesh, then you can't route optimally:
Queue length doesn't work since a request in queue may take 7000ms while still having a length of 1. Another queue with length 5 consisting of 5 70ms requests is better to route to.
The time the last request spent in queue is not useful either because the very next message may be a 7000ms one.
So to solve this problem, you must do something else. You cannot use "intelligent routing" unless you can describe how it will work distributed with, say, 8 routing machines while avoiding latency. And while you are at it, you better measure your solution in a real-world scenario.
You are right that there is an inherent latency hit using intelligent routing versus random routing. However in a resource constrained world, where no one can afford infinite dynos, there is also an average latency hit to random routing -- but rather than being largely fixed for all requests it is instead highly variable. While the magnitude of the two need to be factored in, ceteris paribus low variance in latency is better.
As for your single point of failure point, there are distributed queue algorithms that handle router failure gracefully.
there seems to be some more reading material linked to at the bottom.
Happily this class of problems is very well studied, and so there is a ton of research out there on various options. I know it can be daunting to write a program directly from a research paper, but hey that's why they make the big bucks.
The issue here is that you have N routers all with their own estimates of free vs. in-use. Selecting a random free dyno is fine with one router because you can guarantee that it is truly free. With N routers, each will randomly distribute its work amongst the dynos in the pool, so each now has 0-N queued requests. This means that even with fixed-length response times, you get a dyno-queue-time of N*response_time under full load and some potential jitter even at low loads. Add in huge variations of response time and this gets far worse since the queue length in time is unknown. (There's probably a half-decent bufferbloat analogy hiding in this somewhere...)
It sounds like modern Rails is OK with handling multiple requests simultaneously as long as you're I/O bound and using non-blocking database IO. Would it be possible to use Node-like callbacks within Rails to break up CPU-bound tasks? It would not be an ideal solution, but might help work around this without resorting to Go and the like.
Agreed that routers need to be aware of whether workers are busy. This might be really simple if the dispatcher receives the worker's reply and forwards it, or you could have the worker explicitly signal.
Why would you have a single global dispatch? While it is "optimum" if the dispatch were instant and infallible, presumably you'd have random assignment to any of a handful of independent dispatchers, each perhaps with dedicated workers, perhaps not.
(also agreed that fault tolerance is tricky if your operations aren't idempotent, and costly in any case)
The front line router picks a random appropriate dyno, then hands off the request to the router responsible. That router does intelligent routing to the dynos it is responsible for.
As long as you cluster dynos for a given app behind a set of routers, this essentially solves the problem on Bamboo for the cost of an extra step at the start of each request.
People simulate their stack quite quickly, but do not necessarily simulate what might happen to other ones.
Otherwise, it also means that you'll possibly have twice the data flying around your network due to the overhead of routing from one layer to the other one. Effects of these may be hard to predict.
It's also difficult to know what the impact would be of a router that handles a lot of traffic for an app being cycled or replaced, or crashing. Does this mean you create single point of failure for apps? Do you go for redundancy, failovers, sharding or anything else to fix this? What about netsplits between units across layers?
Random distribution has the advantage of generally being pretty reliable across all of these scenarios.
The data issues are not so bad. You just do as load balancers do and do the handoff in the initial TCP handshake. Only a small fraction of packets go through the first router.
As for the rest, the architectural issues are similar to the ones that Heroku was dealing with in 2009, except that each router has a constrained set of dynos it is responsible for. You would have standard health checks for routers, and the ability to migrate dynos from one router to another. There are issues to solve, but they are fairly reasonable.
Maybe I'd feel differently if I were a Heroku customer, so I'll defer to people directly affected.
Besides, even if a lawsuit did get them their money back, after two or three years, it might not be enough to pay back the legal fees. Nobody wins most lawsuits...
I don't have any experience with Ruby web stacks so I'm curious if this is actually an option for you guys? What would it take to do that? Would the performance increase on Heroku be worth it?
It also seems like if you wanted to self host you would probably need to do those same improvements, right?
Please don't take my comment the wrong way, I'm not trying to say Heroku is somehow excused from their mistakes here. I'm just trying to understand that suggestion from Heroku.
The conclusion is that it will buy you a bit more time, but does not fix the underlying issue.
In Rap Genius' case, they're large enough that they would still have significant issues even if they switched to cedar and unicorn with 2-4 worker processes.
Yeah, but how about config.threadsafe! with puma or thin in multi-thread mode?
It is odd to me that very little of this conversation on the nets recognizes that Rails _does_ support multi-threaded concurrent request handler, with the right app server stack (figuring out the right app server stack can be non-trivial, although it's getting better).
"We're working on evolving away from the global backlog concept in order to provide better support for different concurrency models, and the docs are no longer accurate."
"Getting user perspective is very helpful and I'll apply your feedback as we continue to evolve our product."
"You're correct that we've made some product decisions over the past few years that have evolved our HTTP routing layer away from the "intelligent routing" approach that we used in 2009."
Evolve to me connotates natural selection -- which is rather more haphazard than I would hope for from a engineering process.
Using the word "change" makes it more obvious and the changes they made to their system (that were debatably for the worse) deliberately.
I suppose "awkward" is in the eye of the beholder. This usage is standard enough to be in every dictionary I know. As Oxford succinctly puts it, the word means "develop gradually" — and much like the word "develop," it can be either transitive or intransitive.
> Evolve to me connotates natural selection -- which is rather more haphazard than I would hope for from a engineering process.
Even in biology, there have been several other theories of evolution than natural selection (e.g. Lamarckism and artificial/theistic evolution). The others became disfavored in biology as it became obvious that natural selection was indeed the best explanation of how evolution worked in nature, but natural selection has never been an inherent aspect of the word "evolution."
his product is in the PaaS ecosystem, so why not "evolve"? :-)
Basically it's breaking the web's "what happens when I click a link" contract; middle clicking should not open the same page in new tab, confusing me. Breaking that convention for a very specific purpose (annotating song lyrics) makes sense, but it's extremely misguided on a technical blog post.
There's a reason they are the fastest growing YC company ever, and got a16z in for 15M -- because they are straight killers. They have quietly created an internet empire until this point, and are building something that people love and use everyday.
A lot of folks wouldn't have the chutzpah to call out Heroku like that or are just too small to make this kind of attention. To me it seems as though they are helping Ruby devs save money and time. 8 dynos vs 4 dynos is a hell of a big difference when you're starting out. Also, seems like something that would be pretty fun to do if you worked there.
The crazy amount they're spending should give you an idea of how much throughput they have and what would be required for their own setup. That means a lot of design, additional time implementing, more time updating to new tech (including on the software side, in gems, etc), then in the end, ongoing maintenance.
There's probably a lot more that I'm not mentioning as hey, like them, I'm not interested in spending my time developing, purchasing and administrating my own hosting systems, either. I have apps to write.
Also the amount of money really gives no indication of how optimized a site is or how much traffic it gets. I've seen sites that are in the top 500 spending ~3K/mo on hosting and sites that get 60K uniques a month spending 2x that.
We've been seeing strange delays and optimizing based on New Relic for a long time... and whenever we reported this to Heroku, they would not admit to an issue.
We ended up using threads (on cedar stack) to get more concurrency per dyno.
Good for at least one chortle.
Oh, and it also scales incredibly well. Like I said, there's no reason not to use it over purely random load balancing.
Also, on the topic of lawsuits, how many small startups will go out of business if they get hit with a class action lawsuit every time their documentation accidentally diverges from reality? In this case, RapGenius is small and Salesforce is big, but the legal system will apply the same standard when the plaintiff is big and the defendent is poor. If this becomes precedent, then soon we will have lawyers trying to treat any public post by company employees as 'documentation', forcing startups to have a policy of not allowing their employees to freely help others with their product in public forums. Also, any small startup with a large competitor will have the large competitor paying people to sign up for the product with the sole intent of finding a bug in the documentation so that the small startup can be sued out of business.
> Also, on the topic of lawsuits, how many small startups will go out of business if they get hit with a class action lawsuit every time their documentation accidentally diverges from reality?
I was chatting with a law academic of my acquaintance; her specialty is torts and in particularly, remedies to torts. We discussed some of the different sorts of actions you could bring.
Heroku in their Terms of Service 11.1 have language that basically says "we can change our technicals without telling you". And that's very reasonable. It would be impossible to test every minor change with every client's application. It would also be very annoying for clients to get dozens of emails per day of the form "Updated foolib-2.5.3-55 to foolib-2.5.3-56a".
But torts are a different beast; they live outside and alongside contracts. She could tell me what actions you could take in Australian torts law. We have a tort for "misleading and deceptive conduct"  which would probably be whistled up for this case, given the magnitude and length of the divergence, but that particular tort seems to be an Australian-only innovation. US law has, she told me, a lot of unique features that she hasn't studied very closely. Heroku's ToS requires all disputes to be settled under California law in Californian courts.
> Also, any small startup with a large competitor will have the large competitor paying people to sign up for the product with the sole intent of finding a bug in the documentation so that the small startup can be sued out of business.
Heroku also has ToS language to deter anti-competitive behaviour; not to mention anti-competitive practice laws. Plus, interference with a contract is itself legally troublesome.
Generally speaking, if you can think of a HUGE GAPING PROBLEM in the law, the lawyers and judges have already thought of it and closed it. Usually hundreds of years ago.
 There is also in Australian law a statutory offence of misleading and deceptive conduct, but that would not lead to remedies for Heroku customers.
Of course, I am not a lawyer and this is not legal advice.
Heroku is massively crippling its own product with random routing. Other cloud providers have been able to get this right, and Heroku very obviously knows what kind of applications are running on its server (e.g. deploy a Rails application, Heroku says "Rails" in the console). It would not be difficult to apply different routing schemes for each type of application.
Given that this has been going on for years now, Heroku is either acting with pronounced malice or incompetence. Any competent engineer would not be satisfied with switching the routers over to random and calling it a day. How could that have possibly been approved, then remained for years? They must not have realized what a grave mistake it is.
The #1 thing they should be doing right now (aside from damage control) is to move the routers over to round-robin routing. Random is the most naive scheme possible and is laughably inappropriate for this situation.
See for yourself using this simulator: http://ukautz.github.com/pages/routing-simulator.html
I've been looking at using something like Cloud66 and an ELB to move off of Heroku.
Lets say you set the health check to check an URL of your app that maps to a very, very cheap rack app that does nothing but return some really short string.
Then you set the health check timeout to a very, very low value (like 5-10ms or something).
Now all hosts that don't respond within that low timeout are seen as down with no requests routed to them.
So if that one node that is able to process one request is busy, the health check will fail and the request will be routed elsewhere.
This is a very poor mans solution with some drawbacks:
1) there's still a race condition here between determining that the host is up and sending a request to it, so you might still end up with a request being queued.
2) now you are practically doubling the latency between the load balancer and the app server
3) you are creating quite a bit of load to that small rack app which might have a negative impact on the overall performance.
So in the end, this might be a very bad idea (I didn't think this through fully and I'm not in a position for trying it out), but it might also be a stop-gap measure until the problem is solved for real. Maybe worth trying this out in a staging environment - or with a percentage of hosts.
Also, when you hit the threshold for consecutive failures (2 or 3), the ELB immediately closes all connections to that backend without waiting for their responses to finish.
You should be looking at improving your English!
Sorry to see Rap Genius investing all that money in New Relic, I can't really imagine being on their shoes.
I would be so pissed.
PS: Heroku user here