Hacker News new | comments | ask | show | jobs | submit login
Money Trees – Rap Genius Response to Heroku (rapgenius.com)
283 points by tomlemon on Feb 18, 2013 | hide | past | web | favorite | 100 comments

> This works for future customers, since once Heroku makes these documentation changes, everyone who signs up will understand exactly how routing works. But it does nothing to address the time and money that existing customers have spent over the past few years. What does Heroku owe them?

Lawyers -- well, judges really -- are good at coming up with answers for this exact sort of question.

I am not being facetious. There are legal rules for assessing losses in even very complex, very entangled situations. If you feel Heroku has dudded you, find a torts lawyer.

Heck, Salesforce.com have deep pockets. Round up a few other $20k/month customers and start a class action.

Web companies need to realise that boring old-fashioned rules like "your claims should not be misleading" apply to them too.


ummmmm......salesforce bought heroku

The implication there was that because Salesforce owns Heroku, they would be able to pay up major cash if Rap Genius got some other Heroku customers involved in the lawsuit.

I don't think you can sue Heroku because it was "slow". Were any SLAs offered by Heroku explicitly not met? If so that's a different story.

If AWS was able to be sued by every customer who didn't understand how their infrastructure config worked with the underlying EC2 network, or did something incorrectly for a time due to a documentation mistake there would be more lawyers working at Amazon than engineers.

The argument is not about sueing Heroku because of them being slow. It's sueing them because they lied.

Define "lied". Not having clear documentation or you not understanding 100% how their back end system is implemented does not mean they were lying.

Seems like at most you could sue to get back some of what you paid them.

Do not treat loss as only the money lost. Lost customers are priceless and one can put a high price tag to that.

Also loss in productivity because their documentation was misleading.

If you use Heroku and New Relic, make sure you install the gem we wrote to make New Relic report correct queue times: https://github.com/RapGenius/heroku-true-relic

Hey Tom, I'm the guy who wrote the queue time instrumentation in the New Relic ruby agent. Note that I no longer work for New Relic and my opinions are my own, not those of New Relic or my current employer.

I don't recommend that you use this patch - work needs to be done on Heroku's end, this is not a satisfactory workaround. The ideal would be for them to add a timestamp to the headers at the front end of the dyno machine (i.e. in apache/nginx/whatever) to allow the calculation to include a local-machine-relative timestamp rather than one reliant on two servers being in sync.

The major issue is that servers on AWS do not have synchronized clocks, in general. I'm not sure how Heroku manage their servers, but I do know that in the samples I saw several years ago, we had a very large variance in the queue time reported based solely on that clock skew.

The New Relic reported value is an average, which is a poor choice for something like this, but it's very difficult to graphically illustrate queue time across a network of machines without resorting to it.

I'd be happy to discuss it further, and I know that sgrock [1] is also around the neighborhood - he's one of the current Ruby Agent maintainers.

1: http://news.ycombinator.com/user?id=sgrock

Man, there is a LOT of expertise over there at Rap Genius just to have a website where you can figure out what "hollatickin" means.

I know dude, it was just a joke. RG is cool tech. I don't need some VC to tell me that.

The tech. isn't exactly crazy though. A JavaScript popup over some text.

The implementation is very good though. It's clean, easy to use, and very useful. I could imagine it being very useful on something like Wikipedia. I don't always need to go to an entire article, maybe I only need to see the first paragraph on hover.

Don't get me wrong, I think they've done an awesome job and I can see the use in other fields, absolutely. I'm just not sure this is one of those "we need top talent" type businesses... I'm not even really sure software is their real value add.

Wikipedia actually already has this in the form of Navigation Popups: http://en.wikipedia.org/wiki/Wikipedia:Tools/Navigation_popu...

hey, you're right! I usually surf with JS off and never noticed!

WOAH okay that's legitimately awesome.

"Man, there is a LOT of expertise over there at AirBNB just to have a website where you can stay at a bad hotel."

I worked in the same department as Tom Lehman/Lemon @ D. E. Shaw in 2007...the guy is wicked sharp. So were most of the folks there.

Really? Their tone is entirely wrong in my opinion. They sound like children who have never worked in the real world before. The issue they are facing is certainly a real issue, but they are handling it completely inappropriately.

> Really? Their tone is entirely wrong in my opinion

That's just how they roll.

> The issue they are facing is certainly a real issue, but they are handling it completely inappropriately.

They've managed to start a huge discussion around Heroku's infrastructure while showing off the broader applications of their own annotation technology. I think they're doing just fine.

Heroku is a bit of a black hole with this kind of stuff. They shrugged off the legitimate concerns of a 20k/month customer. I certainly appreciate them publicizing the issue and starting a conversation.

My guess as an armchair observer (and tiny-scale Heroku user) would be that Heroku will offer some affected customers refunds, especially if those customers "threw dynos" at latency problems that were aggravated by the drift in Bamboo routing behavior and hidden by the misleading NewRelic monitoring.

I don't think Adam@Heroku's response on the 11th is that bad. He accepts the feedback and also wants Heroku to help RapGenius 'modernize their stack'. That's not a full and proper solution, nor a remedy for the lost cost/effort so far, but it would have offered a lot of performance and cost relief.

In fact, I think that's why this problem festered: many customers managed to soften the pain by going to Cedar, multiple-workers, app-optimizations, and more dynos... so deeper investigations kept getting backburnered, both inside and outside Heroku, until now.

RapGenius has done us a mitzvah by finally digging deeper, but I'm still eager to see what Heroku thinks the right remedies are, beyond RapGenius's 'must do' ultimatums.

> hidden by the misleading NewRelic monitoring.

The assumptions built into the queue time and queue depth monitoring were essentially the same - that routing would hold requests until a dyno was free, and all queueing happened at the routing fabric level not the dyno level.

Unfortunately, so far as I am aware the only way to get a 'true' round trip time for a given web page is to look at it from the user's perspective as they request it - as a percentage of time, the network roundtrip is the only number you really care about.

If they had been using New Relic correctly (note that I don't work for or speak for New Relic, I'm just a former employee), they'd have seen on the javascript-enabled monitoring that requests were taking a long time. The server-side time is only a portion of that, but it's clearly delineated.

I think this whole thing is composed of two issues: Rap Genius realized that requests were queueing at the dyno level (bad) and decided that they needed numbers to back that up. Unfortunately they picked a number (queue time) that doesn't have much functional basis on Heroku's stack at the moment, which weakens their argument.

What I would like to see change, is to see an additional header placed on the front end of the dyno machine by Nginx or Apache or Yaws or whatever web server runs local to the dyno, immediately as the request hits the machine. That would enable the current New Relic Agent to pick up the queue time spent on the local machine correctly, and basically entirely eliminate the problem of inaccurate queue time statistics.

There's actually code in there already to handle this already - add an HTTP_X_REQUEST_START header to your requests as they enter the machine and it'll be recorded. Not sure how it's displayed these days, I haven't been privy for a couple years now, but the code still exists and records statistics in the Agent.

> If they had been using New Relic correctly (note that I don't work for or speak for New Relic, I'm just a former employee), they'd have seen on the javascript-enabled monitoring that requests were taking a long time. The server-side time is only a portion of that, but it's clearly delineated.

Note that they were using New Relic as part of an expensive add-on package from Heroku. It gave them a queue time value in it's reports, but it was extremely misleading, since the only value it showed was for the router queue time, which should have always been extremely low (and was displayed as such). It didn't say "router queue time" or "dyno queue time shown elsewhere".

Since New Relic is supposed to be showing them everything that happens with their request on Heroku's servers, it seems logical that it would include dyno queue times.

Javascript-enabled monitoring would only show you that the request times are much longer than what Heroku says they are, then you still have to troubleshoot down to figuring out why.

> Unfortunately they picked a number (queue time) that doesn't have much functional basis on Heroku's stack at the moment, which weakens their argument.

I don't think it weakens their argument as it clearly shows that the biggest problem they have is not only out of their control (even with very short run-times, the higher the number of requests you have per minute, the more this problem is going to affect you) but that even buying very expensive tools integrated into Heroku's "stack" will not help you to see where the problem is. The tools were basically hiding the one problem that was solely Heroku's responsibility.

Do not forget that even while there were a lot of statistics about how long running requests can cause other, much shorter requests to take just as long and even timeout, the heart of the matter is that even with a high number of extremely short requests, the router can end up sending many requests to a single dyno while other dynos remain idle. There were plenty of graphs, even animated to show you the effect over time for this random dyno routing.

> It didn't say "router queue time" or "dyno queue time shown elsewhere".

I can tell you that personally when I was writing the code that calculates that queue time value, several years ago, we didn't think such time existed. It was either router time or nothing.

> The tools were basically hiding the one problem that was solely Heroku's responsibility.

Right, absolutely agree, but the problem is their current queue time measurements do not make a strong argument for this due to the issue of clock skew. They're essentially picking a more-or-less nonsense number and saying that it demonstrates that Heroku is bad. The only numbers that are fundamentally reliable are the total request time from the javascript side.

I think the problem festered because there was a gaping hole in the response time monitoring and reporting.

There are still some important points missing from the discussion:

1. Operating at scale with parallel routing. 2. Handle faults while operating at scale with parallel routing 3. Providing correct statistical models for the situation. The one we have right now is a crude approximation. 4. Measuring on the real system for problems.

The optimum routing is to have each dyno with 0 or 1 job at a time and a global queue of all incoming requests. But this is a latency problem then since it takes time for a dyno to tell that it is "ready". The net result is very bad performance and the global queue is a single point of failure. The solution is to queue because this removes the latency --- but with the price you see RG paying if a Dyno can only serve one request at a time.

If a dyno does not report "ready" to the routing mesh, then you can't route optimally:

Queue length doesn't work since a request in queue may take 7000ms while still having a length of 1. Another queue with length 5 consisting of 5 70ms requests is better to route to.

The time the last request spent in queue is not useful either because the very next message may be a 7000ms one.

So to solve this problem, you must do something else. You cannot use "intelligent routing" unless you can describe how it will work distributed with, say, 8 routing machines while avoiding latency. And while you are at it, you better measure your solution in a real-world scenario.

"The optimum routing is to have each dyno with 0 or 1 job at a time and a global queue of all incoming requests. But this is a latency problem then since it takes time for a dyno to tell that it is "ready". The net result is very bad performance and the global queue is a single point of failure. The solution is to queue because this removes the latency --- but with the price you see RG paying if a Dyno can only serve one request at a time."

You are right that there is an inherent latency hit using intelligent routing versus random routing. However in a resource constrained world, where no one can afford infinite dynos, there is also an average latency hit to random routing -- but rather than being largely fixed for all requests it is instead highly variable. While the magnitude of the two need to be factored in, ceteris paribus low variance in latency is better.

As for your single point of failure point, there are distributed queue algorithms that handle router failure gracefully.

Could you point me to such distributed queue algorithms that can handle failure gracefully? I'm interested in reading on the topic.

while not exactly what you're looking for, this might be a good starting point: http://en.wikipedia.org/wiki/Virtual_Router_Redundancy_Proto...

there seems to be some more reading material linked to at the bottom.

I really liked your links to mononcqc :) Because that are way better suggestions than just "intelligent routing" I only skimmed the first one, with Join-Idle-Queueing, but that sounds like a way better algorithm to apply.

Yeah, Heroku's original method wasn't all that great, and random routing can give surprisingly decent performance in some workloads. Unfortunately the workload they were/are working with doesn't seem to be one of them.

Happily this class of problems is very well studied, and so there is a ton of research out there on various options. I know it can be daunting to write a program directly from a research paper, but hey that's why they make the big bucks.

If I understand the articles and your comment correctly:

The issue here is that you have N routers all with their own estimates of free vs. in-use. Selecting a random free dyno is fine with one router because you can guarantee that it is truly free. With N routers, each will randomly distribute its work amongst the dynos in the pool, so each now has 0-N queued requests. This means that even with fixed-length response times, you get a dyno-queue-time of N*response_time under full load and some potential jitter even at low loads. Add in huge variations of response time and this gets far worse since the queue length in time is unknown. (There's probably a half-decent bufferbloat analogy hiding in this somewhere...)

It sounds like modern Rails is OK with handling multiple requests simultaneously as long as you're I/O bound and using non-blocking database IO. Would it be possible to use Node-like callbacks within Rails to break up CPU-bound tasks? It would not be an ideal solution, but might help work around this without resorting to Go and the like.

"The solution is to queue because this removes the latency" - what? Queueing when your workers aren't at capacity surely adds latency.

Agreed that routers need to be aware of whether workers are busy. This might be really simple if the dispatcher receives the worker's reply and forwards it, or you could have the worker explicitly signal.

Why would you have a single global dispatch? While it is "optimum" if the dispatch were instant and infallible, presumably you'd have random assignment to any of a handful of independent dispatchers, each perhaps with dedicated workers, perhaps not.

(also agreed that fault tolerance is tricky if your operations aren't idempotent, and costly in any case)

I suggested an intermediate solution to them.

The front line router picks a random appropriate dyno, then hands off the request to the router responsible. That router does intelligent routing to the dynos it is responsible for.

As long as you cluster dynos for a given app behind a set of routers, this essentially solves the problem on Bamboo for the cost of an extra step at the start of each request.

What's unclear for this kind of scheme is what impact it has on stacks or dynos that can actually take concurrent requests. There's the distinct possibility that a concurrent server will see its average time get worse under such circumstances.

People simulate their stack quite quickly, but do not necessarily simulate what might happen to other ones.

Otherwise, it also means that you'll possibly have twice the data flying around your network due to the overhead of routing from one layer to the other one. Effects of these may be hard to predict.

It's also difficult to know what the impact would be of a router that handles a lot of traffic for an app being cycled or replaced, or crashing. Does this mean you create single point of failure for apps? Do you go for redundancy, failovers, sharding or anything else to fix this? What about netsplits between units across layers? Random distribution has the advantage of generally being pretty reliable across all of these scenarios.

Bamboo is explicitly for stacks that can't take concurrent requests.

The data issues are not so bad. You just do as load balancers do and do the handoff in the initial TCP handshake. Only a small fraction of packets go through the first router.

As for the rest, the architectural issues are similar to the ones that Heroku was dealing with in 2009, except that each router has a constrained set of dynos it is responsible for. You would have standard health checks for routers, and the ability to migrate dynos from one router to another. There are issues to solve, but they are fairly reasonable.

This incident has done wonders for RapGenius's technical brand. I don't know how many people would've identified them as a 'tech company' before, but that number has surely gone up.

Guys, you've made a lot more money than me, so you don't need my advice. But if you want money back, you should probably be communicating in private through your lawyers. Posts like this look like you're trying to get (more) attention.

They ARE trying to get more attention and it's working. The only reason Heroku has prioritized this issue is because it has grown bigger than a customer complaint in a dead-end tracker. Rap Genius, by being smart and noisy, has made this issue about reputation. Heroku isn't on the line for fixing this one issue, they're on the line to do the right thing. And that's way more important to the long-term success of the company than one performance issue.

I'm super glad they are taking this public and getting tons of attention. I'm consulting for some people who are plagued with the same H12 errors. We've spent tons of time and money trying to mitigate these errors as much as possible to no avail. Bringing this problem up to Heroku through their paid support channels has always come back to them finger pointing at everyone but themselves. I'm glad someone is stepping up and pushing back.

The first post, definitely. But this one?

Maybe I'd feel differently if I were a Heroku customer, so I'll defer to people directly affected.

I think they're being "good guys" by keeping this transparent. I don't think they're really serious about getting their personal money back, as I'm sure it pales in comparison to the developer expenses in trying to optimize on bad data.

RapGenius probably don't want their money back --- they want a service that works as advertised. And making this a big, public issue is a better way to get that than legal threats.

Besides, even if a lawsuit did get them their money back, after two or three years, it might not be enough to pay back the legal fees. Nobody wins most lawsuits...

No, but people do win settlements.

Unless your supplier is a giant company where the left hand doesn't know what the right hand is doing, it's seldom wise to sue your supplier before you're ready to dump them.

Is it less wise than publicly taunting your supplier?

Heroku did false advertising of its routing intelligence. They deceived their customers and must be criticized for that. Rapgenius is having an issue that many of us also have, so it is nice that they are sharing the details.

One minor clarification: by "posts like these..." I did not mean to include their first post on these issues with Heroku. That one had a big effect because of the attention it got. But the follow-up posts come across differently (to a non-customer Heroku).

Heroku's suggestion: "modernize and optimize your web stack."

I don't have any experience with Ruby web stacks so I'm curious if this is actually an option for you guys? What would it take to do that? Would the performance increase on Heroku be worth it?

It also seems like if you wanted to self host you would probably need to do those same improvements, right?

Please don't take my comment the wrong way, I'm not trying to say Heroku is somehow excused from their mistakes here. I'm just trying to understand that suggestion from Heroku.

Here's a discussion about that point in the original article's comments: http://news.ycombinator.com/item?id=5216553

The conclusion is that it will buy you a bit more time, but does not fix the underlying issue.

In Rap Genius' case, they're large enough that they would still have significant issues even if they switched to cedar and unicorn with 2-4 worker processes.

> they would still have significant issues even if they switched to cedar and unicorn with 2-4 worker processes.

Yeah, but how about config.threadsafe! with puma or thin in multi-thread mode?

It is odd to me that very little of this conversation on the nets recognizes that Rails _does_ support multi-threaded concurrent request handler, with the right app server stack (figuring out the right app server stack can be non-trivial, although it's getting better).

Agreed. I am working with a client who is slightly lower on the Heroku customer food chain and using Unicorn with four workers right now. Our next step is Puma though that will likely not be the end point.

Certainly investigate Puma on Rubinius or JRuby, and make sure you have config.threadsafe! turned on.

I've lost a lot of faith in Heroku this last week. Going to be doing a lot of investigating Cloud66/Elastic Beanstalk + EC2 for my Rails app. Good excuse to up my sysadmin abilities a bit.

Why does Adam Wiggins repeatedly use the word 'evolve' as a transitive verb in an awkward fashion? Is this some sort of start-up usage that I managed to avoid thus far?

"We're working on evolving away from the global backlog concept in order to provide better support for different concurrency models, and the docs are no longer accurate."

"Getting user perspective is very helpful and I'll apply your feedback as we continue to evolve our product."

"You're correct that we've made some product decisions over the past few years that have evolved our HTTP routing layer away from the "intelligent routing" approach that we used in 2009."

Evolve to me connotates natural selection -- which is rather more haphazard than I would hope for from a engineering process.

People don't like change.

Using the word "change" makes it more obvious and the changes they made to their system (that were debatably for the worse) deliberately.

> Why does Adam Wiggins repeatedly use the word 'evolve' as a transitive verb in an awkward fashion?

I suppose "awkward" is in the eye of the beholder. This usage is standard enough to be in every dictionary I know. As Oxford succinctly puts it, the word means "develop gradually" — and much like the word "develop," it can be either transitive or intransitive.

> Evolve to me connotates natural selection -- which is rather more haphazard than I would hope for from a engineering process.

Even in biology, there have been several other theories of evolution than natural selection (e.g. Lamarckism and artificial/theistic evolution). The others became disfavored in biology as it became obvious that natural selection was indeed the best explanation of how evolution worked in nature, but natural selection has never been an inherent aspect of the word "evolution."

"evolving away from" is not transitive ("evolve our product" is)

his product is in the PaaS ecosystem, so why not "evolve"? :-)

Maybe this is offtopic, but I really don't like the way Rap Genius does links. It makes it so I essentially have to click on each link twice to get to what it actually goes to...

They're not links, they're text annotations. Rap Genius is a text annotation platform currently focused on rap lyrics. In this case most of the annotations happen to be a link + context, which is pretty rare.

Yea, I got that... but in the context of a tech blog post, the interface just doesn't seem to make sense or at least in my head. Its a blog post, not rap lyrics.

I think their hypothesis is something like all long-form text should/will be annotated. While I think it's important to dog-food, I agree that the execution in this case isn't great. The "links" really should just be links.

Agreed, this is incredibly annoying and has tipped the scales for me to "not worth the effort." It's as tho facebook attempted put a technical blog post in a status update and only used @tags or something; the content does not fit the format.

Basically it's breaking the web's "what happens when I click a link" contract; middle clicking should not open the same page in new tab, confusing me. Breaking that convention for a very specific purpose (annotating song lyrics) makes sense, but it's extremely misguided on a technical blog post.

I liked it - most of them were annotations where the additional context provided let me stay on the page in most cases and still get most of the additional value.

I'm sorry, but I don't understand any of this hating on Rap Genius.

There's a reason they are the fastest growing YC company ever, and got a16z in for 15M -- because they are straight killers. They have quietly created an internet empire until this point, and are building something that people love and use everyday.

A lot of folks wouldn't have the chutzpah to call out Heroku like that or are just too small to make this kind of attention. To me it seems as though they are helping Ruby devs save money and time. 8 dynos vs 4 dynos is a hell of a big difference when you're starting out. Also, seems like something that would be pretty fun to do if you worked there.

Yet they are spending a crazy amount on hosting with Heroku when they could be doing it themselves and get cost savings plus know the entire stack.

TL;DR - They've said repeatedly that they would rather work on their product/projects than have to deal with all of the details themselves.

The crazy amount they're spending should give you an idea of how much throughput they have and what would be required for their own setup. That means a lot of design, additional time implementing, more time updating to new tech (including on the software side, in gems, etc), then in the end, ongoing maintenance.

There's probably a lot more that I'm not mentioning as hey, like them, I'm not interested in spending my time developing, purchasing and administrating my own hosting systems, either. I have apps to write.

But it seems like they had to deal with the details themselves, plus they are paying a crazy amount of money. Seems like a lose/lose situation for them.

Also the amount of money really gives no indication of how optimized a site is or how much traffic it gets. I've seen sites that are in the top 500 spending ~3K/mo on hosting and sites that get 60K uniques a month spending 2x that.

Thank you so much for forcing Heroku to confront this issue!

We've been seeing strange delays and optimizing based on New Relic for a long time... and whenever we reported this to Heroku, they would not admit to an issue.

We ended up using threads (on cedar stack) to get more concurrency per dyno.

Wonder how all of these people are feeling right now..


"Rap Genius became a cult phenomenon and scaled from zero to ten million users with ease thanks to the Heroku process model."

Good for at least one chortle.

"Explain Now, as Rap Genius is widely known for its expertise in queuing theory" Is this true, or are they being sarcastic that if they could do it Heroku really should've?

pretty sure thats sacracsm

There's actually a text annotation that explains that, demonstrating how their platform works.

Thanks Steve, when I wrote the question there wasn't one...

Ah. No worries. :)

Ironically, it's possible to get a huge gain over purely random load balancing by examining just two queues at random -- essentially, you should always be doing this since the cost is O(1) and the improvement is large.[0] This doesn't require any distributed locking and at least would qualify as "intelligent" routing -- probably the bare minimum needed to justify that marketing label.

Oh, and it also scales incredibly well. Like I said, there's no reason not to use it over purely random load balancing.

[0] http://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.p...

Wow, the gain from using two choices is indeed impressive! According to section 3.6 (page 72): ``[For a system with 500 servers, i]n equilibrium, the expected time a [request] spends in the system given one choice (d=1) is 1/(1-λ) {where λ is the average request rate}. Hence ... when λ = 0.99 the expected time in the system when d = 1 is 100.00; with two choices {when d=2}, this drops to under 6.''

Heroku should have done something about the issue earlier, but it seems like the problem was just poor prioritization/time management on their end. Yes, these posts got them to finally get moving, but i wonder if perhaps RapGenius could have had the same effect by continuing to bug them privately in the same unyielding manner, instead of going public with it so quickly. That would have allowed Heroku to have focused their energy on fixing the problem, rather than upon worrying about PR and class action lawsuits.

Also, on the topic of lawsuits, how many small startups will go out of business if they get hit with a class action lawsuit every time their documentation accidentally diverges from reality? In this case, RapGenius is small and Salesforce is big, but the legal system will apply the same standard when the plaintiff is big and the defendent is poor. If this becomes precedent, then soon we will have lawyers trying to treat any public post by company employees as 'documentation', forcing startups to have a policy of not allowing their employees to freely help others with their product in public forums. Also, any small startup with a large competitor will have the large competitor paying people to sign up for the product with the sole intent of finding a bug in the documentation so that the small startup can be sued out of business.

I'm the guy who mentioned lawsuits.

> Also, on the topic of lawsuits, how many small startups will go out of business if they get hit with a class action lawsuit every time their documentation accidentally diverges from reality?

I was chatting with a law academic of my acquaintance; her specialty is torts and in particularly, remedies to torts. We discussed some of the different sorts of actions you could bring.

Heroku in their Terms of Service 11.1 have language that basically says "we can change our technicals without telling you". And that's very reasonable. It would be impossible to test every minor change with every client's application. It would also be very annoying for clients to get dozens of emails per day of the form "Updated foolib-2.5.3-55 to foolib-2.5.3-56a".

But torts are a different beast; they live outside and alongside contracts. She could tell me what actions you could take in Australian torts law. We have a tort for "misleading and deceptive conduct" [1] which would probably be whistled up for this case, given the magnitude and length of the divergence, but that particular tort seems to be an Australian-only innovation. US law has, she told me, a lot of unique features that she hasn't studied very closely. Heroku's ToS requires all disputes to be settled under California law in Californian courts.

> Also, any small startup with a large competitor will have the large competitor paying people to sign up for the product with the sole intent of finding a bug in the documentation so that the small startup can be sued out of business.

Heroku also has ToS language to deter anti-competitive behaviour; not to mention anti-competitive practice laws. Plus, interference with a contract is itself legally troublesome.

Generally speaking, if you can think of a HUGE GAPING PROBLEM in the law, the lawyers and judges have already thought of it and closed it. Usually hundreds of years ago.

[1] There is also in Australian law a statutory offence of misleading and deceptive conduct, but that would not lead to remedies for Heroku customers.

Of course, I am not a lawyer and this is not legal advice.

I agree that Heroku's response is pretty unbelievable and their engineering choices very suspect. Reading the email chain between Tom & Adam really drives home how badly this has been handled by Heroku.

Heroku is massively crippling its own product with random routing. Other cloud providers have been able to get this right, and Heroku very obviously knows what kind of applications are running on its server (e.g. deploy a Rails application, Heroku says "Rails" in the console). It would not be difficult to apply different routing schemes for each type of application.

Given that this has been going on for years now, Heroku is either acting with pronounced malice or incompetence. Any competent engineer would not be satisfied with switching the routers over to random and calling it a day. How could that have possibly been approved, then remained for years? They must not have realized what a grave mistake it is.

The #1 thing they should be doing right now (aside from damage control) is to move the routers over to round-robin routing. Random is the most naive scheme possible and is laughably inappropriate for this situation.

See for yourself using this simulator: http://ukautz.github.com/pages/routing-simulator.html

To what extent would using something like Amazon's ELB mitigate this sort of issue in a bring-your-own-cloud approach? Completely?

I've been looking at using something like Cloud66 and an ELB to move off of Heroku.

I don't think it would help at all, as ELB doesn't intelligently route either.

My understanding was that you can specify how the "health check" works for each EC2 node, so if you can report that a particular node is having issues, ELB can observe that. But upon further reading, you're right: you can't do anything other than report that a node is "up" or "down", it seems.

well. I'm talking out of my ass here, but I have this idea:

Lets say you set the health check to check an URL of your app that maps to a very, very cheap rack app that does nothing but return some really short string.

Then you set the health check timeout to a very, very low value (like 5-10ms or something).

Now all hosts that don't respond within that low timeout are seen as down with no requests routed to them.

So if that one node that is able to process one request is busy, the health check will fail and the request will be routed elsewhere.

This is a very poor mans solution with some drawbacks:

1) there's still a race condition here between determining that the host is up and sending a request to it, so you might still end up with a request being queued.

2) now you are practically doubling the latency between the load balancer and the app server

3) you are creating quite a bit of load to that small rack app which might have a negative impact on the overall performance.

So in the end, this might be a very bad idea (I didn't think this through fully and I'm not in a position for trying it out), but it might also be a stop-gap measure until the problem is solved for real. Maybe worth trying this out in a staging environment - or with a percentage of hosts.

ELB health checks have a minimum interval of 2 seconds.

Also, when you hit the threshold for consecutive failures (2 or 3), the ELB immediately closes all connections to that backend without waiting for their responses to finish.

The difference is that an EC2 instance is a lot more powerful than a heroku dyno, so you definitely wouldn't be running a single rails instance on each; you'd be running several, using something like passenger or haproxy, something that does do intelligent routing.

No? That's unfortunate. Is it random too?

>off of

You should be looking at improving your English!

Welcome to Hacker News! Please feel free to look over the Guidelines to see how your comments can be improved: http://ycombinator.com/newsguidelines.html

This is big stuff...

Sorry to see Rap Genius investing all that money in New Relic, I can't really imagine being on their shoes.

I would be so pissed.

PS: Heroku user here

Am I the only one who read "crap genius"?

Holy crap - over $60k to get app performance graphs! Wow - that is super expensive!!

This is unrelated, but every time I see that domain my mind thinks it's either rapegenius.com or ragepenius.com. Surely I can't be the only one?

This comment seems to be made every time rapgenius.com is on HN, every time it gets downvoted. No you're not the only one who thinks that, but why does it even matter.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact