
Money Trees – Rap Genius Response to Heroku - tomlemon
http://rapgenius.com/Lemon-money-trees-rap-genius-response-to-heroku-lyrics
======
jacques_chester
> _This works for future customers, since once Heroku makes these
> documentation changes, everyone who signs up will understand exactly how
> routing works. But it does nothing to address the time and money that
> existing customers have spent over the past few years. What does Heroku owe
> them?_

Lawyers -- well, judges really -- are good at coming up with answers for this
exact sort of question.

I am not being facetious. There are legal rules for assessing losses in even
very complex, very entangled situations. If you feel Heroku has dudded you,
find a torts lawyer.

Heck, Salesforce.com have deep pockets. Round up a few other $20k/month
customers and start a class action.

Web companies need to realise that boring old-fashioned rules like "your
claims should not be misleading" apply to them too.

(IANAL, TINLA)

~~~
chacham15
ummmmm......salesforce bought heroku

~~~
LeafStorm
The implication there was that because Salesforce owns Heroku, they would be
able to pay up major cash if Rap Genius got some other Heroku customers
involved in the lawsuit.

~~~
res0nat0r
I don't think you can sue Heroku because it was "slow". Were any SLAs offered
by Heroku explicitly not met? If so that's a different story.

If AWS was able to be sued by every customer who didn't understand how their
infrastructure config worked with the underlying EC2 network, or did something
incorrectly for a time due to a documentation mistake there would be more
lawyers working at Amazon than engineers.

~~~
dfischer
The argument is not about sueing Heroku because of them being slow. It's
sueing them because they lied.

~~~
res0nat0r
Define "lied". Not having clear documentation or you not understanding 100%
how their back end system is implemented does not mean they were lying.

------
tomlemon
If you use Heroku and New Relic, make sure you install the gem we wrote to
make New Relic report correct queue times:
<https://github.com/RapGenius/heroku-true-relic>

~~~
jaggederest
Hey Tom, I'm the guy who wrote the queue time instrumentation in the New Relic
ruby agent. Note that I no longer work for New Relic and my opinions are my
own, not those of New Relic or my current employer.

I don't recommend that you use this patch - work needs to be done on Heroku's
end, this is not a satisfactory workaround. The ideal would be for them to add
a timestamp to the headers at the front end of the dyno machine (i.e. in
apache/nginx/whatever) to allow the calculation to include a local-machine-
relative timestamp rather than one reliant on two servers being in sync.

The major issue is that servers on AWS do not have synchronized clocks, in
general. I'm not sure how Heroku manage their servers, but I do know that in
the samples I saw several years ago, we had a very large variance in the queue
time reported based solely on that clock skew.

The New Relic reported value is an average, which is a poor choice for
something like this, but it's very difficult to graphically illustrate queue
time across a network of machines without resorting to it.

I'd be happy to discuss it further, and I know that sgrock [1] is also around
the neighborhood - he's one of the current Ruby Agent maintainers.

1: <http://news.ycombinator.com/user?id=sgrock>

------
homosaur
Man, there is a LOT of expertise over there at Rap Genius just to have a
website where you can figure out what "hollatickin" means.

~~~
jdavis703
It's not just about rap: [http://rapgenius.com/Marc-andreessen-why-andreessen-
horowitz...](http://rapgenius.com/Marc-andreessen-why-andreessen-horowitz-is-
investing-in-rap-genius-lyrics)

~~~
gburt
The tech. isn't exactly crazy though. A JavaScript popup over some text.

~~~
homosaur
The implementation is very good though. It's clean, easy to use, and very
useful. I could imagine it being very useful on something like Wikipedia. I
don't always need to go to an entire article, maybe I only need to see the
first paragraph on hover.

~~~
itafroma
Wikipedia actually already has this in the form of Navigation Popups:
[http://en.wikipedia.org/wiki/Wikipedia:Tools/Navigation_popu...](http://en.wikipedia.org/wiki/Wikipedia:Tools/Navigation_popups)

~~~
homosaur
hey, you're right! I usually surf with JS off and never noticed!

------
gojomo
My guess as an armchair observer (and tiny-scale Heroku user) would be that
Heroku will offer some affected customers refunds, especially if those
customers "threw dynos" at latency problems that were aggravated by the drift
in Bamboo routing behavior and hidden by the misleading NewRelic monitoring.

I don't think Adam@Heroku's response on the 11th is that bad. He accepts the
feedback and also wants Heroku to help RapGenius 'modernize their stack'.
That's not a full and proper solution, nor a remedy for the lost cost/effort
so far, but it would have offered a lot of performance and cost relief.

In fact, I think that's why this problem festered: many customers managed to
soften the pain by going to Cedar, multiple-workers, app-optimizations, and
more dynos... so deeper investigations kept getting backburnered, both inside
and outside Heroku, until now.

RapGenius has done us a mitzvah by finally digging deeper, but I'm still eager
to see what Heroku thinks the right remedies are, beyond RapGenius's 'must do'
ultimatums.

~~~
jaggederest
> hidden by the misleading NewRelic monitoring.

The assumptions built into the queue time and queue depth monitoring were
essentially the same - that routing would hold requests until a dyno was free,
and all queueing happened at the routing fabric level not the dyno level.

Unfortunately, so far as I am aware the only way to get a 'true' round trip
time for a given web page is to look at it from the user's perspective as they
request it - as a percentage of time, the network roundtrip is the only number
you really care about.

If they had been using New Relic correctly (note that I don't work for or
speak for New Relic, I'm just a former employee), they'd have seen on the
javascript-enabled monitoring that requests were taking a long time. The
server-side time is only a portion of that, but it's clearly delineated.

I think this whole thing is composed of two issues: Rap Genius realized that
requests were queueing at the dyno level (bad) and decided that they needed
numbers to back that up. Unfortunately they picked a number (queue time) that
doesn't have much functional basis on Heroku's stack at the moment, which
weakens their argument.

What I would like to see change, is to see an additional header placed on the
front end of the dyno machine by Nginx or Apache or Yaws or whatever web
server runs local to the dyno, immediately as the request hits the machine.
That would enable the current New Relic Agent to pick up the queue time spent
on the local machine correctly, and basically entirely eliminate the problem
of inaccurate queue time statistics.

There's actually code in there already to handle this already - add an
HTTP_X_REQUEST_START header to your requests as they enter the machine and
it'll be recorded. Not sure how it's displayed these days, I haven't been
privy for a couple years now, but the code still exists and records statistics
in the Agent.

~~~
neura
> If they had been using New Relic correctly (note that I don't work for or
> speak for New Relic, I'm just a former employee), they'd have seen on the
> javascript-enabled monitoring that requests were taking a long time. The
> server-side time is only a portion of that, but it's clearly delineated.

Note that they were using New Relic as part of an expensive add-on package
from Heroku. It gave them a queue time value in it's reports, but it was
extremely misleading, since the only value it showed was for the router queue
time, which should have always been extremely low (and was displayed as such).
It didn't say "router queue time" or "dyno queue time shown elsewhere".

Since New Relic is supposed to be showing them everything that happens with
their request on Heroku's servers, it seems logical that it would include dyno
queue times.

Javascript-enabled monitoring would only show you that the request times are
much longer than what Heroku says they are, then you still have to
troubleshoot down to figuring out why.

> Unfortunately they picked a number (queue time) that doesn't have much
> functional basis on Heroku's stack at the moment, which weakens their
> argument.

I don't think it weakens their argument as it clearly shows that the biggest
problem they have is not only out of their control (even with very short run-
times, the higher the number of requests you have per minute, the more this
problem is going to affect you) but that even buying very expensive tools
integrated into Heroku's "stack" will not help you to see where the problem
is. The tools were basically hiding the one problem that was solely Heroku's
responsibility.

Do not forget that even while there were a lot of statistics about how long
running requests can cause other, much shorter requests to take just as long
and even timeout, the heart of the matter is that even with a high number of
extremely short requests, the router can end up sending many requests to a
single dyno while other dynos remain idle. There were plenty of graphs, even
animated to show you the effect over time for this random dyno routing.

~~~
jaggederest
> It didn't say "router queue time" or "dyno queue time shown elsewhere".

I can tell you that personally when I was writing the code that calculates
that queue time value, several years ago, we didn't think such time existed.
It was either router time or nothing.

> The tools were basically hiding the one problem that was solely Heroku's
> responsibility.

Right, absolutely agree, but the problem is their current queue time
measurements do not make a strong argument for this due to the issue of clock
skew. They're essentially picking a more-or-less nonsense number and saying
that it demonstrates that Heroku is bad. The only numbers that are
fundamentally reliable are the total request time from the javascript side.

------
jlouis
There are still some important points missing from the discussion:

1\. Operating at scale with parallel routing. 2\. Handle faults while
operating at scale with parallel routing 3\. Providing correct statistical
models for the situation. The one we have right now is a crude approximation.
4\. Measuring on the real system for problems.

The optimum routing is to have each dyno with 0 or 1 job at a time and a
global queue of all incoming requests. But this is a latency problem then
since it takes time for a dyno to tell that it is "ready". The net result is
very bad performance and the global queue is a single point of failure. The
solution is to queue because this removes the latency --- but with the price
you see RG paying if a Dyno can only serve one request at a time.

If a dyno does not report "ready" to the routing mesh, then you can't route
optimally:

Queue length doesn't work since a request in queue may take 7000ms while still
having a length of 1. Another queue with length 5 consisting of 5 70ms
requests is better to route to.

The time the last request spent in queue is not useful either because the very
next message may be a 7000ms one.

So to solve this problem, you must do something else. You cannot use
"intelligent routing" unless you can describe how it will work distributed
with, say, 8 routing machines while avoiding latency. And while you are at it,
you better measure your solution in a real-world scenario.

~~~
bradleyjg
"The optimum routing is to have each dyno with 0 or 1 job at a time and a
global queue of all incoming requests. But this is a latency problem then
since it takes time for a dyno to tell that it is "ready". The net result is
very bad performance and the global queue is a single point of failure. The
solution is to queue because this removes the latency --- but with the price
you see RG paying if a Dyno can only serve one request at a time."

You are right that there is an inherent latency hit using intelligent routing
versus random routing. However in a resource constrained world, where no one
can afford infinite dynos, there is also an average latency hit to random
routing -- but rather than being largely fixed for all requests it is instead
highly variable. While the magnitude of the two need to be factored in,
ceteris paribus low variance in latency is better.

As for your single point of failure point, there are distributed queue
algorithms that handle router failure gracefully.

~~~
mononcqc
Could you point me to such distributed queue algorithms that can handle
failure gracefully? I'm interested in reading on the topic.

~~~
bradleyjg
Sure. See <http://research.microsoft.com/pubs/153348/idleq.pdf>,
<http://arxiv.org/ftp/arxiv/papers/1103/1103.2408.pdf>, and
[http://static.googleusercontent.com/external_content/untrust...](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/spanner-
osdi2012.pdf).

------
goronbjorn
This incident has done wonders for RapGenius's technical brand. I don't know
how many people would've identified them as a 'tech company' before, but that
number has surely gone up.

------
pseut
Guys, you've made a lot more money than me, so you don't need my advice. But
if you want money back, you should probably be communicating in private
through your lawyers. Posts like this look like you're trying to get (more)
attention.

~~~
latchkey
I'm super glad they are taking this public and getting tons of attention. I'm
consulting for some people who are plagued with the same H12 errors. We've
spent tons of time and money trying to mitigate these errors as much as
possible to no avail. Bringing this problem up to Heroku through their paid
support channels has always come back to them finger pointing at everyone but
themselves. I'm glad someone is stepping up and pushing back.

~~~
pseut
The first post, definitely. But this one?

Maybe I'd feel differently if I were a Heroku customer, so I'll defer to
people directly affected.

------
socialist_coder
Heroku's suggestion: "modernize and optimize your web stack."

I don't have any experience with Ruby web stacks so I'm curious if this is
actually an option for you guys? What would it take to do that? Would the
performance increase on Heroku be worth it?

It also seems like if you wanted to self host you would probably need to do
those same improvements, right?

Please don't take my comment the wrong way, I'm not trying to say Heroku is
somehow excused from their mistakes here. I'm just trying to understand that
suggestion from Heroku.

~~~
WillieBKevin
Here's a discussion about that point in the original article's comments:
<http://news.ycombinator.com/item?id=5216553>

The conclusion is that it will buy you a bit more time, but does not fix the
underlying issue.

In Rap Genius' case, they're large enough that they would still have
significant issues even if they switched to cedar and unicorn with 2-4 worker
processes.

~~~
scottshea
Agreed. I am working with a client who is slightly lower on the Heroku
customer food chain and using Unicorn with four workers right now. Our next
step is Puma though that will likely not be the end point.

~~~
steveklabnik
Certainly investigate Puma on Rubinius or JRuby, and make sure you have
config.threadsafe! turned on.

------
instakill
I've lost a lot of faith in Heroku this last week. Going to be doing a lot of
investigating Cloud66/Elastic Beanstalk + EC2 for my Rails app. Good excuse to
up my sysadmin abilities a bit.

------
bradleyjg
Why does Adam Wiggins repeatedly use the word 'evolve' as a transitive verb in
an awkward fashion? Is this some sort of start-up usage that I managed to
avoid thus far?

"We're working on evolving away from the global backlog concept in order to
provide better support for different concurrency models, and the docs are no
longer accurate."

"Getting user perspective is very helpful and I'll apply your feedback as we
continue to evolve our product."

"You're correct that we've made some product decisions over the past few years
that have evolved our HTTP routing layer away from the "intelligent routing"
approach that we used in 2009."

Evolve to me connotates natural selection -- which is rather more haphazard
than I would hope for from a engineering process.

~~~
madeofchalk
People don't like change.

Using the word "change" makes it more obvious and the changes they made to
their system (that were debatably for the worse) deliberately.

------
tibbon
Maybe this is offtopic, but I really don't like the way Rap Genius does links.
It makes it so I essentially have to click on each link twice to get to what
it actually goes to...

~~~
teej
They're not links, they're text annotations. Rap Genius is a text annotation
platform currently focused on rap lyrics. In this case most of the annotations
happen to be a link + context, which is pretty rare.

~~~
tibbon
Yea, I got that... but in the context of a tech blog post, the interface just
doesn't seem to make sense or at least in my head. Its a blog post, not rap
lyrics.

~~~
teej
I think their hypothesis is something like all long-form text should/will be
annotated. While I think it's important to dog-food, I agree that the
execution in this case isn't great. The "links" really should just be links.

------
zmitri
I'm sorry, but I don't understand any of this hating on Rap Genius.

There's a reason they are the fastest growing YC company ever, and got a16z in
for 15M -- because they are straight killers. They have quietly created an
internet empire until this point, and are building something that people love
and use everyday.

A lot of folks wouldn't have the chutzpah to call out Heroku like that or are
just too small to make this kind of attention. To me it seems as though they
are helping Ruby devs save money and time. 8 dynos vs 4 dynos is a hell of a
big difference when you're starting out. Also, seems like something that would
be pretty fun to do if you worked there.

~~~
oijaf888
Yet they are spending a crazy amount on hosting with Heroku when they could be
doing it themselves and get cost savings plus know the entire stack.

~~~
neura
TL;DR - They've said repeatedly that they would rather work on their
product/projects than have to deal with all of the details themselves.

The crazy amount they're spending should give you an idea of how much
throughput they have and what would be required for their own setup. That
means a lot of design, additional time implementing, more time updating to new
tech (including on the software side, in gems, etc), then in the end, ongoing
maintenance.

There's probably a lot more that I'm not mentioning as hey, like them, I'm not
interested in spending my time developing, purchasing and administrating my
own hosting systems, either. I have apps to write.

~~~
oijaf888
But it seems like they had to deal with the details themselves, plus they are
paying a crazy amount of money. Seems like a lose/lose situation for them.

Also the amount of money really gives no indication of how optimized a site is
or how much traffic it gets. I've seen sites that are in the top 500 spending
~3K/mo on hosting and sites that get 60K uniques a month spending 2x that.

------
friendstock
Thank you so much for forcing Heroku to confront this issue!

We've been seeing strange delays and optimizing based on New Relic for a long
time... and whenever we reported this to Heroku, they would not admit to an
issue.

We ended up using threads (on cedar stack) to get more concurrency per dyno.

------
WillieBKevin
Wonder how all of these people are feeling right now..

<http://success.heroku.com/>

~~~
jaytaylor
"Rap Genius became a cult phenomenon and scaled from zero to ten million users
with ease thanks to the Heroku process model."

Good for at least one chortle.

------
porker
"Explain Now, as Rap Genius is widely known for its expertise in queuing
theory" Is this true, or are they being sarcastic that if they could do it
Heroku really should've?

~~~
steveklabnik
There's actually a text annotation that explains that, demonstrating how their
platform works.

~~~
porker
Thanks Steve, when I wrote the question there wasn't one...

~~~
steveklabnik
Ah. No worries. :)

------
erichocean
Ironically, it's possible to get a huge gain over purely random load balancing
by examining just two queues at random -- essentially, you should always be
doing this since the cost is O(1) and the improvement is large.[0] This
doesn't require any distributed locking and at least would qualify as
"intelligent" routing -- probably the bare minimum needed to justify that
marketing label.

Oh, and it also scales incredibly well. Like I said, there's no reason not to
use it over purely random load balancing.

[0]
[http://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.p...](http://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf)

~~~
bumbledraven
Wow, the gain from using two choices is indeed impressive! According to
section 3.6 (page 72): ``[For a system with 500 servers, i]n equilibrium, the
expected time a [request] spends in the system given one choice ( _d_ =1) is
1/(1-λ) {where λ is the average request rate}. Hence ... when λ = 0.99 the
expected time in the system when _d_ = 1 is 100.00; with two choices {when _d_
=2}, this drops to under 6.''

------
bshanks
Heroku should have done something about the issue earlier, but it seems like
the problem was just poor prioritization/time management on their end. Yes,
these posts got them to finally get moving, but i wonder if perhaps RapGenius
could have had the same effect by continuing to bug them privately in the same
unyielding manner, instead of going public with it so quickly. That would have
allowed Heroku to have focused their energy on fixing the problem, rather than
upon worrying about PR and class action lawsuits.

Also, on the topic of lawsuits, how many small startups will go out of
business if they get hit with a class action lawsuit every time their
documentation accidentally diverges from reality? In this case, RapGenius is
small and Salesforce is big, but the legal system will apply the same standard
when the plaintiff is big and the defendent is poor. If this becomes
precedent, then soon we will have lawyers trying to treat any public post by
company employees as 'documentation', forcing startups to have a policy of not
allowing their employees to freely help others with their product in public
forums. Also, any small startup with a large competitor will have the large
competitor paying people to sign up for the product with the sole intent of
finding a bug in the documentation so that the small startup can be sued out
of business.

~~~
jacques_chester
I'm the guy who mentioned lawsuits.

> _Also, on the topic of lawsuits, how many small startups will go out of
> business if they get hit with a class action lawsuit every time their
> documentation accidentally diverges from reality?_

I was chatting with a law academic of my acquaintance; her specialty is torts
and in particularly, remedies to torts. We discussed some of the different
sorts of actions you could bring.

Heroku in their Terms of Service 11.1 have language that basically says "we
can change our technicals without telling you". And that's very reasonable. It
would be impossible to test every minor change with every client's
application. It would also be very annoying for clients to get dozens of
emails per day of the form "Updated foolib-2.5.3-55 to foolib-2.5.3-56a".

But torts are a different beast; they live outside and alongside contracts.
She could tell me what actions you could take in Australian torts law. We have
a tort for "misleading and deceptive conduct" [1] which would probably be
whistled up for this case, given the magnitude and length of the divergence,
but that particular tort seems to be an Australian-only innovation. US law
has, she told me, a lot of unique features that she hasn't studied very
closely. Heroku's ToS requires all disputes to be settled under California law
in Californian courts.

> _Also, any small startup with a large competitor will have the large
> competitor paying people to sign up for the product with the sole intent of
> finding a bug in the documentation so that the small startup can be sued out
> of business._

Heroku also has ToS language to deter anti-competitive behaviour; not to
mention anti-competitive practice laws. Plus, interference with a contract is
itself legally troublesome.

Generally speaking, if you can think of a HUGE GAPING PROBLEM in the law, the
lawyers and judges have already thought of it and closed it. Usually hundreds
of years ago.

[1] There is also in Australian law a _statutory_ offence of misleading and
deceptive conduct, but that would not lead to remedies for Heroku customers.

Of course, I am not a lawyer and this is not legal advice.

------
STRML
I agree that Heroku's response is pretty unbelievable and their engineering
choices very suspect. Reading the email chain between Tom & Adam really drives
home how badly this has been handled by Heroku.

Heroku is massively crippling its own product with random routing. Other cloud
providers have been able to get this right, and Heroku very obviously knows
what kind of applications are running on its server (e.g. deploy a Rails
application, Heroku says "Rails" in the console). It would not be difficult to
apply different routing schemes for each type of application.

Given that this has been going on for years now, Heroku is either acting with
pronounced malice or incompetence. Any competent engineer would not be
satisfied with switching the routers over to random and calling it a day. How
could that have possibly been approved, then remained for years? They must not
have realized what a grave mistake it is.

The #1 thing they should be doing _right now_ (aside from damage control) is
to move the routers over to round-robin routing. Random is the most naive
scheme possible and is laughably inappropriate for this situation.

See for yourself using this simulator:
<http://ukautz.github.com/pages/routing-simulator.html>

------
jxf
To what extent would using something like Amazon's ELB mitigate this sort of
issue in a bring-your-own-cloud approach? Completely?

I've been looking at using something like Cloud66 and an ELB to move off of
Heroku.

~~~
jcheng
I don't think it would help at all, as ELB doesn't intelligently route either.

~~~
jxf
My understanding was that you can specify how the "health check" works for
each EC2 node, so if you can report that a particular node is having issues,
ELB can observe that. But upon further reading, you're right: you can't do
anything other than report that a node is "up" or "down", it seems.

~~~
pilif
well. I'm talking out of my ass here, but I have this idea:

Lets say you set the health check to check an URL of your app that maps to a
very, very cheap rack app that does nothing but return some really short
string.

Then you set the health check timeout to a very, very low value (like 5-10ms
or something).

Now all hosts that don't respond within that low timeout are seen as down with
no requests routed to them.

So if that one node that is able to process one request is busy, the health
check will fail and the request will be routed elsewhere.

This is a very poor mans solution with some drawbacks:

1) there's still a race condition here between determining that the host is up
and sending a request to it, so you might still end up with a request being
queued.

2) now you are practically doubling the latency between the load balancer and
the app server

3) you are creating quite a bit of load to that small rack app which might
have a negative impact on the overall performance.

So in the end, this might be a very bad idea (I didn't think this through
fully and I'm not in a position for trying it out), but it might also be a
stop-gap measure until the problem is solved for real. Maybe worth trying this
out in a staging environment - or with a percentage of hosts.

~~~
bgentry
ELB health checks have a minimum interval of 2 seconds.

Also, when you hit the threshold for consecutive failures (2 or 3), the ELB
immediately closes all connections to that backend without waiting for their
responses to finish.

------
MrGando
This is big stuff...

Sorry to see Rap Genius investing all that money in New Relic, I can't really
imagine being on their shoes.

I would be so pissed.

PS: Heroku user here

------
Giszmo
Am I the only one who read "crap genius"?

------
drudru11
Holy crap - over $60k to get app performance graphs! Wow - that is super
expensive!!

------
signed0
This is unrelated, but every time I see that domain my mind thinks it's either
rapegenius.com or ragepenius.com. Surely I can't be the only one?

~~~
citricsquid
This comment seems to be made every time rapgenius.com is on HN, every time it
gets downvoted. No you're not the only one who thinks that, but why does it
even matter.

