
Heroku - Bamboo Routing Performance - nigma
https://blog.heroku.com/archives/2013/2/15/bamboo_routing_performance/
======
redguava
I don't understand why people think this is a great response. They know how
their routing works, just say so. It can't be that hard to give a basic
overview of it before they release a more comprehensive post.

As for the comment "Improving our documentation and website to accurately
reflect our product". That is a very round about way of saying "our website
indicates our service does things that it does not" which is a VERY bad thing.
People are paying for this service based on what Heroku claims it does.

If the website has been inaccurate for years, that is false advertising and
really a bigger problem than they are giving credit to.

If anything, I am more disappointed now that I have read this response, it has
not appeased anything.

~~~
spiralganglion
Documentation discrepancies happen. I've seen them with pretty much every
platform I've worked on.

Just yesterday, I found a critical discrepancy between the ActionScript
documentation and the actual behaviour of the ActionScript compiler, costing
my team a day of work. (I tried to report the issue to Adobe, but the Adobe
Bug Reporting System was down. Perhaps they need a Bug Reporting System for
the Bug Reporting System.)

I think it's pretty heroic (yeah, pun) for Heroku to own their mistake, make
the changes they've proposed, and accept the fire we've been pouring on them.
They could have easily tried to weasel their way out of this, or attack the
claims (Tesla/NYT comes to mind). Instead, they've accepted their own
wrongdoing, and have pledged to make it right.

Who cares if the explanation comes today or tomorrow? Give them a few more
hours to make sure their new round of technical claims are accurate, since
such accuracy is exactly what's at issue.

~~~
redguava
In answer to "who cares if the explanation comes today or tomorrow", I care if
the explanation comes today or tomorrow. I use Heroku and have hit scaling
issues in the last few weeks very similar to this. More information on what is
going on behind the scenes will help me immediately.

As for discrepancy in documentation, this is one of the most major parts of
their infrastructure and directly relates to how well applications scale. To
claim they have intelligent routing and then not having so, that is completely
misleading and not just a minor documentation discrepancy. This isn't a tech
document that got out of date, this is straight from their main "how it works"
page... <http://www.heroku.com/how/scale>. Read the bit on routing.

~~~
bitcartel
Well, for marketing purposes, random doesn't sound as impressive as
intelligent. It's a discrepancy but it does appear to be disclosed.

On the page you link to, it says: "Incoming web traffic is automatically
routed to web dynos, with intelligent distribution of load instantly as you
scale."

When you click on "Read more about routing..." it says: "Request distribution
- The routing mesh uses a random selection algorithm for HTTP request load
balancing across web processes." <https://devcenter.heroku.com/articles/http-
routing>

~~~
RobAley
It may sound more impressive, but it's simply wrong. It's not even ambiguous.

If they had said some thing like "with advanced algorithmic distribution of
load instantly as you scale", its wishy-washy enough that its technically
correct and those that need to know exactly how it does it will need to go and
look at the docs.

As it is, intelligent distribution tells those that need to know that the
distribution of load is based on intelligence gathered from the system, so
they may not look farther. And it's simply not true.

------
antoko
That's actually a pretty impressive response as far as it goes. Obviously
there's no details at this point, but he absolutely takes responsibility,
doesn't try to deflect or sugar coat it, and manages to find a tone that is
both professional/serious, yet also down-to-earth and earnest. I guess the
real impact will be how they go about "making it right" but in terms of a
first response to the situation the tone is near perfect.

~~~
kami8845
No this is not impressive. This is them fucking up and misleading customers
for 3 years, enjoying a great reputation and now FINALLY getting called out
for their BS. They're about to lose that great reputation that they've spent
the past years building up, so of course they're in major crisis mode and
doing everything they can to fix this.

~~~
antoko
Your response doesn't really seem to be directed at my comment. In your first
sentence you're using "this" as if you're talking about the blog post which is
what I praise in my comment(the tone of it), but then in your second sentence
"this" refers to the routing issue raised by rapgenius, I wasn't commenting on
that at all - I'm not a Heroku customer I was merely commenting on a GM's
response to negative press.

~~~
kami8845
Yeah, it's still not impressive. The only thing that I'd say is impressive in
response to the situation would be:

"Hi, CEO of Heroku here.

Sorry. We've been misleading customers and only telling the truth when
pressed-hard for years. We've created this financial model for all of our
customers who have been overpaying on dynos because of our shitty routing and
will be reimbursing them based on that.

We've also rolled out a second dyno-tier, called "dyno with non-shitty
routing". It's 10x as expensive, but at least we're being honest about it. All
current customers will enjoy our "dyno with non-shitty routing" for the price
they're currently paying for the next 2 years. Enough time for them to migrate
away, like any reasonable person would expect them to after this."

~~~
white_devil
This is spot-on.

 _Of course_ they're going to post _something_ , and of course they'll make it
sound as good as possible. But it's baffling how so many people applaud such
meaningless damage control drivel time after time.

~~~
antoko
Considering how many people/companies fail so spectacularly at it, why would
it be baffling? I also think you underestimate the difficulty in making things
"sound as good as possible" that one quality is the basis of the entire
marketing and political industries - and a large component of many others.
You're basically saying "It's obvious - just be perfect!" - it is not that
easy.

All that being said - I'm really not "applauding" Heroku or their actions
(which are what matter) I'm waiting to hear what they'll say. In the mean time
I thought their messaging (which matters much less) was good.

~~~
kami8845
No worries antoko. We're not really banging up on you, just pissed off at
heroku :) Their wording was good, better than usual when companies fuck up.
But they're deeply embedded in the startup community so good PR with us is
expected, betraying our trust like this, however is not.

------
ibdknox
It's a good response in that they _are_ taking responsibility, but it is
pretty obvious that they are reluctant to say anything about a fix. In my
mind, "it's hard" isn't a valid excuse in this case, especially when there are
relatively straightforward solutions that will solve this at a practical
level. For example, you could imagine a naive form of intelligent routing that
would work simply by keeping a counter per dyno:

\- request comes in and gets routed to the dyno with the lowest count. Inc the
count.

\- response goes out. Dec the counter.

Since they control the flow both in and out, this requires at most a sorted
collection of counters and would solve the problem at a "practical" level. Is
it possible to still end up with one request that backs up another one or two?
Sure. Is it likely? No. While this isn't as ideal as true intelligent routing,
I think it's likely the best solution in a scenario where they have incomplete
information about what a random process on a dyno can reliably handle (which
is the case on the cedar stack).

Alternatively, they could just add some configuration that allows you to set
the request density and then you _could_ bring intelligent routing back. The
couple of milliseconds that lookup/comparison would take is far better than
the scenario they're in now.

EDIT: I realized my comment could be read as though I'm suggesting this naive
solution is "easy". At scale it certainly isn't, but I do believe it's
possible and as _this_ is their business, that's not a valid reason to do what
they are.

~~~
gojomo
What if their inbound routing is hundreds of machines, each of which may get a
request for any of their thousands of apps, spread across tens of thousands of
web dynos?

Do you have a distributed sufficiently-consistent counter strategy that won't
itself become a source of latency or bottlenecks or miscounts under traffic
surges?

~~~
ibdknox
Atomic counters are pretty fast. Redis, for example, should be able to handle
it without breaking a sweat: <http://redis.io/topics/benchmarks>

~~~
gojomo
I doubt they want every inbound request to require:

• query remote redis for lowest-connection-count dyno(s) (from among
potentially hundreds): 1 network roundtrip

• increment count at remote redis for chosen dyno: 1 network roundtrip (maybe
can be coalesced with above?)

• when connection ends, decrement count at remote redis for chosen dyno: 1
network roundtrip

That's 2-3 extra roundtrips each inbound request, and new potential failure
modes and bottlenecks around the redis instance(s). And the redis instance(s)
might need retuning as operations scale and more state is needed.

Random routing lets a single loosely-consistent (perhaps distributed) table of
'up' dynos, with no other counter state, drive an arbitrarily large plant of
simple, low-state routers.

~~~
nikcub
This has all been solved previously. In Google Appengine the scheduler is
aware of, for each instance:

* the type of instance it is

* the amount of memory currently being used

* the amount of CPU currently being used

* the last request time handled by that instance

It also tracks the profile of your application, and applies a scheduling
algorithm based on what it has learned. For eg. the url /import may take 170MB
and 800ms to run, on average, so it would schedule it with an instance that
has more resources available.

It does all this _prior_ to the requests running.

You can find more docs on it here:

[https://developers.google.com/appengine/docs/adminconsole/in...](https://developers.google.com/appengine/docs/adminconsole/instances)

For eg.

> Each instance has its own queue for incoming requests. App Engine monitors
> the number of requests waiting in each instance's queue. If App Engine
> detects that queues for an application are getting too long due to increased
> load, it automatically creates a new instance of the application to handle
> that load

This is what it looks like from a user point of view:

<http://i.imgur.com/QFMXeT1.png>

Heroku essentially need to build all of that. The way it is solved is that the
network roundtrips to poll the instances run in parallel to the scheduler. You
don't do:

* accept request

* poll scheduler

* poll instance/dyno

* serve request

* update scheduler

* update instance/dyno

This all happens asynchronously. At most your data is 10ms out of date. It
would also use a very lightweight UDP based protocol and would broadcast (and
not round-trip, since you send the data frequently enough with a checksum that
a single failure doesn't really matter, at worst it delays a request or two).

~~~
kawsper
> It also tracks the profile of your application, and applies a scheduling
> algorithm based on what it has learned. For eg. the url /import may take
> 170MB and 800ms to run, on average, so it would schedule it with an instance
> that has more resources available.

That is very awesome technology, but it something like that available for non-
google people?

~~~
nikcub
Expensive commercial appliances like the popular f5 big ip's can, and that is
what a lot of large-scale websites use:

<http://www.f5.com/glossary/load-balancer/>

In terms of open source, HAProxy has layer 7 algorithms but they are much
simpler:

[http://cbonte.github.com/haproxy-
dconv/configuration-1.5.htm...](http://cbonte.github.com/haproxy-
dconv/configuration-1.5.html#4-balance)

If you were inclined, you could write an algorithm to implement something
similar in one of the open source routers.

------
nikcub
There is a perverse conflict with platform service providers - the worse your
scheduler performs the more profitable your service will be.

You replace intelligent request scheduling with more hardware and instances,
which you charge the user for.

How much investment is there in platform service providers towards developing
better schedulers that would reduce the number of instances required to serve
an application? That answer, in this case, is "not a lot"

The incentives between provider and user are not aligned, which is why I am
more inclined to buy and manage at a layer lower with virtual machines.

Edit: AppEngine went through a similar issue. Here is an interesting response
from an engineer on their team:

[https://groups.google.com/forum/#!msg/google-
appengine/y-LnZ...](https://groups.google.com/forum/#!msg/google-
appengine/y-LnZ2WYJ5Q/j_w13F4oSSkJ)

~~~
praptak
> There is a perverse conflict with platform service providers - the worse
> your scheduler performs the more profitable your service will be.

I think the practical significance of this kind of incentives is overrated.
The company I work for does outsourcing work, paid by hour. Do they have
incentives to make me produce less so that their customers pay for more hours?
Theoretically. Do they act on it? Hell, no - there is competition and customer
satisfaction matters.

~~~
quahada
The business of government contracting shows these conflicts of interest are
real and lead to Billions of $ of waste annually.

There is plenty of competition for government work, but there are many ways to
game the system even in the rare truly open/fair competitive bids.

~~~
chc
That's because competition for government work is not really based on customer
satisfaction in the same way it is in other industries AFAIK.

------
programminggeek
Wow, I feel like Heroku is really dropping the ball here. Like, they are
acting punch drunk or something. Basically all this says is "we hear you and
we are sorry". They could have posted that a day ago. This still says nothing
about what is wrong and what they are doing to fix it.

Also, I'm not sure at what point this is, but at some point around say $3-5k a
month, (100+ dynos) you really should rethink using Heroku. At that point and
higher, you really ought to know about your infrastructure enough to optimize
for scale. The "just add more dynos" approach is stupid because adding more
web fronts is often the lazy/expensive approach. Add a few queues or some
smarter caching and you'll need fewer web servers. Throw in something like
Varnish where you can and you need even fewer servers. Point being, at some
point scaling is no longer "free", it takes work and Heroku isn't magic.

~~~
7952
At $3-5k a month Heroku may as well start offering a consultancy service
rather than hosting. Wanting unlimited scaling without needing local talent is
a reasonable thing to want, but its unrealistic to expect if from one single
platform.

~~~
ceejayoz
A lot of their success stories at <http://success.heroku.com/> are sites you'd
expect to be spending the $3-5k/month.

If their platform can't handle higher amounts of load, they really should
indicate as such.

------
aneth4
This is a horribly inadequate response. Prices for hardware have dropped 30%
over the last 3 years and heroku is admitting their performance has degraded
by many orders of magnitude. It's completely unacceptable to simply say, "yeah
there's a problem, we'll give you some metrics to understand it better."

Sure, it's great they responded. The response should be "you're right, we are
fixing it and issue credits" for revenue gained from fraudulent claims about
the performance of their product and a credibility straining bait-and-switch.

------
salman89
Most people are going to come here and mention how they are not planning on
fixing the problem.

Put it into context. Heroku made this change 3 years ago, and also has had no
issues admitting the change to users. Their documentation has lagged far
behind and I believe they will be more transparent in the future. This is an
engineering decision they made a long time ago that happened to get a lot of
PR in the past 24 hours. Until there is a business reason (losing customers),
I don't see them "fixing" the problem.

~~~
twog
I think this PR has already hurt Heroku & caused them to lose customers.

~~~
WestCoastJustin
_The only thing worse than being talked about is not being talked about - OW_

You are almost investing in heroku by using their stack and tool chain, it
isn't easy for well established customers to just up and move. This is
probably a PR win for them, rather than a loss. Truth be told, it will be how
they handle this in the coming months that will make them win/lose customers.

~~~
camus
If you are locked in the trunk with any cloud solution , then you are a bad
programmer/syst admin/whatever taking bad decisions, period.

You should be able to move your project infrastructure quickly from a service
to another ,if you cant to that, well too bad when your infrastructure
fails...

------
xwowsersx
What the hell? It's good he owned up...I guess. But the response basically
sounds like "yeah, we've been charging the same prices over the last few years
for increasingly degraded performance and we would have continued to do so,
but someone finally caught on so I guess we have to now do something about
this, right?"

------
ibrahima
I think this is really a fine response considering the pretty terrible way the
original post was written and the community responded. The simulation was a
bit of a stretch because the supposed number of servers you need to achieve
"equivalent" performance is highly dependent on how slow your worst case
performance is, and if your worst case isn't that bad the numbers look a lot
better. Don't remember the precise math, but back when I studied random
processes we studied this problem and the conclusion was that randomly routing
requests is generally not that much worse than doing the intelligent thing,
and doing the intelligent thing is nowhere near as trivial as Rapgenius and
random HN posters would have you believe. Given generally well behaved
requests he random solution should be maybe 2-3x worse but nothing near 50x
worse.

And besides, I really don't see why someone who needs that many dynos is still
on Heroku.

~~~
tomlemon
Rap Genius cofounder:

> The simulation was a bit of a stretch because the supposed number of servers
> you need to achieve "equivalent" performance is highly dependent on how slow
> your worst case performance is, and if your worst case isn't that bad the
> numbers look a lot better

It's still pretty bad. Here's a graph of the relative performances of the
different routing strategies when your response times are much better (50%:
50ms, 99%: 537ms, 99.9%: 898ms)

[http://s3.amazonaws.com/rapgenius/1360871196_routerstyles_fa...](http://s3.amazonaws.com/rapgenius/1360871196_routerstyles_fast.png)

See <http://rapgenius.com/1504222> for more

~~~
gingerlime
Disclaimer: I've never used heroku :)

if I understand the chart correctly, using unicorn with two workers gets you
pretty close to intelligent routing with no intelligence. I imagine adding up
to three or four would make things even better... I don't know about puma/thin
etc where you can perhaps crank it even further without too much memory tax(?)

To me this seems like the easiest approach for everybody concerned. Heroku can
keep using random routing without adding complexity, and most users will not
get affected if they are able to split the workload on the dyno-level.

On a slight tangent: On the rails app I'm working on I'm trying to religiously
offload anything that might block or take too long to a resque task. It's not
always feasible, but I think it's a good common-sense approach to try to avoid
bottlenecks.

------
drchiu
What I find incredibly irritating about this blog response by Heroku is that
it took a very visible post on Hackernews for them to act and reconsider their
way of doing business.

They saw the potential loss in customers, and then acted. What this means is
that they never had in mind to provide the best support and product they could
for their customers before this news broke out.

Sad.

------
mcgwiz
Credit for owning the scope of the problem (allowing serious discrepancies for
3 years), which is sure to cost them trust from the community. But the skeptic
in me reminds me that it's likely there was no way out of admitting it.

What disheartens me is that the documentation discrepancy caused real,
extremely substantial aggregate monetary impact on customers, yet there is no
mention of refunds. Perhaps that will come, but in my opinion, anything short
of that is just damage control.

This is a time excessively demonstrate integrity, for them to go above and
beyond. It's in their interest not to just paper over the whole thing.

------
spankalee
It's so refreshing to see this kind of communication. I don't use Heroku, and
don't know much about this specific issue, but they're responses to downtime
and complaints have been so direct and BS-free that I'll definitely consider
them when I need a PaaS.

------
ivzar
I feel like there is an answer for this, but why are two companies in the "YC
family" at odds so publicly? If RapGenius is "starting beef" like is done in
the music industry, I find it odd that it would happen with someone on their
own "label".

Perhaps this is ignorance on my behalf of how companies who have already been
sold (Heroku) fit into the picture, but some explanation would be appreciated.

~~~
damian2000
Heroku has been owned by Salesforce.com since dec. 2010

------
auggierose
Well, let's put it like this. Those of us who know our programming shit and
aren't afraid of a little math know exactly what has being going on here and
that this answer is pretty much BS (what else is he supposed to say? basically
he makes minimal concessions given the facts).

------
timothya
_Working closely with our customers to develop long-term solutions_

Of the five action items they listed, it seems that only the last of them is
about actually solving the problem. I hope they are committed to it - better
visibility of the problem can help, but I'd rather not have the problem in the
first place.

~~~
raylu
Actually, that's the only one that smells like BS to me. The others have clear
meaning and goals.

~~~
dragonwriter
The other ones are things that are obvious and immediate responses to the
problem on description that don't take any deep analysis of alternatives.

Long-term fixes actually do require deep analysis of alternatives (and even
what the appropriate parameters are for a solution that will deal with
customers problems while maintaining Heroku's scalability), and aren't
something you can do much more than make vague references to off the cuff.

The key question on that point will be follow-through.

------
RaphiePS
Interesting -- he seems to be saying that they'll explain all about the
problem, but not do anything about it.

~~~
gojomo
That's not fair. What do you think "tools to understand and improve the
performance of your apps" and "develop long-term solutions" from his bullets
mean?

But, I'm surprised they didn't wait until the "in-depth technical review" was
available to apologize. And the idea that they were informed of a problem
"yesterday" doesn't quite match the impression RapGenius gave, that they'd
been discussing this with Heroku support for a while.

~~~
callum85
I think RaphiePS's comment is fair.

"tools to understand and improve the performance of your apps" only commits
them to updating their docs and tools to reflect how their system really
works. It doesn't indicate any intention to fix the actual problem (the fact
that requests can be routed to busy dynos), nor that they will make any kind
of reimbursement to people who made business decisions based on incorrect
docs.

"develop long-term solutions" doesn't really mean anything.

~~~
RaphiePS
Yeah, that phrase is a bit suspect. It could easily mean "teaching people how
to effectively deal with our dumb routing."

------
kevinfat
Can someone explain, to people who know nothing about scaling infrastructure,
why routing to idle dynos is a hard problem?

~~~
gojomo
It requires statefulness and decisionmaking at the routing layer, and that's
another thing that adds overhead and can go wrong at scale. (For example,
there may be no one place with knowledge of all in-process requests. Traffic
surges may lead to an arbitrary growth of state in the routing layer, rather
than at the dynos.)

There are probably some simple techniques whereby dynos can themselves
approximate the throughput of routing-to-idle, while Heroku's load-balancers
continue to route randomly. For example, if a 'busy' dyno could shed a
request, simply throwing it back to get another random assignment, most jam-
ups could be alleviated until most dynos are busy. (And even then, the load
could be spread more evenly, even in the case of some long requests and
unlucky randomization.) Heroku may just need to coach their customers in that
direction.

~~~
jacques_chester
I must be stupid, because surely it can't be _that_ hard to partition the
routing groups?

For example, use a hashing algorithm that switches to 1 of N intelligent
routers based on domain name.

If you pick the right algo you can pretty much add routers whenever you like.

(It would be nice to know what Heroku have tried so far, at the very least to
drive off know-it-all blowhards like me.)

~~~
gojomo
They could partition the routing, and maybe they do. But then (a) there's one
extra hop mapping to the specialist routing group; and (b) it's still nice to
have super-thin minimal-state routers, for example with just a list of up
dynos updated once every few seconds, as opposed to live dyno load state
updated thousands of times per second.

I too hope their full response givss more insight into their architecture... I
have a couople of small projects at Herooku already and may use them. for
several larger ones in the future.

~~~
jacques_chester
> _there's one extra hop mapping_

I thought about mentioning this. Because 1 small hop is still less than random
blowouts in response time.

You can even cheat by pushing the router IP into DNS. Hop eliminated.

> _it's still nice to have super-thin minimal-state routers_

I imagine Heroku's customers are not interested in what is nice for Heroku,
they want Heroku to do the icky difficult stuff _for them_. That was the whole
pitch.

Anyway, we're arguing about Star Wars vs Star Trek here because we have no
earthly idea what they've tried.

~~~
gojomo
_...pushing the router IP into DNS..._

Maybe, but they don't currently give each app its own IP, and might not want
the complications of volatile IP reassignments, DNS TTLs, and so on. (Though,
their current "CNAME-to-yourapp.herokuapp.com" recommendation would allow for
this.)

 _...want Heroku to do the icky difficult stuff..._

Yes, but to a point. Customers also want Heroku to provide a simple model that
allows scaling as easy as twisting a knob to deploy more dynos, or move to a
higher-resourced plan. Customers accept some limitations to fit that model.

Maybe Heroku has a good reason for thin, fast, stateless routing -- and that
works well for most customers, perhaps with some app adjustments. Then,
coaxing customers to fit that model, rather than rely on any sort of 'smart'
routing that would be overkill for most, is the right path.

We'll know a lot more when they post their "in-depth technical review" Friday.

------
encoderer
(Wonders what this response would look like if Elon Musk was running Heroku.)

~~~
pkulak
RapGenius wasn't deliberately making shit up to get page views.

------
damian2000
So the issue only affects Bamboo? that's what it seems to be saying

~~~
thenduks
It does _seem_ to be what they are saying but, unfortunately, no :(

Random request routing is also present on Cedar [1]. The difference is that,
on Cedar, you can easily run multi-threaded or even multi-process apps (the
latter being harder due to a 512mb memory limit) which can mitigate the
problem, but does not solve it. Modifying your app so all of your requests are
handled extremely quickly also mitigates the problem, but does not solve it.

Seems to me the obvious solution is to do these things (multi-threaded app
server, serve only/mostly short requests) _and_ use at least a somewhat
intelligent routing algorithm (perhaps 'least connections' would make sense).

[1] - [https://devcenter.heroku.com/articles/http-
routing#request-d...](https://devcenter.heroku.com/articles/http-
routing#request-distribution)

------
zensavona
Maybe I'm missing something here, this response speaks specifically about
Bamboo - do all new services now not run on Cedar?

------
mhartl
This is a great response, and I'll look forward to the follow-ups in the days
to come. Kudos to the Heroku team. Bravo.

------
tyler_grady
Is it me not understanding disqus, or did Heroku's moderator just deleted my
comment?

~~~
benatkin
I think I saw your comment and that they must have deleted it. Apparently
their idea of keeping it civil means keeping out links to the blog post that
it was a response to.

------
alberth
It seems strange for me to read in Heroku's response how forthcoming they are
to accept blame and responsibility for the "a degradation in performance over
the past 3 years".

Yet they state their action plan to "fix" this issue is to update their
DOCUMENTATION and no mention of fixing the DEGRADATION issues itself.

Just bizarre.

~~~
dragonwriter
> Yet they state their action plan to "fix" this issue is to update their
> DOCUMENTATION and no mention of fixing the DEGRADATION issues itself.

This is flat out untrue. The third bullet point in their action plan is to
update their documentation, and the fifth is "Working closely with our
customers to develop long-term solutions".

Updating the documentation to accurately reflect what the platform does is
obviously critical to allow people to make decisions and manage applications
on the platform as it is, so is an important and immediate part of the action
plan.

Long-term fixes to the problem are also important, and are explicitly part of
the action plan. Its clear that they haven't identified what those solutions
are, but its not at all true that they haven't mentioned them as part of the
action plan.

------
austingunter
I'm very curious to see what the technical review turns up tomorrow.

This feels like something that would have been connected to the Salesforce
acquisition 3 years ago, and then making the service less efficient in order
to increase profits or revenue targets on paid accounts. Not to mention saving
money on the free ones.

It would be a little bit like Tesla not only selling you the Model S, but also
selling you the electricity you charge the vehicle with. At some point, they
make the car less efficient, forcing you to charge more often, and then
claiming they didn't document this very well. Frankly, there are only so many
people who will be a capable enough electrical engineer (or in Heroku's case,
a sysadmin) to catch the difference and measure it.

The apology should be, "we misled you, and betrayed your trust. Here's how
we're planning on resolving that, and working to rebuild our relationship with
our customers over the next year. [Insert specific, sweeping measures...]

------
podperson
Seems to me like a classy response to a real problem from Heroku.

We all need to remember that there are no magic bullets. The fact that Heroku
can get a startup to, say, 5M uniques per day by dragging some sliders on a
web panel and running up a bill on a corporate AMEX is pretty impressive.

At some point scaling a web business becomes a core competency and one needs
to deal with it. I'm guessing by the time scaling an app on Heroku becomes an
issue, if better understanding your scaling needs and handling them directly
isn't going to save you a TON of money, your business model is probably
broken.

------
tomlemon
Rap Genius cofounder:

Our response: [http://rapgenius.com/Oren-teich-bamboo-routing-
performance-l...](http://rapgenius.com/Oren-teich-bamboo-routing-performance-
lyrics)

------
habosa
So do the issues in the RapGenius post only affect those on the Bamboo stack?
I'm procrastinating migrating to Cedar now but this could be a very good
reason.

Also, I really love seeing a company take responsibility like this. I know the
situations (and the stakes) are not comparable but this is a lot better than
what Musk did when Tesla got a bad review. As a company just take the blame
and say you can and will fix it, that's good enough for most people.

------
twog
Honest question, why would Rapgenuis still be on Heroku if the y needed 100
dynos? Why not go directly to AWS at that scale? The cost savings would be
pretty significant. Am I missing something?

~~~
ibdknox
Ops guys cost a lot more than just using Heroku, not to mention the cost of
simply having the responsibility of servers (even if they are virtual). Never
underestimate the value of just not having to think about something,
especially when you're small group of people.

~~~
philwelch
I think the amount of time and energy they've invested in studying Heroku's
routing and queueing strategy counts as having to think about something.

------
mattquiros
Did they just say that they have no plans to return to intelligent routing,
just making naive routing more visible to you?

------
instakill
Bamboo routing? Is Cedar not affected?

~~~
psynapse
I think Cedar is just not affected _as badly_ because it will route > 1
request at a time to a dyno, which helps if you're using something like
Unicorn.

------
wowzer
At this point they haven't really done anything. I'm really curious to see
what they come up with.

------
seivan
Wait, so those guys were on Bamboo, and complaining? Fuck, that is so not
cool.

We've been on cedar ever since it launched, and been running puma threads or
unicorn workers. The idea of one dyno per request is bullshit, and I wasn't
sure if they were on cedar or not. A dyno is an allocated resource (512mb, not
counting db, k/v store etc)

How ballsy of them to complain when they are doing it wrong.

~~~
badgar
> We've been on cedar ever since it launched, and been running puma threads or
> unicorn workers. The idea of one dyno per request is bullshit, and I wasn't
> sure if they were on cedar or not. A dyno is an allocated resource (512mb,
> not counting db, k/v store etc)

It doesn't matter if you think one dyno per request is "bullshit" or not,
Rails isn't multithreaded, so what do you propose they do? Using unicorn_rails
on Cedar lets you fork off a few subprocesses to handle more requests on the
dyno queue which gets you a constant factor bite at the dyno queue lengths, a
few weeks or months of scale at best - it's not a real solution.

Heroku knows that Rails on Cedar is _just_ as affected by their inability to
route requests and they're only not copping to it in this blog post because
they don't have a customer running Cedar complaining so loudly. Which is
cowardly.

> How ballsy of them to complain when they are doing it wrong.

If you mean that deploying a rails app to Heroku is doing it wrong - a
sentiment many are agreeing with right now - then yes, you're correct!

~~~
btilly
If you pay attention to queueing theory, you'd know that even a modest amount
of parallelism per worker will let you run much closer to capacity while still
having very few bad request pileups.

Another way to put that is that using Cedar lets you get acceptable end user
performance with far fewer dynos.

