
Answering your questions about Heroku routing and web performance - adamwiggins
https://blog.heroku.com/archives/2013/4/2/routing_and_web_performance_on_heroku_a_faq
======
ebbv
It seems to me that Heroku is still failing to understand (or at least cop to)
the fact that the switch from intelligent to randomized routing was a loss of
a major reason people chose Heroku in the first place.

A lot of Heroku's apparent value came from the intelligent routing feature.
Everybody knew that it was harder to implement than random routing, that's why
they were willing to pay Heroku for it.

Nobody's arguing random routing isn't easier and more stable; of course it is.

The problem is that by switching over to it, Heroku gave up a major selling
point of their platform. Are they really blind enough not to know this? I have
a hard time believing that.

It seems to me the real way to make people happy is to discount the "base"
products which come with random routing and make intelligent routing available
as a premium feature. Of course, people who thought they were getting
intelligent routing should be credited.

~~~
adamwiggins
I hear you. Heroku's value proposition is that we abstract away infrastructure
tasks and let you focus on your app. Keeping you from needing to deal with
load balancers is one part of that. If you're worried about how routing works
then you're in some way dealing with load balancing.

However, if someone chose us purely because of a routing algorithm, we
probably weren't a great fit for them to begin with. We're not selling
algorithms, we're selling abstraction, convenience, elasticity, and
productivity.

I do see that part of the reason this topic has been very emotional (for us
inside Heroku as well as our customers and the community) is that the Heroku
router has historically been seen as a sort of magic black box. This matter
required peeking inside that black box, which I think has created a sense of
some of that magic being dispelled.

~~~
cynicalkane
You sound like a politician talking to someone of the opposite party, in that
you say "I hear you", but then completely fail to address anyone's concerns.
Selling a "magic black box" that guarantees certain properties, changes them,
then lies about having changed them presents a liability for users who want to
do serious work.

A major selling point of Heroku is that scaling dynos wouldn't be a risk. This
guarantee is now gone and it's not coming back soon even if routing behavior
is reverted, because users prefer good communication and trust with their
providers. The responses of Heroku are blithe non-acknowledgement
acknowledgements of this problem.

~~~
bcgraham
This is really unfair. This comment:

>A lot of Heroku's apparent value came from the intelligent routing feature.
Everybody knew that it was harder to implement than random routing, that's why
they were willing to pay Heroku for it.

is being addressed by Adam in this comment:

>Heroku's value proposition is that we abstract away infrastructure tasks and
let you focus on your app. Keeping you from needing to deal with load
balancers is one part of that. If you're worried about how routing works then
you're in some way dealing with load balancing.

I think Adam is getting to a really fair point here, which is that nobody
really minds whether the particular algorithm is used. If A-Company is using
"intelligent routing" and B-Company uses "random routing," but B-Company has
better performance and slower queue times, who are you going to choose? You're
going to choose B-Company.

At the end of the day, "intelligent routing" is really nothing more than a
feather in your cap. People care about performance. That's what started this
whole thing - lousy performance. Better performance is what makes it go away,
not "intelligent routing."

~~~
cynicalkane
Intelligent routing and random routing have different Big O properties. For
someone familiar with routing, or someone who's looked into the algorithmic
properties, "intelligent routing" gives one high-level picture of what the
performance will be like (good with sufficient capacity), whereas random
routing gives a different one (deep queues at load factors where you wouldn't
expect deep queues).

This is why it was good marketing for Heroku to advertise intelligent routing,
instead of just saying 'oh it's a black box, trust us'. You need to know, at
the very least, the asymptotic performance beavhior of the black box.

And that's why the change had consequences. In particular, RapGenius designed
their software to fit intelligent routing. For their design the dynos needed
to guarantee good near-worst-case performance increased with the _square_ of
the load, and my back-of-the-envelope math suggests the average case increases
by O(n^1.5).

The original RapGenius post documents them here: <http://rapgenius.com/James-
somers-herokus-ugly-secret-lyrics>

The alleged fix, "switch to a concurrent back-end", is hardly trivial and
doesn't solve the underlying problem of maldistribution and underutilization.
Maybe intelligent routing doesn't scale but 1) there are efficient non-
deterministic algorithms that have the desired properties and 2) it appears
the old "doesn't scale" algorithm actually worked better at scale, at least
for RapGenius.

------
deedubaya
Haters gonna hate.

I wonder how many people bitching here are actual customers who are having
problems that haven't been address with a solution. I'm guessing that number
is low.

Oh, you're a _potential_ customer? That's why you're bitching? About a problem
you may or _may not_ have if you _actually choose the product?_ Think about
that argument for a second.

I've never seen such a transparent response and follow up as I have from
Heroku on this issue. Most other companies would have gone into immediate
damage mitigation mode and let the wound heal instead of re-opening it and
_giving feedback on how to fix the problem_ as Heroku has.

I applaud the Heroku team for their effort on their platform and being a
kickass company.

~~~
stickfigure
I'm a real customer with real problems. Grep this page for latchkey's
description of it.

The funny thing is, I don't have much sympathy for Rails users. Scaling
problems with a single-threaded, serial request-processing architecture? No
surprise there. But we have inexplicable H12 problems with Node.js. There's
something broken in the system and it _isn't_ random routing.

~~~
Cabal
You're talking like Node offers real concurrency. It doesn't.

~~~
stickfigure
There's nothing wrong with Node's concurrency. Our app, like most webapps, is
I/O bound. Any individual instance should be able to handle thousands of
concurrent requests as long as they are all blocked on I/O.

Being able to process more than one concurrent request (as Node can) is "real
concurrency". Java-style native threading is a step above and beyond this, and
unnecessary for most web applications.

------
bradleyjg
Random routing to concurrent servers works fairly well if the kind of long
running requests you need to worry about spend a lot of time blocking for some
external service (e.g. a database call). Then you can get a lot of benefit on
the from cooperative or preemptive multitasking on the server, and so the
performance characteristics, from the point of view of a new request, of each
server is roughly the same, and so going to a random server is pretty good.

However, if you have long running requests because they make intensive use of
server resources (CPU, RAM) then concurrent servers buys you very little. In
that case, sending a new request to a server that is chugging along on a
difficult problem is significantly different than sending it to one that
isn't. That's where knowing the resource state of each server, and routing
accordingly is of huge benefit.

While load balancing is a very difficult problem, with some counterintuitive
aspects, it is an area of active research, and there are some very clever
algorithms out there.

For example, this article
(<http://research.microsoft.com/pubs/153348/idleq.pdf>) from UIUC and
Microsoft introduces the Join-Idle-Queue, which is suitable for distributed
load balancers, has considerably lower communication overhead than Shortest
Queue (AFAICT the original 'intelligent' routing algorithm), and compares its
charateristics to both SQ and random routing.

------
tvladeck
All of the stuff that Heroku are doing now to mitigate the routing issues for
their Bamboo customers are things they should have done when it first became
an issue. They are not going above and beyond in any way to make up for the
lost time and money their customers.

Again, this is not about

-how advanced Heroku's technology is on an absolute level

-how challenging routing is for their scale

-what competitors offer in this space and for what prices

This is only about the delta between what Heroku sold their customers and what
their customers received. They are collapsing the delta now, by being honest
about what they are selling (and improving their offering, it seems), but they
are doing _nothing_ to address the long time for which the delta was
significant for a subset of their customers.

~~~
chc
Your complaints are all about the past. What are you looking for here? For
Heroku to invent a time machine?

~~~
ambrice
How about refunds? Isn't that the normal response when a company screws up in
a way that causes the customers to pay more money?

~~~
adamwiggins
Yes, we've given credits (or a refund, at the customer's preference) in cases
where lack of visibility or inaccurate docs led to over-provisioning of dynos.

There's actually very few cases where people paid more money than they would
have otherwise. Heroku is a service with your bill prorated to the second. For
the most part, if people don't like the performance (which is measurable
externally via benchmarks and browser inspectors), they leave the service.
Many people who hit problems with visibility and performance did exactly that.

Naturally, we'll be working hard to try to recapture their business, as well
as to remove any reasons that existing customers might leave as they hit
performance or visibility problems scaling up in the future.

------
cmelbye
They still fail to understand that using Unicorn doesn't magically fix this
issue. Like, at all. It simply means that the dunk gets tied up when n+1
requests (n is number of unicorn workers) get randomly routed to it instead of
just 1. It's in no way comparable to a node.js server that handles thousands
of concurrent requests asynchronously. They're simply two different designs,
and Ruby's traditional design is fundamentally incompatible with Heroku's
router.

~~~
mattsoldo
That's not right. In any configuration the goal is to _minimize_ queue time.
What is critical to doing this is having a request queue and a pool of
concurrent "workers" (to use the generic queueing theory term) in back of it.

Unicorn uses the operating system's TCP connection queue to queue incoming
requests that it is not able to immediately server. While n+1 requests can
(and will) get routed to a single dyno, this only results in 1 request being
queued. It will be queued until the first of the in-process requests is
served, which will take ~ the average response time for the app. Given that
the other n requests did not get queued (queue time = 0), the average queue
time will equal Sum(queue times) / # requests = Average Response Time / n+1.

------
streptomycin
He forgot to explain why they won't be refunding customers who were defrauded.

~~~
res0nat0r
Because they weren't victims of fraud.

~~~
dangrossman
Is promising one service but delivering another not fraud?

Promises from the Heroku website pre-Rap Genius posts:

    
    
        "Incoming web traffic is automatically routed to web dynos, with intelligent distribution of load instantly"
    
        "Intelligent routing: The routing mesh tracks the availability of each dyno and balances load accordingly. Requests are routed to a dyno only once it becomes available. If a dyno is tied up due to a long-running request, the request is routed to another dyno instead of piling up on the unavailable dyno’s backlog."
    
        "Intelligent routing: The routing mesh tracks the location of all dynos running web processes (web dynos) and routes HTTP traffic to them accordingly."
    
        "the routing mesh will never serve more than a single request to a dyno at a time"
    

Actual service provided: requests are routed randomly to dynos regardless of
how many requests they are currently handling or their current load.

~~~
res0nat0r
Your definition of fraud differs from most: "Wrongful or criminal deception
intended to result in financial or personal gain."

They weren't _intentionally_ and _purposefully_ misleading people. Not having
docs up to date on your website, or you not knowing how the underlying backend
works is not fraud.

As I've mentioned before if every AWS customer could sue Amazon for not
understanding how all of the underlying tech worked, or could sue when some of
the docs were out of date, there would be more lawyers working there than
engineers.

------
hglaser
This is a great response. I'm curious why something like this wasn't posted
within 24 hours of RapGenius going public? I'd bet that a more thorough,
technical reply would have mitigated a lot of the PR issues.

~~~
adamwiggins
Agreed, I wish we could have done it much sooner. It took a shocking amount of
time to sort through all the entangled issues, emotion, and speculation to try
to get to the heart of the matter, which ultimately was about web performance
and visibility.

Also, we wanted to respond to our _customers_ first and foremost, and general
community discussion second. So we spent close to a month on
skype/hangout/phone with hundreds of customers understanding how and at what
magnitude this affected their apps.

That was hugely time-consuming, but it gave us the confidence to speak about
this in a manner that would be focused on customer needs instead of purely
answering community speculation and theoretical discussions.

~~~
hglaser
Thanks for replying. As a paying Heroku customer (who's not affected by the
routing issue), while seeing a blog post earlier would have been nice, it's
great to hear that you spent so much time with affected customers.

~~~
adamwiggins
Glad to hear you're not affected. But we always like talking to customers,
feel free to drop me a line at adam at heroku dot com if you'd ever like to
spend a few minutes chatting on skype or jabber.

------
bifrost
I'd say there's a major flaw here - "the new world of concurrent web backends"
- If now was 1995 I might agree with you, but having concurrency in web-app
servers is not new. The lack of concurrency in test/demo app situations is
totally understandable, in a production environment you'd have to be
completely bonkers to think thats ever useful. I also agree that "having to
deal with loadbalancing" is something that most people don't get and shouldn't
really have to get, but when the way you do loadbalancing is so fundamentally
flawed that its worse than Round Robin DNS, you also clearly don't understand
it.

To be fair, I have to say I meet people all the time who don't understand it
at all and think $randomcrappything is great at loadbalancing. If "connections
go to more than one box!" is your metric, then yes, thats loadbalancing. My
metric is "Do you send connections to servers that are responsive and not
overloaded, maybe with session affinity" and in general most HW loadbalancer
products since 1998 have supported that. So if you're not better than 1998
technology, you may want to reevaluate your solution.

~~~
bgentry
_I'd say there's a major flaw here - "the new world of concurrent web
backends" - If now was 1995 I might agree with you, but having concurrency in
web-app servers is not new._

Sadly, concurrency is relatively new and unfamiliar territory to many in the
Ruby on Rails community.

------
persei8
Is it true that you don't fully buffer body of POST request on router? This
limitation works against Unicorn design and makes it hardly a "fix":
[http://rubyforge.org/pipermail/mongrel-
unicorn/2013-April/00...](http://rubyforge.org/pipermail/mongrel-
unicorn/2013-April/001743.html)

~~~
geekylucas
Heroku only buffers the request headers:
[https://devcenter.heroku.com/articles/http-
routing#request-b...](https://devcenter.heroku.com/articles/http-
routing#request-buffering)

------
scottshea
So their solution to the random routing is for the customer to switch to
Unicorn/Puma on JRuby. Wow.

~~~
adamwiggins
Yes, because that is the solution. Empirically.

We've run many experiments over the past month to try other approaches to
routing, including recreating the exact layout of the Bamboo routing layer
(which would never scale to where we are today, but just as a point of
reference). None have produced results that are anywhere near as good as using
a concurrent backend. (I'd love to publish some of these results so that you
don't have to take my word for it.)

That said, we're not done. There are changes to the router behavior that could
have an additive effect with apps running Unicorn/Puma/etc, and we'll continue
to look into those. But concurrent backends are a solution that is ready and
fully-baked today.

~~~
habosa
Please publish these results. I think a chart showing that Unicorn + Random
routing is better than Thin + Intelligent routing would go a long way to
ending this whole thing. That's assuming that you can make deploying a Unicorn
app as easy as it was with Thin ('git push heroku')

~~~
adamwiggins
We might. But what does this actually get us? It helps clear Heroku's name,
but it doesn't help our customers at all. I'd prefer to spend our time and
energy making customer's lives better.

Given the choice between continuing the theoretical debate over routing
algorithms vs working on real customer problems (like the H12 visibility
problem mentioned elsewhere in this thread), I much prefer the latter.

~~~
habosa
I respect that mindset, I just don't think it would hurt. Maybe a middle
ground would be a full-scale tutorial on how to switch from Thin on
Bamboo/Cedar to Unicorn on Cedar for Rails users. It's a non-trivial process
and I know I'd like some help with it. And in this same tutorial/article you
could throw down the benchmarks you ran as motivation/justification.

------
ROFISH
Something that's been bugging me about Heroku is that the dyno price has
stayed the same ever since launch: $0.05 per hour. Compare to services like
Digital Ocean and AWS (who lowered prices significantly in the past few
years), Heroku is starting to get very expensive.

The 2X dyno at 2x cost doesn't really make me happy, it just invites me to
spend more money when it would be more cost-efficient to move.

------
cjackson27
I can't speak to the issues that people are running into when they reach large
scale, but I run a small app with two dynos and we've been having issues with
H12 request timeout errors for weeks now. This has been bringing down our
production app for periods of about fifteen minutes almost daily.

I've been completely disappointed with Heroku's support so far. First they
obviously skimmed my support request and provided a canned response that was
completely off base. Their next response didn't come for four days and only
after I called their sales team to see what I could do to get better support.
Their only option is a $1k / mo support contract. If you're running a mission
critical app, I'd think twice before choosing Heroku.

~~~
adamwiggins
The difficulty of diagnosing H12 errors is really challenging. One thing I can
recommend is using the http-request-id labs feature:
<https://devcenter.heroku.com/articles/http-request-id> With this enabled and
some code in your app, you can correlate your app's request logs directly
against the router logs and trace what happened with any particular H12.

I'd be happy to help you do this if you're game. Contact me via adam at heroku
dot com.

Could you also email me some links to your support tickets so I can check out
what happened there?

------
derengel
Heroku's message: "If you have slow clients you are screwed" Unicorn is
designed to only serve fast clients on low-latency.

And no, they don't do any buffering.

~~~
adamwiggins
I think that overstates it a bit, but yes, there are problems with Unicorn and
slow clients. We're investigating:
[https://blog.heroku.com/archives/2013/4/3/routing_and_web_pe...](https://blog.heroku.com/archives/2013/4/3/routing_and_web_performance_on_heroku_a_faq#comment-850974645)

If this is an immediate problem for you, it might be worth your while to make
your app threadsafe, which gives you more concurrent webserver options.

------
latchkey
I am working for a fairly large heroku app running Node on ~50-100 web dyno's
with another 20-50 backends. Here are the problems as I see it:

We get H12's all the time. Randomly. The only suggestion we get from Heroku is
to make the requests process as fast as possible. Thus, we've spent
considerable amount of time going through everything we can possibly do to
make all requests respond as fast as possible. I've given up. I see this as a
fundamental issue with the routing system. If you are going to use Heroku for
a large production deployment, H12's (and your users getting dropped
connections) will become a fact of life.

There is no auto scaling. We have no idea how many dyno's we actually need.
So, we over do it in order to handle peek traffic times. This must be a great
money maker for Heroku. There is no incentive for them to build auto scaling
into their system because that would mean they wouldn't make as much money.
Yes, auto scaling is a hard problem to solve, but there should at least be a
plan to start on it and there is none that I have found.

Up until someone bitched loudly, nothing was happening to fix any of this. We
have an expensive paid support contract with Heroku and before this whole
routing issue blew up in public, their only recommendation was to tune the app
more and buy into NewRelic for ~$8k / month. We did both and found NewRelic to
not give any relevant information to help us. We did a NodeTime trial for
~$49/mo and that actually helped a lot in identifying slow spots in our app.
We fixed all the slow spots in our app and still see an endless stream of
H12's. Regardless, it shouldn't take a public bitch slapping for a company to
listen to their customers.

You log into a dyno and see a load average of 30+. Who knows if that number is
accurate or how big the underlying box really is, but regardless, I can't
imagine that number being good. Am I getting H12's because I'm on an
overloaded box or is it because the routing system is fundamentally broken? I
don't know and nobody can tell me. This is not a good position to be in.

I have heard from several sources that Heroku isn't happy being on AWS and has
been wanting to migrate off AWS for a while now. So, if your hosting provider
isn't happy on their hosting provider, there must be a reason for that and in
the bigger picture, you the customer, is getting screwed.

Given these things, I will never recommend that a company use Heroku. It is
great if you know you are going to never have more than one dyno, but if you
think you are going to go into a large production system with it, it is far
better to find something else. Which brings me to another rant... how come
none of these other PaaS solutions are as easy as Heroku? The git deploy is
seriously the one thing they got mostly right. I'd love to see someone build a
layer on top of all the PaaS solutions so that I can just deploy my code to
any one of them (or event multiple).

~~~
adamwiggins
We're aware of the random H12s problems. Some apps are affected pretty badly,
others not at all, and we're not sure why yet. Sorry that you've had such a
bad experience with this. We're continuing to investigate. If we're not able
to find a solution in a timely fashion, I'll completely understand if you no
longer want to use our product for this app.

Knowing how many dynos you need is definitely a problem. We have implemented
autoscaling in the past... but it always sucked. It's hard to find a one-size-
fits-all-solution. Rather than ship something sub-par we chose not to ship
anything at all.

I understand a lot of people do well with autoscaling libraries and 3rd party
add-ons. Would be curious to hear your experience with any of these.

I completely agree that it shouldn't take complaining in public to get a
company to listen to its customers. That's was our biggest mistake in all of
this, IMO — not listening.

For dyno load, have you tried log-runtime-metrics?
<https://devcenter.heroku.com/articles/log-runtime-metrics> It provides per-
dyno load averages.

I gladly accept your compliment that our git deploy remains the best on the
market. :)

I'm sorry we haven't been able to serve you better. Let me know if you'd be
willing to talk via skype sometime — even if you end up leaving the platform
(or already have), I'd like to understand in more depth where we went wrong so
we can do better in the future.

~~~
latchkey
Your response only re-enforces my hard won opinion that Heroku should never be
used for a production environment for any business that is trying to be
successful and popular. Admitting that you have no idea why critical areas of
your infrastructure is causing issues, while at the same time charging people
an arm and a leg for services (we pay ~$4k/mo) feels like theft to me. I've
built solutions for a large porn company that runs on significantly less
infrastructure than what we are running on Heroku and handles 100x more
traffic. Something is wrong here with the dyno/router model and maybe it is
that you guys are just oversubscribed and not admitting to it in public.

Yes, autoscaling is hard. I have apps on Google AppEngine and see their issues
as well. That said, at least they are trying. Maybe even take one of those 3rd
party libraries and try to harden and adapt that and make it a real solution?
I think the real problem though lies in the fact that there isn't any good
metrics for what dynos are doing so there is no metric for when something is
too busy or not. Yes, log-runtime-metrics puts out some numbers, but those
numbers are meaningless when all I have is a slider to change the amount of
money we are paying you.

I should qualify that git deploy compliment because there are issues with that
as well. For example, why do you have to rebuild the npm modules from scratch
each time? Why not have a directory full of pre-built modules for your dyno's
that are just copied into my slug? This relatively simple change would
increase the speed of deployments greatly. Never mind that deployments aren't
reliable and fail randomly. At least it is easy to just try again.

~~~
adamwiggins
Again, sorry to hear about your bad experience. Hard-to-diagnose errors, no
autoscaling or other method to know proper dyno provisioning, and slow deploy
times — these things suck.

Would love the chance to win back your trust and hang onto your business. Let
me know your app name (in email if you prefer) and I can see if there's
anything we can do for you in the near term.

------
jholman
I'd like to start by acknowledging that I'm one of the "non-customers who are
watching from the sidelines". I think Adam's right that this is an important
distinction.

Adam, there's something that confuses me about this. I'm no expert in routing
theory, nor have I done the experiments, so forgive me if my reasoning misses
something.

I understand why RapGenius took you up on your original promises of
"intelligent routing", and I think I understand what you're saying about
scaling, and how scaling "intelligent routing" is so far unsolvable, and the
motivation for your transition from Bamboo to Cedar, especially in the context
of concurrent clients. What I don't understand is this:

It seems to me that if you split into two (or more) tiers, and random-load-
balance in the front tier (hit first by the customer), and then at the second
tier only send requests to unloaded clients, that you eliminate RapGenius's
problem for customers who followed your specific recommendations for good
performance on Bamboo (to go single-threaded and trust the router).

Do you have reason to believe that this doesn't one-shot RapGenius's problem?
Do you have strategic/architectural reasons for rejecting this even though it
would work? Did you try it and it failed? What's the story there?

Maybe I'll write a simulator to (dis)prove my naive theory. :P

~~~
adamwiggins
> It seems to me that if you split into two (or more) tiers, and random-load-
> balance in the front tier (hit first by the customer), and then at the
> second tier only send requests to unloaded clients [...]

I'm unclear how you'd think introducing a second tier changes things. That
tier would need to track dyno availability and then you're right back to the
same distributed state problem.

Perhaps you mean if the second tier was smaller, or even a single node? In
that case, yes, we did try a variation of that. It had some benefits but also
some downsides, one being that the extra network hop added latency overhead.
We're continuing to explore this and variations of it, but so far we have no
evidence that it would provide a major short-term benefit for RG or anyone
else.

> Do you have reason to believe that this doesn't one-shot RapGenius's
> problem?

As a rule of thumb, I find it's best to avoid one-shots (or "specials"). It's
appealing in the short term, but in the medium and long term it creates huge
technical debt and almost always results in an upset customer. Products made
for, and used by, many people have a level of polish and reliability that will
never be matched by one-offs.

So if we're going to invest a bunch of energy into trying to solve one (or a
handful) of customer's problems, a better investment is to get those customers
onto the most recent product, and using all the best practices (e.g.
concurrent backend, CDN, asset compilation at build time). That's a more
sustainable and long-term solution.

~~~
jholman
Sorry, yes, I'm supposing that the second tier serves fewer dynos;
sufficiently few that your solutions from 2009 (that motivated you to
advertise intelligent routing in the first place) are still usable.

> As a rule of thumb, I find it's best to avoid one-shots (or "specials").

Absolutely, and I would never suggest that. However, it's not just RG that has
this problem, right? If I understand correctly, isn't it every single customer
who believed your advertising and followed your suggested strategy to use
single-threaded Rails, and doesn't want to switch?

So it's not about short or medium term; it's about letting customers take the
latency hit (as you note), in order to get the scaling properties that they
already paid for.

------
jordanthoms
My biggest issue with heroku is the general slowness with the API - maybe I'm
just impatient, but most simple commands like listing releases, viewing logs
etc take at least a second, sometimes five before anything happens. Pushes
also take quite a while, even the Git push part is much slower than pushing to
github. It's just a general sluggishness which gets annoying after a while.

If they could get all the API requests down under 500ms I'd be much happier.

~~~
adamwiggins
Yeah, the developer-facing control surfaces on the platform (API calls and git
push mainly) have gotten slower over the past year or two. This is on my list
of of personal pet peeves, but so far has not made it onto our list of
priorities.

We try to drive priorities based on what customers want, not what we want: and
what we've heard in the last year or so is all about app uptime, security, and
now performance and visibility.

I'm very much hoping that bringing back "fast is a feature" on the developer-
facing portions of the product is something we can work on this year.

------
randall
I think the most annoying thing is they still don't answer Rap Genius's
questions about being owed money for paying megabucks for newrelic. I mean If
you offer a service that provides incorrect data for two years and you don't
offer any sort of framework for reimbursement, that still seems, at best
annoying, at worst, dishonest.

~~~
adamwiggins
We spent quite a bit of time trying to find a one-size-fits-all framework.
There just wasn't one, so we've done credits on a case-by-case basis.

Sorry you find it annoying. It's what was best for our customers.

------
bhauer
I am surprised they do not moderate the comments on their blog. They have one
visible presently that is plain offensive.

------
siong1987
Since we are on the question of visibility of Heroku dyno, what is the amount
of CPU power each dyno has?

What about 2Xdyno?

~~~
adamwiggins
This is a tough area. If you go look at various types of infrastructure
providers (e.g. EC2, Linode, Rackspace) you'll see that they always end up
making up vague units of measurement (e.g. "cores") and then showing all the
resources in reference to whatever the base unit is. So there's really no good
way to talk about CPU power like there is with memory.

That said, I can say that a 1X dyno is not very powerful compared to, say, any
server you'd purchase for your own datacenter. Our intention is that 2X dynos
will provide twice the CPU horsepower, although CPU and I/O are harder to
allocate reliability in virtualized environments.

------
j-kidd
From the article:

> Q. Did the Bamboo router degrade?

> A. Yes. Our older router was built and designed during the early years of
> Heroku to support the Aspen and later the Bamboo stack. These stacks did not
> support concurrent backends, and thus the router was designed with a per-app
> global request queue. This worked as designed originally, but then degraded
> slowly over the course of the next two years.

From Adam's message on Feb 17th, 2011
([https://groups.google.com/forum/?fromgroups=#!topic/heroku/8...](https://groups.google.com/forum/?fromgroups=#!topic/heroku/8eOosLC5nrw)):

> You're correct, the routing mesh does not behave in quite the way described
> by the docs. We're working on evolving away from the global backlog concept
> in order to provide better support for different concurrency models, and the
> docs are no longer accurate. The current behavior is not ideal, but we're on
> our way to a new model which we'll document fully once it's done.

It looks like random load balancing was already the expected behavior 2 years
ago? The "slow degradation" part seems a bit dishonest to me.

~~~
adamwiggins
There are two separate issues here, and it's easy to get them confused. One is
the slow degradation on Bamboo without any change to the routing algorithm
code, and the other was the explicit product choice for Cedar with a different
code path in the router. Both are described fully here:
[https://blog.heroku.com/archives/2013/2/16/routing_performan...](https://blog.heroku.com/archives/2013/2/16/routing_performance_update)

The reason it's easy to confuse these two is also part of what confused us at
the time. The slow degradation of the Bamboo routing behavior was causing it
to gradually become more and more like the explicit choice we had made for our
new product.

But of course it's up to you (and everyone else observing) to judge whether
this was some kind of malicious intent to mislead, versus that we made a
series of oversights that added up to some serious problems for our customers.
And that we are now doing everything in our power to be fully transparent
about, to rectify, and to make sure never happen again.

~~~
j-kidd
Sorry about the accusation. I read the Bamboo's issue wrongly. The article
from Feb 2013 seems to imply that the slow degradation happened from 2011 to
2013. It starts with "Over the past couple of years", I guess that's what got
me confused. The FAQ clarifies that the slow degradation happened from 2009 to
2011.

------
thatthatis
What data do you have to show that the random selection algo has superior
performance to a round-robin algo?

~~~
adamwiggins
We investigated round-robin. With N routing (or load balancer) nodes, and any
degree of request variance, round robin effectively becomes random very
quickly.

------
badgar
> 1k req/min

Also known as <17 requests per second... or a _trickle_ of traffic. Hooray for
using bigger numbers and a nonstandard unit to hide inadequacy!

Does Heroku use req/min throughout their service? I can't understand why they
would, unless they also can't build the infrastructure to measure on a per-
second basis.

> After extensive research and experimentation, we have yet to find either a
> theoretical model or a practical implementation that beats the simplicity
> and robustness of random routing to web backends that can support multiple
> concurrent connections.

Does this CTO think companies like Google and Amazon route their HTTP traffic
randomly? No... he knows there are scaleable routing solutions and random
routing isn't the best. So he cites "simplicity and robustness." Here, this
means "we can't be bothered."

~~~
ozgune
(I was in the bigger engineering team at Amazon that looked into this between
between '04-'08.)

After having notable issues with Cisco's hardware load balancers, there was an
internal project at Amazon aimed at developing scalable routing solutions.

After years of development effort, it turned out that the "better" solutions
didn't work well in production, at least not for our workloads. So we went
back to million $ hardware load balancers and random routing.

I don't know if things changed after I left, but I can tell you it wasn't an
easy problem. So I completely buy the robustness and simplicity argument these
guys are making.

~~~
senderista
Nope, DRR is still dead :)

