A lot of Heroku's apparent value came from the intelligent routing feature. Everybody knew that it was harder to implement than random routing, that's why they were willing to pay Heroku for it.
Nobody's arguing random routing isn't easier and more stable; of course it is.
The problem is that by switching over to it, Heroku gave up a major selling point of their platform. Are they really blind enough not to know this? I have a hard time believing that.
It seems to me the real way to make people happy is to discount the "base" products which come with random routing and make intelligent routing available as a premium feature. Of course, people who thought they were getting intelligent routing should be credited.
However, if someone chose us purely because of a routing algorithm, we probably weren't a great fit for them to begin with. We're not selling algorithms, we're selling abstraction, convenience, elasticity, and productivity.
I do see that part of the reason this topic has been very emotional (for us inside Heroku as well as our customers and the community) is that the Heroku router has historically been seen as a sort of magic black box. This matter required peeking inside that black box, which I think has created a sense of some of that magic being dispelled.
Developers don't want to pay for abstraction just for abstraction's sake, they want to pay for abstraction of difficult things. Intelligent routing is one of those difficult things. Random routing is easy, which is of course why it's also more reliable, but also why you're seeing people feeling like they didn't get what they paid for.
I should be clear; this doesn't affect me personally but I am totally sympathetic with the customers who are bent out of shape about this and I still see a divide between your response and why people are upset, and I'm trying to help you bridge that divide.
It's interesting — very few customers are actually bent out of shape about this. (A few are, for sure.) It's more non-customers who are watching from the sidelines that seem to be upset. I do want to try to explain ourselves to the community in general, and that's what this post was for. But my first loyalty is to serving our current customers well.
I'm using gunicorn with python, and if I use the sync worker the request queue easily hits 10 seconds and nothing works; if I switch to gevent or eventlet, new-relic tells me that redis is taking the same time stuck getting a value. This is using the same code in my current provider that works just fine with eventlet and scales well.
To add insult to injury, adding dynos actually degrades performance.
Can you email me at adam at heroku dot com and I'll connect you with one of our Python experts? I can't promise we'll solve it, but I'd like to take a look.
Would you be willing to pair up with one of our developers on your app's performance? If so email me (adam at heroku dot com).
Fwiw, I've found the addon / db connection limits to be the primary blocker when load testing so far.
Seems to me you don't get it. Sure there are some very vocal non-customers but you also have a lot of potential customers and users (spinning up free instances) evaluating your product and hoping for a better product. I agree that your true value is the abstraction you provide. Some of these potential customers want to ensure Heroku is as good an abstraction as promised to justify the cost and commitment.
I'd argue that we dropped the ball on that before (on web performance, at least), and are rectifying it now.
A major selling point of Heroku is that scaling dynos wouldn't be a risk. This guarantee is now gone and it's not coming back soon even if routing behavior is reverted, because users prefer good communication and trust with their providers. The responses of Heroku are blithe non-acknowledgement acknowledgements of this problem.
>A lot of Heroku's apparent value came from the intelligent routing feature. Everybody knew that it was harder to implement than random routing, that's why they were willing to pay Heroku for it.
is being addressed by Adam in this comment:
>Heroku's value proposition is that we abstract away infrastructure tasks and let you focus on your app. Keeping you from needing to deal with load balancers is one part of that. If you're worried about how routing works then you're in some way dealing with load balancing.
I think Adam is getting to a really fair point here, which is that nobody really minds whether the particular algorithm is used. If A-Company is using "intelligent routing" and B-Company uses "random routing," but B-Company has better performance and slower queue times, who are you going to choose? You're going to choose B-Company.
At the end of the day, "intelligent routing" is really nothing more than a feather in your cap. People care about performance. That's what started this whole thing - lousy performance. Better performance is what makes it go away, not "intelligent routing."
This is why it was good marketing for Heroku to advertise intelligent routing, instead of just saying 'oh it's a black box, trust us'. You need to know, at the very least, the asymptotic performance beavhior of the black box.
And that's why the change had consequences. In particular, RapGenius designed their software to fit intelligent routing. For their design the dynos needed to guarantee good near-worst-case performance increased with the square of the load, and my back-of-the-envelope math suggests the average case increases by O(n^1.5).
The original RapGenius post documents them here: http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics
The alleged fix, "switch to a concurrent back-end", is hardly trivial and doesn't solve the underlying problem of maldistribution and underutilization. Maybe intelligent routing doesn't scale but 1) there are efficient non-deterministic algorithms that have the desired properties and 2) it appears the old "doesn't scale" algorithm actually worked better at scale, at least for RapGenius.
That's not the point.
As I think of it, "performance" is an observation of a specific case while intelligent/random can be used to predict performance across all cases.
"A major selling point of Heroku is that scaling wouldn't be a risk"
This is interesting, especially the word "risk." Can you expand on this?
To my point of view, the routing is "random" thus kind of unpredictable. If scaling becomes more of an issue with my business, the last thing I want is to have random scaling issues that I can not do anything about because the load balancer is queuing requests randomly to my dynos.
I want my business to be predictable and if I'm not able to have it I'm going to pack my stuff and move somewhere else.
For now, I'm happy with you except for your damn customer service. They take way too long to answer our questions!
Can you email me (adam at heroku dot com) some links to your past support tickets? I'd like to investigate.
Thanks for running your app with us. Naturally, I expect you to move elsewhere if we can't provide you the service you need. That's the beauty of contract-free SaaS and writing your app with open standards: we have to do a good job serving you, or you'll leave. I wouldn't want it any other way.
Fire-fighting during the scaling phase is a problem that every fast-growing software-as-a-service business will probably have to face. I think Heroku makes it easier, perhaps way easier; but I hope our marketing materials etc have not implied a scaling panacea. Such a thing doesn't exist and is most likely impossible, in my view.
As I understand it, Heroku (on the Bamboo stack) didn't up and decide "Hey, we're gonna switch from intelligent to random routing." The routers are still (individually) intelligent. It's just that there are more of them now, and they were never designed to distribute their internal queue state across the cluster. The system as a whole behaves more and more like a random router as the number of intelligent bamboo routing nodes increases.
For Cedar, we did make an explicit decision. People on the leading edge of web development were running concurrent backends and long-running requests. Our experimental support for Node.js in 2010 was a big driver here, but also people who wanted to use concurrency in Ruby, like Ilya Grigorick's Goliath and Rails becoming threadsafe. These folks complained that the single-request-per-backend algorithm was actually in their way.
This plus horizontal scaling / reliability demands caused us to make an explicit product decision for Cedar.
that makes sense, i was wondering how intelligent routing was implemented in the first place.
In the early days, Heroku only had a single routing node that sat out front. So it wasn't a distributed systems problem at that point. You could argue that Heroku circa 2009 was more of a prototype or a toy than a scalable piece of infrastructure. You couldn't run background workers, or large databases. We weren't even charging money yet.
Implementing a single global queue in a single node is trivial. In fact, this is what Unicorn (and other concurrent backends) do: put a queue within a single node, in this case a dyno. That's how we implemented it in the Heroku router (written in Erlang).
Later on, we scaled out to a few nodes, which meant a few queues. This was close enough to a single queue that it didn't matter much in terms of customer impact.
In late 2010 and early 2011 our growth started to really take off, and that's when we scaled out our router nodes to far more than a handful. And that's when the global queue effectively ceased to exist, even though we hadn't changed a line of code in the router.
The problem with this, of course, is that we didn't give it much attention because we had just launched a new product which made the explicit choice to leave out global queueing. It's this failure to continue full stewardship of our existing product that's the mistake that really hurt customers.
So to answer your question, there was never some crazy-awesome implementation of a distributed global queue that we got rid of. It was a single node's queue, a page of code in Erlang which is not too different from the code that you'll find in Unicorn, Puma, GUnicorn, Jetty, etc.
This sentence is really good, and I would humbly suggest you consider hammering it home even more than you have.
I gathered early on that there were inherent scaling issues with the initial router (which makes sense intuitively if you think about Heroku's architecture for more than 10 seconds), but I feel like most of the articles I've seen the past few weeks have this "Heroku took away our shiny toys because they could!" vibe. (Alternative ending: "Heroku took away our shiny toys to expand their market to nodeJS!")
Anyway, that's my take.
So it was an oversimplified system that worked great but wasn't scalable and was at some point going to completely fall over under increasing load.
IMExp, this is not a wrong thing to build initially and it's not wrong to replace it either. But the replacement is going to have a hard time being as simple or predictable. :-)
Sorry, correct me if I missed something, but I believe that as the overall volume of system transactions increases (thus necessitating more "intelligent" nodes) the volume of random dispersal from the core routers increases as well, which can create situations like what we saw with RapGenius where some requests are 100ms and others are 6500ms (the reason being that random routing is not intelligent and can assign jobs to a node that's completely saturated). Adding more and more intelligent nodes doesn't solve the crux of the issue, which is the random assignment of jobs in the core routers to the "intelligent" routers/nodes.
This whole situation boils down to this: "Intelligent routing is hard, so fuck it" and that's why everyone is pissed off. Heroku could've said "hey Intelligent routing is hard so we're not doing that anymore" but instead they just silently deprecated the service. It's a textbook example of how to be a bad service provider.
Ok, let's really dig in on this. Is this truly a case of us being lazy? We just can't be bothered to implement something that would make our customers' lives better?
The answer to these questions is no.
Single global request queues have trade-offs. One of those tradeoffs is more latency per request. Another is availability on the app. Despite the sentiment here on Hacker News, most of our customers tell us that they're not willing to trade lower availability and higher latency per request for the efficiencies of a global request queue.
Are there other routing implementations that would be a happy medium between pure random (routers have no knowledge of what dynos are up to) and perfect, single global queue (routers have complete, 100% up-to-date knowledge of what dynos are up to)? Yes. We're experimenting with those; so far none have proven to be overwhelmingly good.
In the meantime, concurrent backends give the ability to run apps at scale on Heroku today; and offer other benefits, like cost efficiencies. That's why we're leaning on this area in the near term.
What's the constraint that prevents you from having your dynos register with the loadbalancer cluster and then having the latter perform leastconn balancing per application?
Also why would that mean "lower availability" or "higher latency"? Did you look into zookeeper?
This is how it works. Dynos register their presence into a dyno manager which publishes the results into a feed, and then all the routing nodes subscribe to that feed.
But dyno presence is not the rapidly-changing data which is subject to CAP constraints; it's dyno activity, which changes every few milliseconds (e.g. whenever a request begins or ends). Any implementation that tracks that data will be subject to CAP, and this is where you make your choice on tradeoffs.
> why would that mean "lower availability" or "higher latency"?
I'll direct you back to the same resources we've referenced before:
> Did you look into zookeeper?
This is the best question ever. Not only did we look into it, we actually invested several man-years of engineering into building our own Zookeeper-like datastore:
Zookeeper and Doozerd make almost the opposite trade-off as what's needed in the router: they are both really slow, in exchange for high availability and perfect consistency. Useful for many things but not tracking fast-changing data like web requests.
If that's what you actually mean then I'd ask: Can't the dynos reject requests when they're busy ("back pressure")?
AFAIK that's the traditional solution to distributing the "leastconn" constraint.
In practice we've implemented this either with the iptables maxconn rule (reject if count >= worker_threads), or by having the server immediately close the connection.
What happens is that when a loadbalancer hits an overloaded dyno the connection is rejected and it immediately retries the request on a different backend.
Consequently the affected request incurs an additional roundtrip per overloaded dyno, but that is normally much less of an issue than queueing up requests on a busy backend (~20ms retry vs potentially a multi-second wait).
PS: Do you seriously consider Zookeeper "really slow"?! http://zookeeper.apache.org/doc/r3.1.2/zookeeperOver.html#Pe...
> What's the constraint that prevents you from having your dynos register with the loadbalancer cluster and then having the latter perform leastconn balancing per application
I suspect this is a consequence of the CAP theorem. You'll end up with every loadbalancer needing a near-instantaneous perception of every server's queue state and then updating that state atomically when routing a request. Now consider the failure modes that such a system can enter and how they affect latency. Best not to go there.
My understanding is that Apache Zookeeper is designed for slowly-changing data.
But that's not true. Only the loadbalancers concerned with a given application need to share that state amongst one another. And the number of loadbalancers per application is usually very small. I.e. the number is <1 for >99% of sites and you need quite a popular site to push it into the double digits (a single haproxy instance can sustain >5k connect/sec).
Assigning pooled loadbalancers to apps while ensuring HA is not trivial, but it's also not rocket science. I'm a little surprised by the heroku response here, hence my question which constraint I might have missed.
Dyno-presence per application is very slowly-changing data by zookeeper standards.
> the number of loadbalancers per application is usually very small. I.e. the number is <1 for >99% of sites and you need quite a popular site to push it into the double digits (a single haproxy instance can sustain >5k connect/sec).
So most Heroku sites have only a single frontend loadbalancer doing their routing, and even these cases are getting random routed with suboptimal results?
Or is the latency issue mainly with respect to exactly those popular sites that end up using a distributed array of loadbalancers?
> Assigning pooled loadbalancers to apps while ensuring HA is not trivial, but it's also not rocket science.
To me the short history of "cloud-scale" (sorry) app proxy load balancing shows that very well-resourced and well-engineered systems often work great and scale great, that is until some weird failure mode unbalances the whole system and response time goes all hockey stick.
> Dyno-presence per application is very slowly-changing data by zookeeper standards.
OK, but instantaneous queue depth for each and every server? (within a given app)
The system as a whole behaves more and more like a random router as the number of intelligent bamboo routing nodes increases
The point is that this process was gradual and implicit, so there's no point at which the intelligence was explicitly "deprecated". That doesn't excuse how things ended up, but it does explain it to some degree.
But that's the service people thought they were getting and what they wanted.
If Heroku prices out the intelligent routing and says; "Ok you can have intelligent routing with your current backend stack, but it's going to cost you $25/mo for evert 10 dynos, or you can switch your stack and use randomized routing for free." Then they are empowering their customers to make the choice rather than dictating to them what they should do.
Aside from that, I am extremely sympathetic to Heroku's engineering point here --- it's obviously hard for HN to extract the engineering from the drama in this case! Randomized dispatch seems like an eminently sound engineering solution to the request routing problem, and the problems actually implementing it in production seem traceable almost entirely to††† the ways Rails managed to set back scalable web request dispatch by roughly a decade††††.
††† IT IS ALL LOVE WITH ME AND THIS POINT COMING UP HERE...
†††† ...it was probably worth it!
The solution is to combine the two approaches. You split the 100 nodes into 10 groups of 10, you route randomly to one of the groups, and then within a group you route intelligently. This works really well. The probability of one of the request queues filling up is astronomically small, because for a request queue to fill up, all 10 request queues in a group have to fill up simultaneously (and as we know from math, the chance that an event with probability p occurs at n places simultaneously is exponentially small in n). Even if you route randomly to 50 groups of 2, that works a lot better than routing randomly to 100 groups of 1 (though obviously not as well as 10 groups of 10). There is a paper about this: http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...
This is essentially what they are suggesting: run multiple concurrent processes on one dyno. Then the requests are routed randomly to a dyno, but within a dyno the requests are routed intelligently to the concurrent processes running on that dyno. There are two problems with this: (1) dynos have ridiculously low memory so you may not be able to run many (if any) concurrent processes on a single dyno (2) if you have contention for a shared resource on a dyno (e.g. the hard disk) you're back to the old situation. They are partially addressing point (1) by providing dynos with 2x the memory of a normal dyno, which given a Rails app's memory requirements is still very low (you probably have to look hard to find a dedicated server that doesn't have at least 20x as much memory).
They could be providing intelligent routing within groups of dynos (say groups of 10) and random routing to each group, but apparently they have decided that this is not worth the effort. Another thing is that apparently their routing is centralized for all their customers. Rapgenius did have what, 150 requests per second? Surely that can even be handled by a single intelligent router if they had a dedicated router per customer that's above a certain size (of course you still have to go to the groups of dynos model once a single customer grows beyond the size that a single intelligent router can handle).
There's a tradeoff between:
* a well-engineered request handler (a solved problem more than a decade ago) and
* an efficient development environment (arguably a nearly-unsolved problem before the Rails era)
And I feel like mostly the Heroku drama is a result of Rails developers not grokking which end of that tradeoff they've selected.
Of course Heroku is under no obligation to do anything, but its customers have to justify its cost and low performance relative to a dedicated server. And most applications run just fine on a single or at most a couple dedicated servers, which means you don't have routing problems at all, whereas to get reasonable throughput on Heroku you have to get many Dynos, plus a database server. A database server with 64GB ram costs $6400 per month. You can get a dedicated server with that much ram for $100 per month. Heroku is supposed to be worth that premium because it is convenient to deploy on and scale. Because of these routing problems which may require a lot of engineering effort in your application it's not even clear that Heroku is more convenient (e.g. making it use less memory so that you can run many concurrent request handlers on a single Dyno).
I'm not sure there are such providers, and if there aren't, I think it's safe to point the finger towards Rails.
As a system for efficiently handling database-backed web requests, Rails is archaic. Not just because of its memory use requirements! It is simultaneously difficult to thread and difficult to run as asynchronous deferrable state machines.
These are problems that Schmidt and the ACE team wrote textbooks about more than 10 years ago.
(Again, Rails has a lot of compensating virtues; I like Rails.)
> I'm not sure there are such providers, and if there aren't, I think it's safe to point the finger towards Rails.
This is not sound logic. I described above two methods for solving the problem: (1) increase the memory per Dyno (see below: they're doing this, going from 512MB to 1GB per Dyno IIRC, which although still low will be a great improvement if that means that your app can now run 2 concurrent processes per Dyno instead of 1), or (2) do intelligent routing for small groups of Dynos. Do you understand the problem with random routing, and why either of these two would solve it? If not you might find the paper I linked to previously very interesting:
"To motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately log n / log log n with high probability. Now suppose instead that the balls are placed sequentially, and each ball is placed in the least loaded of d >= 2 bins chosen independently and uniformly at random. Azar, Broder, Karlin, and Upfal showed that in this case, the maximum load is log log n / log d + Θ(1) with high probability [ABKU99].
The important implication of this result is that even a small amount of choice can lead to drastically different results in load balancing. Indeed, having just two random choices (i.e., d = 2) yields a large reduction in the maximum load over having one choice, while each additional choice beyond two decreases the maximum load by just a constant factor."
Most things are inferior to other substitutable things! :)
And here we've re-invented the airport passport checking queue - everybody hops onto the end of a big long single queue, then near the front you get to choose the shortest of the dozen or two individual counter queues
I wonder what the hybrid intelligent/random queue analogues of the in-queue intelligence gathering and decision making you caan do at the airport might be? "Hmmm, a family with small children, I'll avoid their counter queue even if it's shortest", "a group of experienced-looking business travellers, they'll probably blow through the paperwork quickly, I'll queue behind them". I wonder if it's possible/profitable to characterize requests in the queue in those kinds of ways?
The difference, of course, is that ELBs are single-tenant. So a big app might only end up with half a dozen nodes, instead of the much larger number in Heroku's router fleet.
Offering some kind of single-tenant router is one possibility we've considered. Partitioning the router fleet, homing... all are ideas we've experimented with and continue to explore. If one of these produces conclusive evidence that it provides a better product for our customers and in keeping with the Heroku approach to product design, obviously we'll do it.
My hypothesis is that tenant-specific intelligent load balancers would be plausible; i would guess that you would never need more than a handful of HAProxy or nginx-type balancers to front even a large application. Your main challenge would then be routing requests to the right load balancer cluster. If you had your own hardware, LVS could handle that (i believe that Wikipedia in 2010 ran all page text requests through a single LVS in each datacentre), but i'm not sure what you do on EC2.
However, "hypothesis" is just a fancy way of saying "guess", which is why your findings from actual experiments would be so interesting.
There are screenshots of the website where it said that in multiple places.
I wonder how many people bitching here are actual customers who are having problems that haven't been address with a solution. I'm guessing that number is low.
Oh, you're a potential customer? That's why you're bitching? About a problem you may or may not have if you actually choose the product? Think about that argument for a second.
I've never seen such a transparent response and follow up as I have from Heroku on this issue. Most other companies would have gone into immediate damage mitigation mode and let the wound heal instead of re-opening it and giving feedback on how to fix the problem as Heroku has.
I applaud the Heroku team for their effort on their platform and being a kickass company.
The funny thing is, I don't have much sympathy for Rails users. Scaling problems with a single-threaded, serial request-processing architecture? No surprise there. But we have inexplicable H12 problems with Node.js. There's something broken in the system and it isn't random routing.
Being able to process more than one concurrent request (as Node can) is "real concurrency". Java-style native threading is a step above and beyond this, and unnecessary for most web applications.
However, if you have long running requests because they make intensive use of server resources (CPU, RAM) then concurrent servers buys you very little. In that case, sending a new request to a server that is chugging along on a difficult problem is significantly different than sending it to one that isn't. That's where knowing the resource state of each server, and routing accordingly is of huge benefit.
While load balancing is a very difficult problem, with some counterintuitive aspects, it is an area of active research, and there are some very clever algorithms out there.
For example, this article (http://research.microsoft.com/pubs/153348/idleq.pdf) from UIUC and Microsoft introduces the Join-Idle-Queue, which is suitable for distributed load balancers, has considerably lower communication overhead than Shortest Queue (AFAICT the original 'intelligent' routing algorithm), and compares its charateristics to both SQ and random routing.
Again, this is not about
-how advanced Heroku's technology is on an absolute level
-how challenging routing is for their scale
-what competitors offer in this space and for what prices
This is only about the delta between what Heroku sold their customers and what their customers received. They are collapsing the delta now, by being honest about what they are selling (and improving their offering, it seems), but they are doing nothing to address the long time for which the delta was significant for a subset of their customers.
There's actually very few cases where people paid more money than they would have otherwise. Heroku is a service with your bill prorated to the second. For the most part, if people don't like the performance (which is measurable externally via benchmarks and browser inspectors), they leave the service. Many people who hit problems with visibility and performance did exactly that.
Naturally, we'll be working hard to try to recapture their business, as well as to remove any reasons that existing customers might leave as they hit performance or visibility problems scaling up in the future.
Unicorn uses the operating system's TCP connection queue to queue incoming requests that it is not able to immediately server. While n+1 requests can (and will) get routed to a single dyno, this only results in 1 request being queued. It will be queued until the first of the in-process requests is served, which will take ~ the average response time for the app. Given that the other n requests did not get queued (queue time = 0), the average queue time will equal Sum(queue times) / # requests = Average Response Time / n+1.
Let's completely ignore the random vs. intelligent routing question for the moment and just talk about the New Relic analytics add-on. Heroku customers were paying ~$40 per month PER DYNO for this service. One of the most compelling things about New Relic is that it not only tracks average request time, but also breaks down this data into its components so developers can see which systems are slowing down their requests. Not only did it fail to report the queueing component, it failed to account for it in the total request time, meaning both values were wrong and basically useless.
I don't understand how Heroku can admit that the tools for which their customers paid obscene amounts of money were completely broken for two years straight and yet do nothing about it besides apologize. The fact that most customers did not realize it was broken does not mean it didn't cause real, tangible harm to their businesses.
It's hard to draw an analogy here since it's rare to use a product for so long without noticing it's not what you paid for, but I'll try - imagine you lease a car for two years. You can pay $500/month for the standard model with n tons of CO2 emissions/mile, or $800/month for a hybrid model with n/5 tons of emissions. You opt for the more expensive hybrid, but after driving the car around happily for two years, you pop the hood only to find that the company gave you the standard model. After complaining to the company, they apologize and give you the standard model. Wouldn't you expect them to also retroactively refund you $300/month for every month you paid for a product that you didn't receive? Does the fact that this was a genuine mistake, rather than an attempt to defraud, change your expectation for receiving a refund? It certainly wouldn't for me.
The New Relic question is tricky. The free version of NR includes queue time, so that implies that the incremental value you're getting from the paid service does not include this. I'm also not sure how "this product you've gotten for free has a bug" fits into this equation. But overall, yes, NR is a product that is designed to give you visibility, and due to incorrect data being reported, some of that visibility wasn't there.
Therefore, we've given credits to people who have had substantial financial impact of the sort you describe. There aren't very many in this category and I believe we've already covered them all, but if you believe you're in this category please email me: adam at heroku dot com
Promises from the Heroku website pre-Rap Genius posts:
"Incoming web traffic is automatically routed to web dynos, with intelligent distribution of load instantly"
"Intelligent routing: The routing mesh tracks the availability of each dyno and balances load accordingly. Requests are routed to a dyno only once it becomes available. If a dyno is tied up due to a long-running request, the request is routed to another dyno instead of piling up on the unavailable dyno’s backlog."
"Intelligent routing: The routing mesh tracks the location of all dynos running web processes (web dynos) and routes HTTP traffic to them accordingly."
"the routing mesh will never serve more than a single request to a dyno at a time"
They weren't intentionally and purposefully misleading people. Not having docs up to date on your website, or you not knowing how the underlying backend works is not fraud.
As I've mentioned before if every AWS customer could sue Amazon for not understanding how all of the underlying tech worked, or could sue when some of the docs were out of date, there would be more lawyers working there than engineers.
Like if I order a laptop from Dell and, oops, it gets lost in the mail, it's not okay to just say "oh, we changed our shipping company, so that should happen less in the future."
A closer comparison might be: What if Dell shipped you a laptop that they advertised as having an SSD with really good performance (and they thought that was true, or at least they did when they wrote the ad for the laptop), and it turns out that for your workload the performance isn't so great? Would you expect a refund then?
Also, we wanted to respond to our customers first and foremost, and general community discussion second. So we spent close to a month on skype/hangout/phone with hundreds of customers understanding how and at what magnitude this affected their apps.
That was hugely time-consuming, but it gave us the confidence to speak about this in a manner that would be focused on customer needs instead of purely answering community speculation and theoretical discussions.
The rest of the whole thing was people who were unsatisfied by the change and wanted the old product offering to return and/or felt shortchanged by Heroku. Luckily, Heroku doesn't lock you into their platform like GAE does, for example; it's essentially a bunch of shell scripts to deploy binaries on EC2 workers, a hosted instance of Postgres on EC2, some routing and ops as a service. Anyone who wasn't happy about the change could've contacted Heroku to say "you said we would be sold non-random routing, but you've sold us random routing" or moved their app away to another provider or even their own servers.
Does this suck if you wanted the old service? Sure. Witness the flak that the canning of Google Reader got and it's obvious that discontinuing old services isn't exactly a popular decision. On the other hand, you can self-host (it's massively cheaper than Heroku) or switch to another provider. And if there are no other providers that specialize in Rails hosting, isn't that a business opportunity?
That's not to say I wouldn't send it to a lawyer. But I'd do it with some hustle. Something like: "This is going out in 24 hours. Please comment ASAP!"
To be fair, I have to say I meet people all the time who don't understand it at all and think $randomcrappything is great at loadbalancing. If "connections go to more than one box!" is your metric, then yes, thats loadbalancing. My metric is "Do you send connections to servers that are responsive and not overloaded, maybe with session affinity" and in general most HW loadbalancer products since 1998 have supported that. So if you're not better than 1998 technology, you may want to reevaluate your solution.
Sadly, concurrency is relatively new and unfamiliar territory to many in the Ruby on Rails community.
We've run many experiments over the past month to try other approaches to routing, including recreating the exact layout of the Bamboo routing layer (which would never scale to where we are today, but just as a point of reference). None have produced results that are anywhere near as good as using a concurrent backend. (I'd love to publish some of these results so that you don't have to take my word for it.)
That said, we're not done. There are changes to the router behavior that could have an additive effect with apps running Unicorn/Puma/etc, and we'll continue to look into those. But concurrent backends are a solution that is ready and fully-baked today.
To be honest it's the fumbling around in the dark that has annoyed me. I am with you 100% on your manifesto and your points about the type of service you provide. However the time we have spent on this (starting before you came clean about the issues) and the time spent on other increasingly suspicious advice to "up dynos" or spend time "optimising your app" sours this slightly. I accept the "magic black box" comes with its compromises and required understanding at our end but it also means needing to be far more communicative and honest about it at yours. Something which you are putting right I can see.
I for one think the premise of Heroku is a great one and you have succeeded for us in many of the things you have set out to achieve. This whole situation has been a real shame, I'm sure this must have a been a pretty shody time for you guys and I hope you come out the better for it. The quicker the better to be honest so you can focus on the new features we'd like to see.
Regarding your app: indeed, Unicorn is a huge improvement, but far from the end of the story. "Performance" is like "security" or "uptime" — it's not a one-time feature, something you check off a list and move on. It's something that requires constant work, and every time you fix one problem or bottleneck that just leads you to the next one.
Over time, though, your vigilance pays off with a service that its users deem to be fast or secure or have good uptime. Yet there's no such thing as a finish line on these.
Bringing it back to details. Kazuki from Treasure Data made this Unicorn worker killer that might help you: https://github.com/kzk/unicorn-worker-killer If you're still not happy with your app's performance, give me a shout at adam at heroku dot com and we'll see if we can help.
Given the choice between continuing the theoretical debate over routing algorithms vs working on real customer problems (like the H12 visibility problem mentioned elsewhere in this thread), I much prefer the latter.
It works fantastically well for backends that can support 20+ concurrent connections, e.g. Node.js, Twisted, JVM threading, etc. It works less well as you can put fewer connections in each backend, which is part of why we're working on larger dynos.
The 2X dyno at 2x cost doesn't really make me happy, it just invites me to spend more money when it would be more cost-efficient to move.
I've been completely disappointed with Heroku's support so far. First they obviously skimmed my support request and provided a canned response that was completely off base. Their next response didn't come for four days and only after I called their sales team to see what I could do to get better support. Their only option is a $1k / mo support contract. If you're running a mission critical app, I'd think twice before choosing Heroku.
I'd be happy to help you do this if you're game. Contact me via adam at heroku dot com.
Could you also email me some links to your support tickets so I can check out what happened there?
And no, they don't do any buffering.
If this is an immediate problem for you, it might be worth your while to make your app threadsafe, which gives you more concurrent webserver options.
We get H12's all the time. Randomly. The only suggestion we get from Heroku is to make the requests process as fast as possible. Thus, we've spent considerable amount of time going through everything we can possibly do to make all requests respond as fast as possible. I've given up. I see this as a fundamental issue with the routing system. If you are going to use Heroku for a large production deployment, H12's (and your users getting dropped connections) will become a fact of life.
There is no auto scaling. We have no idea how many dyno's we actually need. So, we over do it in order to handle peek traffic times. This must be a great money maker for Heroku. There is no incentive for them to build auto scaling into their system because that would mean they wouldn't make as much money. Yes, auto scaling is a hard problem to solve, but there should at least be a plan to start on it and there is none that I have found.
Up until someone bitched loudly, nothing was happening to fix any of this. We have an expensive paid support contract with Heroku and before this whole routing issue blew up in public, their only recommendation was to tune the app more and buy into NewRelic for ~$8k / month. We did both and found NewRelic to not give any relevant information to help us. We did a NodeTime trial for ~$49/mo and that actually helped a lot in identifying slow spots in our app. We fixed all the slow spots in our app and still see an endless stream of H12's. Regardless, it shouldn't take a public bitch slapping for a company to listen to their customers.
You log into a dyno and see a load average of 30+. Who knows if that number is accurate or how big the underlying box really is, but regardless, I can't imagine that number being good. Am I getting H12's because I'm on an overloaded box or is it because the routing system is fundamentally broken? I don't know and nobody can tell me. This is not a good position to be in.
I have heard from several sources that Heroku isn't happy being on AWS and has been wanting to migrate off AWS for a while now. So, if your hosting provider isn't happy on their hosting provider, there must be a reason for that and in the bigger picture, you the customer, is getting screwed.
Given these things, I will never recommend that a company use Heroku. It is great if you know you are going to never have more than one dyno, but if you think you are going to go into a large production system with it, it is far better to find something else. Which brings me to another rant... how come none of these other PaaS solutions are as easy as Heroku? The git deploy is seriously the one thing they got mostly right. I'd love to see someone build a layer on top of all the PaaS solutions so that I can just deploy my code to any one of them (or event multiple).
Knowing how many dynos you need is definitely a problem. We have implemented autoscaling in the past... but it always sucked. It's hard to find a one-size-fits-all-solution. Rather than ship something sub-par we chose not to ship anything at all.
I understand a lot of people do well with autoscaling libraries and 3rd party add-ons. Would be curious to hear your experience with any of these.
I completely agree that it shouldn't take complaining in public to get a company to listen to its customers. That's was our biggest mistake in all of this, IMO — not listening.
For dyno load, have you tried log-runtime-metrics? https://devcenter.heroku.com/articles/log-runtime-metrics It provides per-dyno load averages.
I gladly accept your compliment that our git deploy remains the best on the market. :)
I'm sorry we haven't been able to serve you better. Let me know if you'd be willing to talk via skype sometime — even if you end up leaving the platform (or already have), I'd like to understand in more depth where we went wrong so we can do better in the future.
Yes, autoscaling is hard. I have apps on Google AppEngine and see their issues as well. That said, at least they are trying. Maybe even take one of those 3rd party libraries and try to harden and adapt that and make it a real solution? I think the real problem though lies in the fact that there isn't any good metrics for what dynos are doing so there is no metric for when something is too busy or not. Yes, log-runtime-metrics puts out some numbers, but those numbers are meaningless when all I have is a slider to change the amount of money we are paying you.
I should qualify that git deploy compliment because there are issues with that as well. For example, why do you have to rebuild the npm modules from scratch each time? Why not have a directory full of pre-built modules for your dyno's that are just copied into my slug? This relatively simple change would increase the speed of deployments greatly. Never mind that deployments aren't reliable and fail randomly. At least it is easy to just try again.
Would love the chance to win back your trust and hang onto your business. Let me know your app name (in email if you prefer) and I can see if there's anything we can do for you in the near term.
'why not get a dedicated professional to handle devops on your team'
It sure is easy to write that, but the reality isn't as rosy. I've gone through the process at two companies to try to do that, interviewing ~50-80 people and it has been a nightmare. It is really really difficult to find quality devops people. Again, this makes PaaS like AWS, Heroku and AppEngine a lot more attractive. They are betting their entire business on being able to hire good devops people, so they tend to attract better talent.
Any thoughts on that. It offers a Heroku'esque deploy.
I should also add that one thing that Heroku did get 100% correct is the heroku logs -t command (aka: tailing the logs). Nobody else does that one quite as well.
Adam, there's something that confuses me about this. I'm no expert in routing theory, nor have I done the experiments, so forgive me if my reasoning misses something.
I understand why RapGenius took you up on your original promises of "intelligent routing", and I think I understand what you're saying about scaling, and how scaling "intelligent routing" is so far unsolvable, and the motivation for your transition from Bamboo to Cedar, especially in the context of concurrent clients. What I don't understand is this:
It seems to me that if you split into two (or more) tiers, and random-load-balance in the front tier (hit first by the customer), and then at the second tier only send requests to unloaded clients, that you eliminate RapGenius's problem for customers who followed your specific recommendations for good performance on Bamboo (to go single-threaded and trust the router).
Do you have reason to believe that this doesn't one-shot RapGenius's problem? Do you have strategic/architectural reasons for rejecting this even though it would work? Did you try it and it failed? What's the story there?
Maybe I'll write a simulator to (dis)prove my naive theory. :P
I'm unclear how you'd think introducing a second tier changes things. That tier would need to track dyno availability and then you're right back to the same distributed state problem.
Perhaps you mean if the second tier was smaller, or even a single node? In that case, yes, we did try a variation of that. It had some benefits but also some downsides, one being that the extra network hop added latency overhead. We're continuing to explore this and variations of it, but so far we have no evidence that it would provide a major short-term benefit for RG or anyone else.
> Do you have reason to believe that this doesn't one-shot RapGenius's problem?
As a rule of thumb, I find it's best to avoid one-shots (or "specials"). It's appealing in the short term, but in the medium and long term it creates huge technical debt and almost always results in an upset customer. Products made for, and used by, many people have a level of polish and reliability that will never be matched by one-offs.
So if we're going to invest a bunch of energy into trying to solve one (or a handful) of customer's problems, a better investment is to get those customers onto the most recent product, and using all the best practices (e.g. concurrent backend, CDN, asset compilation at build time). That's a more sustainable and long-term solution.
> As a rule of thumb, I find it's best to avoid one-shots (or "specials").
Absolutely, and I would never suggest that. However, it's not just RG that has this problem, right? If I understand correctly, isn't it every single customer who believed your advertising and followed your suggested strategy to use single-threaded Rails, and doesn't want to switch?
So it's not about short or medium term; it's about letting customers take the latency hit (as you note), in order to get the scaling properties that they already paid for.
If they could get all the API requests down under 500ms I'd be much happier.
We try to drive priorities based on what customers want, not what we want: and what we've heard in the last year or so is all about app uptime, security, and now performance and visibility.
I'm very much hoping that bringing back "fast is a feature" on the developer-facing portions of the product is something we can work on this year.
Sorry you find it annoying. It's what was best for our customers.
What about 2Xdyno?
That said, I can say that a 1X dyno is not very powerful compared to, say, any server you'd purchase for your own datacenter. Our intention is that 2X dynos will provide twice the CPU horsepower, although CPU and I/O are harder to allocate reliability in virtualized environments.
> Q. Did the Bamboo router degrade?
> A. Yes. Our older router was built and designed during the early years of Heroku to support the Aspen and later the Bamboo stack. These stacks did not support concurrent backends, and thus the router was designed with a per-app global request queue. This worked as designed originally, but then degraded slowly over the course of the next two years.
From Adam's message on Feb 17th, 2011 (https://groups.google.com/forum/?fromgroups=#!topic/heroku/8...):
> You're correct, the routing mesh does not behave in quite the way described by the docs. We're working on evolving away from the global backlog concept in order to provide better support for different concurrency models, and the docs are no longer accurate. The current behavior is not ideal, but we're on our way to a new model which we'll document fully once it's done.
It looks like random load balancing was already the expected behavior 2 years ago? The "slow degradation" part seems a bit dishonest to me.
The reason it's easy to confuse these two is also part of what confused us at the time. The slow degradation of the Bamboo routing behavior was causing it to gradually become more and more like the explicit choice we had made for our new product.
But of course it's up to you (and everyone else observing) to judge whether this was some kind of malicious intent to mislead, versus that we made a series of oversights that added up to some serious problems for our customers. And that we are now doing everything in our power to be fully transparent about, to rectify, and to make sure never happen again.
Also known as <17 requests per second... or a trickle of traffic. Hooray for using bigger numbers and a nonstandard unit to hide inadequacy!
Does Heroku use req/min throughout their service? I can't understand why they would, unless they also can't build the infrastructure to measure on a per-second basis.
> After extensive research and experimentation, we have yet to find either a theoretical model or a practical implementation that beats the simplicity and robustness of random routing to web backends that can support multiple concurrent connections.
Does this CTO think companies like Google and Amazon route their HTTP traffic randomly? No... he knows there are scaleable routing solutions and random routing isn't the best. So he cites "simplicity and robustness." Here, this means "we can't be bothered."
After having notable issues with Cisco's hardware load balancers, there was an internal project at Amazon aimed at developing scalable routing solutions.
After years of development effort, it turned out that the "better" solutions didn't work well in production, at least not for our workloads. So we went back to million $ hardware load balancers and random routing.
I don't know if things changed after I left, but I can tell you it wasn't an easy problem. So I completely buy the robustness and simplicity argument these guys are making.
In theory, clever load distribution algorithms (of which one can imagine many variations) are very compelling. Maybe like object databases, or visual programming, or an algorithm that can detect when your program has hit an infinite loop. These are all compelling, but ultimately impractical or impossible in the real world.
Re: I can't speak to Google and Amazon, and they aren't representative of the size of our customers anyway. We have discussed with many folks who run ops at many companies that are more on par with the size of our mid- and large-sized customers, and single global request queues are exceedingly rare.
The most common setup seems to be clusters of web backends (say, 4 clusters of 40 processes each) that each have their own request queue, but with random load balancing across those clusters. This is a reasonable happy medium between pure random and global request queue, and isn't too different from what you get running (say) 16 processes inside a 2X dyno and 8 web dynos.
It takes $179/mo (6 dynos) to handle 17 requests/second? That's insane.
There are apps on Heroku that serve 30k–50k reqs/min on 10–20 dynos, typically written in something like Scala/Akka or Node.js and serving incredibly short (~30ms) response times with very little variation. But these are unusual.
The more common case of a website, written in non-threadsafe Rails, with median response times of ~200ms but 95th percentile at 3+ seconds, would probably use those same 10 dynos to do only a few thousand requests per minute. Whether or not you use a CDN and page caching also makes a big difference (see Urban Dictionary for an example that does it well).
But it really depends. We were trying to quantify when you should be worried. If you're running a blog that serves 600 rpm / 10 reqs/sec off of two dynos, you don't need to sweat it.
Visibility is hard no matter where you run your app. But this is an area where Heroku can get a lot better, and we intend to.
Another key part of your statement is 'with very little variation'. The code pretty much can't be doing anything other than serving up some static content because as soon as anything that requires any sort of IO or cpu will instantly throw the system into H12 hell. Yes, a CDN will take load off your Heroku dyno's because god forbid that your dyno actually do anything itself. Except that you forget that not all apps are webapps and in my case, there is no reason to add a CDN when I'm just serving requests and responses to an iphone app.
The other part of the problem is being able to actually do something about it. I've tried anywhere between 50 and 300 dynos (yes we got that number increased). If we could just throw money at the problem that would be one thing, but nothing was able to resolve the H12's that we see and our paid support contract was no help either.
"If you're running a blog that serves 600 rpm / 10 reqs/sec off of two dynos, you don't need to sweat it."
Once again, we are back at the same conclusion... don't use Heroku if you want to run a production system.
Thanks. I admit I'm not familiar with the platform.