Hacker News new | comments | show | ask | jobs | submit login

It seems to me that Heroku is still failing to understand (or at least cop to) the fact that the switch from intelligent to randomized routing was a loss of a major reason people chose Heroku in the first place.

A lot of Heroku's apparent value came from the intelligent routing feature. Everybody knew that it was harder to implement than random routing, that's why they were willing to pay Heroku for it.

Nobody's arguing random routing isn't easier and more stable; of course it is.

The problem is that by switching over to it, Heroku gave up a major selling point of their platform. Are they really blind enough not to know this? I have a hard time believing that.

It seems to me the real way to make people happy is to discount the "base" products which come with random routing and make intelligent routing available as a premium feature. Of course, people who thought they were getting intelligent routing should be credited.




I hear you. Heroku's value proposition is that we abstract away infrastructure tasks and let you focus on your app. Keeping you from needing to deal with load balancers is one part of that. If you're worried about how routing works then you're in some way dealing with load balancing.

However, if someone chose us purely because of a routing algorithm, we probably weren't a great fit for them to begin with. We're not selling algorithms, we're selling abstraction, convenience, elasticity, and productivity.

I do see that part of the reason this topic has been very emotional (for us inside Heroku as well as our customers and the community) is that the Heroku router has historically been seen as a sort of magic black box. This matter required peeking inside that black box, which I think has created a sense of some of that magic being dispelled.

-----


I doubt anyone chose Heroku solely because of the routing algorithm, but the sales point of intelligent routing was certainly an appealing one.

Developers don't want to pay for abstraction just for abstraction's sake, they want to pay for abstraction of difficult things. Intelligent routing is one of those difficult things. Random routing is easy, which is of course why it's also more reliable, but also why you're seeing people feeling like they didn't get what they paid for.

I should be clear; this doesn't affect me personally but I am totally sympathetic with the customers who are bent out of shape about this and I still see a divide between your response and why people are upset, and I'm trying to help you bridge that divide.

-----


Well said.

It's interesting — very few customers are actually bent out of shape about this. (A few are, for sure.) It's more non-customers who are watching from the sidelines that seem to be upset. I do want to try to explain ourselves to the community in general, and that's what this post was for. But my first loyalty is to serving our current customers well.

-----


What about potential customers? I've been evaluating your platform and just be completely frustrated with 3 days wasted trying to solve these performance issues.

I'm using gunicorn with python, and if I use the sync worker the request queue easily hits 10 seconds and nothing works; if I switch to gevent or eventlet, new-relic tells me that redis is taking the same time stuck getting a value. This is using the same code in my current provider that works just fine with eventlet and scales well.

To add insult to injury, adding dynos actually degrades performance.

-----


That sucks. It may or may not be related to how the router works, but it's definitely about performance and visibility, which is what this is all about.

Can you email me at adam at heroku dot com and I'll connect you with one of our Python experts? I can't promise we'll solve it, but I'd like to take a look.

-----


I'd be really interested to see some public information resulting from debugging Python apps. We're holding pretty steady, but see a fairly constant stream of timeouts due, apparently, to variance in response times. To be sure, we're working on that. But, in the meantime, our experiments with gevent and Heroku have been less than inspiring.

-----


I've connected Nathaniel (poster above) with one of our Python folks. Looking forward to seeing what they discover.

Would you be willing to pair up with one of our developers on your app's performance? If so email me (adam at heroku dot com).

-----


I'm an existing customer using python with gunicorn. I'd be very keen to see any learnings about an optimal setup.

Fwiw, I've found the addon / db connection limits to be the primary blocker when load testing so far.

-----


"It's interesting — very few customers are actually bent out of shape about this"

Seems to me you don't get it. Sure there are some very vocal non-customers but you also have a lot of potential customers and users (spinning up free instances) evaluating your product and hoping for a better product. I agree that your true value is the abstraction you provide. Some of these potential customers want to ensure Heroku is as good an abstraction as promised to justify the cost and commitment.

-----


Fair enough. I think the best thing we can do for those potential future customers is be really clear about what the product does and give them good tools and guidance to evaluate whether it fits their needs.

I'd argue that we dropped the ball on that before (on web performance, at least), and are rectifying it now.

-----


If its only a few customers that are bent out of shape, how come you haven't quickly offered them refunds?

-----


We did.

-----


When did this happen? As of March 2, Rap Genius was still seeking to get money refunded.

http://venturebeat.com/2013/03/02/rap-genius-responds/

-----


Just in the last few weeks. I won't disclose details for any particular customer, but I assume that Tom @ RG will make a public statement about it at some point.

-----


You sound like a politician talking to someone of the opposite party, in that you say "I hear you", but then completely fail to address anyone's concerns. Selling a "magic black box" that guarantees certain properties, changes them, then lies about having changed them presents a liability for users who want to do serious work.

A major selling point of Heroku is that scaling dynos wouldn't be a risk. This guarantee is now gone and it's not coming back soon even if routing behavior is reverted, because users prefer good communication and trust with their providers. The responses of Heroku are blithe non-acknowledgement acknowledgements of this problem.

-----


This is really unfair. This comment:

>A lot of Heroku's apparent value came from the intelligent routing feature. Everybody knew that it was harder to implement than random routing, that's why they were willing to pay Heroku for it.

is being addressed by Adam in this comment:

>Heroku's value proposition is that we abstract away infrastructure tasks and let you focus on your app. Keeping you from needing to deal with load balancers is one part of that. If you're worried about how routing works then you're in some way dealing with load balancing.

I think Adam is getting to a really fair point here, which is that nobody really minds whether the particular algorithm is used. If A-Company is using "intelligent routing" and B-Company uses "random routing," but B-Company has better performance and slower queue times, who are you going to choose? You're going to choose B-Company.

At the end of the day, "intelligent routing" is really nothing more than a feather in your cap. People care about performance. That's what started this whole thing - lousy performance. Better performance is what makes it go away, not "intelligent routing."

-----


Intelligent routing and random routing have different Big O properties. For someone familiar with routing, or someone who's looked into the algorithmic properties, "intelligent routing" gives one high-level picture of what the performance will be like (good with sufficient capacity), whereas random routing gives a different one (deep queues at load factors where you wouldn't expect deep queues).

This is why it was good marketing for Heroku to advertise intelligent routing, instead of just saying 'oh it's a black box, trust us'. You need to know, at the very least, the asymptotic performance beavhior of the black box.

And that's why the change had consequences. In particular, RapGenius designed their software to fit intelligent routing. For their design the dynos needed to guarantee good near-worst-case performance increased with the square of the load, and my back-of-the-envelope math suggests the average case increases by O(n^1.5).

The original RapGenius post documents them here: http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics

The alleged fix, "switch to a concurrent back-end", is hardly trivial and doesn't solve the underlying problem of maldistribution and underutilization. Maybe intelligent routing doesn't scale but 1) there are efficient non-deterministic algorithms that have the desired properties and 2) it appears the old "doesn't scale" algorithm actually worked better at scale, at least for RapGenius.

-----


>If A-Company is using "intelligent routing" and B-Company uses "random routing," but B-Company has better performance and slower queue times, who are you going to choose?

That's not the point.

As I think of it, "performance" is an observation of a specific case while intelligent/random can be used to predict performance across all cases.

-----


Harsh. I'm going to ignore the more inflammatory parts of this (feel free to restate if you want me to engage in discussion), but one bit did grab my attention:

"A major selling point of Heroku is that scaling wouldn't be a risk"

This is interesting, especially the word "risk." Can you expand on this?

-----


Very small customer here. I don't know much about the abstraction you provide us and I don't want to know as long as things go well.

To my point of view, the routing is "random" thus kind of unpredictable. If scaling becomes more of an issue with my business, the last thing I want is to have random scaling issues that I can not do anything about because the load balancer is queuing requests randomly to my dynos.

I want my business to be predictable and if I'm not able to have it I'm going to pack my stuff and move somewhere else.

For now, I'm happy with you except for your damn customer service. They take way too long to answer our questions!

Cheers! :)

-----


Absolutely right, I totally agree. Random scaling issues that you can't either see or control is exactly the opposite of what we want to be providing.

Can you email me (adam at heroku dot com) some links to your past support tickets? I'd like to investigate.

Thanks for running your app with us. Naturally, I expect you to move elsewhere if we can't provide you the service you need. That's the beauty of contract-free SaaS and writing your app with open standards: we have to do a good job serving you, or you'll leave. I wouldn't want it any other way.

-----


[deleted]

Very clarifying, thanks.

Fire-fighting during the scaling phase is a problem that every fast-growing software-as-a-service business will probably have to face. I think Heroku makes it easier, perhaps way easier; but I hope our marketing materials etc have not implied a scaling panacea. Such a thing doesn't exist and is most likely impossible, in my view.

-----


My company has been building apps for startups for years, and I can confirm that Heroku is consistently perceived as a "I never have to worry about scaling" solution.

-----


Very useful observation. I'd love to figure out how we can better communicate that while we aim to make scaling fast and easy, "you never have to worry about scaling" is much too absolute.

-----


Your marketing materials clearly use phrases like "forget about servers", "easily scale to millions of users", "scale effortlessly" and so forth. You're making it very easy to misunderstand you.

-----


"the switch from intelligent to randomized routing"

As I understand it, Heroku (on the Bamboo stack) didn't up and decide "Hey, we're gonna switch from intelligent to random routing." The routers are still (individually) intelligent. It's just that there are more of them now, and they were never designed to distribute their internal queue state across the cluster. The system as a whole behaves more and more like a random router as the number of intelligent bamboo routing nodes increases.

-----


True, although it's more confusing than this. We didn't make an explicit decision for Bamboo, and thus the big problem — docs fell out of date, at a minimum.

For Cedar, we did make an explicit decision. People on the leading edge of web development were running concurrent backends and long-running requests. Our experimental support for Node.js in 2010 was a big driver here, but also people who wanted to use concurrency in Ruby, like Ilya Grigorick's Goliath and Rails becoming threadsafe. These folks complained that the single-request-per-backend algorithm was actually in their way.

This plus horizontal scaling / reliability demands caused us to make an explicit product decision for Cedar.

-----


> true

that makes sense, i was wondering how intelligent routing was implemented in the first place.

-----


Oh, got it. How's this:

In the early days, Heroku only had a single routing node that sat out front. So it wasn't a distributed systems problem at that point. You could argue that Heroku circa 2009 was more of a prototype or a toy than a scalable piece of infrastructure. You couldn't run background workers, or large databases. We weren't even charging money yet.

Implementing a single global queue in a single node is trivial. In fact, this is what Unicorn (and other concurrent backends) do: put a queue within a single node, in this case a dyno. That's how we implemented it in the Heroku router (written in Erlang).

Later on, we scaled out to a few nodes, which meant a few queues. This was close enough to a single queue that it didn't matter much in terms of customer impact.

In late 2010 and early 2011 our growth started to really take off, and that's when we scaled out our router nodes to far more than a handful. And that's when the global queue effectively ceased to exist, even though we hadn't changed a line of code in the router.

The problem with this, of course, is that we didn't give it much attention because we had just launched a new product which made the explicit choice to leave out global queueing. It's this failure to continue full stewardship of our existing product that's the mistake that really hurt customers.

So to answer your question, there was never some crazy-awesome implementation of a distributed global queue that we got rid of. It was a single node's queue, a page of code in Erlang which is not too different from the code that you'll find in Unicorn, Puma, GUnicorn, Jetty, etc.

-----


> So to answer your question, there was never some crazy-awesome implementation of a distributed global queue that we got rid of.

This sentence is really good, and I would humbly suggest you consider hammering it home even more than you have.

I gathered early on that there were inherent scaling issues with the initial router (which makes sense intuitively if you think about Heroku's architecture for more than 10 seconds), but I feel like most of the articles I've seen the past few weeks have this "Heroku took away our shiny toys because they could!" vibe. (Alternative ending: "Heroku took away our shiny toys to expand their market to nodeJS!")

Anyway, that's my take.

-----


> So to answer your question, there was never some crazy-awesome implementation of a distributed global queue that we got rid of.

So it was an oversimplified system that worked great but wasn't scalable and was at some point going to completely fall over under increasing load.

IMExp, this is not a wrong thing to build initially and it's not wrong to replace it either. But the replacement is going to have a hard time being as simple or predictable. :-)

-----


Not a wrong thing to build initially, but perhaps a wrong thing to advertise a feature based on, unless you have a plan for how to continue to deliver that feature as you scale up.

-----


Everybody who scales rapidly has some growing pains, so I'm sympathetic. But I agree that by advertising that as a feature they're specifically asking customers to outsource this hard problem to them.

-----


I don't understand how the system behaves more intelligently when the edge routers increase. The core routers random behavior gets worse with the larger load, and increasing the number of intelligent routers doesn't help to solve that problem in any meaningful way.

Sorry, correct me if I missed something, but I believe that as the overall volume of system transactions increases (thus necessitating more "intelligent" nodes) the volume of random dispersal from the core routers increases as well, which can create situations like what we saw with RapGenius where some requests are 100ms and others are 6500ms (the reason being that random routing is not intelligent and can assign jobs to a node that's completely saturated). Adding more and more intelligent nodes doesn't solve the crux of the issue, which is the random assignment of jobs in the core routers to the "intelligent" routers/nodes.

This whole situation boils down to this: "Intelligent routing is hard, so fuck it" and that's why everyone is pissed off. Heroku could've said "hey Intelligent routing is hard so we're not doing that anymore" but instead they just silently deprecated the service. It's a textbook example of how to be a bad service provider.

-----


> "Intelligent routing is hard, so fuck it"

Ok, let's really dig in on this. Is this truly a case of us being lazy? We just can't be bothered to implement something that would make our customers' lives better?

The answer to these questions is no.

Single global request queues have trade-offs. One of those tradeoffs is more latency per request. Another is availability on the app. Despite the sentiment here on Hacker News, most of our customers tell us that they're not willing to trade lower availability and higher latency per request for the efficiencies of a global request queue.

Are there other routing implementations that would be a happy medium between pure random (routers have no knowledge of what dynos are up to) and perfect, single global queue (routers have complete, 100% up-to-date knowledge of what dynos are up to)? Yes. We're experimenting with those; so far none have proven to be overwhelmingly good.

In the meantime, concurrent backends give the ability to run apps at scale on Heroku today; and offer other benefits, like cost efficiencies. That's why we're leaning on this area in the near term.

-----


most of our customers tell us that they're not willing to trade lower availability and higher latency per request

What's the constraint that prevents you from having your dynos register with the loadbalancer cluster and then having the latter perform leastconn balancing per application?

Also why would that mean "lower availability" or "higher latency"? Did you look into zookeeper?

-----


> What's the constraint that prevents you from having your dynos register with the loadbalancer cluster and then having the latter perform leastconn balancing per application?

This is how it works. Dynos register their presence into a dyno manager which publishes the results into a feed, and then all the routing nodes subscribe to that feed.

But dyno presence is not the rapidly-changing data which is subject to CAP constraints; it's dyno activity, which changes every few milliseconds (e.g. whenever a request begins or ends). Any implementation that tracks that data will be subject to CAP, and this is where you make your choice on tradeoffs.

> why would that mean "lower availability" or "higher latency"?

I'll direct you back to the same resources we've referenced before:

http://aphyr.com/posts/278-timelike-2-everything-fails-all-t... http://ksat.me/a-plain-english-introduction-to-cap-theorem/

> Did you look into zookeeper?

This is the best question ever. Not only did we look into it, we actually invested several man-years of engineering into building our own Zookeeper-like datastore:

https://github.com/ha/doozerd

Zookeeper and Doozerd make almost the opposite trade-off as what's needed in the router: they are both really slow, in exchange for high availability and perfect consistency. Useful for many things but not tracking fast-changing data like web requests.

-----


Hm. Until now I thought dyno-presence is your issue, but now I realize you're talking about the actual "leastconn" part, i.e. the requests queueing up on the dynos itself?

If that's what you actually mean then I'd ask: Can't the dynos reject requests when they're busy ("back pressure")?

AFAIK that's the traditional solution to distributing the "leastconn" constraint.

In practice we've implemented this either with the iptables maxconn rule (reject if count >= worker_threads), or by having the server immediately close the connection.

What happens is that when a loadbalancer hits an overloaded dyno the connection is rejected and it immediately retries the request on a different backend.

Consequently the affected request incurs an additional roundtrip per overloaded dyno, but that is normally much less of an issue than queueing up requests on a busy backend (~20ms retry vs potentially a multi-second wait).

PS: Do you seriously consider Zookeeper "really slow"?! http://zookeeper.apache.org/doc/r3.1.2/zookeeperOver.html#Pe...

-----


Note: Just a bystander here

> What's the constraint that prevents you from having your dynos register with the loadbalancer cluster and then having the latter perform leastconn balancing per application

I suspect this is a consequence of the CAP theorem. You'll end up with every loadbalancer needing a near-instantaneous perception of every server's queue state and then updating that state atomically when routing a request. Now consider the failure modes that such a system can enter and how they affect latency. Best not to go there.

My understanding is that Apache Zookeeper is designed for slowly-changing data.

-----


You'll end up with every loadbalancer needing a near-instantaneous perception of every server's queue

But that's not true. Only the loadbalancers concerned with a given application need to share that state amongst one another. And the number of loadbalancers per application is usually very small. I.e. the number is <1 for >99% of sites and you need quite a popular site to push it into the double digits (a single haproxy instance can sustain >5k connect/sec).

Assigning pooled loadbalancers to apps while ensuring HA is not trivial, but it's also not rocket science. I'm a little surprised by the heroku response here, hence my question which constraint I might have missed.

My understanding is that Apache Zookeeper is designed for slowly-changing data.

Dyno-presence per application is very slowly-changing data by zookeeper standards.

-----


Again, I'm no expert on Heroku's architectre. Just thinking out loud here, and feel free to tell me to RTFA. :-)

> the number of loadbalancers per application is usually very small. I.e. the number is <1 for >99% of sites and you need quite a popular site to push it into the double digits (a single haproxy instance can sustain >5k connect/sec).

So most Heroku sites have only a single frontend loadbalancer doing their routing, and even these cases are getting random routed with suboptimal results?

Or is the latency issue mainly with respect to exactly those popular sites that end up using a distributed array of loadbalancers?

> Assigning pooled loadbalancers to apps while ensuring HA is not trivial, but it's also not rocket science.

To me the short history of "cloud-scale" (sorry) app proxy load balancing shows that very well-resourced and well-engineered systems often work great and scale great, that is until some weird failure mode unbalances the whole system and response time goes all hockey stick.

> Dyno-presence per application is very slowly-changing data by zookeeper standards.

OK, but instantaneous queue depth for each and every server? (within a given app)

-----


You seem to have misread the above comment. Here's what it actually said:

The system as a whole behaves more and more like a random router as the number of intelligent bamboo routing nodes increases

The point is that this process was gradual and implicit, so there's no point at which the intelligence was explicitly "deprecated". That doesn't excuse how things ended up, but it does explain it to some degree.

-----


Er, I wasn't trying to claim that the system behaves more intelligently. The performance degradation is just an emergent consequence of gradually adding more nodes to the routing tier. This post might be useful for some context: http://aphyr.com/posts/277-timelike-a-network-simulator

-----


Is anybody operating at Heroku's scale offering centralized request routing queues? At what price?

-----


Not that I know of but that's why I'm saying it would be a premium product. Likely pricing would have to scale with the number of dynos running behind the router.

But that's the service people thought they were getting and what they wanted.

If Heroku prices out the intelligent routing and says; "Ok you can have intelligent routing with your current backend stack, but it's going to cost you $25/mo for evert 10 dynos, or you can switch your stack and use randomized routing for free." Then they are empowering their customers to make the choice rather than dictating to them what they should do.

-----


If it's truly impossible to get centralized request routing queues at Heroku's scale in any other product offering, that is evidence that a demand that Heroku provide it might be unreasonable.

Aside from that, I am extremely sympathetic to Heroku's engineering point here --- it's obviously hard for HN to extract the engineering from the drama in this case! Randomized dispatch seems like an eminently sound engineering solution to the request routing problem, and the problems actually implementing it in production seem traceable almost entirely to††† the ways Rails managed to set back scalable web request dispatch by roughly a decade††††.

††† IT IS ALL LOVE WITH ME AND THIS POINT COMING UP HERE...

†††† ...it was probably worth it!

-----


Random routing vs fully centralized request routing is a false dichtonomy. Suppose you have 100 nodes, and you have a router that routes randomly to one of those 100 nodes. This works very poorly. Now suppose you have 100 nodes, and you have a router that routes intelligently too one of those 100 nodes, e.g. to the one with the smallest request queue. From a theoretical perspective this works really well but it may be impossible to implement efficiently.

The solution is to combine the two approaches. You split the 100 nodes into 10 groups of 10, you route randomly to one of the groups, and then within a group you route intelligently. This works really well. The probability of one of the request queues filling up is astronomically small, because for a request queue to fill up, all 10 request queues in a group have to fill up simultaneously (and as we know from math, the chance that an event with probability p occurs at n places simultaneously is exponentially small in n). Even if you route randomly to 50 groups of 2, that works a lot better than routing randomly to 100 groups of 1 (though obviously not as well as 10 groups of 10). There is a paper about this: http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...

This is essentially what they are suggesting: run multiple concurrent processes on one dyno. Then the requests are routed randomly to a dyno, but within a dyno the requests are routed intelligently to the concurrent processes running on that dyno. There are two problems with this: (1) dynos have ridiculously low memory so you may not be able to run many (if any) concurrent processes on a single dyno (2) if you have contention for a shared resource on a dyno (e.g. the hard disk) you're back to the old situation. They are partially addressing point (1) by providing dynos with 2x the memory of a normal dyno, which given a Rails app's memory requirements is still very low (you probably have to look hard to find a dedicated server that doesn't have at least 20x as much memory).

They could be providing intelligent routing within groups of dynos (say groups of 10) and random routing to each group, but apparently they have decided that this is not worth the effort. Another thing is that apparently their routing is centralized for all their customers. Rapgenius did have what, 150 requests per second? Surely that can even be handled by a single intelligent router if they had a dedicated router per customer that's above a certain size (of course you still have to go to the groups of dynos model once a single customer grows beyond the size that a single intelligent router can handle).

-----


I understand and don't disagree with everything you are saying, but the focus of my attention is on what you're talking about in your 3rd graf. When you talk about your example problems (1) and (2) with routing to concurrent systems on large number of dynos, what you're really discussing is an engineering flaw in the typical Rails stack.

There's a tradeoff between:

* a well-engineered request handler (a solved problem more than a decade ago) and

* an efficient development environment (arguably a nearly-unsolved problem before the Rails era)

And I feel like mostly the Heroku drama is a result of Rails developers not grokking which end of that tradeoff they've selected.

-----


I'm not sure I agree. Yes, it's a Rails problem that it is using large amounts of memory (on the other hand (2) isn't Rails specific at all, it applies equally to e.g. Node). But it's a Heroku problem that it gives Dynos just 512MB of memory. It's a Heroku problem that it doesn't have a good load balancer. Heroku is in the business of providing painless app hosting, and part of that is painless request routing. These problems may not be completely trivial to solve, but they're not rocket science either. Servers these days can hold hundreds of gigabytes of memory, the 512MB limitation is completely artificial on Heroku's part. Intelligent routing in groups is also very much achievable. Sure, it requires engineering effort, but that's the business Heroku is in.

Of course Heroku is under no obligation to do anything, but its customers have to justify its cost and low performance relative to a dedicated server. And most applications run just fine on a single or at most a couple dedicated servers, which means you don't have routing problems at all, whereas to get reasonable throughput on Heroku you have to get many Dynos, plus a database server. A database server with 64GB ram costs $6400 per month. You can get a dedicated server with that much ram for $100 per month. Heroku is supposed to be worth that premium because it is convenient to deploy on and scale. Because of these routing problems which may require a lot of engineering effort in your application it's not even clear that Heroku is more convenient (e.g. making it use less memory so that you can run many concurrent request handlers on a single Dyno).

-----


If there is another provider that seamlessly operates at Heroku's scale (ie, that can handle arbitrarily busy Rails apps) at a reasonable price that has better request dispatching, I think it's very easy to show that you're right.

I'm not sure there are such providers, and if there aren't, I think it's safe to point the finger towards Rails.

As a system for efficiently handling database-backed web requests, Rails is archaic. Not just because of its memory use requirements! It is simultaneously difficult to thread and difficult to run as asynchronous deferrable state machines.

These are problems that Schmidt and the ACE team wrote textbooks about more than 10 years ago.

(Again, Rails has a lot of compensating virtues; I like Rails.)

-----


I certainly already agreed that Rails' architecture is bad (though the reason that it has this problem is its memory usage, and not any of the other reasons you mention). Herokus architecture is bad as well. It's the combination of these that causes the problem. But that does not mean that it's impossible, or even hard, to solve the problem at Herokus end.

> I'm not sure there are such providers, and if there aren't, I think it's safe to point the finger towards Rails.

This is not sound logic. I described above two methods for solving the problem: (1) increase the memory per Dyno (see below: they're doing this, going from 512MB to 1GB per Dyno IIRC, which although still low will be a great improvement if that means that your app can now run 2 concurrent processes per Dyno instead of 1), or (2) do intelligent routing for small groups of Dynos. Do you understand the problem with random routing, and why either of these two would solve it? If not you might find the paper I linked to previously very interesting:

"To motivate this survey, we begin with a simple problem that demonstrates a powerful fundamental idea. Suppose that n balls are thrown into n bins, with each ball choosing a bin independently and uniformly at random. Then the maximum load, or the largest number of balls in any bin, is approximately log n / log log n with high probability. Now suppose instead that the balls are placed sequentially, and each ball is placed in the least loaded of d >= 2 bins chosen independently and uniformly at random. Azar, Broder, Karlin, and Upfal showed that in this case, the maximum load is log log n / log d + Θ(1) with high probability [ABKU99].

The important implication of this result is that even a small amount of choice can lead to drastically different results in load balancing. Indeed, having just two random choices (i.e., d = 2) yields a large reduction in the maximum load over having one choice, while each additional choice beyond two decreases the maximum load by just a constant factor."

-- http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...

-----


I understand that one approach to dispatching requests at the load balancer is superior to the other, just as I understand that one way of absorbing requests at the app server is better than the other.

Most things are inferior to other substitutable things! :)

-----


That's a mild way of putting it. With the current way of dispatching requests you need exponentially many servers to handle the same load at the same queuing time, if your application uses too much memory to run multiple instances concurrently on a single server.

-----


I work at Heroku. To address you concerns about memory limitations, know that we're fast-tracking 2X dynos (this is also mentioned in the FAQ blog post). Extra memory will make it easier to get more concurrency out of each dyno.

-----


Yes, that will be a huge improvement!

-----


"You split the 100 nodes into 10 groups of 10, you route randomly to one of the groups, and then within a group you route intelligently."

And here we've re-invented the airport passport checking queue - everybody hops onto the end of a big long single queue, then near the front you get to choose the shortest of the dozen or two individual counter queues

I wonder what the hybrid intelligent/random queue analogues of the in-queue intelligence gathering and decision making you caan do at the airport might be? "Hmmm, a family with small children, I'll avoid their counter queue even if it's shortest", "a group of experienced-looking business travellers, they'll probably blow through the paperwork quickly, I'll queue behind them". I wonder if it's possible/profitable to characterize requests in the queue in those kinds of ways?

-----


$25 a month? Did you forget a few zeroes?

-----


It was just a placeholder price. :)

-----


Amazon ELB? It does cost significantly more than Heroku AFAIK.

-----


My understanding is that ELBs are HAProxy, and they may be set to use the leastconn algorithm (a global request queue that is friendly to concurrent backends). However, once you get any amount of traffic they start to scale out the nodes in the ELB, which produces essentially the same results as the degradation of the Bamboo router that we've documented.

The difference, of course, is that ELBs are single-tenant. So a big app might only end up with half a dozen nodes, instead of the much larger number in Heroku's router fleet.

Offering some kind of single-tenant router is one possibility we've considered. Partitioning the router fleet, homing... all are ideas we've experimented with and continue to explore. If one of these produces conclusive evidence that it provides a better product for our customers and in keeping with the Heroku approach to product design, obviously we'll do it.

-----


I hope you'll be able to share your findings with us, even if they're negative. As someone who has no stake in Heroku, i have the luxury of finding this problem simply interesting!

My hypothesis is that tenant-specific intelligent load balancers would be plausible; i would guess that you would never need more than a handful of HAProxy or nginx-type balancers to front even a large application. Your main challenge would then be routing requests to the right load balancer cluster. If you had your own hardware, LVS could handle that (i believe that Wikipedia in 2010 ran all page text requests through a single LVS in each datacentre), but i'm not sure what you do on EC2.

However, "hypothesis" is just a fancy way of saying "guess", which is why your findings from actual experiments would be so interesting.

-----


ELBs have a least-conn per node routing behavior. If your ELB is present in more than 1 AZ, then you have more than one node. If you have any non-trivial amount of scale, then you probably have well more than 1 node.

-----


Random routing will work fine so long as the operating system has the ability to intelligently schedule the work. The problem is that the dynos were setup to handle request sequentially. Unicorn will help mitigate the problem, but the best solution is to not try and serve webpages from hardware that is about as powerful as an old mobile phone. The 2x dynos are a step in the right direction, but I have no idea why they don't offer a real app server like a 16x dyno.

-----


It's probably easy to guess that once we have 2X dynos, 4X dynos (and up) may be on the way. We'll drive this according to demand, so if you need/want dynos of a particular size, drop us a line.

-----


I'm sure at least one person's job depends on this working out in the direction they've been trying to go, cf. Blaine Cook (who I think got a raw deal).

-----


[deleted]

http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics

There are screenshots of the website where it said that in multiple places.

-----


Unless I'm entirely misunderstanding the situation, this was definitely the premise of the "intelligent routing mesh" that Heroku used for so long on the Bamboo stack.

-----




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: