Heroku's Ugly Secret: The story of how the cloud-king turned its back on Rails

teich · on Feb 14, 2013

This is Oren Teich, I run Heroku.

I've read through the OP, and all of the comments here. Our job at Heroku is to make you successful and we want every single customer to feel that Heroku is transparent and responsive. Getting to the bottom of this situation and giving you a clear understanding of what we’re going to do to make it right is our top priority. I am committing to the community to provide more information as soon as possible, including a blog post on http://blog.heroku.com.

doktrin · on Feb 14, 2013

Thanks for the response, but I have to admit that the lack of a clear-cut answer here is a little worrisome.

Anyone who wants to like Heroku would hope that the OP is flat out, 100%, wrong. The fact that Heroku's official answer requires a bit of managing implies otherwise.

On a related tangent, I would also encourage future public statements to be a little less opaque than some Heroku has put out previously.

For instance, the cause of the outage last year was attributed to "...the streaming API which connects the dyno manifold to the routing mesh" [1]. While that statement is technically decipherable, it's far from clear.

[1] https://status.heroku.com/incidents/372

toyg · on Feb 14, 2013

Maybe it doesn't need "managing", Oren might just want to talk with whoever was responsible for the change and see what the best way forward is. I don't think panicked, knee-jerk reactions like "OMG we were wrong and will revert that commit pronto!" are beneficial in situations as complex as this.

doktrin · on Feb 14, 2013

You're assuming that the change was actually made. Until we hear definitively from Heroku, the only evidence is an (admittedly, well documented) blog post.

toyg · on Feb 14, 2013

Yeah, absolutely. I'm just saying we can't expect a to manager to immediately respond to a highly-technical issue questioning subtle changes in internal behaviour which might have been introduced years ago.

bambax · on Feb 14, 2013

What's the point of posting a link to the front page of your blog, where the most recent article is 15 days old (4 hours after the comment above)?

What we want to know:

- is the OP right or wrong? That is, did you switch from smart to naive routing, for all platforms, and without telling your existing or future customers?

- if you did switch from smart to naive routing, what was the rationale behind it? (The OP is light on this point; there must be a good reason to do this, but he doesn't really say what it is or might be)

- if the OP is wrong, where do his problems might come from?

- etc.

shabble · on Feb 14, 2013

>> I am committing to the community to provide more information as soon as possible, including a blog post on http://blog.heroku.com

> What's the point of posting a link to the front page of your blog, where the most recent article is 15 days old (4 hours after the comment above)?

I think OP is saying 'I am going to investigate the situation; when I am finished here [the blog] is where I will post my response', not that there is something there already.

That said, it's all a little too PR-Bot for my taste (although there's probably only so many ways to say the same info without accidentally accepting liability or something).

mcguire · on Feb 14, 2013

Note: I think we have different referents for "OP" here; bombax's is, I think, the whining customer; while shabble's is the pompous CEO.

Me, I'm the swarthy pirate. Arrrh.

bambax · on Feb 14, 2013

Upvoted, although it's bambax, not bombax ;-)

mcguire · on Feb 26, 2013

I'm having trouble reading around the eye patch.

eli · on Feb 14, 2013

What's the point of posting a link to the front page of your blog

Well, he promised a detailed blog post, at which point that link will be extremely helpful.

I do not think it is fair to expect an immediate detailed response to those questions. If I were CEO of Heroku, I wouldn't say anything definite until after talking to the engineers and product managers involved--even if I was already pretty sure what happened. The worst thing you could do at this point is say something that's just wrong.

bambax · on Feb 14, 2013

I don't expect an immediate response; I would have been happy with just: "This is Heroku's CEO. I'm on it."

But a link, that doesn't point anywhere useful, introduced by a PR phrase that sounds a little like "Your call is important to us", was a little annoying, esp. after reading the OP where they say they have contacted Heroku multiple times on this issue.

eli · on Feb 14, 2013

I guess it's a matter of perception. I thought "I'm on it, expect an update later" is what he said.

praptak · on Feb 14, 2013

> if you did switch from smart to naive routing, what was the rationale behind it?

Most probable cause: smart routing is hard to scale. Multiple routers, with each one doing random distribution independently of others will still produce a globally random distribution. No need for inter-router synchronization.

If multiple routers try smart routing, they must do quite a bit of state sharing to avoid situations where N routers try to schedule their tasks on a single dyno. And even if you split dynos between routers then you need to move requests between routers in order to balance them.

character0 · on Feb 14, 2013

While I think it is appropriate for Heroku to respond to this thread (and other important social media outlets covering this), linking to a blog without any messaging concerning your efforts might not be the greatest move... This may not be a sink or swim moment for Heroku, but tight management of your PR is key to mitigating damage. Best of luck, Heroku is a helpful product and I want to see you guys bounce back from the ropes on this one.

csense · on Feb 14, 2013

Telling people where to look for a reply when they have one is a great idea, IMHO.

teich · on Feb 15, 2013

I've posted an update on our blog, with another to follow tomorrow:

https://blog.heroku.com/archives/2013/2/15/bamboo_routing_pe...

GhotiFish · on Feb 14, 2013

I'm looking forward to hearing why Heroku is using such a strange load balancing strategy.

tjbiddle · on Feb 14, 2013

Looking forward to your blog post. Hoping things get cleared up!

willvarfar · on Feb 14, 2013

hint just use a rabbitmq queue or something. Don't have a 'smart' LB that has to know everyone's state; instead, have dynos that get more work as quick as they can.

praptak · on Feb 14, 2013

MQ might be a solution but certainly not in the "just use" class. Unless you want to introduce a bottleneck and a single point of failure, this queue has to be distributed.

Managing a distributed queue is hard, for reasons similar to ones making the original problem hard - DQs require global state in a distributed environment. There are tradeoffs involved - the synchronization cost might become a bottleneck in itself.

Pushing the problem on the distributed brokers is making a big bet on the queuing solution. Nope, definitely not in the "just use" category.

willvarfar · on Feb 14, 2013

yes, I know all the ins and outs.

But they will end up building a pull rather than push system in the end.

avodonosov · on Feb 14, 2013

I hope the solution will not break the possibility for multithreaded apps to receive several requests

sylvinus · on Feb 14, 2013

Me too. I see this as a side-effect of Rails single-threaded craziness, our "modern" Node.js apps run faster than ever.

antihero · on Feb 14, 2013

Can "Dynos" serve multiple requests simultaneously? That's the question, really.

neilmiddleton · on Feb 14, 2013

That's up to the process you have running on the dyno

toast76 · on Feb 13, 2013

Wow. This is explains a lot.

We've always been of the opinion that queues were happening on the router, not on the dyno.

We consistently see performance problems that, whilst we could tie down to a particular user request (file uploads for example, now moved to S3 direct), we could never figure out why this would result in queuing requests given Heroku's advertised "intelligent routing". We mistakenly thought the occasion slow request couldn't create a queue....although evidence pointed to the contrary.

Now that it's apparent that requests are queuing on the dyno (although we have no way to tell from what I can gather) it makes the occasional "slow requests" we have all the more fatal. e.g. data exports, reporting and any other non-paged data request.

46Bit · on Feb 13, 2013

Several Rails apps I develop have been suffering from similar issues. Perhaps 2-3% of requests take 0.4-2s in just processing. If the allocation is a little intelligent, it'll not perform too badly and is less work than much harder optimization. Yet if it's random, it'll queue up horribly.

I'm pissed. Spent way too much time unable to explain it to coworkers, thinking I just didn't understand Heroku's platform and that it was my fault.

Turns out, I didn't understand it, because Heroku never thought to clearly mention something that's pretty important.

Easiest fix: moving to EC2 next week. I've wanted to ever since our issues became evident but it's hard to make a good argument from handwaving about 'problems'.

jholman · on Feb 14, 2013

> Easiest fix: moving to EC2 next week. I've wanted to ever since these issues became evident but it's hard to make a good argument from handwaving about 'problems'.

Of course, then you need to solve all these problems yourself. That sounds pretty easy, you'll have it done next week no problem!

That was sarcastic, but this isn't: good luck, let us know how it goes.

codewright · on Feb 14, 2013

I've done dev-ops in the ads industry before, it's really not that hard if you're a competent programmer. You just have to take a more studious approach than most non-dev-ops programmers and read up on things before deploying them.

But if you want to let the various PAAS providers put the fear into you, that's your cowardice.

Let the others learn as they may.

Edit:

To clarify, Heroku is making the problem harder on themselves than it would be for an individual to serve their own needs because of the complexity managing so many customers and apps.

You don't have to be Heroku to do for yourself what they offer.

46Bit · on Feb 14, 2013

Thanks for this codewright. I've consistently found that learning things others are afraid of is a good business decision - and I reckon this is a big one.

46Bit · on Feb 14, 2013

> Of course, then you need to solve all these problems yourself. That sounds pretty easy, you'll have it done next week no problem!

I agree with this, actually. I know it's not simple to do your own servers when you're growing. Yet I'd rather improve my existing ops skills a bit than have to setup everything as async APIs (on EC2 anyway). That's the only way I can see that I can solve this.

timr · on Feb 14, 2013

"Yet I'd rather improve my existing ops skills a bit than have to setup everything as async APIs (on EC2 anyway). That's the only way I can see that I can solve this."

You're going to discover that a lot of "ops skills" boils down to "do things asynchronously whenever possible". And while nearly any smart engineer can think of the "right" way of doing something, finding the time to do it all is a huge opportunity cost.

That's what the parent is trying to say. It's not that you can't do it; it's that it's a really bad idea to do it, at first.

coldtea · on Feb 14, 2013

>That's what the parent is trying to say. It's not that you can't do it; it's that it's a really bad idea to do it, at first.

This "bad idea" is how 90% of the web works...

loopdoend · on Feb 14, 2013

It's a lot easier than you think when you aren't limited by artificial restrictions on the number of concurrent requests you can serve. Taking out what could be considered here to be a hostile intermediary will free up tech resources to fix problems that actually exist.

I can only imagine how these guys must have been beating their heads against the wall. Heroku charges a premium price and should be providing a premium service.

coldtea · on Feb 14, 2013

>Of course, then you need to solve all these problems yourself. That sounds pretty easy, you'll have it done next week no problem!

Depending on the complexity of their setup, they COULD have it done next week no problem.

After all tens of thousands of other sites have. It's not like everybody except Google and Facebook is using Heroku.

vidarh · on Feb 14, 2013

Solving this problem is easy: Run haproxy, and he'll have detailed control over the balancing algorithm, including balancing by least connections and a number of other measures (if he, for example, wants to segregate the long running requests on a specific set of backends, it's trivial).

tbenst · on Feb 14, 2013

Best thing we can do is follow through on the article's call-to-action for emailing support@heroku.com:

"After reading the following RapGenius article (http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics), we are reevaluating the decision to use Heroku. I understand that using a webserver like unicorn_rails will alleviate the symptoms of the dyno queuing problem, but as a cash-strapped startup, cost-efficiency is of high importance.

I look forward to hearing you address the concerns raised by the article, and hope that the issue can be resolved in a cost-effective manner for your customers."

jaggederest · on Feb 14, 2013

It's interesting, because initially the way that queue time detection worked within New Relic was via timestamps.

Currently, though, I believe it's just fed as a number of milliseconds: https://github.com/newrelic/rpm/blame/master/lib/new_relic/a...

This solves the issue of the application seeing out-of-whack queue times if there's clock skew between the front-end routing framework and the actual dyno box, but misses all the queued time spent in the dyno-queue per rap genius's post.

sgrock · on Feb 14, 2013

I work on New Relic's ruby agent, and you're right (hey Justin). In fact we support both methods (i.e. passing a timestamp or a duration). We rely on the front end server (e.g. nginx, apache) to set a timestamp in an HTTP header and forward that to the ruby application. In the case of heroku there is a special header that they pass which describes the queuing duration. Because we're in the ruby application we don't have control over whether this timestamp is accurate but I'm very interested in ideas on how we could do a better job in this situation.

We do provide javascript based browser instrumentation ("Real User Monitoring") which measures request time from the browser's perspective. This might give you a more accurate idea of what real users are experiencing in this case.

jaggederest · on Feb 14, 2013

My thoughts before I left the project were to add increased granularity of queue times via having headers added at each passing server and show a rainbow chart for the 'depth' of queue at each layer, not sure if that ever got added.

There's facility for that in the Agent, to allow multiple copies of the header and use whichever came first (for the beginning) and whichever came last (for the end ), it'd be relatively easy to hook metrics into each of those.

runarb · on Feb 13, 2013

Is it so that a dyno can only handle a single user request at a time?

Why dos it not use some kind of scheduling system to handle other task while one task is waiting on i/o?

Cushman · on Feb 13, 2013

It's not exactly so, if you use a server that spawns child processes: http://michaelvanrooijen.com/articles/2011/06/01-more-concur... you can potentially handle 3-4 requests per dyno at a time. That doesn't fix the root problem, though.

toast76 · on Feb 14, 2013

Investigating this approach now. It won't fix the problem, but will certainly reduce the occurrence of blocked dynos. Thx!

EDIT: will need to look into our memory perf though, looks like we'll need to do some work to get more than a couple of workers.

joeya · on Feb 14, 2013

I can confirm this. We experimented with Unicorn as a way to get some of the benefits of availability-based routing despite Heroku's random routing. Our medium-sized app (occupying ~230 MB on boot) would quickly exceed Heroku's 512 MB memory limit when forking just 2 unicorn workers, so we had to revert to thin and a greater number of dynos.

dspillett · on Feb 14, 2013

That makes more efficient use of the nodes, because otherwise idle cpu time and memory allocation within a node has a chance of getting used, but it doesn't stop the queueing problem (it mearly gives you more virtual nodes to hand tasks to unintelligently).

Also as the nodes are virtual machines anyway and may be contending with each other for IO, and for most apps these days you spend more time waiting for IO than you do spinning the CPU (unless you have a lot of static content so don't need to hit the db for many requests - but such requests are better handled by a caching layer above that which handles the fancier stuff), so the benefit of running multiple processes per node is going to be a lot less noticable than if you are talking about the nodes being physical machines with dedicated storage channels.

michaelrkn · on Feb 13, 2013

We ran into this exact same problem at Impact Dialing. When we hit scale, we optimized the crap out of our app; our New Relic stats looked insanely fast, but Twilio logs told us that we were taking over 15 seconds to respond to many of their callbacks. After spending a few weeks working with Heroku support (and paying for a dedicated support engineer), we moved to raw AWS and our performance problems disappeared. I want to love Heroku, but it doesn't scale for Rails apps.

WillieBKevin · on Feb 14, 2013

We moved our Twilio app off Heroku for the same reasons. Extensive optimizations and we would still get timeouts on Twilio callbacks.

The routing dynamics should be explained better in Heroku's documentation. From an engineering perspective, they're a very important piece of information to understand.

We're with https://bluebox.net now and are very happy.

FireBeyond · on Feb 13, 2013

This should be more prominent. I want to love Heroku, and am sure that I could.

But really, throwing in the towel at intelligent routing and replacing it with "random routing" is horrific, if true.

It's arguable that the routing mesh and scaling dynamics of Heroku are a large part, if not -the- defining reason for someone to choose Heroku over AWS directly.

Is it a "hard" problem? I'm absolutely sure it is. That's one reason customers are throwing money at you to solve it, Heroku.

timmaah · on Feb 13, 2013

Even their own docs were wrong on this for a long time. It bit me in the ass back in 2011 and I got them to clarify and update the documentation just a little.

http://tiwatson.com/blog/2011-2-17-heroku-no-longer-using-a-...

michaelrkn · on Feb 13, 2013

Thanks for the blog post, by the way. When we were struggling with our own Heroku scaling issues last year (we eventually moved to AWS), I came across it and it was good vindication that somebody else was facing the same issue.

chc · on Feb 14, 2013

> But really, throwing in the towel at intelligent routing and replacing it with "random routing" is horrific, if true.

The thing is, their old "intelligent routing" was really just "we will only route one request at a time to a dyno." In other words, what changed is that they now allow dynos to serve multiple requests at a time. When you put it that way, it doesn't sound as horrific, does it?

Dylan16807 · on Feb 14, 2013

Yes it does sound as horrific. It should never route more than N requests to a dyno, where N is the number of requests it can handle simultaneously. It doesn't matter if N is 1, 2, or 10. And it should always send new requests to the emptiest dyno.

chc · on Feb 14, 2013

While I agree that routing to the emptiest dyno would be a good thing, I think it's a little melodramatic to describe giving every dyno at least three times and up to several thousand times as much power as "horrific."

Dylan16807 · on Feb 14, 2013

Power, what?

The problem is when you send a dyno that has all its threads stuck on long-running computations a new request, because it won't be able to even start processing it. The power is orthogonal to the problem.

The only mitigation is that if a dyno can handle a large number of threads, it probably won't get clogged. But if it can only handle 3 and gets new requests at random, you're in a bad place.

arcatek · on Feb 14, 2013

The requests will not be served at the same time, that's the whole point. If a request is routed to a busy dyno, you will have to wait that the previous job finish before being able to start yours.

antihero · on Feb 14, 2013

Thing is though, it says that a Dyno is a Ubuntu virtual machine. In what sort of horrendous configuration can an ENTIRE VM serve only a SINGLE REQUEST AT A TIME?!

That is utter madness, and the validity of the argument depends on whether it's the Heroku or this dude's fault that the VM is serving only a single request at a time (and it taking >1sec to handle a request).

vemv · on Feb 14, 2013

Not Heroku's fault in this case, Rails (and any other single-threaded environment) can handle a single request at a time.

joesb · on Feb 15, 2013

But one VM can host multiple Rails instances, each on different port. That's what Passenger or Unicorn do, acting as proxy of group of locally spawn Rails instances.

damian2000 · on Feb 14, 2013

I believe the article is saying that this wait will only occur with Rails due to it being single threaded? Reactor pattern frameworks like node don't suffer the same issue.

teraflop · on Feb 14, 2013

They will if the request is CPU-bound. In that case, throwing more concurrent requests at a server than it has cores just slows all of them down.

JPKab · on Feb 14, 2013

You are exactly right, but I would hope that someone wouldn't be using Node at all if CPU bound requests are even remotely commoneplace on their app. But of course, if they built their app to spawn processes when this happens, you end up with a case of a need for intelligent routing.

joseph_cooney · on Feb 14, 2013

How many web requests are typically CPU bound though? Not many.

vidarh · on Feb 14, 2013

Doesn't matter.

At some point you hit memory limits, disk IO limits, or simply a connection limit. It doesn't matter what limit:

If you have some requests that are longer running than others, random load balancing will make them start piling up once you reach some traffic threshold.

You can increase the threshold by adding more backends or increasing the capacity of each backend (by optimizing, or picking beefier hardware if you're on a platform that will let you), and maybe you can increase it enough that it won't affect you.

But no matter what you do, you end up having to allocate more spare resources at it than what you would need with more intelligent routing.

If you're lucky, the effect might be small enough to not cost you much money, and you might be able to ignore it, but it's still there.

chc · on Feb 14, 2013

> But no matter what you do, you end up having to allocate more spare resources at it than what you would need with more intelligent routing.

I think we have to remember that the "intelligent routing" in question here is actually marketing-speak for "one request per server." Are you saying that when your servers can only receive one request at a time, you will necessarily need fewer than if your servers can handle three requests at a time but are assigned requests randomly?

Dylan16807 · on Feb 14, 2013

No it's not. "Intelligent routing" means giving each server exactly as many requests as it can handle, and no more. If your servers can handle 3 requests each, using intelligent routing to make sure they never get more than 3 will blow the latency of the random method out of the water.

DeepDuh · on Feb 14, 2013

Why do you 'want to love' Heroku? Because their marketing speak is so great?

robryan · on Feb 14, 2013

Because the premise is a good one, abstracting away some of the lower level details so people can just worry about building their apps. It has always been a tough abstraction to pull off though. It is possible the answer is simply that rap genius has outgrown heroku, in the same way as twitter outgrowing rails it doesn't make it a bad tool for everyone.

FireBeyond · on Feb 14, 2013

Precisely. I'm definitely not a startup worshipper; in fact, in terms of the HN spectrum, I'm probably at the more cynical end of the spectrum. But I like the ease of integration, and the fact that, for a smaller application, it certainly helps with my administration and infrastructure overhead.

DigitalJack · on Feb 14, 2013

I'm sure it's because the idea of heroku is so great.

DeepDuh · on Feb 14, 2013

As others here have pointed out in different discussions: Ideas aren't worth much by themselves. It's the subset that's implementable of which it's the subset that has gotten implemented that's worth anything. If we just love ideas we should better become philosophers.

Sorry to be a bit harsh, but I find it a bit shocking how even in this field where we can basically play god and do whatever we want and what we think is best on increasingly powerful bit-cruncher-and-storer-machines, so many here seem to behave like a herd of sheep and just do what 'everyone else' does. Just sit down for a moment and think! What are my requirements right now? What could be a requirement in the near future? What technologies are there which can help me? Am I sure about these feature? Better read up on it first! How difficult is it to get it to behave in ways that are or can be important for me?

Now list that stuff down. If it's puzzling sleep over it, forget it for a few days. Then suddenly, for example under a hot shower you get an idea - that requirement I had isn't really one, I can solve it differently! Come back, take the now fitting piece of the puzzle and do your job in 20% of the time that would have been needed if you would just have blindly followed some path. That's how it usually works for me. Be picky, be exact, but be lazy.

Now about that routing dispatcher problem: Couldn't we solve that in one to two weeks on a generic plattform, but specifically for a certain use case? Let's say you want to have a worker queue of rails request handlers that work in parallel. Just write that damn router! Maybe I'd be lazy, learn Erlang for a week and think about it afterwards.

ajanuary · on Feb 14, 2013

They want to love it because the idea is great. They don't love it because it doesn't deliver. Sounds like you both agree.

DeepDuh · on Feb 15, 2013

I guess the problem I have is with the word 'love' itself. It's such a strong feeling for something that's just a tool that can in some situations make you turn around bits more efficiently. Can't we just look at that stuff rationally and reserve that kind of feelings for family and friends?

Ingaz · on Feb 14, 2013

"Love" is complicated matter.

Java is almost done from "Love".

Rails is done from "Love" of those who do not "love" java.

Hype.

I don't "love" java and Rails.

badgar · on Feb 14, 2013

> That's one reason customers are throwing money at you to solve it, Heroku.

People are throwing money at Heroku because it's really easy to use, not because it's the best long-term technology choice. Seriously - what percentage of Heroku paying users do you think actually read up on the finest technical details like routing algorithms before they put in their credit card? Heroku knows. They know you can't even build a highly-available service on top of it, since it's singly-homed, and they're still making tons of money.

aneth4 · on Feb 14, 2013

So heroku doesn't want these $20,000/mo accounts, just technically understaffed startups paying $1000/mo?

I think heroku does want to be a long-term technology choice.

badgar · on Feb 14, 2013

> So heroku doesn't want these $20,000/mo accounts, just technically understaffed startups paying $1000/mo?

> I think heroku does want to be a long-term technology choice.

Oh, I'm sure Heroku wants to be a long-term technology choice. That doesn't mean they're trying to be one with their current product offerings.

Consider their product choices since launch: they've added dozens of small value-added technology integrations. Features for a few bucks a month like New Relic to upsell their smallest customers. The price drop was also a big move to reduce barriers to using their platform - which also targets smaller customers. They launched Clojure as their third supported language! Meanwhile, they're singly-homed and have had several protracted outages, and have no announced plans to build a multi-homed product. Scalability has gotten worse with this random routing development.

I think Heroku has known for a long time that they don't have a long-term platform product and that they can't keep big accounts until they build one.

lkrubner · on Feb 14, 2013

Good lord!!!!!

Percentage of the requests served within a certain time (ms)

  50%    844

  66%   2977

  75%   5032

  80%   7575

  90%  16052

  95%  20069

  98%  29282

  99%  30029

 100%  30029 (longest request)

Those numbers are amazingly awful. If I ever run ab and see 4 digits I assume I need to optimize my software or server. But 5 digits?

Why in the world would a company spend $20,000 a month for service this awful?

CoffeeDregs · on Feb 14, 2013

Worse than that:

  * 89/100 requests failed (according to
    https://gist.github.com/a-warner/c8cc02565dc214d5f77d ).  
  * Heroku times out requests after 30 seconds, so the 30000ms
    numbers may be timeouts (I've forgotten if *ab* includes 
    those in the summary).
  * That said, the *ab* stats could be biased by using overly 
    large concurrency settings (not probably if you're running 50 dynos...),
    but still...

But still WTF. 89/100 requests failed? That's not happy-making.

Uncertainty is DiaI (death-in-an-infrastructure). I just created a couple of projects on Heroku and love the service, but this needs to be addressed ASAP (even if addressing it is just a blog post).

Also, if you have fewest-connections available, I've never understood using round-robin or random algorithms for load-balancers...

donavanm · on Feb 14, 2013

> I've never understood using round-robin or random algorithms for load-balancers...

LeastConns/FastestConn selection is very dangerous when a backend host fails. Imagine a host has a partial failure, allowing health checks to pass. This host now fast fails and returns a 500 faster than other hosts in the pool generate 200s. This poison host will have less active connections and your LB will route more requests to it. A single host failure just became a major outage.

I like WRR selection for back ends, then use a queue or fast fail when your max active conns is exceeded. Humans prefer latency to errors, so let the lb queue (to a limit) on human centric VIPs. Automated clients deal better with errors so have your lb throw a 500 directly, or RST, or direct to a pool that serves static content.

shabble · on Feb 14, 2013

You'd have some sort of error threshold/rate limit[1] at which point the server is marked dead and falls out of the backends list, surely?

Or even, an alarm threshold if responses are averaging /too fast/, based on your expected load & response times.

I've not done any deployment/ops beyond hte trivial/theoretical though, so I don't know how this would work in reality.

donavanm · on Feb 15, 2013

No, lbs don't inspect established streams. The lb will periodically send requests to a known uri as a health heck instead. The problem is when the health check uri isn't indicative of availabilty. (Hint: it never is)

Nope, don't do this either. Unless you like getting pages because things are working?

jeremyjh · on Feb 15, 2013

You can configure a LB to inspect layer 7 - in HAProxy this is done with an observe directive. Error 500s would then take the member out of a pool. You are right that the health check of a static file may put it right back into the pool, but you can give it a slow rise value so that it waits a few minutes to do that. I'm not saying this is easy to get right but it is definitely possible to at least reduce the frequency of selection of a server raising 500 errors.

donavanm · on Feb 15, 2013

Yes, all things are possible. You'll also have to keep state and track rates. Otherwise a very low error rate could down all of your backend hosts.

But you're now running a stateful l7 application proxy. That's waaaaaay more expensive than a tcp proxy with asynchronous checks.

bbatsell · on Feb 14, 2013

> But still WTF. 89/100 requests failed? That's not happy-making.

Unless something has changed recently, ab doesn't handle dynamic pages very well. It takes the first pageload as a baseline, and any subsequent request with any portion of the page that is randomized, or is a CSRF token, or reflects most recent changes, etc., is marked as "failed" because it doesn't match the baseline's length.

The page in question does have a block in the footer reflecting "hot songs", which I'm guessing changed a bit during the run.

CoffeeDregs · on Feb 14, 2013

AB is not a particularly smart tool (though it's very handy), but I've never seen it reject as not-accurate any non-baseline sized requests... In fact, I can't imagine the rationale that would lead a benchmarking tool to do so. [Obviously, a testing tool might do so...]

bbatsell · on Feb 14, 2013

I've seen it frequently, and yes, this is how it behaves, e.g.:

[1]: http://stackoverflow.com/a/579466/94830

[2]: https://issues.apache.org/bugzilla/show_bug.cgi?id=42040

I imagine the rationale was something along the lines of many servers/apps are written to incorrectly return 200 with a descriptive error page rather than 500 or whatever the appropriate status code would be. And at the time ab was first written, pages were a lot more static than they are now, so a different page would be more likely to indicate an incorrect response.

stock_toaster · on Feb 14, 2013

use wrk[1] or weighttp[2] instead.

[1]: https://github.com/wg/wrk

[2]: redmine.lighttpd.net/projects/weighttp/wiki

JPKab · on Feb 14, 2013

I've been scanning through the comments, and I have yet to see anything written by a Heroku engineer to defend the company. I'm hoping its in here and I missed it. I have a feeling that this all might be absolutely true, and they have lawyers/PR trying to think of a damage control plan.

I suspect that the reason they'be been pushed to do this is financial, and it makes me think that Nodejitsu's model of simply not providing ANY free plans other than one month trials is a good one. I realize it's apples and oranges, since NJ is focused on async and this wouldn't even be a problem for a Node app, but from a business perspective I feel like this would alleviate pressure. How many dynos does Heroku have running for non-paying customers? Do these free dynos actually necessitate this random routing mesh bullshit? If not, what?

joeblau · on Feb 14, 2013

Nodejitsu actually does have a free plan if your project is Open Source.

http://blog.nodejitsu.com/opensource-dot-jitsu

badgar · on Feb 14, 2013

> How many dynos does Heroku have running for non-paying customers? Do these free dynos actually necessitate this random routing mesh bullshit? If not, what?

Of course the random routing mesh isn't necessitated by anything, this problem is already solved by bigger companies.

csense · on Feb 14, 2013

High cost/risk associated with switching providers, and frog-in-heating-water syndrome.

eli · on Feb 14, 2013

Well, at X level of concurrency, wouldn't most set ups with load balancers start to spit numbers like that?

trotsky · on Feb 14, 2013

no, at x level of concurrency most set ups wont spend 16 seconds or more on 10% of their requests.

eli · on Feb 14, 2013

You mean because they'll crash before then? Otherwise I don't follow. Surely there's always a limit to how many simultaneous requests can be processed at once.

trotsky · on Feb 14, 2013

Hmm, you're right of course. Somewhere our terms got crossed, I wouldn't call (total requests/servers) concurrency, I'd call that request density.

bignoggins · on Feb 13, 2013

Rap Genius is employing a classic rap-mogul strategy: start a beef

parsnips · on Feb 13, 2013

Not only that, East Coast vs. West Coast at that...

hunterhusar · on Feb 14, 2013

hunterhusar · on Feb 15, 2013

-4 for removing an LOL darn

jussij · on Feb 14, 2013

We've had Battle Rap, Gangsta Rap, you name it Rap.

Maybe this is the start of Off the Rails Rap.

mattj · on Feb 13, 2013

So the issue here is two-fold: - It's very hard to do 'intelligent routing' at scale. - Random routing plays poorly with request times with a really bad tail (median is 50ms, 99th is 3 seconds)

The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).

aristus · on Feb 13, 2013

I do perf work at Facebook, and over time I've become more and more convinced that the most crucial metric is the width of the latency histogram. Narrowing your latency band --even if it makes the average case worse-- makes so many systems problems better (top of the list: load balancing) it's not even funny.

jhspaybar · on Feb 13, 2013

I can chime in here that I have had similar experiences at another large scale place :). Some requests would take a second or more to complete with the vast majority finishing in under 100MS. A solution was put in place that added about 5 MS to the average request, but also crushed the long tail(it just doesn't even exist anymore) and everything is hugely more stable and responsive.

bshanks · on Feb 14, 2013

Let's assume that it is unacceptable to have each dyno tell the router each time it finishes a request ( http://news.ycombinator.com/item?id=5217157 ). And also that the goal is to reduce worst-case latency. And also that we don't know a priori how many requests each dyno should queue before it is 'full' and rejects further requests ( http://news.ycombinator.com/item?id=5216771 ).

Proposal:

1) Once per minute (or less often if you have a zillion dynos), each dyno tells the router the maximum number of requests it had queued at any time over the past minute.

2) Using that information, the router recalculates a threshold once a minute that defines how many queued requests is "too many" (e.g. maybe if you have n dynos, you take the log(n)th-busiest-dyno's load as the threshold -- you want the threshold to only catch the tail).

3) When each request is sent to a dyno, a header fields is added that tells the dyno the current 'too many' threshold.

4) If the receiving dyno has too many, it passes the request back to the router, telling the router that it's busy ( http://news.ycombinator.com/item?id=5217157 ). The 'busy' dyno remembers that the router thinks it is 'busy'. The next time its queue is empty, it tells the router "i'm not busy anymore" (and repeats this message once per minute until it receives another request, at which point it assumes the router 'heard').

5) When a receiving dyno tells the router that it is busy, the router remembers this and stops giving requests to that dyno until the dyno tells it that it is not busy anymore.

I haven't worked on stuff like this myself, do you think that would work?

genwin · on Feb 14, 2013

How was 5 ms added? Multiple sleep states per request?

I imagine the long tail disappears in a similar way that a traffic jam is prevented by lowering the speed limit.

mixedbit · on Feb 14, 2013

I think you misunderstood: they optimized the long running requests and the optimization incurred 5ms performance loss for short requests. It is not that the additional 5ms solved the problem.

harshreality · on Feb 13, 2013

I seem to recall Google mentioning on some blog several years ago that high variance in response latency degrades user experience much more than slightly higher average request times. I can't find the link though; if anyone has it, I'd be grateful.

nostrademons · on Feb 14, 2013

Jeff Dean wrote a paper on it for CACM:

http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scal...

There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.

gleb · on Feb 14, 2013

From experience, this is an incredibly effective way to DoS yourself. It was the default behaviour of nginx LB ages ago. Maybe only on EngineYard. Doesn't really matter as nobody uses nginx LB anymore.

Even ignoring the POST requests problem (yup, it tried to replay those) properly cancelling a request on all levels of a multi-level rails stack is very hard/not possible in practice. So you end up DOSing the hard to scale lower levels of the stack (e.g. database) at the expense of the easy to scale LB.

nostrademons · on Feb 14, 2013

It's a capacity/latency tradeoff. Perhaps I'm biased by working at Google, where capacity is cheap and latency will kill you, but IIUC Heroku runs off of AWS and their database tier needs to scale horizontally anyway, so reserving sufficient overflow capacity should simply be a matter of throwing money at the problem.

krunaldo · on Feb 14, 2013

Nginx introduced least_conn lb method in 1.3.1 which makes it a bit better. http://nginx.org/en/docs/http/ngx_http_upstream_module.html#...

ha-proxy is a lot better than nginx + more flexible if you want to introduce non-http to your stack.

Shouldn't the request be canceled on all levels if you cut the HTTP connection to the frontend?

fizx · on Feb 14, 2013

Right now, heroku has one inbound load-balancer that's out of their control (probably ELB(s)). This load balancer hits another layer of mesh routers that heroku does control, and that perform all of herokus magic. In order for "intelligent routing," which is more commonly known as "least-conn" routing to work amongst the mesh layer, all of the mesh routers would have to share state with each other in real-time, which makes this a hard problem.

Alternately, heroku can introduce a third layer between the mesh routers and the inbound random load balancer. This layer consistently hashes (http://en.wikipedia.org/wiki/Consistent_hashing) the api-key/primary key of your app, and sends you to a single mesh router for all of your requests. Mesh routers are/should be blazing fast relative to rails dynos, so that this isn't really a bottleneck for your app. Since the one mesh router can maintain connection state for your app, heroku can implement a least-conn strategy. If the mesh router dies, another router can be automatically chosen.

gojomo · on Feb 14, 2013

A relatively easy fix, for read-only or idempotent requests. Also, if long-tail latency requests wind up being run twice, this technique might accelerate tip-over saturation. Still, this 'hedged request' idea is good to keep in mind, thanks for the pointer.

The 'tied request' idea from the Dean paper is neat, too, and Heroku could possibly implement that, and give dyno request-handlers the ability to check, "did I win the race to handle this, or can this request be dropped?"

badgar · on Feb 14, 2013

> There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.

Your solution doesn't work if requests aren't idempotent.

nostrademons · on Feb 14, 2013

Yeah, I know. I figure that for incoming HTTP traffic it's relatively easy to balance the GET requests, and if they're doing anything remotely sane with HTTP those ought to be idempotent (if they're not, Googlebot will come along and delete their site ;-)).

For mutating requests, there's a solution as well, but it involves checksumming the request and passing the checksum along so that the database layer knows to discard duplicate requests that it's already handled. You need this anyway if there's any sort of retry logic in your application, though.

donavanm · on Feb 14, 2013

Or use a parameter that's effectively a nonce.

javajosh · on Feb 14, 2013

This is something I've read in networked game literature: players react far better to consistent and high latency than to inconsistent and low latency, even if the averages are lower in the latter case. (It might even have been a John Carmack article).

mvgoogler · on Feb 14, 2013

Sounds like Jeff Dean :-)

http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scal...

dbpatterson · on Feb 14, 2013

That's probably true, but the value that Heroku is selling (and they charge a lot for it!) is that you _don't_ need to deal with this - that they will balance that out for you.

bmohlenhoff · on Feb 13, 2013

Correct, this matches my observations as well. I'd trade an increase in mean latency for a decrease in worst-case latency anytime. It makes it so much easier to reason about how many resources are needed for a given workload when your latency is bounded.

cmccabe · on Feb 14, 2013

> Narrowing your latency band --even if it makes the average case worse-- makes so many systems problems better (top of the list: load balancing) it's not even funny.

Yeah, it's a lot more practical than implementing QoS, isn't it?

jholman · on Feb 14, 2013

Re the distribution, absolutely. That "FIFTY TIMES" is totally due to the width of the distribution. Although, you know, even if their app was written such that every single request took exactly 100ms of dyno time, this random routing would create the problem all over again, to some degree.

As for the intelligent routing, could you explain the problem? The goal isn't to predict which request will take a long time, the goal is to not give more work to dynos that already have work. Remember that in the "intelligent" model it's okay to have requests spend a little time in the global queue, a few ms mean across all requests, even when there are free dynos.

Isn't it as simple as just having the dynos pull jobs from the queue? The dynos waste a little time idle-spinning until the central queue hands them their next job, but that tax would be pretty small, right? Factor of two, tops? (Supposing that the time for the dyno-initiated give-me-work request is equal to the mean handling time of a request.) And if your central queue can only handle distributing to say 100 dynos, I can think of relatively simple workarounds that add another 10ms of lag every factor-of-100 growth, which would be a hell of a lot better than this naive routing.

What am I missing?

awj · on Feb 14, 2013

I think the problem is that any servers which can handle concurrent requests now need to decide how many requests they can handle. Since most application servers seem to have concurrency values of "1, ever" or "I dunno, lots" this is a hard problem.

Your solution would likely work if you had some higher level (application level? not real up on Heroku) at which you could specify a push vs. pull mechanism for request routing.

jholman · on Feb 14, 2013

Yeah, I dunno squit about Heroku either.

Given that, according to TFA (and it's consistent with some other things I've read) Heroku's bread and butter is Rails apps, and given that, according to TFA, Rails is single-threaded, that (valid) point about concurrency in a single dyno is perhaps not that relevant? You'd think that Heroku would continue to support the routing model that almost all of their marketing and documentation advertises, right? Even if it's a configurable option, and it only works usefully with single-threaded servers?

And if you did do it pull-based, it wouldn't be Heroku's problem to decide how many concurrent requests to send. Leave it to the application (or whatever you call the thing you run on a dyno).

And it doesn't need to be pull-based, if the router can detect HTTP connections closing in dynos, or whatever.

But the idea of pull-based work distribution is pretty straightforward. It's called a message queue.

lil_tee · on Feb 14, 2013

Simulation author here with some additional analysis using a faster distribution of request times. If you use a distribution with median 50 ms, 90th percentile 225 ms, and 99.9th percentile 898 ms, then you need 30 intelligent dynos to handle 9000 requests/minute without queueing. In the same scenario with 30 naive dynos, 44% of requests get queued.

Animations and results are in the explanation at http://rapgenius.com/1502046

DannyBee · on Feb 14, 2013

Yes, it is very hard to do it at scale, but so what? I mean, isn't the whole premise of their company to do intelligent things at scale so you don't have to?

It's not an insurmountable problem by any measure, and it's definitely worth it.

pdonis · on Feb 14, 2013

The solution here is to figure out why your 99th is 3 seconds.

I'm not sure this applies to the OP. His in-app measurements were showing all requests being handled very fast by the app itself; the variability in total response time was entirely due to the random routing.

mcphilip · on Feb 13, 2013

Wouldn't the bad tails of random routing be an unpredictably long length of time since long running requests times are unpredictable?

Even if you work on narrowing the fat tails, shouldn't you still need to be upfront and clear about how adding a new dyno only gives you an increased chance of better request handling times as you scale?

nevinera · on Feb 14, 2013

That isn't the only problem with random routing - the problems aren't as pronounced with uniform response speeds, but you still get a significant difference in net effective queue time, especially if you're operating close to your throughput limit.

cmccabe · on Feb 14, 2013

> The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).

The Golang runtime uses non-blocking I/O to get around this problem.

cmccabe · on Feb 15, 2013

Wow, it's annoying to get modded down for saying something correct. Don't believe me? Run a golang program with strace -f sometime.

You could write a pthreads-compliant threading library without using threads at all, just epoll.

nthj · on Feb 13, 2013

I'm inclined to wait until Heroku weighs in to render judgement. Specifically, because their argument depends on this premise:

> But elsewhere in their current docs, they make the same old statement loud and clear: > The heroku.com stack only supports single threaded requests. Even if your applicaExplaintion were to fork and support handling multiple requests at once, the routing mesh will never serve more than a single request to a dyno at a time.

They pull this from Heroku's documentation on the Bamboo stack [1], but then extrapolate and say it also applies to Heroku's Cedar stack.

However, I don't believe this to be true. Recently, I wrote a brief tutorial on implementing Google Apps' openID into your Rails app.

The underlying problem with doing so on a free (single-dyno) Heroku app is that while your app makes an authentication request to Google, Google turns around and makes a "oh hey" request to your app. With a single-concurrency system, Google your app times out waiting for Google to get back to you and Google won't get back to you until your app gets back to you so hey deadlock.

However, there is a work-around on the Cedar stack: configure the unicorn server to supply 4 or so worker processes for your web server, and the Heroku routing mesh appropriately routes multiple concurrent requests to Unicorn/my app. This immediately fixed my deadlock problem. I have code and more details in a blog post I wrote recently. [2]

This seems to be confirmed by Heroku's documentation on dynos [3]: > Multi-threaded or event-driven environments like Java, Unicorn, and Node.js can handle many concurrent requests. Load testing these applications is the only realistic way to determine request throughput.

I might be missing something really obvious here, but to summarize: their premise is that Heroku only supports single-threaded requests, which is true on the legacy Bamboo stack but I don't believe to be true on Cedar, which they consider their "canonical" stack and where I have been hosting Rails apps for quite a while.

[1] https://devcenter.heroku.com/articles/http-routing-bamboo

[2] http://www.thirdprestige.com/posts/your-website-and-email-ac...

[3] https://devcenter.heroku.com/articles/dynos#dynos-and-reques...

[edit: formatting]

wwarnerandrew · on Feb 14, 2013

Yes, it's true that the Cedar stack supports forking web servers like unicorn, and that an individual dyno can run multiple workers and therefore serve multiple requests at the same time.

However, dumb routing is still very problematic – even if your dyno can work on two requests simultaneously it's still bad for it to get sent a third request when there are other open dynos.

Also, for apps with a large-ish memory footprint, you can't run very many workers. A heroku dyno has 512mb memory, so if your app has a 250mb footprint, then you can basically only have two workers.

Another essential point to note is that the routing between cedar and bamboo is essentially unchanged. They simply changed the type of apps you can run.

richcollins · on Feb 14, 2013

Right if the apps internal queue is full and it stops accepting connections I'm assuming it will still queue at the dyno level anyway.

kennystone · on Feb 14, 2013

If you have 2 unicorn servers and you happen to get 3 slow requests routed to it, you are still screwed, right? Seems to me like it will still queue on that dyno.

michaelrkn · on Feb 14, 2013

That's exactly what happened to us - switching to unicorn bought us a little time and a bit of performance, but we hit the exact same problems again after a couple more weeks of growth.

ibdknox · on Feb 14, 2013

Yeah, the only real question is whether or not it's true that they no longer do intelligent routing. If that is the case, then regardless of anything else the problem exists once you pass a certain scale/request cost. It won't matter if that one dyno can handle hundreds of requests at once, it will still queue stupidly.

barmstrong · on Feb 14, 2013

This is true - unicorn masks the symptoms for a period of time but does not solve the underlying problem in the way a global request queue would.

Also, if the unicorn process is doing something cpu intensive (vs waiting on a 3rd party service or io etc) then it won't serve 3 requests simultaneously as fast as single processes would.

rjacoby5 · on Feb 14, 2013

One of the hidden costs of Unicorn is spin-up time. Unicorn takes a long time to start, then fork. We would get a ton of request timeouts during this period. Switching back to Thin, we never got timeouts during deploys - even under very heavy load.

adrr · on Feb 14, 2013

Maybe this is a stupid question, but with unicorn it forks the request and can process multiple requests at the same time. Previously it seems that only one request could be handled by the dyno so requests had to queue on the dynamic routing layer but with multiple request support with unicorn or whatever, wouldn't it be more efficient to dump all the requests to dynos? Followup question, also how would intelligent routing work if it just previously checked to see if which dyno had no requests? That seems like an easy thing to do, now you would have to check CPU/IO whatever and route based on load. Not specifically targeted at you but to everyone reading the thread.

vidarh · on Feb 14, 2013

> Previously it seems that only one request could be handled by the dyno so requests had to queue on the dynamic routing layer but with multiple request support with unicorn or whatever, wouldn't it be more efficient to dump all the requests to dynos?

It would be if all requests were equal. If all your requests always take 100ms, spreading them equally would work fine.

But consider if one of them takes longer. Doesn't have to be much, but the effect will be much more severe if you e.g. have a request that grinds the disk for a few seconds.

Even if each dyno can handle more than one requests, since those requests share resources, if some of them slows down due to some long running request, response times for the other requests are likely to increase, and as response times increase, it's queue is likely to increase further, and it gets more likely to pile up more long running requests.

> Followup question, also how would intelligent routing work if it just previously checked to see if which dyno had no requests? That seems like an easy thing to do, now you would have to check CPU/IO whatever and route based on load. Not specifically targeted at you but to everyone reading the thread.

There is no perfect answer. Just routing by least connections is one option. it will hurt some queries that will end up being piled up on servers processing a heavy request in high load situations, but pretty soon any heavily loaded servers will have enough connections all the time that most new requests will go to lighter loaded servers.

Adding "buckets" of servers for different types of requests is one option to improve it, if you can easily tell by url which requests will be slow.

nevinera · on Feb 14, 2013

That gets pretty unlikely, especially if you have many dynos and a low frequency of slow requests. The main reason unicorn can drastically reduce queue times here is that it does not use random routing internally.

richcollins · on Feb 14, 2013

how does it decide to queue at the dyno level anyway? Does it check for connection refusal at the TCP level?

dblock · on Feb 14, 2013

The connection is accepted, and a single-threaded web server will do the queuing.

richcollins · on Feb 15, 2013

Oh so the server process hosting rails is itself queueing? Is that what they refer to as "dyno queueing"? I thought perhaps there was another server between the router and your apps server process.

spoiler · on Feb 14, 2013

Slightly off topic, but what are everyone's. experience and thoughts about Puma[1]?

I am using it on a small production environment with Heroku and I like it, but when we officially launch the app, should we switch to Unicorn?

[1] http://puma.io/

gizzlon · on Feb 14, 2013

I don't have a stake in the Ruby webserver wars, but the unicorn site has a very good discussion about how it works internally, why it's build as it is, pros & cons etc..

This seems to be missing from most of these project sites, which are often just marketing (look! It's better!!), and therefore not very trustworthy.

From the outside it looks like the biggest differentiator in each generation of ruby servers (and, I guess, db managment systems :) is not that the new is better or worse, but simply that has different trade-offs.

ylansegal · on Feb 14, 2013

I did some performance analysis on puma vs unicorn vs thin a while ago:

http://ylan.segal-family.com/blog/2012/08/20/better-performa...

Although as noted in the comments, I neglected to run threadsafe! and should have probably tried rubinius or jruby. I have been meaning to redo. Take with a grain of salt

argarg · on Feb 14, 2013

You should give a try to puma 2.0 currently in beta 6 (https://rubygems.org/gems/puma). Lots of performance improvements. I haven't benchmarked it but my guess is it outperforms unicorn.

_gtly · on Feb 14, 2013

You seem have written one of the very few articles I've seen benchmarking this. I'd love to know more about how Puma compares to Unicorn (especially unicorn configurations mentioned by some in this conversation) and Thin for serving rails on Heroku. Many of the Unicorners pushing their solution don't appear to be aware of Puma and its potential benefits. I'm curious if Puma with MRI has benefits, too. Thanks!

miloshadzic · on Feb 14, 2013

AFAIK Puma should have a lower memory footprint on MRI than Thin but I haven't done benchmarks myself.

steveklabnik · on Feb 14, 2013

Also note that Rails 4 will have threadsafe! on by default, too, so if you didn't have it on here, that'll make it different for those apps.

tcc619 · on Feb 14, 2013

I have been using it on a few small projects. Haven't ran into an issue yet and the setup has been really easy.

sorentwo · on Feb 14, 2013

We are running a smaller app on JRuby using Puma with thread safe on. It has a significantly smaller footprint, as we are only booting one server.

Overall really solid, though more useful if you can use something other than MRI.

jarcoal · on Feb 13, 2013

This is how I run my apps as well, and they seem to handle more than one request concurrently per dyno, but I'm not smart enough to dispute this post, so I'm just sitting back and watching.

dedsm · on Feb 14, 2013

Nevertheless, random routing is a bad idea even if a dyno can handle multiple requests simultaneously

scottshea · on Feb 14, 2013

To amplify, I do a ton of queue adjustment work with Unicorn at the backlog level. It is so frequent that we set up Unicorn startup to read from an ENV variable on Heroku that we set as needed.

With two Unicorn workers we found that 25 was the best backlog threshold to accept (it refuses additional requests). When we were able to go to 5 Unicorn workers on Heroku we had to start to adjust that.

tibbon · on Feb 14, 2013

You don't happen to have any documentation for how to do that do you? Very curious. Never seen anything about setting up Unicorn like this prior (I'm just using 3 or 4 Unicorns/dyno currently)

jwrubel · on Feb 14, 2013

Here's a gist of our unicorn.rb config (https://gist.github.com/apangeajwrubel/4953849) Using env variables lets us adjust parameters without a code push (still requires a restart). We saw dramatic reduction in 'long tail' requests when we dropped the backlog to 25. We're experimenting now (thanks to @scottshea) with even lower values. At some point the routing mesh will give up retrying and throw an H21 (https://devcenter.heroku.com/articles/error-codes#h21-backen...). One data point that would be useful from heroku is how many retries we get.

pointful · on Feb 14, 2013

To expand on this:

You have to remove the port declaration from the line for Unicorn in your Procfile, and then add a line like this to your unicorn.rb file to define the listener port along with adjusting the backlog size:

listen ENV['PORT'], :backlog => Integer(ENV['UNICORN_BACKLOG'] || 100)

scottshea · on Feb 14, 2013

We did this in unicorn.rb `:backlog => Integer(ENV['UNICORN_BACKLOG'] || 200)` and then set the UNICORN_BACKLOG variable by the Heroku command line `heroku config:set UNICORN_BACKLOG=25 -a <app_name>`. We have been as high as 1024 and as low as 10. We settled in at 25 for us.

seivan · on Feb 13, 2013

How I have been doing it for the last year.

Puma define 4:8 threads or Unicorn 3 workers.

habosa · on Feb 13, 2013

Wow.

Normally when I read "X is screwing Y!!!" posts on Hacker News I generally consider them to be an overreaction or I can't relate. In this case, I think this was a reasonable reaction and I am immediately convinced never to rely on Heroku again.

Does anyone have a reasonably easy to follow guide on moving from Heroku to AWS? Let's keep it simple and say I'm just looking to move an app with 2 web Dynos and 1 worker. I realize this is not the type of app that will be hurt by Heroku's new routing scheme but I might as well learn to get out before it's too late.

michaelrkn · on Feb 14, 2013

My company switched off of Heroku for our high-load app because of these same problems, but I still really like Heroku for apps with smaller loads, or ones in which I'm will to let a very small percentage of requests take a long time.

stevewilhelm · on Feb 14, 2013

Heroku Support Request #76070

To whom it may concern,

We are long time users of Heroku and are big fans of the service. Heroku allows us to focus on application development. We recently read an article on HN entitled 'Heroku's Ugly Secret' http://s831.us/11IIoMF

We have noticed similar behavior, namely increasing dynos does not provide performance increases we would expect. We continue to see wildly different performance responses across different requests that New Relic metrics and internal instrumentation can not explain.

We would like the following:

1. A response from Heroku regarding the analysis done in the article, and 2. Heroku-supplied persistant logs that include information how long requests are queued for processing by the dynos

Thanks in advance for any insight you can provide into this situation and keep up the good work.

stevewilhelm · on Feb 15, 2013

Heroku's response:

Hi Steve,

I've been reading through all the concerns from customers, and I want every single customer to feel that Heroku is transparent and responsive. Our job at Heroku is to make you successful. Getting to the bottom of this situation and giving you a clear and transparent understanding of what we’re going to do to make it right is our top priority. I am committing to the community to provide more information as soon as possible, including a blog post on http://blog.heroku.com.

Oren Teich Heroku GM

htsh · on Feb 13, 2013

Why not hire a devops guy & rack your own hardware? Or get some massive computing units at amazon (just as good but more expensive)?

This reminds me of the excellent 5 stages of hosting story shared on here from a while back:

http://blog.pinboard.in/2012/01/the_five_stages_of_hosting/

regularfry · on Feb 13, 2013

Because the whole point is that you shouldn't have to.

Scramblejams · on Feb 14, 2013

I don't know, $20k/mo strikes me as an awful lot of money to avoid engaging in work that a scrappy internet startup really ought to be competent at. If you don't know how to run the pipes and they get clogged and your plumber's not picking up the phone, you're screwed.

That amount buys a whole lotta dedicated servers and the talent to run them. (Sidenote: Every time I price AWS or one of its competitors for a reasonably busy site, my eyes seem to pop out at the cost when compared to dedicated hardware and the corresponding sysadmin salary.)

The larger issue is: Invest in your own sysadmin skills, it'll pay off in spades, especially when your back's up against the wall and you figure out that the vendor-which-solves-all-your-problems won't.

encoderer · on Feb 14, 2013

Three thoughts:

1. Employees are expensive. A good ops guy who believes in your cause and wants to work at an early stage startup can be had for $100k. (Maybe NYC is much cheaper than the bay area, but I'll use bay area numbers for now because it's what I know). That's base. Now add benefits, payroll taxes, 401k match, and the cost of his options. So what... $133k?. That's one guy who can then never go on vacation or get hit by the proverbial bus. Now buy/lease your web cluster, database cluster, worker(s), load balancers, dev and staging environments, etc. Spend engineering time building out Cap and Chef/Puppet scripts and myriad other sysops tools. (You'd need some of that on AWS for sure, but less on Heroku which is certainly much much more expensive than AWS)

2. When you price-out these AWS systems are you using the retail rates or are you factoring in the generous discount larger customers are getting from Amazon? You realize large savings first by going for reserved instances and spot pricing and stack on top of that a hefty discount you negotiate with your Amazon account rep.

3. I've worked at 2 successful, talent Bay Area startups in the last few years: One that was built entirely on AWS, and now, currently, one that owns all of their own hardware. Here's what I think: It's a wash. There isn't a huge difference in cost. You should go with whatever your natural talents lead you towards. You have a founding team with solid devops experience? Great, start on the cloud and then transition early to your own hardware. If not, focus on where your value-add is and outsource the ops.

baudehlo · on Feb 14, 2013

Your last sentence hides another option which is the route we took: outsource devops. It cost us about $2000 in consultant fees for a fully setup system, easy to expand and add hardware to, and much more cost effective long term than AWS or Heroku. Our guy runs a small ops company who have 24/7 on-call. It's really the perfect solution.

Scramblejams · on Feb 14, 2013

I may have a use for this. Link to your ops provider?

jamwt · on Feb 14, 2013

But... you do. If you get big and need to scale, you're going to want to take the control back. Them's the breaks.

They should go dedicated for now, it's too early to colo IMO.

c3 · on Feb 13, 2013

we just switched 1/3rd of our infrastructure off our existing host (engineyard, which uses AWS) onto raw AWS and saved about $2500/month. You can do it too!

bherms · on Feb 14, 2013

I've been around a few companies migrating from EY/Heroku -> AWS and the cost effectiveness is always astonishing. In addition you gain full control over your architecture, which is a major plus.

Vitaly · on Feb 14, 2013

And how much more engineering time do you waste on it now?

douglasfshearer · on Feb 14, 2013

Interested in how you achieved this. Did you change server setup significantly from the default EY stack?

adanto6840 · on Feb 14, 2013

I'm curious about this too, especially the last part (similar / identical stack) ?

barmstrong · on Feb 14, 2013

We were very surprised to discover Heroku no longer has a global request queue, and spent a good bit of time debugging performance issues to find this was the culprit.

Heroku is a great company, and I imagine there was some technical reason they did it (not an evil plot to make more money). But not having a global request queue (or "intelligent routing") definitely makes their platform less useful. Moving to Unicorn helped a bit in the short term, but is not a complete solution.

homosaur · on Feb 14, 2013

While I generally agree with your thoughts, I also wonder what's the reason for continuing to misrepresent their service until you dig 20 layers deep in docs.

rapind · on Feb 14, 2013

I'd been using Heroku since forever, but bailed on them for a high traffic app last year (Olympics related) due to poor performance once we hit a certain load (adding dynos made very little difference). We were paying for their (new at the time) critical app support, and I brought up that it appears to be failing at a routing level continuously. And this was with a Sinatra app served by Unicorn (which at the time at least was considered unsupported).

We went with a metal cluster setup and everything ran super smooth. I never did figure out what the problem was with Heroku though and this article has been a very illuminating read.

gojomo · on Feb 14, 2013

They want to force the issue with a public spat. Fair enough.

But, they also might also be able to self-help quite a bit. RG makes no mention of using more than 1 unicorn worker per dyno. That could help, making a smaller number of dynos behave more like a larger number. I think it was around when Heroku switched to random routing that they also became more officially supportive of dynos handling multiple requests at once.

There's still the risk of random pileups behind long-running requests, and as others have noted, it's that long-tail of long-running requests that messes things up. Besides diving into the worst offender requests, perhaps simply segregating those requests to a different Heroku-app would lead to a giant speedup for most users, who rarely do long-running requests.

Then, the 90% of requests that never take more than a second would stay in one bank of dynos, never having pathological pile-ups, while the 10% that take 1-6 seconds would go to another bank (by different entry URL hostname). There'd still be awful pile-ups there, but for less-frequent requests, perhaps only used by a subset of users/crawler-bots, who don't mind waiting.

gojomo · on Feb 14, 2013

On further thought, Heroku users could probably even approximate the benefits from the Mitzenmacher power-of-two-choices insight (mentioned elsewhere in thread), without Heroku's systemic help, by having dynos shed their own excess load.

Assume each unicorn can tell how many of its workers are engaged. The 1st thing any worker does – before any other IO/DB/net-intensive work – would be to check if the dyno is 'loaded', defined as all other workers (perhaps just one, for workers=2) on the same dyno already being engaged. If so, the request is redirected to a secondary hostname, getting random assignment to a (usually) different dyno.

The result: fewer big pileups unless completely saturated, and performance approaching smart routing but without central state/queueing. There is an overhead cost of the redirects... but that seems to fit the folk wisdom (others have also shared elsewhere in thread) that a hit to average latency is worth it to get rid of the long tails.

(Also, perhaps Heroku's routing mesh could intercept a dyno load-shedding response, ameliorating pile-ups without taking the full step back to stateful smart balancing.)

Added: On even further thought: perhaps the Heroku routing mesh automatically tries another dyno when one refuses the connection. In such a case, you could set your listening server (g/unicorn or similar) to have a minimal listen-backlog queue, say just 1 (or the number of workers). Then once it's busy, a connect-attempt will fail quickly (rather than queue up), and the balancer will try another random dyno. That's as good as the 1-request-per-dyno-but-intelligent-routing that RapGenius wants... and might be completely within RapGenius's power to implement without any fixes from Heroku.

mononcqc · on Feb 14, 2013

Most of the 'Power of two choices' I've read about assumes the presence of a global queue (http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...) -- there's a parallel variety, but they go light on details in that text.

I'm unaware of how Heroku does things. I'd guess they dropped the global queue because it's unpractical (failure prone, not scalable as it's a single point of contention).

I'm mostly surprised to see people happy being able to handle 1 or 2 requests in parallel per instance in general. That sounds absolutely insane to me.

gojomo · on Feb 14, 2013

FWIW, the gunicorn in my Heroku web dynos is set to use 12 workers, though it hasn't been stressed at that level.

zenazn · on Feb 13, 2013

Randomized routing isn't all bad. In fact, if Heroku were to switch from purely random routing to minimum-of-two random routing, they'd perform asymptotically better [1].

[1]: http://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.p...

jemfinch · on Feb 14, 2013

If Heroku had the data needed to do minimum-of-two random routing, they'd have the data needed to do intelligent routing. The problem is not the algorithm itself: "decrement and reheap" isn't going to be a performance bottleneck. The problem is tracking the number of requests queued on the dyno.

gojomo · on Feb 14, 2013

If Heroku had the data needed to do minimum-of-two random routing, they'd have the data needed to do intelligent routing.

Not strictly true; imagine that they can query the load state of a dyno, but at some non-zero cost. (For example, that it requires contacting the dyno, because the load-balancer itself is distributed and doesn't have a global view.)

Then, contacting 2, and picking the better of the 2, remains a possible win compared to contacting more/all.

See for example the 'hedged request' strategy, referenced in a sibling thread by nostradaemons from a Jeff Dean Google paper, where 2 redundant requests are issued and the slower-to-respond is discarded (or even actively cancelled, in the 'tiered request' variant).

lil_tee · on Feb 13, 2013

I'm not familiar with minimum-of-two-random routing, but it does seem like assigning request to dynos in sequence would perform much better than assigning randomly (ie in a scenario with n dyno capacity, request 1 => dyno 1, request 2 => dyno 2, ... request n => dyno n, request n+1 => dyno 1, ..., repeat)

That'd be probably significantly better than the case of (request i => dyno picked out of hat) for all i

icebraining · on Feb 14, 2013

That's essentially round-robin, but it still requires syncing the data of which dyno is next, which is probably what they're trying to avoid.

pencilcode · on Feb 14, 2013

Thanks! Very useful.

goronbjorn · on Feb 13, 2013

Aside from the Heroku issue, this is an amazing use of RapGenius for something besides rap lyrics. I didn't have to google anything in the article because of the annotations.

brusch · on Feb 14, 2013

Funny - I thought this was a really interesting article - but I couldn't stand all these annotations. And when I was selecting text (what i do mindlessly when I am reading an article) all the hell broke loose and it tried to load something.

Very annoying when I want to concentrate on the technical details. So we'll see once again - everyone's different.

yeonhoyoon · on Feb 14, 2013

rap genius is definitely something bigger than rap lyrics. it could potentially be used for a platform for collectively deciphering any text.

http://rapgenius.com/Marc-andreessen-why-andreessen-horowitz...

zeeg · on Feb 14, 2013

If this is such a problem for you, why are you still on Heroku? It's not a be-all end-all solution.

I got started on Heroku for a project, and I also ran into limitations of the platform. I think it can work for some types of projects, but it's really not that expensive to host 15m uniques/month on your own hardware. You can do just about anything on Heroku, but as your organization and company grow it makes sense to do what's right for the product, and not necessarily whats easy anymore.

FYI I wrote up several posts about it, though my reasons were different (and my use-case is quite a bit different from a traditional app):

* http://justcramer.com/2012/06/02/the-cloud-is-not-for-you/

* http://justcramer.com/2012/08/30/how-noops-works-for-sentry/

rdl · on Feb 13, 2013

Wow. I suspect Rap Genius has the dollars now where it's totally feasible for them to go beyond Heroku, but it still might not be the best use of their time. But if they have to do it, they have to do it.

OTOH, having a customer have a serious problem like this AND still say "we love your product! We want to remain on your platform", just asking you to fix something, is a pretty ringing endorsement. If you had a marginal product with a problem this severe, people would just silently leave.

zende · on Feb 13, 2013

Rap Genius is limited more by time than by money if anything. It would make more sense to throw money at the problem instead of people.

gleb · on Feb 13, 2013

It doesn't appear that running on Heroku is free for them in terms of time.

rdl · on Feb 13, 2013

There's also the outage hell. It's been ok for a month or two, but getting killed whenever AWS has a blip in US-East (there's no cross-region redundancy, and minimal resilience with an AZ or Region-wide service has serious problems) isn't great.

It probably doesn't hurt RG as much as lower overall performance during normal operations does, though.