Hacker News new | past | comments | ask | show | jobs | submit login
Heroku Blog: Routing Performance Update (heroku.com)
309 points by sinak on Feb 16, 2013 | hide | past | web | favorite | 192 comments

Come on startups, you should be technically skilled and able to optimize in order to spend little money. If you sum EC2 and Heroku you are going to pay like 10x what it takes to run the same machines power in a dedicated server, all this because you can't handle the operations? This is absurd IMHO.

Also people that want to start a business, there is a huge opportunity here, create software that makes managing Apache, Redis, PostgreSQL, ..., in dedicated servers very easy and robust. Traget a popular and robust non-commercial distribution like Ubuntu LTE, and provide all is needed to deploy web nodes, database nodes, with backups, monitoring, and everything else trivial.

Startups can give you 5% of what they are giving now to EC2 and Heroku and you still will make a lot of money.

"I can only write my Ruby code but can't handle operations" is not the right attitude, grow up. (This is what, in a different context, a famous security researcher told me in a private mail 15 years ago, and it was the right advice)

> Also people that want to start a business, there is a huge opportunity here, create software that makes managing Apache, Redis, PostgreSQL, ..., in dedicated servers very easy and robust. Traget a popular and robust non-commercial distribution like Ubuntu LTE, and provide all is needed to deploy web nodes, database nodes, with backups, monitoring, and everything else trivial.

So the pitch is: clone Heroku, which has taken dozens of very smart engineer man-years to build and refine, then charge 5% of what the market will bear?

> "I can only write my Ruby code but can't handle operations" is not the right attitude, grow up.

I handled ops for 120+ Rails apps while managing a team of juniors and making time to write code. What a stupid waste of my time: ops is forever & never-ending, while implementing & delivering a new feature to 100K customers can bump sales' conversions permanently.

If I can outsource relatively-linearly-increasing Ops costs and instead focus on delivering value that multiplies compounding-interest style, that's not childish.

It's great business.

That's what I think about your reasoning.

1) Once you develop a culture of PaaS the reality is, if you check the history of most startups, you'll continue along this way, and the money you'll burn will be massive. I suspect that if you build a sysop culture earlier, later you'll be able to just hire a full time sysop instead of spending a lot more.

2) If you don't understand very well your platform from the point of view of operations, likely you'll build a non scalable system (scalable not just from the point of view, let's distribute the load among N nodes, but about the constant times / CPU / energy it takes to serve a request). You'll end spending all the time again porting your app to another framework / language. This happens all the times.

3) The first months of a startup are usually not so critical from the point of view of operations, it is unlikely you'll spend a lot of time managing servers, unless...

Unless you selected a platform that is too complex to start with. I think that many of the PaaS companies I see today are here mainly for two reasons: Rails & Java frameworks.

So first of all, select your platform wisely.

I'm not sure Rails and Java are the culprits. Rails deployment is actually not that difficult once you get a few things figured out.

Perhaps the problem is a generation that has grown up without looking under the hood as much, preferring the flashy "just works" Mac, for instance, to running Linux as a desktop OS and knowing something about how to administrate it it. That's a skill that may not allow you to be an admin for a big site, but should be enough to help you get a server up and running, and maintain it.

Also - and this is a positive development - as of late there has been more focus on making sure you've got something the market wants before you worry about scaling and another problems like that.

Still though, I think that it's not that hard to get a basic server up and running, and deploy to something you manage yourself. Amongst other things, it gives you way more flexibility than what something like Heroku can offer. Want to recompile Postgres yourself, with some funky option enabled? Not a problem!

David what you say is absolutely possible, that is, the reason could be more of a cultural / focus shift, but when I think to invest my money on a startup where the core guys can't, if they want, setup a server and run their operations if needed or fix a bug overnight understanding what's wrong in a server, I feel there is something wrong.

About the focus on making sure you build something the market wants, it is a good approach, but on the other side maybe the focus is just on that, and there is little interest about the ability to create a sustainable service where the only exit is not being acquired. A big part of being sustainable, especially with the freemium model, is cheap operations. It's a key value of being able to say to your users once you start to get some traction, ok, give me 5$ per month, and I'll be profitable.

From that point of view startups spending like 2000$ / month while not even remotely profitable while it is possible perhaps with more wise coding and in-house ops to spend 200$ seems very odd to me, and not a trend I want to see encouraged, so I posted my original message in order to provide, maybe with some provocation intent, a different point of view.

Just think for a moment if there are no longer big companies acquiring you (that is now the norm but it was not in the past) how the operations and costs point of view changes.

Unless you selected a platform that is too complex to start with. I think that many of the PaaS companies I see today are here mainly for two reasons: Rails & Java frameworks.

Can't speak for Heroku but with EC2 you get the ability to dynamically add/remove instances. While this seems like it's needed only for certain kinds of startups, the reality is that most background processing is batch oriented and it's extremely cost effective to just spin up instances when I need them, crunch the data and terminate them after that. And, if you go for spot instances (and use a fault tolerant framework like Hadoop to distribute jobs), it's becomes even more cheaper.

I think EC2 is not quite "PaaS":


Since you still do need to administrate the instances you run.

> Unless you selected a platform that is too complex to start with. I think that many of the PaaS companies I see today are here mainly for two reasons: Rails & Java frameworks.

I think this is where I misunderstood you originally. For 3 days, Hacker News has been discussing the performance of -Rails- on Heroku. If I had understood at the beginning that your real argument is to use Sinatra/PHP/NodeJS/whatever (you're somewhat vague on recommended alternatives), I would not have jumped in. I do not have enough experience hosting other tech stacks at scale to be advising on the viability of open source solutions.

As for your other 3 points, presuming a Rails stack: all of those smell like a premature optimization for any company with less than 500K customers. If I went to my clients and said "Hey, we should invest 1-2 days in spooling up a server, because if we gain traction 6 months from now, we'll have to pay Heroku a lot of money until we fix it!", they would grin a bit and say:

"That would be a great problem to have. Now go back to shipping features so we gain traction and actually have to worry about fantastic problems like how to deal with having too many users."

Yes, features are important, and you don't want to waste your time doing stuff you can easily outsource. However, I think it's not about saving $2000 by outsourcing, it's really about understanding how your software runs. How each request is handled. How each layer and architecture affects scalability, performance and stability.

One of the things we've seen in this Rails/Heroku example is that the platform can't really be completely transparent. If you treat it as some obscure service and don't understand how it fits together - you lose. You can't optimize, you can't scale, you can't give consistent experience to your users.

I personally don't particularly like the distinction between devops and developers. I think all developers should have a good grasp of the platform. Of how pieces fit together. This is a worthwhile investment, not to save money on your PaaS, but to build a good solid app with good architecture. Performance, scalability and security are also features you should be shipping.

No, presumably the pitch is: clone Heroku's software stack as an OSS project, then get the server/VPS sellers to let people opt to install it as a package when they rent out a node, the same way things like Wordpress can automatically be installed.

Sure, see OpenStack: http://www.openstack.org/

[that said, comparing automated scaling of a Ruby on Rails SaaS app with 100K+ users to a one-click WordPress install is kinda humorous. Even WordPress has a market for expert DevOps guys to manage your site: http://wpengine.com]

I don't think OpenStack is the right comparison to make to Heroku. In the common parlance, OpenStack is infrastructure as a service, while Heroku is platform as a service. Cloud Foundry http://cloudfoundry.com/ and OpenShift https://openshift.redhat.com/app/ are more comparable. They're both open source.

I think OpenStack, CloudFoundry, ..., are still not what I have in mind, as this solutions are more designed at scale, that is, if you want to provide a PaaS service to third parties or if you are a very large organization. IMHO the missing piece here is something that you can install in just one or a few nodes to start providing easy-to-setup and monitored services, with proven configurations and setups and so forth.

Something like TurnkeyLinux? http://www.turnkeylinux.org/

Well, Herouk's stack may indeed be way more impressive in sheer scope, but for any one individual user, there's no difference.

As long as bringing up more than one node in your "application" gets them to start talking to one-another and auto-distributing load between them [including sometimes just electing one of them to serve as a router node that doesn't actually have apps running on it], your experience will be identical to paying Heroku, save for the fact that when one of the machines goes down--although the load will be redirected away from it--you'll have to "manually" fix it if you want it back.

In the default case, this just means dropping the IaaS node responsible and provisioning a new one [possibly on a different IaaS provider, if there's a lot of failure in one provider]. Though if you did pay for physical hardware, there can be some pain here. ;)

Basically, what I'm suggesting is a software stack that is to computation as Tahoe-LAFS is to storage. Just give it a set of machines that resemble Unix boxen--with more separate providers and regions = more better--and it'll give you something that looks like Heroku in exchange.

It could even have an optional provisioner service running atop it, that globally aggregates known IaaS providers and gives them ratings based on machine response times from the perspective of various other components, then lets you just say "okay, you're allowed $N/mo more to scale me up" and it'll find places around the globe where your users are being underserved, find well-rated API-compatible IaaS providers in those places, and provision nodes in those locations, adding them to your mesh. And then dropping them back out for alternatives if they start to suck :)

--you know, if this doesn't already exist, I might be willing to put in some time...

And there are some VERY good reasons to invest in DevOps even for Wordpress.

I'd write a lengthy post on this subject right here, but I already did so elsewhere: http://www.mmomeltingpot.com/2012/03/wpengine-review-after-1...

(And Patio11 wrote on the same topic a while ago, too: http://www.kalzumeus.com/2012/02/09/why-i-dont-host-my-own-b... )

>"I can only write my Ruby code but can't handle operations" is not the right attitude, grow up.

How is such a condescending post at the top?

Everyone running a startup is an idiot because they choose not to waste their time on your priority?

Heroku isn't 10 times more and it wouldn't matter even if it was. Talented people are hard to find and spending time on operations when you might not be around in 4 months may not be the most important thing to focus on. On a 6 to 8 month time scale, heroku would be 10 times cheaper that taking the hit of setting up everything to emulate it. Deployments, backups, monitoring. Those things take a lot of time to setup correctly.

The annoying thing about that post is that it derails the conversation. The top post should be analyzing Heroku's response, whether it was sufficient or not. Or how developers can mitigate Heroku's dumb-by-design routing issue.

Right, I'm interested in XYZ, so that is what the top post should be ...

I run a startup that is hosted on Heroku. I write Ruby code but can't handle operations (I have and can set up servers but if shit hits the fan I wouldn't have a clue. Also I don't have the time). My startup is a one-man show. And you suggest I should grow up? By grow up you mean that I should watch my customers face days of downtime because I don't know how to fix the server, how to secure the server probably, how to scale or have money to hire someone who can?

Of the "ShitHNSays"-stuff I read here this surely takes the cake.

The primary bottleneck for most startups isn't money, it's time.

Also for most startups (assuming you're not doing something CPU/bandwidth heavy) the actual cost of hosting is going to be a relatively small part of the budget. If your burn rate is a million dollars, reducing 10k of hosting costs to 1k isn't really worth the effort.

Kind of depends

There's the "two guys in a dorm eating ramen" and there's the YC backed "startup" that already has a couple of hires

So if your monthly budget is of hundreds of dollars, using a VPS instead of Heroku makes sense

Let say your average http request takes 200ms to serve, then with 5 heroku worker dynos you can serve about 10 million requests a day which should comfortably cover the requirements of most startups.

How much do those dynos cost ? - $143/month. That's less than three hours salary for a developer.

If your monthly budget is hundreds of dollars I'm guessing your traffic is low enough you can just use the free tiers of heroku, appfog, etc.

10,000,000 requests each day are 115 requests per second. I don't follow your math here with a 200ms CPU time per request.

115/30 is almost 4, which is almost 5; 5*200ms = 1 second;

I guess he meant 10M/mo, not 10M/day.

Yes, my mistake.

Heroku customers don't typically do math when they do capacity planning. That's part of the Operations toolchest they're buying. They just up/down dynos.

Yes, but there's a step before

$143 is money that is missed by a college student, for example.

Better go with something cheaper or the free tiers of Heroku, AWS, etc

You're probably still developing the system, or having only a small amount of requests.

Well the point of all this is that simple scaling math like that doesn't add up because of the queuing latency introduced by heroku's routing layer. But still at that scale your point is valid.

And the real problem is the non-uniformity of the load. If your 10M/mo request start concentrating around weekends, you've got a problem... Except you haven't, as your infrastructure is elastic, and this is the real bonus.

Also, I'd be very afraid to get accidentally hit by HN traffic if I'd host my own server.

Money can buy time in various forms. I think PaaS is one abstraction level too much to remain flexible and efficient as you grow.

Definitely. This is possible with e.g. puppet already - where I work I've setup puppet modules for provisioning everything automatically for developers, they just need to enter the customer name and whether they need dev, stage and/or live for that customer - the rest is automated. We use http://www.hetzner.de/en/hosting/produktmatrix/rootserver-pr... for hosting - let me just say that I can buy an entire rack with 64 GB ram in each server, for the same cost as one amazon instance :D

We're also looking into Cloud Foundry - basically Open Source Heroku - with support for buildpacks. If the installations gets easy enough, this could be the new thing.

This feels like a distraction. How about instead of pontificating on how someone else chooses to do business, we discuss the actual merits (or lack thereof) of Heroku's routing issues.

That business model is basically what Cloud66 seems to be doing: https://www.cloud66.com/

Looks pretty decent, would make sense to see something similar for other stacks.

It's more like 10x from Heroku to AWS, then another order of magnitude to metal. AWS has other advantages in addition to convenience, though.

My thoughts exactly. Like you don't outsource dev, you can't outsource sysops, it's a core competency to a tech startup.

Is it not more like you can't outsource customer support and product development, but you'd better outsource everything else if it's economically viable?

Thus if your product is some kind of hosting (i.e. blog or image hosting), the reliability is your product, so you don't outsource sysops, but if you offer some project management software... Well it's almost certainly hosted so reliability is your product again and you better have your own sysops.

So the only viable conclusion seems like you limit your trust to the PaaS provider and have some plan for the rare case when they get screwed. Or you could bet on technology not requiring such monstrous resources as Rails and host your entire MegaCorp on a single server rack.

Fail fast.

I can perfectly well install, set up and maintain my own Ruby servers - but it takes time.

Alternatively, I can pay someone else to do that, and remove that timesink from the elapsed time between "start developing" and "find out how well we've achieved market fit".

I can always optimise later - move off Heroku, develop our own load balancing, all that stuff. Once I've got a working product/market fit, I probably will.

But doing that before I know if I'm going to chuck the entire infrastructure in the garbage and move on to idea #2 (and #3, and #4...), or indeed pivot so wildly that we'll have to reorganise all our server stuff anyway, is a waste of time. And time is valuable.

Chef/Puppet/Stackscripts/Salt/etc, and develop your own load balancing? Why on earth would you do that instead of just using HAProxy?

Misspoke - when I said "develop our own load balancing", I meant "test and pick the best existing load balancing solution, then implement that on our servers". That's as opposed to "use whatever our hosting provider offers".

If your Ruby code deploys to X you should know X, if it's part of your stack it should be part of your toolbox.

"create software that makes managing Apache, Redis, PostgreSQL" Like Plesk, but better, faster and less resource hungry.

I'm honestly surprised there aren't tools available now replicating what Heroku does.

We should have puppet scripts to deploy, instrument and manage all the popular infrastructure choices by now.

The same way originally Linux was a build it yourself box of parts, we ought to have "cloud infrastructure" distributions, from bootup to app deployment.

The neat thing here is that as people improve the distribution you get cumulative savings. Back in the 90's you needed a skilled individual to setup a Unix/Linux system. Now, even an MBA can do it. The same could happen with infrastructure on a higher level.

" If you sum EC2 and Heroku you are going to pay like 10x what it takes to run the same machines power in a dedicated server, all this because you can't handle the operations? This is absurd IMHO."

It really depends. My current experience with heroku is that it is absurdly cheap - at least, for our use cases. We would have to do a lot more traffic for us to ever consider moving to a colo.

Not to mention I'd bet that they do have someone spending all their time dealing with Heroku ops... and my sense is it's not just because of this bug.

Inaka has a combined total of close to a billion pageviews/month across all our EC2-hosted apps for all of our clients and we have zero full time operations staff - we have 2 guys that spend (much less than) part time on it.

This seems to me like the server admin version of "build a website like Amazon."


What's more likely, that everyone else in the world is too dumb to see this opportunity, or that perhaps you have underestimated how hard it is to do?

Is Canonicals Juju along the lines of what you mean? https://juju.ubuntu.com/

I'm watching all this debate with great interest. The company I'm consulting for has ~150 dynos, a paid support contract, etc. We're on Cedar... NodeJS latest (which just got upgraded to 0.8.19 after being ignored since October), with Mongo, Rabbit and Redis hosted by third party addon companies.

After a ton of H12 errors, they helped us find out some slow points and optimize some things that were relatively slow. On our own, we did a huge amount of work to make things as fast as possible. While the H12's have gotten better, nothing has gotten rid of them completely. It really points to something fundamentally wrong with the routing layer because at some level we just can't optimize our code any further. There is definitely quite a few times in the logging where we just can't explain how things are insanely slow and we certainly can't explain why we get H12 errors anymore. To the point where we just gave up with it.

The thing that bothers me the most is that we have been complaining for a month now behind the scenes through our paid support contract about the things that are now being semi admitted in public. No PaaS is perfect and certainly hard problems are being worked on by smart people... the real issue here is the way that Heroku has pointed fingers at everyone but themselves until finally someone had the time and balls to get a posting to the top of HN.

150 dynos with Node.js and you are still getting H12 errors?

The issue so far reported is that if we ever get multiple requests on a single dyno then we'll have a queuing delay because Rails is essentially single-threaded. But with Node.js I think it would be a fairly large amount of requests on a single dyno I suspect before we get any queuing delays

Yes, it is a heavy traffic app. It is a backend for a very popular iphone app. The thing is that it doesn't matter if we have it set to 200, 300 or 70. We still get H12's. Sure, less now that we spent weeks tuning the absolute fuck out of our app and moving most of the processing of data offline to backends (using Rabbit as the queue), but the H12's still happen in the web tier on a regular basis (about 1-2 a minute instead of the 100's we were getting before).

The amazing part here to me is that the guys at Heroku don't even have a clear picture themselves of why H12 errors occur. They use internal tools like Splunk to provide visibility into their app, but they don't expose those to the customers. There really should be better reporting for all this stuff.

If your requests are CPU intensive, Node.js won't help since it doesn't support preemption.

And even if you're primarily IO-limited, a single request that consumes too much CPU will cause queuing.

At this point we've done so much optimization of our app that our requests are not CPU or IO bound (those things have been offloaded to backend processes through a Rabbit message queue) and we still get H12 errors and random slowness. At the time of this writing, in the last 10 minutes, we've had 2 H12 errors. It should be zero.

I'm still confused: since you're using NodeJS, I imagine your dynos are effectively handling a large amount of concurrent requests. This should in turn negate any impact of long-running requests, since they don't cause further requests to be queued in any way. So, where are requests (or responses) being queued (or lost) in your app? In Rabbit? Are you getting errors and slowness as streaks rather than isolated random events? Could it be due to spinup time of new backend workers, or something along those lines?

Node handles one request at a time. It isn't multithreaded. It will receive a request, process that request and return a response. If another request comes in at the same time another request is in process, it is queued until the currently processing request is finished. I googled around, here is a good explanation for you. http://howtonode.org/understanding-process-next-tick

The way my application worked is that we had 'long' running things like saving data to a database happening before we return a response to a client (in this case an iphone app). Sometimes mongo, the dyno, networking, phase of the moon, talking to the facebook api, etc... we would get 'slow' processing and it would take a few seconds for a response to make it to the client. As soon as this happens, on a heavily loaded system, the heroku router would get backed up (since it only routes to 2-3 dynos at a time) and would start throwing H12 errors.

So, what we did was rewrite the entire app to do minimal data processing in the web tier, send the response back to the client as quickly as possible. At the same time, we also send a rabbit queue message out with all the instructions in it to process the data 'offline' in a worker task. There is no spinup since these workers are running all the time. We even have several groups of workers depending on the message type so that we can segregate the work across multiple groups of dyno workers. This also allows us to easily scale to more than a 100 dynos to process messages. It works great. Rabbit is a godsend.

I say 'long' and 'slow' above because the longest amount of time we should be taking is a couple seconds at most. Unfortunately, the way that the heroku router is designed is fundamentally broken. As soon as you get a lot of 'slow' requests going to the same dyno's they start to stack up and the router just starts returning H12 errors. It doesn't matter how many dyno's you have because the router only talks to 2-3 dyno's at a time. We get H12's with 50, 100, 200, 300, etc dynos.

We also saw very strange behavior with the dyno's. We use nodetime to log how long things take and we'd see redis/mongo take only a few ms, but we'd have >15s just for the request to complete... somewhere things are slow and we can't figure out where. Until this whole mess came out, Heroku just pointed fingers at everyone else but themselves.

Oh by the way, as soon as you get around 200-300 dyno's deploys start error'ing out as well because heroku can't start up enough dyno's fast enough and that whole process times out too. You can't tell if a deploy worked or didn't. They didn't seem to care about that at all either.

Anyway, I could keep going... but once again, I'll repeat that I'm glad that the Rapgenius guys are calling Heroku out in public on this stuff. There is some big issues here that need to be addressed and the H12/router stuff is the big issue. I'm looking forward to see how they pull out of this one.

> Node handles one request at a time. It isn't multithreaded. It will receive a request, process that request and return a response. If another request comes in at the same time another request is in process, it is queued until the currently processing request is finished.

I'm missing something here. Node does not multithread requests, but it surely can process many requests simultaneously if these requests are waiting for async operations: database, external APIs or other types of I/O usually. That's the very core idea of evented servers.

So, my model is that i.e. if a node process receives 100 requests over the period of 1 second, and each request takes 3 seconds to process but most of that time is spent waiting for async, then the 100 responses will be sent back essentially 3 seconds after they arrived, no queuing to speak of.

From your description, routers do not send multiple requests to the same dyno even if dynos could handle them, and only have a limited amount of dynos they talk to. So queuing is happening in the routers, while dynos idle away waiting for async.

This would be complementary to the problem described by Rapgenius, and mean that the Heroku architecture does not play well with any type of server, neither evented (Node, yours) nor sequential (Rails, as shown by Rapgenius) nor presumably multithreaded or multiprocess (which effectively behaves like evented to the outside world). A huge mess indeed!

> Node does not multithread requests, but it surely can process many requests simultaneously if these requests are waiting for async operations: database, external APIs or other types of I/O usually. That's the very core idea of evented servers.

Within a single request, Node can async its dealing with outside services (databases, api's, etc), but it is still only processing one request at a time. There is no 'synchronized' keyword in javascript. ;-)

Check this out:


There is an interesting header in there: X-Heroku-Dynos-In-Use. From what I understand, this header is the number of dynos that a router is communicating with. For us, this is always around 2-3.

I suspect that the router is just a dumb nginx process sitting in front of my app. It is setup to communicate with 2-3 of my dyno's in a round robin fashion. If any one of those dyno's doesn't process the request fast enough, then requests start to back up. Once requests start to back up past 30s worth of execution, the router starts just killing those queued requests instead of just leaving them in a queue or sending the requests to another set of dyno's. Even worse is if you have a dyno that crashes (nodejs likes to crash at the first sign of an exception). I suspect that is why we see 2 or 3 in that header.

I think that part of the problem is that the routers don't just start talking to more dyno's if you have them available. So, it doesn't matter if you have 50, 200, 500 dyno's because the router is always only talking to a small subset of them. Even if you add more dyno's in the middle of heavy requests, you are still stuck with H12 for the existing dyno's. A full system restart is necessary then.

By 'processing' I mean the Node application has received the request and has not yet sent the response, i.e. the connection is still alive. 'synchronized' has no bearing here. If the request processing is purely CPU-bound with no async operations then only one request will be processed at any time, otherwise Node will happily process up to thousands of requests simultaneously. This is the ideal use case for Node. It should be trivial to log the amount of simultaneous requests being processed.

According to Heroku docs, Cedar routers do not do any queuing and just serve requests immediately to any random dyno. They are pretty clear on this in multiple places, specifically talking about concurrent requests in Node. They also mention a 'routing mesh', which suggests there are many routers doing their thing. But that header you see maybe not be relevant to Cedar, just like the other header 'X-Heroku-Queue-Depth' should not apply to Cedar either.

> Within a single request, Node can async its dealing with outside services (databases, api's, etc), but it is still only processing one request at a time. There is no 'synchronized' keyword in javascript. ;-)

This is as wrong as it can get. Multiple requests are processed "concurrently", in the sense that a request gets served as soon as an existing request awaits on async calls. It is different from thread-based concurrency and thus there is no need for things like "synchronized" keyword.

I think we are saying the same thing. When I say 'it is still only processing one request at a time', I'm referring to my code execution, not the async stuff or anything else within Node.

I found this to be a good explanation...


Ah, the issue of CPU intensive code in a single-threaded event loop of a system designed to handle I/O bound applications... let's not go there again!

Depending on what side of Hanlon's razor you fall, the only conclusion I get from this is that they are either incompetent or dishonest. I have a very hard time believing that this issue remained unknown to them for years.

As for the post, it's pretty much just documentation. I didn't see any apology. And the only promise of a better tomorrow is a vague "Working to better support concurrent-request Rails apps on Cedar".

I also didn't see any mention of refunds for all of the extra dynos that were needed due to the degrading performance of their service - or all the extra support hours where they told everyone 'not our problem!'.

I wish I could vote this up many more times. It's exactly what I want to find out about.

What SLA did they explicitly fail to deliver on such that they should offer a rebate?

They apologized in the last post. Also, self-critical language like "fallen short of [our] promise" and "we failed to..." is a de facto apology and acceptance of responsibility even when the word 'sorry' only appeared earlier.

I can understand how this developed. Things worked well for most customers. Many of those with problems got them under control with more dynos or multi-worker setups. Heroku's Rails roots biased them towards a "keep it simple, throw hardware at it, or look for optimizations in the app/sql/db" mindset. Well, many of their Rails/Bamboo customers complaining about latency, even in the presence of this growing issue, may have also (or even primarily) had other app issues too. (When supporting developers, especially many beginning/free-plan developers, it doesn't take long for your conditional probability P((we have a real problem)|(customer thinks we have a problem)) to go very low, and P((customer app has a problem)|(customer thinks we have a problem)) to go very high.)

Even when Heroku had a unitary (and thus 'smart') router, they surely got latency complaints that were completely due to customer app issues or under-provisioning, so they stuck with the 'optimize app or throw dynos at it' recommendation for too long. And, when they habitually threw more hardware at the Bamboo routing mesh, they were unwittingly making the pile-up issues for Bamboo web dynos worse. Some key data about the uneven pre-accept queueing at dynos was missing, which combined with habits of thought that had worked so far gave them a blind spot.

Despite the growing issue, adding dynos at the margin would still always help (at least a little) — as well as adding to Heroku revenues. Even without any nefarious intent, a 'problem' that fits neatly into your self-conception ("we give people the dyno knob to handle any scaling issues and it works"), and is also correlated with rising business, may not be recognized promptly. That's just a natural human biased-perception issue, not incompetence or dishonesty.

In short, Heroku needs to hire someone with some operations research experience. This is a mathematical modelling problem, not really a code problem.

Break out Mathematica, Matlab or R and model the damn problem. Then go research the solutions already available (Hint: look at many grocery stores, queuing problems).

I think apologies are over-demanded by our somewhat hysterical media that likes nothing better than to enhumble/humiliate a public figure (because it sells papers); and this flows through into expectations of private and corporate behaviour. But I've never had much use for apologies from other people. Years of abuse make "sorry" an entirely debased term in my lexicon. I've seen statements of regret that omit the word and are all the more sincere for it.

Much more useful than an apology is an acceptance of fault (which is not the same thing); an expression of desire to improve, and a sincere and demonstrable commitment to doing so.

[NB: don't mean to imply that Heroku have necessarily achieved all of that here]

I didn't see any apology. And the only promise of a better tomorrow is a vague "Working to better support concurrent-request Rails apps on Cedar".


They apologized in another earlier post. https://blog.heroku.com/archives/2013/2/15/bamboo_routing_pe...

"We failed to explain how our product works. We failed to help our customers scale. We failed our community at large. I want to personally apologize, and commit to resolving this issue"

> ... they are either incompetent or dishonest ...

Exactly. If they didn't know, they should have. If they did know, well, ...

I agree fully with either they are incompetent or dishonest. I hope this response gets more press because Heroku better be beyond perfect from this point on. There is no excuse for this.

Your razor has a false-dilemma. They may be very competent, but having no intentions of caring for non-concurrent applications. Either because they did not think about the scenario or because the way RoR operates is silly.

Not to be too harsh, but I'm not sure whether "we had no idea it was so bad" is better or worse than "we knew it was bad, but didn't tell anybody" for a platform company. The tone of the post is appropriately apologetic, but this does make you wonder what other problems they're missing.

As a long time NetAdmin for a small WISP I understand where they are coming from. Even with monitoring and health checks, there are a lot of unknowns on the network. Sometimes I don't know there's a problem until a customer calls to complain. And even when they do complain, it can be tough to identify a root cause even with all the data we collect.

We try to do our best to be proactive, but sometimes you don't even know to look for something until it's a problem. Heroku has quite a few more moving parts than our modest network with a large percentage of that in the form of code written by thousands of other people. As long as Heroku is learning lessons from their failures, which I believe is the case, they are doing a good job.

In this whole sad, sorry tale, the problem is that the Heroku support engineers clearly knew, and communicated to the customer, that the unexpected/unreported lag was caused by the random queuing. They knew what the problem was for three years; it was only when a big customer went very public that they adopted the right tone and action plan.

I imagine there's great discomfort inside Heroku these days because somewhere between the line engineers who knew about this issue, and the CEO, the fact that customers were complaining was swallowed up.

That's all very understandable up to a point. But what I find more than a little disconcerting is that they had paying customers complaining about this issue for years but were "unable" to find the root cause or were unwilling to explain it. After one customer's comlaint was picked up by a wider public it took them barely a day or two to find out and explain what was going on.

In other words, as long as it suited them financially, they couldn't be bothered analyze, document and fix the issue properly. Now they come out apologizing because the reputational damage is greater the money they make from letting customers use their stupid architecture. This is a classic really.

Time for a refund.

Actually, didn't the customer find out the problem? They identified random routing as the culprit. Also identified that new relic does not report these metrics.

Also heroku told them that "requests are queued up at the dyno level". So how can they now claim that they didn't know?

It seems to me that Heroku has chosen to be dishonest:

Heroku's blog response: "but until this week, we failed to see a common thread among these reports."


Adam's response to Tim Watson, a year ago:

"You're correct, the routing mesh does not behave in quite the way described by the docs. We're working on evolving away from the global backlog concept in order to provide better support for different concurrency models, and the docs are no longer accurate. The current behavior is not ideal, but we're on our way to a new model which we'll document fully once it's done."


It is quite possible to know that the mesh is not exactly as documented without having realized that the difference has a severe performance impact.

I think they also addressed the performance issue in addition to saying it is not documented: "...evolving away from the global backlog concept in order to provide better support for different concurrency models"

The fact that a year later, performance for their starting use case had not been addressed says that they failed to address the performance issue.

It is still not clear to me that they actually really understood the performance issue, or how big it was going to be.

Hmm, no tangible solutions yet, but I expect that will be next.

From the discussion I've seen they have roughly two minimal options:

(1) Shard/tier the Bamboo routing nodes, so that a single router tends to handle any particular app, and thus the original behavior is restored. Consistent hashing on the app name could do the trick, or DNS tricks on the app names mapping to different routing subshards.

(2) Enable dynos to refuse requests, perhaps by refusing a connect or returning an error or redirect that tells a router to try the next dyno. (There are some indications a 'try another' logic already exists in their routers, so it might even be possible for customers to do this without Heroku's help. I have a question in with Heroku support about which request-shedding techniques might work without generating end-user visible errors.)

Both could potentially benefit from some new per-dyno load-monitoring features... which would also allow other more-sophisticated (but more costly and fragile at scale) balancing or load-shedding strategies.

I can see the commentariat lynch mob is out, but definitive recommendations and fixes take time. As they've admitted and apologized for the problem, I'd guess they'll have a more comprehensive response before their end-of-the-month user conference.

I would suggest a slight variation on the first. Have initial routers pick a random appropriate dyno then pick the router for that dyno. The second router picks a random appropriate dyno from the list it is in charge of.

If you make sure that the dynos for a given app are clumped behind few routers in the second layer, then you effectively get the old behavior. But you get it in a much more scaleable way. The cost is, of course, that you add an extra router to everything.

(I emailed this suggestion to them. I have no idea whether they will listen.)

> (1) Shard/tier the Bamboo routing nodes...

This would work, but would require another layer before routers to do the sharding. I'm not sure if Heroku currently has such layer, or is it beyond their control (provided by AWS). But this is definitely an option.

> (2) Enable dynos to refuse requests...

The way I understand the architecture is that Dynos code is fully controlled by a client. They would need to introduce another component that would be responsible for rejecting requests. With such additional component they could also try to reverse the flow: each dyno is assigned to a single router and asks this router for a next request where it has finished processing the previous one. Routers queue all requests and handle them only when dynos ask them to.

re (2): The router could just interpret a specific response code, or an early-close of the socket, as a directive to try another dyno. (I think the router already treats a failure-to-connect-to-listening-socket this way.) So, it would be up to the client code to optionally send such a refusal, when appropriate. It doesn't require any new Heroku component to decide when to send rejections... only router support for respecting them when recieved.

A simple solution might be to just have the request routers connect in parallel to multiple dynos, and use the first one to connect successfully. You'll increase overall load (which can be mitigated to an extent by waiting a short delay before opening additional sockets), but a single slow dyno won't hold you up either.

> "... but until this week, we failed to see a common thread among these reports."

So Rap Genius, a customer, was able to figure out the issues (from the outside looking in) but Heroku, "on the inside" wasn't able to figure them out?

Or they're playing the "we didn't know, we're going to fix it right away" angle?

EDIT: also, s/failed to/did not/ makes more sense. "failed" implies they tried.

If you want a press release to be ignored, release it at EOD on a Friday.

...after a meteor explodes over Russia.

Until your customer blogs about it on Monday morning :)

Yeah, it's not like this one was gonna slip by; that somehow, over the weekend, everyone would forget about it.

Although, if I were Heroku, that's certainly what I would have been praying for. =)


To me it read as "we didn't know, we're not going to fix it right away (if ever), we'll just document it better"

It sounds like they have one routing cluster for all of Heroku. If this is the case, and large routing clusters are the problem's root cause, they should just shard the cluster.

I.e., give their bigger customers like RapGenius (who said they pay Heroku $20k/month and whose HN post spurred this debate) their own dedicated routing cluster with 2-5 nodes. Once a single customer exceeds that size, they're probably paying Heroku enough that Heroku can afford to devote an engineer to working specifically with them to implement large-scale architecture choices like running the app's DB on a cluster, etc.

Pile several smaller customers on a shared routing cluster to cut costs and keep the cluster utilization high, but once the cluster gets to be a certain size (reaches a fixed number of backend dynos or metrics get bad enough), start putting customers on a new cluster.

It should be fairly trivial to use DNS or router rules to dynamically move existing customers from one routing cluster to another.

The problem, as documented by the customer who went public with this issue, is that their request distribution scheme went from intelligent (i.e., load-based) to random, and a random distribution of requests is almost guaranteed to cause significant queuing for some non-trivial number of requests unless one has an absurd amount of extra capacity in place already, with ruinous financial aspects.

The problem, as documented by the article, is that requests have always been served randomly, but the maximum number of requests that can be queued for any single backend node is equal to the number of frontend nodes N in the routing cluster.

When N is equal to one, it's exactly what the previous discussion has labeled "intelligent routing;" when N is small, it's similar enough to intelligent routing that few will notice the difference.

As N becomes large, you need a proportional (hence also large) number of requests to trigger the load balancing feature. At some point, the load balancing no longer kicks in, because for any real-world application the total workload is finite. But the performance complaints start much earlier; the "load balancing" on e.g. a 50-node balancing cluster might still kick in and stop routing requests to a hung server, but the 49 users waiting behind the hung request still suffer the latency and complain.

I've adopted this picture of the situation because it agrees with both the reported behavior discussed by RapGenius and others in the previous thread, and this article's discussion of Heroku's architecture.

Indeed, and this is another point where rapgenius statistical model is wrong compared to the real world behavior of the system.

As the artictle says, it was the gradual scaling of the router cluster that eventually led to the request routing being effectively random on the Bamboo stack.

Well, this certainly calls into question their competence. They're a PaaS company that doesn't understand or measure their load balancing performance.

If you are a PaaS company, and you only have 5 metrics you can record, then 99% percentile latency across all apps should be one of them.

On another note: why is Rails single-threaded??? That seems unbelievable. So if you have a 2 second database query, your Rails process does nothing else for that 2 seconds? I mean people complain about the GIL in Python, which actually has reasons behind it, but this is just crazy.

On another note: why is Rails single-threaded??? That seems unbelievable. So if you have a 2 second database query, your Rails process does nothing else for that 2 seconds?


Thats not what happens, Rails does not block on IO, it will switch to another thread/process another request while it waits for IO to complete in the first request.


I think you only read the beginning of that article, not the end. What the article seems to imply is that you have to use a web server that spins off Rails in multiple processes. Ruby does concurrent I/O (obviously) but Rails itself does not appear to do it by default, causing the problems discussed here.

"Rails, in fact, does not yet reliably support concurrent request handling. This leaves Rails developers unable to leverage the additional concurrency capabilities offered by the Cedar stack, unless they move to a concurrent web server like Puma or Unicorn."

"Most obviously, early versions of Rails were not threadsafe. As a result, all Rails users were operating with a mutex around the entire request, forcing Rails to behave like the first “Imagined” diagram above. Annoyingly, Mongrel, the most common Ruby web server for a few years, hardcoded this mutex into its Rails handler. As a result, if you spun up Rails in “threadsafe” mode a year ago using Mongrel, you would have gotten exactly zero concurrency."

Rails 3 is single threaded by default, but it can be made multi-threaded with config.threadsafe!. Rails 4 is multi-threaded by default.

Ah, no. Being single-threaded isn't the same as blocking on IO wait, because threads can switch tasks. This is the essence of the classic Unix select(2) loop and most of the higher-performing variants; poll(2), kqueues etc.

Welp, I was waiting for their official response to decide if I should deploy my app with Heroku or roll up my sleeves and rig up AWS servers (which I've done before but was looking forward to not having to deal with it.) Based upon this post, it sounds like there are really no concrete steps that they have planned to fix the underlying issue. So, AWS it is.

I am still considering having Heroku manage my PostgreSQL instance. This would be a large burden lifted leaving me to just manage the app servers, etc. Is there any reason to be concerned about their PostgreSQL hosting? Any horror stories?

If you're considering Heroku, don't automatically dismiss it because of all this. For one, it's unlikely your site/app will ever be as big as RapGenius. That's not a shot, just reality. They ran into problems as an edge case. Is Heroku and their architecture at fault? Hell yeah it is, but I have faith that they will fix it.


Because I really think that when push comes to shove, Heroku was actually trying to do the right thing with the changes they made and perhaps didn't consider or understand some of the ramifications of the changes they made to the Rails community. They may have fallen in love a little too much with the new node.js hotness and the like. Their CORE audience is startups/new business where Rails is very popular and they understand what this has done to their reputation. If they don't address this in a serious way it will damage their business severely.

I don't personally use Heroku but I have used it in the past and would not hesitate to use it on an appropriate project.

As for me, this story excludes Heroku from any future consideration. Why?

There are two reasons for paying a significant premium to a a platform provider: a) you don't want to do this job yourself, b) someone knows better how to do it.

This story showed me that (b) isn't true. Now, I'm all for learning and growing and improving, but not on my business and not on my applications, and especially not while I'm paying a premium for it, being promised "scalability".

"it's unlikely your site/app will ever be as big as RapGenius"

Heroku's entire promise is scalability. And this isn't an edge case, it will bite you if you need anything over 1 dyno.

Yeah but for how long, a month? Do you honestly think this won't be fixed?

Well so far it's been unfixed for 3 years.

Fwiw our site isn't as big as rap genius and we see highly variable performance and timeouts even on requests that on average only take 100ms. I'm not ready to blame heroku entirely since there's a lot of moving parts that still could be at fault, but after optimizing as much as we could we still see the behavior. It feels like the routing layer could be the most likely problem. Summary: you might see highly variable request time regardless of traffic levels.

Or, use their Cedar stack and multi-worker dynos, where the problem is much less acute (and is only going to affect you once you need many dynos). Figure that in a month or two they'll have learned and deployed more than you would on your own.

Good points, but I think the other (maybe bigger) issue here is one of perception. Yes, they may nail this particular problem, but will customers trust their handling of the next issue?

Heroku's offering is based on instilling confidence in customers that it will just plain do what is advertised without the pain. The story here is that customers experienced additional pain not only because they used Heroku's service, but also because of Heroku's failure to address real customer concerns for so long.

If your app is Railsian and will have requests that take a few seconds to be served, I'd suggest AWS. If it's ultimately a simple CRUD app then Heroku should be fine until you're at significant scale - and this issue will never be as severe.

Bit I wrote earlier about time-consuming requests - http://news.ycombinator.com/item?id=5216593

What a scumbag move from Heroku.

1) Releasing a press release at 7 AM in the morning on a Saturday (CET)

2) The release looks mostly like the stuff a politicians spindoctor would ask the politician to say. Don't promis/admit too much.

3) They clearly state that they want to continue with this extremely inefficient way of routing. The right thing to do would be to make smaller clusters of Load Balancers who could then do proper routing, e.g. measuring the number of requests per dyno, last processing time, etc.

I'm currently working on a large project on Heroku and I'm very disappointed about this. We chose Heroku because we believed we could just `heroku scale web=X` when needed. Instead, now we know that it will be of very little use.

In the next week, I will be looking into a solution where I can utilize Heroku's add-on system without running my apps in Heroku Dynos. Creating a small system to hosts LXC's on AWS EC2 seems within my capabilities (or I could use Cloud Foundry's application server component) - and I believe I can configure a load balancer better than Heroku.

Let me know if anyone else is interrested - we could make an open source project for this :-)

> 1) Releasing a press release at 7 AM in the morning on a Saturday (CET)

I think they aimed to put out a response ASAP.

I agree, the other way to look at this is that they're working through the weekend to figure it out.

That would be great - more specifically if they would release a new one on Monday stating that they found a solution. I still don't like the concept of using a PaaS and having to discover these kind of things yourself.


Just put it up on Rap Genius – create an account to help explain this post to the Heroku users it affects!

This is still not ideal. Even if you're running unicorn, you're still susceptible to queueing spikes due to random load balancing. The concurrency just gives you a small buffer and/or some smoothing on 95th percentile responses. Right?

At least there's a commitment to update the reporting tools... getting bad data in New Relic was (IMHO) the worst -- even worse than out-of-date docs.

Actually, no. Using unicorn with only 2 workers makes a tremendous difference, not just incremental. RapGenius' own statistical model demonstrates this.

Picture each individual dyno in that case as its own "intelligent router". Since it's not distributed and this requires no network coordination, the job of knowing which workers are available becomes trivial.

If you're inclined to read up on queuing theory, you'll see that having at least 2 processes per worker makes the problem much simpler.

Fantastic, that's good to hear. I'm able to run ~5 workers on Heroku with gunicorn (Django), so I imagine that means I'm outta the woods for a while at least.

I would love to read up a bit on queuing theory. Any good pointers?

There's a basic interactive + theory overview at http://homepages.inf.ed.ac.uk/jeh/Simjava/queueing/mm1_q/mm1... (NB: via in-browser java applets, which most things have disabled by default just now)

Rap Genius cofounder here. Below is the full unedited text of https://help.heroku.com/tickets/37665, a Heroku support ticket I logged about 1 year ago. Sorry it is so long, but I think you'll find it interesting:

Tom@Rapgenius| about 1 year ago I know this is a bit of a vague problem, but I've been getting a bunch of Error H12 (Request Timeout)s recently, and I'm not sure what to do about it. It's not like I have some particularly slow actions; I'm getting this error for actions that under most circumstances work totally fine (i.e., return in less than 300ms). Also I don't have a deep request queue (I'm running 40 dynos which is more than enough). Maybe I'm doing some slow queries? Should I upgrade my DB? Also, I do notice that most of my app's time (according to New Relic) is being spent in Ruby (http://cl.ly/29132F272W2D0K1l2I3P). Would upgrading Ruby to 1.9 noticeably help this performance? (I'm a bit nervous it'll create a ton of problems).

Phil@Heroku Hello - I can look into this, but I'll need access to your New Relic account. Will you make sure 'phil@heroku.com' has access? Also, from your screenshot I notice your DB times are ~ 100 ms. We recommend keeping those times closer to 50 ms. You might be able to speed things up with a database upgrade. I'll look into New Relic once I have access and let you know what I find.

Tom@Rapgenius Thanks, Phil! How do I give you access to my New Relic account? I tried clicking "account settings" and got this: http://cl.ly/0V2J3i0826400I2s3b2c

Phil@Heroku Tom, I have access now. I'm not sure what was blocking me earlier. After looking at New Relic and the database server, I think a larger database will help. At the very least, it will be helpful to try the next level for a week and compare performance statistics in New Relic with the prior week. Your app is using an Ika right now, and the next step up is the Zilla database. We've made the upgrade process very simple, and it's outlined here - http://devcenter.heroku.com/articles/fast-database-changeove... Your database is ~ 5.4 GB in size (via the 'heroku pg:info' command) so an upgrade shouldn't take too long. You will be able to test the process by adding a Follower and timing it via the 'heroku pg:wait' command. This should give you a good idea of how long it will take to spin up the new database. Also, should the Zilla not help much, the downgrade process to an Ika will be the same. You only pay for the resources used.

The current database server appears to be a bit under-powered when it comes to Compute Units. The Zilla has more power and should provide some room to grow. As for an upgrade to Ruby 1.9.2, I'm not sure how much that would help. It would be an involved upgrade that would take time to plan and deploy. The database upgrade should be a quicker solution. Long-term you may want to consider moving to the Cedar stack and Ruby 1.9.2.

Tom@Rapgenius Thanks! I'm upgrading now

Tom@Rapgenius I'm still getting a ton of "Request Timeout" errors. E.g.: 2011-12-08 14:46:53.222 219 1 2011-12-08T14:46:53+00:00 d. heroku router - - Error H12 (Request timeout) -> GET rapgenius.com/Wale-ambition-lyrics dyno=web.17 queue= wait= service=30000ms status=503 bytes=0 one weird thing: there aren't any values listed for the "queue" and "wait" parameters. Could that indicate a problem? Could an exception have been thrown earlier in the request before the timeout? Or does the timeout error just indicate that the request took too long? If it's the latter I'm not sure how to troubleshoot all these errors since the associated actions are fast the vast majority of the time

Tom@Rapgenius Here's another interesting example:

2011-12-08 15:59:32.293 222 1 2011-12-08T15:59:32+00:00 d. heroku router - - Error H12 (Request timeout) -> GET rapgenius.com/static/templates_for_js dyno=web.17 queue= wait= service=30000ms status=503 bytes=0 This action is extremely simple – it doesn't access the DB or any external services. Here's the template: Ballin! <% unless current_user %> <% form_for User.new, :html => { :id => '' } do |f| %> Tired of entering your email address? Create a Rap Genius account and you'll never have to worry about it (or anything else) ever again: <%= render :partial => "/users/form", :object => f %> <%= f.submit "Create Account" %> <small>(Already have an account? <%= link_to 'Sign in', login_path, :class => :facebox %>)</small> <% end %> <% end %>

Besides a big request queue (which there isn't), how could this action possibly time out?

Phil@Heroku Tom - sorry for not getting back to you sooner.

It's possible for H12s to occur even for simple actions if there is already queueing for the app. With a busy site like your's, even a few H12s can cause a cascade of H12s for successive requests.

It looks like New Relic has not reported any downtime over the past 24 hours. Can we let the site run through the weekend and see how things look Monday after 3 days of New Relic data with the new Zilla?

Tom@Rapgenius > It's possible for H12s to occur even for simple actions if there is already queueing for the app.

I feel what you're saying, but I don't think my app's queuing. For one thing, New Relic shows 0 time spent in the queue during the period in which I'm getting all these timeouts. For another, I'm running 40 dynos and my average request time is <400ms. So:

400 ms * 3000 requests / minute * 1 min / 60000 ms = 20 simultaneous requests (i.e., 20 dynos) so 40 dynos should definitely be more than enough.

Also, shouldn't Heroku be showing me the queue / wait stats at the time of the timeout? That would help prove whether my app was queuing at the time in question

It looks like New Relic has not reported any downtime over the past 24 hours.

New Relic isn't great at catching intermittent problems like this; you really feel it when you're using the site continuously for an hour or whatever. Also, users make many more HTTP requests than New Relic (since every page load kicks off several AJAX requests).

That said, there has been downtime in the past 24 hours (though less than in the previous 24): http://cl.ly/3D2b1Z170B0w1f1m113m

Tom@Rapgenius Here's some additional data: At 5am this morning (EST), Rap Genius went down. I woke up at 11am (it's a Saturday!), did a logs --tail and observed that basically every request was timing out. I did heroku restart, and now every request started returning a backlog too deep error

Finally, I added another 10 dynos (bumping the total to 50, which is a log of dynos!), and this seems to have fixed the problem – perhaps because my app needs the additional capacity, or perhaps because merely changing the number of dynos reset something else. Either way, I'm sticking with 50 dynos for now out of fear even though I doubt my app needs that many (right?)

Either way, the 5 hours of unexplained downtime (there weren't any application-level exceptions or anything) that was fixable by tweaking my dyno count further supports my theory that something's going on with my app on Heroku's end.

Phil@Rapgenius Tom - I've been looking over your New Relic stats.

First - the good news - the upgrade to a Zilla seems to have helped. Database times are down a bit, which can only help. I checked the actual database server and it's not showing signs of over-work like the previous Ika was. Second, I notice that downtimes reported by New Relic over the past two weeks are in the early morning hours - 3 to 6 AM PST. Do you have any scheduled tasks that run during these times?

Also, request queueing is nearly zero, so 50 dynos does seem like a lot. What are your usage patterns like? The RPM graph in New Relic indicates the normal cyclical usage pattern, lower during the night, but what does Google Analytics tell you?

Finally, the Heroku platform has been having issues over the past week, but none of them correspond to the downtime you had Saturday morning.

That all sounds so familiar. Just like what Phil was telling us too. Get a New Relic account. "Why do we need a $7k/mo NR account?" Oh, because you have some slow requests...

We go fix all of our slow requests, but we still have H12 errors.

Can you tell us how many dyno's we actually need to serve our requests and not get H12 errors? No.

Hey Rappgenius... thanks for having the balls to call Heroku out in public on this stuff. We are in the same boat.

Have you guys considered suing Heroku to get some of your money back? Given the nature of Heroku's deception and the resulting ill-gotten gains across its entire customer bases, it would seem like you could work with an enterprising attorney to form a class-action suit against the company and get money back not just for yourselves but for the entire effected customer bases.

Just a thought. ;)

A Y-Combinator company suing another Y-Combinator company. Now that would be interesting.

Heroku was acquired by Salesforce, so I guess it's not really a YC company anymore.

Ah that explains why they have turned evil. Usually once a company has been acquired they are no longer worth using.

Normally I'd say yes, but one of the core Cedar stack engineers is a close friend of mine and I've been over at Heroku a few times to nerd out over beers. Given my conversations with him and other engineers about the designs I can guarantee to you that (1) Heroku is extremely autonomous relative to Salesforce (I bet you most of SF's influence is on the BD side of things and not on the engineering side. In fact I doubt they have much influence on engineering) and (2) there was no malice involved. Like any engineering first organization, they probably spotted an opportunity to innovate on efficiency and optimize a key process. Problem is that in this case v1.0 of their solution turned out to be a dud and they probably didn't have the instrumentation needed for this new approach in place to detect and prevent these issues. They may not have been measuring the all the right things so a modification detrimental to performance slipped through the cracks;

I think all this drama is a great opportunity for Heroku to bring out a new stack for non-concurrent apps with an intelligent routing mesh (v2.0a, perhaps). If I were you I'd go for another beer with the Heroku engineers and design/prototype it.

I upvoted this comment, it has been my experience this is the case.

It'd be like suing an airline for a delayed flight.

In a way, yes––if the airline's product for sale wasn't a particular "flight," but instead a speedy "travel system" that guaranteed customers to fly on the first available (ergot fastest) flight.

To further extend the airline analogy, it's as if you had bought one of these "fast travel" products and were told to buy more plane tickets to avoid delays!

The tort (aka crime) committed is the deceitful way in which nature of the product was marketed to the clear financial benefit of Heroku and to the clear financial detriment of the customer. The fix offered to customers is always an up-sell effort to "buy more Heroku products."

The thing about Heroku's intelligent queing system is that it lowers customers' costs and thus Heroku's profits. Switching away from intelligent queuing to random queuing was a business decision with clear moneymaking advantages for Heroku.

In one fell swoop, the move made the value of a customer to Heroku skyrocket, potentially increasing its valuation upon acquisition and/or helping the founders more easily hit any earn-out revenue targets required by Salesforce as part of their acquisition.

They don't really claim to be the most reliable or performant, just the easiest to deploy and maintain (arguably scale). Keep in mind they don't even have any meaningful SLA.

Yes, but false advertising may still be a claim, even if there is no concrete agreement. And the legal standard there doesn't even require overt lies. That is, even statements that are technically true but misleading such that a reasonable person would conclude something untrue can meet the test.

And, in this case, documentation regarding the way their system works, as well other of their materials was overtly untrue. Beyond that, customers would merely need to show damages and also that Heroku profited from its untrue statements.

Whether it intended to mislead or not, Heroku could have a real problem on its hands.

Yes, perfectly possible and regulated in the eu

I had a string of similar requests with Heroku between Feb 2011 and June 2012, before we migrated off their platform.

I would complain about h12 errors, they would tell me to upgrade my resources and/or that it was my problem and there was nothing they can do. We ended up with a solution that was easily 10x as expensive (over-powered DB, too many dynos) as our initial configuration, and it still didn't fix the issue.

I'm happy to provide the full text support requests, but they don't tend to be quite as juicy as the one you posted.

What did you migrate to?

We're with bluebox.net and have been very happy.

I love how their first suggestion is upgrade your db $$$ .

Whatever you do, get your database off Heroku. Heroku's postgres offering has sweet options, but performance is fucking awful. We went from a 20k+ month heroku bill to a fraction of that price on EC2 so we could get a slave that could keep up with the writes to our master.

Performance being shit would be fine if it wasn't a miserable experience trying to migrate off. Not being able to setup replication to slaves off of Heroku means you have to deal with things like Bucardo and all the problems that brings along.

Heroku is really awesome for little web apps that performance doesn't matter on. If you think you're building something that people will want to use, you don't want to be on Heroku.

That's quite interesting; I don't see how one can really so soundly beat the disk configuration, but it certainly can be matched by something similar.

It was posted about quite some time ago: http://orion.herokuapp.com/past/2009/7/29/io_performance_on_.... If this has become bad advice, inquiring minds wish to know. EBS Priops seem promising and different.

Replication out of Heroku would be neat, but the WAL as-is is architecture and operating system dependent. Luck and tenaciousness willing, logical replication in Postgres may yet serve that use case, some day -- we'll have to see as to the details. A lot of effort has been expended attempting to get it in some form into 9.3, but it's far from sure thing.

Also, don't forget to set up archiving, if you continue to retain fast interconnect with S3: https://github.com/wal-e/wal-e, which I principally wrote on Heroku's behalf, but now is a project of multiple users and contributors. It has a focus on ease of use and reliability, although it could suffer improvement.

Yeah, we're using wal-e in conjunction with streaming replication.

Either of those options would have been awesome (anything but Bucardo, really. Bucardo's triggers overwhelmed an already overwhelmed database)

"EBS Priops seem promising and different."

We run many busy PG databases on PIOPS. The difference from normal EBS is huge. Very happy with PIOPS.

Interesting -- can you go into more detail or provide references that you found helpful when trying to run Postgres on ec2?

Yeah, if you run Ubuntu, don't try running any versions other than 11.04 if you're going to be under heavy i/o. We had all kinds of CPU soft lock issues with software raid on other versions. Otherwise? 2 m2.4xlarge running mostly standard postgresql installs absolutely dominate Mecha instances.

Obviously, you need to deal with backups and that sort of thing, but, you can hire a full time person for the price difference between Heroku and EC2 as soon as you get to the top end of Heroku pricing.

Definitely interesting, but from this 'full unedited text' it looks like both you and Heroku had better things to do at the time than investigate more deeply. (At the time you seemed satisfied but suspicious that 50 dynos fixed things; Phil@Heroku shares your suspicions but his followup questions get no response. Case closed, everyone moves on to other things until another complaint or fresh info comes in.)

Tom, like many others here, I'd like to personally thank you for publicly surfacing this issue. Two months ago we ran into the same exact problem while doing some performance stress testing. After going back and forth with five different Heroku support staff members for a week, we ended up no where. Their response was simply to increase the # of dynos, but seeing as our average response time was 80ms with 0 request queueing that didn't any make sense. In the end we dropped it, since we were just doing some stress test, but I'm glad they are finally "doing" something about it.

Our documentation recommends the use of Thin, which is a single-threaded, evented web server. In theory, an evented server like Thin can process multiple concurrent requests, but doing this successfully depends on the code you write and the libraries you use. Rails, in fact, does not yet reliably support concurrent request handling. This leaves Rails developers unable to leverage the additional concurrency capabilities offered by the Cedar stack, unless they move to a concurrent web server like Puma or Unicorn.

So, do/will they now recommend Puma/Unicorn over Thin?

I'm not able to follow parts of the post. Our routing cluster remained small for most of Bamboo’s history, which masked this inefficiency.

If you went from 1 router to 2, 50% of routers can't optimally route a request. If you went from 2 to 3, you would have 66% which can't route. 3 to 4, 75%.

Once you get to say 10 routers, you are already at 90% sub-optimal routing. So are they saying, the had only 1 or 2 routers earlier?

This is exactly what I was thinking as I was reading this blog post. I did a double take when I read that because it makes absolutely no sense at all. It seems to imply that their fabled web serving stack never actually worked as advertised but no one cared or complained publicly enough to make Heroku explain the issue.

I just can't believe that Heroku had like 2-5 routers in their entire stack. I'm wondering if maybe they have some kind of simple sharding for routers based on the app name so all hash values go to a small set of routers. I can understand if that was the case how large numbers of small applications would seem to developers and heroku customer support to be using a per application queue. Of course, this would break down the instant there were very large applications served on heroku that would overwhelm the small number of routers assigned to each shard bucket.

I think that is what they're saying: for a long time Bamboo only had very few routers. One simple HTTP router can move a lot of requests!

(Maybe other topologies also hid the problem for a while, such as apps having traffic that still tended to come through one router even though coming from any was possible.)

Nothing really changed. If you cannot see how Heroku fucked you as a customer, then they actually did not fuck you; Otherwise, just do it yourself.

So to recap:

Ruby on Rails is using a default configuration where each process can serve one request at a time. There is no cooperative switch (as in Node.js) or (near) preemptive switch (as in Erlang, Haskell, Go, ...).

The routing infrastructure at Heroku is distributed. There are several routers and one router will queue at most one message per back-end dyno in the Bamboo stack and route randomly in the Cedar stack. If two front-end routers route messages to the same Dyno, then you get a queue, which happens more often on a large router mesh.

Forgetting who is right and wrong, there are a couple of points to make in my opinion.

The RoR model is very weak. You need to handle more than one connection concurrently, because under high load queueing will eventually happen. If one expensive request goes into the queue, then everyone further down the queue waits. In a more modern system like Node.js you can manually break up the expensive request and thus give service to other requests in the queue while the back-end works on the expensive req. In stronger models, Haskell, Go and Erlang, this break-up is usually automatic and preemption makes sure it is not a problem. If you have a 5000ms job A and 10 50ms jobs, then after 1ms, the A job will be preempted and then the 50ms jobs will get service. Thus an expensive job doesn't clog the queue. Random queueing in these models are often a very sensible choice.

Note that Heroku is doing distributed routing. Thus the statistical model Rapgenius has made is wrong. One, requests does not arrive in a Poisson process. Usually one page load gives rise to several other calls to the back-end and this makes the requests dependent on each other. Two, there is not a single queue and router but multiple such. This means:

* You need to take care of state between the queues - if they are to share information. This has overhead. Often considerable overhead.

* You need to take care of failures of queues dynamically. A singular queue is easy to handle, but it also imposes a single point of failure and is a performance bottleneck.

* You have very little knowledge of what kind of system is handling requests.

Three, nobody is discussing how to handle the overload situation. What if your dynos can take 2000 req/s but the current arrival rate is 3000, if you forget about routing for a moment. How do you choose to drop requests, because you will have to do so.

If you want to solve this going forward, you probably need Dyno queue feedback. Rapgenius uses the length of the queue in their test, but this is also wrong. They should use the sojourn time spent in the queue which is an indicator for how long you wait in the queue before being given service. According to rapgenius, they have a distribution where requests usually take 46ms (median) but the maximum is above 2000ms. I can roughly have a queue length of 43 and 1 have the same sojourn time then. Given this, you can feed back to the routers about how long a process will usually stay in queue.

But again, this is without assuming distribution of the routers. The problem is way way harder to solve in that case.

(edit for clarity in bullet list)

"Forgetting who is right and wrong"

The right vs wrong issue here is not the proper way to architect a router. The issue is that Heroku glossed over an extremely important aspect of their engineering documentation, because it painted their platform in a bad light. This is particularly damning, because as an engineer working on their platform, I could design around their shortcomings as long as they don't hide them from me.

Furthermore, I believe we could make an argument Heroku intentionally misled (both in documentation and in their support responses) clients as to how their router worked.

The reason I want to stay out of that discussion is that it often just amounts to some mud-slinging from one side upon the other. Intentionally mislead is quite an accusation and I don't think Heroku would indulge in it.

Personally, I think it is incredibly naive to build an application around a framework where you have no built-in concurrency. The main reason is that the queue you will build up in front of it is outside your reach so you have to sustain it.

It is also naive to think that your cooked up statistical model resembles reality in any way. Routing is a hard problem so it is entirely plausible that your model does not hold up in reality. Besides, the time it takes to construct those R models is a missed opportunity for improving the backend you have. And the R models doesn't say a lot, sorry. At best they just stir up the storm --- and boy did they succeed.

I agree it is unfortunate that Heroku's documentation isn't better and that New Relic doesn't provide the accurate latency statistics. But to claim that this is entirely Heroku's fault is, frankly, naive as well.

It is quite an accusation, but I'm far from a bystander in this issue. I have extensive, documented communication with Heroku engineers over the course of 1.5 years (Feb 2011 - June 2012).

I'm not discussing any sort of cooked up statistical models. I'm discussing the real-world experience I had scaling an application on Heroku.

"mud-slinging from one side upon the other"

You imply that I was uninvolved before Rap Genius's expose. I assure you that is not the case. I've chosen a side in this argument well before Rap Genius went public.

The "you" were addressed to RG, not your experience. I completely agree that scaling is one of the harder problems in computer science. Most people just run a stateless system and hope for the best, but this is a delicate matter and it is a hard problem.

The hard part being a customer or Heroku is that the problem might be on the other side of the fence. And how do you communicate that in a diplomatic way?

Personally my opinion is something along the lines of "If you use Ruby +rails in that configuration, then you deserve the problem".

Great summary, and I too think that it's best to separate one discussion about blame/responsibility/trust and a different one about the architecture/technology.

What I feel is also missing from most comments about the technical aspects is the effect of running more than one process on each node (which is possible with rails using e.g. unicorn). Being able to process more than one request at a time on each node should alleviate wait times and bottle-necks, even with a simple e.g. round-robin routing layer. It might not be the absolute optimum, but it could still provide a pretty good / good enough balance.

As a side note about the overload scenario you mentioned - It's very interesting to consider, but we can easily get dragged into Denial-of-Service territory, and designing against DoS is an even harder problem in my opinion (until you considered handling 'normal' load anyway).

Overload is as much about establishing a baseline. Even under "perfect" routing, the problem is that you need to know what load your system can cope with.

You need queueing to a certain extent since a queue will absorb spikes in the load and smooth out the sudden arrival of a large number of requests in a short time. But excessive queueing leads to latency.

One of the "intelligent" routing problems is we have to consider something more than queue length. We need to know, prior to running, how expensive a query is going to be. Otherwise we may end up queueing someone after the expensive query while a neighboring dyno could serve the request quickly. But you generally can't do this. Under load, such a system would "amplify" expensive queries: all queries after the expensive one in queue will be expensive as well.

This is why I would advise people to move to a model where concurrency happens "in the process" as well. It is actually easier to dequeue the work off the routing layer as fast as possible and then interleave expensive and cheap work in the dyno.

True for Rails 3, though you can enable threading (multiple requests per process). Rails 4 will likely default new apps threaded, but that option would still have to be manually enabled for older ones. (I say likely because there are no formal releases yet, and it's still possible stuff will get pulled back.)

It was news to me that Rails operated in this way. It is a very weak concurrency model (to be precise: it is nonexistent). I knew that Rails was naive, but I had not dawned on me just how naive it is. It is interesting because then a system like node.js is a giant leap in the right direction - even if there are better languages out there.

(disclaimer: I do Erlang for a living and often operate in highly concurrent settings.)

Rails has a pure shared-nothing scale-out philosophy; it achieves concurrency with multiple processes. The fact that Heroku only runs one process per dyno by default is a shame though.

It won't help much for several reasons:

Context switches are expensive. Very expensive. So your system runs overhead in the operating system. And you need to spawn a process per request.

Pooling processes is a problem as well. If we only have 4 workers in the pool, then we have to queue requests on one of the 4 workers. But we don't know how expensive those requests are to serve, a priori. Even knowing the queue length or the queue sojourn time won't be able to divulge this information to us, only help a little. More workers just push the problem out further.

If you want to be fast, you need:

The ability to switch between work quickly, in the same process.

The ability to interleave expensive work with cheap work.

The two main solutions are evented servers: Node.js, Twisted, Tornado (both python); and preemptive runtimes: Go, Haskell, Erlang (of which Erlang is the only truly preemptive). I much prefer the preempted solution because it is automatic and you don't have to code anything yourself.

There is a strong similarity to cooperative and preemptive multitasking in operating systems by the way. Events are cooperative. Do note there are no more cooperative operating systems around which you use on a daily basis :)

As far as I know, the one process per dyno is not a restriction that Heroku puts on the application, it is an architectural decision of the application.

Spawn up worker pool if you want > 1 req at a time.

> The RoR model is very weak.

It's not "the ROR model". Rails is just an actor in the Rack space; how the framework handles feeds requests to Rails is entirely up the Rack adapter.

Rails itself is thread-safe and can handle requests concurrently with multiple threads; it can also run in a multi-process configuration.

The setup described in the article focuses on Thin, a web server that is primarily single-threaded. These days, many people use Unicorn, which uses a forking model to serve requests from concurrent processes.

From the perspective of someone who might be looking at Heroku as a host in the future, this is a bit scary. Their response appears to be mostly apologetic in that they're sorry that it happened - but does nothing to address the issue. It's more of a "we screwed up, oh well" than anything else.

They would have warranted a better response if they said they were actively looking into how to improve the routing system, but by the looks of things they're going to sit by and hope developers switch practices so they don't have to solve their problem.

Well they did say that they were working on fixing the speed of concurrent requests for rails on their platform. While that is vague it would point to them actively working on a solution.

Regardless, this is a problem of web applications at scale. Personally, I've never had to scale an app above 2 dynamos. So I will continue to use their service since it works as advertised for the domain of small startups that are not yet at scale. The work of porting from Heroku to AWS is work that I would have to put in anyways so I see no reason to waist that time any earlier in a project then I have to. Sure moving the db will be a pain but it's something I'm willing to live with.

The bit where they mention their working on concurrency:

Working to better support concurrent-request Rails apps on Cedar

I appreciate the honesty, but I don't see any "this is how we'll fix it", rather, just "we promise to document it and make it clear to anyone who wants to measure it".

Effectively, they have a fundamental architectural problem, and don't know how to work past it.

They also have a fundamental cultural problem, and don't appear to have recognised it.

In short, they went for ~2 years with documentation that advertised features that the implementation didn't have, while receiving a string of support issues that they wouldn't acknowledge as their problem.

Yet, the blog posts show no indication that they are interested in working out why they offered such terrible service to their customers and how they can fix the company culture to take this issues seriously in the future.

There's a chance they'll come up with an architectural solution for their routing problem. But unless they do some serious introspection and work out why they (as a team) stuffed up so badly, then there's no chance that they'll fix that problem.

And if they don't fix that, then why would anyone have confidence that this sort of issue isn't going to be commonplace?

(Disclaimer: Not a Heroku customer)

tl;dr Don't run your single-threaded Rails app (read: the vast majority of Rails apps) on Heroku. Neither the Cedar or Bamboo stacks are optimized for them at this point.

It's incredible to me how the performance of their stack has degraded so far for what is probably the most common use case for their platform.

I think that's partly an insensitivity driven by Rails itself. Ruby and Rails have many good features, but speed is not one of them. Spending time working with Rails definitely wore down my previously high standards for page speed.

Given that their metrics hid the problem, I could see most shops saying, "Well, the stats are fast, so the occasional slow page we see must be a fluke."

There's another large issue with queueing requests on the dyno: when the app restarts, all the requests currently queued up on that dyno get dropped and the client receives a 503.

Sad to know. `httpd -k restart` was available since the beginning of time.

For what it's worth, Google App Engine uses a so-called "intelligent" global request queue/scheduler. In most circumstances it's quite effective.

It is effective, but not without problems.

http://code.google.com/p/googleappengine/issues/detail?id=78... http://code.google.com/p/googleappengine/issues/detail?id=57... http://code.google.com/p/googleappengine/issues/detail?id=78...

Needless to say, it is VASTLY better than Heroku's router. You'll never see an H12 error with GAE.

Indeed, while having a global request queue is good, the scheduler's one-size-fits-all behavior can sometimes be suboptimal. Thankfully there are several ways in which it can be tuned: https://developers.google.com/appengine/docs/adminconsole/pe...

Source please?

"Pending request latency arises when all of your application's available instances are too busy to serve new requests. When this happens, incoming requests go to a pending request queue." https://developers.google.com/appengine/docs/adminconsole/pe...

Pending latency is exposed as a metric on the App Engine Dashboard.

I am not a Rails developer nor a sys admin. I didn't really understand what was the problem the reported via Rap Genius.

My question is does it affect Node.js apps?

The smoking gun is H12 errors and they occur quite a bit on our NodeJS app. The key is to get the instances to serve requests as fast as possible, which is a good thing, except at some point you can only get responses out at fast as you can and there is quite a few mysterious unexplained things that cause responses to happen slowly. The routing layer is fundamentally designed incorrectly. H12 errors should never happen in a random fashion and they do. A request can sometimes take 10ms and other times, timeout after 30s. Even if it is just returning 'ok' and not doing anything else.

It will. If X is the max number of requests node can handle per dyno, the router can send >X requests to a single dyno even if you have other dynos which are processing <X requests.

Essentially the router sends requests by random. This results in request pileups. In the original rap genius post, there is an animated image which shows this in action

> On the Cedar stack, the root cause is the fact that Cedar is optimized for concurrent request routing, while some frameworks, like Rails, are not concurrent in their default configurations.

First of all, Ruby doesn't exactly block on I/O [1], so there's that. Secondly, while Rails 3 has config.threadsafe! turned off, Rails 4 will have it turned on. [2]

1: https://github.com/tenderlove/fibur#synopsis 2: http://tenderlovemaking.com/2012/06/18/removing-config-threa...

EDIT: I just realized non-Rubyists might not get the Fibur joke. Here's the code: https://github.com/tenderlove/fibur/blob/master/lib/fibur.rb...

I think the discussion comes down to which option you ultimately prefer -- 1) be in full control of your stack, having only yourself to criticize, and having to deal with all the devops action that comes with it, or 2) let heroku deal with the underlying while you focus on your app, as their promise suggests, and suffer (also financially) from time to time due to faults like this.

Everyone has their own reasoning on this decision, and that's ok. I think that the voices here saying "I won't start my project on heroku because of this" are not thinking this through. I think that for any project starting up, focusing on app rather than infrastructure is much more important than routing issues.

People pay $35/mo for a Heroku Dyno that can only handle 1 request at a time? That's absurd.

I feel like Heroku is punting on this one in a big way.

I think this really means that for a small app, Heroku is fine, once it gets bigger, you probably want to get a real scaling strategy.

Agree. Though that's a bit ironic --- one of Heroku's selling points was scalability, at least a big part of Rails community believed it to be.

Don't know about Rails, but I'm building a MVP using Node.JS and hosting a free dyno. Can handle 250 reqs/s for dynamic requests (i.e. full stack, Postgres, Redis, Jade, Express etc.) and 1500+ reqs/s for static contents. I do use cluster with 3 workers though. Pretty happy with the result so far.

Performance and reliability has never been their forte. They probably have the worst up time in the industry.

Scalability... yes until the point you really need to scale.

The main/only advantage I really see to heroku is the ease of deployment and how they managed to keep it simple.

This guy's writing style and tone make me want to never do business with him.

You know, I wrote a really fast HTTP router using Redis pub/sub and a pool of single threaded nodejs processes running on seperate EC2 boxes and it worked great. I don't know why it is so hard to scale RoR when it is single-threaded.

They should of just used nodejs for routing.

So the tl;dr from Heroku is: we can't scale single-threaded applications.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact