Hacker News new | comments | show | ask | jobs | submit login
Dear Heroku: Quit blaming all of us when you fail. Do this instead… (pardner.com)
271 points by pardner on June 7, 2012 | hide | past | web | favorite | 125 comments

This is misguided. Nobody cares why your site is down, and for most sites 99% of your users will have no clue what is meant by "This site is hosted by Heroku". And a good chunk of that other 1% isn't even going to bother reading the error text accompanying the whitescreen.

In the end, you chose to host your site on a platform that went down. That is just as much your fault as a typo in the code. If you had a setup with a hosted machine at Rackspace and the power goes out, you don't expect a custom error. So why would you expect one from Heroku?

"99% of your users will have no clue what is meant by "This site is hosted by Heroku""

Couldn't agree more.

Further I wouldn't want anyone to know where the site is hosted anyway. We have some customers on a VPS at Media Temple and use custom dns so they don't see MT's dns servers (which of course they could find out if they want to check the IP obviously). They don't need to know that info. As far as they are concerned we are the vendor.

The message could simply say "We are having technical difficulties and we're working to get the service restored as quickly as possible".

I don't think anyone should take it upon themselves to say "Nobody cares" about the relevant details of an error. It's not just an issue of blame, but what the end customer can do with the information. The information that it is the platform that failed temporarily is useful for the end customers because they don't have to lose too much confidence in the software vendor because their code is failing. It lets the customers know the problem is probably temporary. Customers do care because a lot of them can interpret that level of error detail, and it is actionable, useful information to them. Heroku should at least give developers the option of finer grained error messages.

>In the end, you chose to host your site on a platform that went down.

That's only partially true, there are limited viable options for platform hosts out there, and virtually no one has cracked the 100% up-time challenge, so there isn't any degree to which a software developer could choose a host with 100% up-time if it doesn't exist. Thus they're not at fault for choosing the "wrong" host or not creating something that is astronomically difficult themselves.

"The information that it is the platform that failed temporarily is useful for the end customers because they don't have to lose too much confidence in the software vendor because their code is failing."

You're failing to understand that the overwhelming, vast majority of people have no idea what half the words in your sentence meant, have no idea what idea you're trying to convey, and even if they understood the idea, wouldn't be able to get it from your sentence.

Nobody cares whether it's a bug in your code or your host going down, because 99% of people don't know the difference.

I think that business customers do care quite a bit, actually. If, for instance, your business customer experiences significant downtime, but loves you otherwise, a verifiable explanation as to the source of that downtime can be the difference between keeping that customer or loosing that customer.

If the customer attributes the downtime to your host rather than you, then your customer might tell you to switch hosts rather than fire you. If you have a good relationship, the added transparency can be the difference between keeping the customer and loosing the customer.

Of course, this line of thinking is no incentive for heroku to change its practices.

This is BS. Your users DO care.

When our service is down for some reason, the ONLY question we get (like this one we got today) is - "Has the service been down recently? If so, no worries - it happens, just wanted to report and see if this is temporary or if it is just me."

Having a "holy shit our entire service is down" message means your users dont' have to ask "is it just me?".

That's a big difference. It has nothing to do with shifting blame, it's about keeping your users informed. Information makes people happy.

Yes, information does make people happy. And my point is that a message that says "this site is hosted at Heroku" gives absolutely zero information to the vast majority of internet users. Most people simply don't know what that phrase means. It's no different than if someone suggested a new feature that is only visible for Opera users. It's just a waste of time and effort to do something that caters to such a small sliver of the market.

Also, a Heroku-specific error page would give zero help to the problem you describe above. Anecdotal evidence in this thread suggests Heroku sometimes goes down briefly for small chunks of its users. So if you see the Heroku error, it could be for just you. Or it could be system-wide outage. It wouldn't even solve that problem!

Showing enough information for your users to sensibly understand what they should do is important, but thats not what the article was about[1], it was about the application developer wanting to pass-the-buck when there application failed because of a Heroku error.

This is really lame because, you chose to host on Heroku not your customers, so it is your fault. You can play pass-the-buck but as far as your customers are concerned it is your fault.

1. "..Quit blaming all of us"

You say "pass the buck" I say "tell the truth." Heroku's current message says it's an "application error" even when it is not an application error. It is incorrect. And I suggested one simple way to correct it. They could word the message as they see fit, but platform_problem != application_error. What percent of customer grok/care about the difference is irrelevant... an incorrect error message should be corrected.

My only gripe with the heroku app error page is it doesn't show your companies branding. I would like to be able to upload a static fail page with a generic message for my customers. Heck, let me specify a URL to redirect to when Heroku crashes.

That was my first through as well. I was thinking "why is this author demanding that Heroku put their brand on his app's error messages?"

  > If you had a setup with a hosted machine at Rackspace and the power goes out, you don't expect a custom error. So why would you expect one from Heroku?
Because you pay them to be a platform? (As apposed to paying rackspace for some hardware to run things youself)

This is exactly the same stance that allowed Netscape to win the browsers wars initially by being more user friendly. Arrogant and 'misguided' technical people wanting to see developers shamed for not conforming with strict HTML standards. They instead took it out on the end user.

Why do you care? Do you think this helps your customer at all?

Perhaps you're worried about your technical reputation. All you're doing is moving the blame to some part of your code to the decision you made to host on Heroku.

Down is down. Unavailable is unavailable. To your customers that's all that matters.

For what it's worth I think hosting on Heroku makes plenty of sense and I'm actually moving my app (Crisply) to Heroku so that my technical team spends less time writing chef scripts, less time managing database clusters and more time adding value. But when Heroku goes down my customers will be just as screwed.

In my experience, users actually do care quite a bit. Back when my company was on Rackspace, we had several periods of significant downtime that weren't our fault. Customers who contacted us were very upset, but when we made it clear that it wasn't a bug with our software, but rather a problem with the hosting provider, they all calmed down. I believe there are at least a few key things a customer takes away from a message like that, even if they don't understand anything about website hosting:

1) This isn't a bug with our software. They don't need to worry about the security of their data or anything else like that.

2) There's no point in getting mad at us. Sure we chose our hosting provider, but we're just as upset about the downtime as the customer is.

3) This problem is affecting other websites as well. People seem to be better at handling stress if they think everyone else is stressed out too.

4) There's a team of professionals working on the problem. My customers know I run a small company, and they seem to appreciate knowing that a much larger company handles the hosting.

>"1) This isn't a bug with our software. They don't need to worry about the security of their data or anything else like that." So I guess no one has ever loosed up a firewall rule when something was down to try to get it back up again, providing a perfect opportunity for someone with, let's say, stolen MySQL credentials to connect to your now-exposed DB.

>"2) There's no point in getting mad at us. Sure we chose our hosting provider, but we're just as upset about the downtime as the customer is." And they're losing just as much business because /their/ site/service is now broken too. That is plenty of reason to get mad at you guys, since to them, /you are the total solution/.

>"3) This problem is affecting other websites as well. People seem to be better at handling stress if they think everyone else is stressed out too." Maybe, but then I'd just say "wow, so you have really bad planning and expect that this stuff is all very reliable then, no?" The internet goes down. Power goes out. Expect it and build around it.

>"4) There's a team of professionals working on the problem. My customers know I run a small company, and they seem to appreciate knowing that a much larger company handles the hosting." Yes, because I fully expect IBM to resolve my issue faster than a mom-and-pop store. If anything, that would make me /more/ anxious. Ever had to call Level3, Cogent, or ATT for a null route? It takes us ~1 minute and them at least 20, sometimes much longer.

I don't mean to sound snarky, but it seems like you don't deal with customers very often. You're thinking rationally when you should be thinking emotionally. Customers are hugely inconvenienced by downtime, but there's nothing to be done about it. You just need to get them to calm down and remember that you're someone they like doing business with, and this is just an unfortunate mistake that's outside of your control. Like everyone else has already mentioned, everyone knows that websites go down, so customers can deal with it as long as you give them some peace of mind[1].

Here's a related anecdote: it's common for potential customers to ask me about security before signing up. I used to actually answer them by explaining the details of our security practices, but no one understood what I was talking about. Now I tell people that the site is hosted on Amazon's servers so we can take advantage of the infrastructure they've built. Obviously this is a non-answer, but it makes people feel a lot better. They knew they wouldn't be able to evaluate our security anyway, they just wanted some sign that they can trust us (and they already trusted us because we're the only company that picks up the phone when they call).

So I guess what I'm saying is that when you're dealing with technology, it's good to be rational. When you're dealing with people, emotions are what matter.

[1] Just so you know, one other thing that I would say during the downtimes was that we were in the process of switching from Rackspace to Amazon precisely because of these problems. We weren't just sitting around accepting them. Since making the switch, the service has been rock solid, so customers know we meant what we said.

Frankly, I believe it certainly can matter. It may not make any difference to all clients, but some will draw a distinction between avoidable / unavoidable downtime.

Rational clients understand they are not hiring demigods or living in some uptime utopia. However, they expect their service providers to adhere to best practices and make logical decisions.

In this case it's Heroku, in another setting it might be a NetApp filer. Both are reasonable solutions in their appropriate environment, yet both may fail. There's a significant difference between downtime caused by the failure of a reasonably trusted solution as opposed to that due to design flaws, bleeding edge prototyping and flat out bad decisions. The former wouldn't undermine my faith in a service provider, while the latter certainly might.

I believe you're looking at this issue through the eyes of a technical person who understands those distinctions. There's a huge group of sites and apps catering to entirely non-technical folks to whom that distinction would only bring confusion.

For a software bug tracking app, I can see why you'd want this. For a store that sells something like cloth diapers, I think it would be confusing (at absolute best) and certainly not any kind of improvement for either Heroku's customer or their customer's customer.

That's correct. I certainly am not pretending to speak for all consumers in all situations, which is why I stated it can matter.

My primary point was that since it can matter, the information should be presented. An underlying assumption being that it can't hurt, but could help.

However, you brought up a point I hadn't fully considered : namely that this information could be directly confusing to some users and by extension undermine their faith in their service provider.

That said, I think the sample error message in the OP is sufficiently clear for the broad spectrum of users. Therefore, I believe presenting users with that message, or something similar, would do more good than harm.

But how do you know what kind of users Heroku's customers' customers are? How do you know that none of them are technical users? How do you know that none of Heroku's customers are running a software bug tracking app?

I don't know; but your comment suggests you're missing my point entirely. It's not about what each individual customer does; it's that implementing such a feature across the board might work well for Heroku customers whose own customers are technical; but, would fall down big time for those whose customers are non-technical.

Heroku going down isn't unavoidable downtime. It's clownshoes downtime.

That's a strong statement. Can you please elaborate?

Heroku is generally regarded as a reliable hosting solution. If it was a bargain basement alternative that was expected to fail randomly, then this would in fact be "clownshoes" downtime.

Building in full redundancy behind Heroku is possible, but non-trivial. They are after all being paid to provide a reasonably fault tolerant solution.

Therefore, while not literally unavoidable, it may not make good business sense for a venture to incur the additional expenses associated with implementing full redundancy. Most of their clients probably understand that downtime happens and will be perfectly content as long as every reasonable effort was made to keep the system online.

edited to add : Communication is key. Business relationships typically don't crumble due to aberrations like this (power outage, hosting provider going down, etc.). However, they do crumble if communication either doesn't occur, or the wrong information is conveyed. That is the crux of this discussion.

Suppose some of your customers see an error message and report it. You could deliver valuable uptime to them that much faster when you know it's a bug in your code and not Heroku.

When you know it's Heroku, you can deliver valuable uptime to your customers faster because you don't have to spend time testing for bugs and checking server logs--jumping straight to what you would do: contacting Heroku.

Your customers won't be just as screwed--3 hours of down is not the same as 6 hours of down.

Well said; I'd even take it a step further and say that your customers (unless your customers are technical folks) have absolutely no idea about the distinction anyway. Saying that it's hosted on Heroku means nothing to them.

Moreover, there may be companies hosting on Heroku who insist that the service is "white label" for any number of reasons; forcing something like onto them would devalue Heroku to those companies.

unless your customers are technical folks

But Heroku can't know if the customer of their customers are technical folks or not and neither can you. Some people do care, the OP for example seems to care and we can assume thats because his customers do in fact care (only the OP can decide if his customers care or not).

Bullshit. It's your fault. I'm your user, you took my money. We're done here. Everything is your fault.

Actually I do not disagree.... I make the platform choices, so i live with the results. However, IMO an error message that distinguishes between an "hosting" issue and an "app" issue is not only fair, it is in fact, meaningful data to (at least some) customers.

The people who care that it's a Heroku issue not app issue (hint: not many) have probably already heard about the Heroku outage.

Everyone else will be confused about what the hell this "Heroku" thing is.

That said, if they don't use appropriate error codes (maybe 502 or 504 for Heroku issues and 503 for app issues?) they should. But I don't think error messages should mention "Heroku" by name.

I think that's true with Heroku, but gets less true as providers get bigger. If you tell a user that your service is temporarily down "because Google is down", even a lot of regular people will know what that means, and not really blame you for it.

>"and not really blame you for it"

Delusional. If it's down, it's done, and unless your product is a developer tool, >99% of people won't care why.

Not necessarily true, especially if your customers are management types looking to lay on the blame as thick as possible wherever they find it, and will terminate a relationship if they believe the service-provider is incompetent.

For those kinds of customers, they may understand what Heroku is and why their vendor is using it and will definitely make at least some distinction about outage fault.

So far my evidence suggests that that's incorrect. I suspect it's because a lot of people think "Google" is synonymous with "the Internet", though.

I'm not sure about that either. If it was my product I think I'd want a polite on-brand 500 page saying my hamsters are working on it. Seeing something that suggests hosting is down doesn't frankly matter to me, even as a technical user... and to someone in between myself and my mom, with enough knowledge to understand what a "hosting company" is... I wonder if they might feel ripped off as well: "Oh great. I gave this company my money and they don't even have their own servers!"

Your customer wants to know whether it's you or Heroku because...? (There may be a legitimate business case for the distinction. Perhaps you can clarify.)

Why, do your users care?

You might be providing a service to people who would like to know what part of the chain broke. If hosting company A fails all the time, and that's visible to me as a technical user of service A, I would avoid hosting company A for anything that I happen to host. I'm not the average end user, but that information is still valuable to me, and I would think less of hosting company A, instead of service A.

By Heroku not listing when its their downtime, they are insulating their reputation as a hosting company from end users, at the expense of the customers already using them. It's a little shady.

I agree that the average end user would probably not care, but most not caring does not mean it's not valuable information to some people. So I see where the original poster is coming from.

This is a poor response: Abusive, insulting and makes only a token effort to advance the discussion. Does the fact that it sits at the top of the page mean that it's the most highly rated?

And in practical terms, it seems totally theoretical.

In my experience, more information is always valuable. It's not a matter of shifting responsibility, it's a matter of understanding what the problem is and efficiently getting it resolved.

Sending out an error that is incorrect is wrong -- both a theoretical and practical observation.

I apologize if anyone felt it was abusive or insulting, it certainly wasn't intended that way. I'm very passionate about holding myself directly accountable for the entirety of my user's experience with what they paid me for. It's possible that the message suggested by the OP would improve the user's experience, but I don't see how and the OP didn't make a case for that.

Instead the article read to me as if the benefit of displaying this message is that the user's frustration might be allowed to shift to the sub-contracted vendor. I find it hard not to be infuriated by that idea.

And yeah, I think it means it's the most highly rated... or at least something very close to that.

>"It's possible that the message suggested by the OP would improve the user's experience"

Unless the error message is a slightly-less-functional app, no.


OP was playing the role of the user. Paying customers don't care about the implementation details of the products they pay for. They only care about whether or not they work.

Leave it to paying customers to decide what they care about, and leave it to Heroku's paying customers to decide whether they care about the content of the error page.

I believe the key lies in the second sentence, 'Nobody cares why your site is down, and for most sites 99% of your users will have no clue what is meant by "This site is hosted by Heroku"'. My mother wouldn't care whom the hosting provider is nor understand what it is.


The hard fact is that /your app is unavailable/ and I promise that >99% of users won't care why. Did anyone care why Twitter used to Failwhale? It was down, and that sucked, and software exists these days (and has for >20 years) to eliminate single point of failure.

Please take a close look at the comment you replied to and your response. I think you'll see that you're feigning outrage, in an attempt to advance your own point of view.

not always true, as a freelance dev I tend to give options to my clients so they end up owning their hosting, something happens I'm authorized to get in and contact support, but I always make it clear to them they are responsible for it

Wow, totally. Really good point. In this situation your client is a hosting user and the distinction between the error messages makes loads of sense.

Exactly! A customer could care less who you host with and what stack you use. The customer thinks one thing: if YOU are down, then YOU are down.

If you want to look at it that way, in most cases it's the customer's fault, because they don't want to pay what it would require to have a fully redundant system that could weather certain kinds of outages.

You get what you pay for.

Bullshit again. Most customers think they have paid for a fully redundant system... unless I missed a recent wave of conversion funnels that included a message about how the reason it's only 9$ a month is because it might go down, like, whenever.

But you're right about the second part. You get what you pay for... unfortunately that's often different than getting what you bought.

you're missing the point

what OP is complaining about is that when Heroku has an outage it says that there's an error within the client's application. I agree that it's the client's responsibility to have an up-and-running app, while the average user doesn't really care what's going on behind the scene, in this case Heroku is still giving out factually wrong and misleading information.

I can imagine users will oftentimes tell Heroku's clients to fix their app when in reality there's nothing they can do.

I don't think so. I think you're missing the point. I think the point is that there is an error in the application. Heroku is part of the application, and the customer doesn't care. If you extend your line of thinking then you could put up error messages like "We're sorry, but this gosh darn database driver has totally let us down but we didn't write it so go complain to the people who did".

> "...in reality there's nothing they can do."

This is not a reality I'm familiar with.

A hundred times. You need a backup plan for if Heroku is out.

...and if Heroku isn't reliably reliable, time to find new hosting or work on that HA scheme that the team has been itching for.

Devs don't get that UX is all that matters for most software.

Agreed. The last thing a customer cares about is vendors pointing fingers at each other.

When you go to a restaurant to get lunch, and they are closed due to a power failure, do you blame them?

I agree with the other commenter that the analogies are getting stretched but I'll bite:

1. You haven't paid them for anything yet.

2. Wrong question!!! It's an opportunity. If I showed up for lunch and despite power being out (probably on the entire block or neighborhood) the proprietors were set up outside making cold sandwiches next to a sign that said "Sorry, power's out so only egg salad" I'd be thrilled. Here are people single-mindedly devoted to my experience.

That's really my point. It's an attitude problem. I want to spend my money with people who hustle when it hurts, and I want to do that for my users. I'm not saying it's not "fair" to close shop and blame the other guy. Sure it's fair, but the person who cares more is gonna eat (or make, in this analogy) your lunch... and the world will be a better place for it.

>> If I showed up for lunch and despite power being out (probably on the entire block or neighborhood) the proprietors were set up outside making cold sandwiches next to a sign that said "Sorry, power's out so only egg salad" I'd be thrilled.

The relevant part of the analogy is that you, the end user, wouldn't actually know why the restaurant is closed. Could be a power outage, or it could be due to health code violations. Having this information accurately communicated to the customer could impact their willingness to return to the establishment.

Purely as a customer, I do not give a crap whether you make egg salad in an attempt to show your hustle and dedication. I am not thrilled because I do not go out for egg salad to begin with. If you are an awesome hustling entrepreneur, it does not make your egg salad taste any better to me. If the power is out, there is nothing you can do and I understand that and do not penalize you for it, if you normally provide me with good service.

I don't want to start a whole "actually, it's more like x analogy" thread, but really, there are gaping fundamental holes when applying yours to the situation.

If they put up a sign which says "temporarily closed due to power failure" then I do not blame them.

I go someplace outside of the power failure

You should plan for your customer to tell you that, but it's complicated when your vendor tells your customer that.

True, though the dev has the same relationship with Heroku.

I agree with the content of the article, but you're right -- users don't wanna hear it.

Continue in that vein, then if there is a natural disaster then it is their fault as well.

The information is useful. Heroku should provide it. We're done here.

I totally agree with you (the first part). It's all calculated risk, and evaluation of cost.. you keep it up when the math works out. If it's down and it hurts your customers to the point that they'll walk.. it's on you. Bad math, your problem.

I speak as someone who's worked primarily in healthcare building services where I assure you we held ourselves personally responsible for natural disasters.

So it depends on your app. If my startup lets people take photos of their dessert and paste lolcats on them, then maybe my hosting goes down and I show my users a page that says the server must have farted, who cares.. but the last thing I'd do is show a page that said the people I pay with their money must be fucking up at the moment and we'll all wait together for things to get better.

Point is: My users are not my peers, they're my responsibility and livelihood. Even when something totally out of my control occurs. Fuck, especially when something out of my control occurs.

It is their fault. The company could have used global load balancing with the app hosted in multiple geographically distributed data centers in multiple national jurisdictions through multiple independent providers. This would ensure that the earthquake which leveled Amazon's California data center and the flood which took out Hetzner's data center in Germany and the martial law declaration which took out Linode's Japanese data center and the bankruptcy which closed down Rackspace's Amsterdam data center has no operational impact on Peer1's data center in New York where the service continues uninterrupted. As the data is fully synchronized between all data centers, the company keeps running, all customers are online, and the company can work on setting up additional redundancy in a Canadian data center to make up for the others which were lost.

Yes, this costs money. It's why people accustomed to getting everything for free on the internet can't fathom why larger companies charge six or seven figures for a service that they could roll out themselves by installing an open source package on some Linode VM. If you're paying that kind of money for the reliability, it's because you're extending a promise to your end customers, and the service contract you receive from your provider should come with lots of guarantees and financial penalties if the conditions warranting the price tag aren't met.

It is absolutely your fault if the site does down due to natural disaster. This is why companies who are large enough to do so host the site at multiple locations with the ability to failover.

Customers usually don't care about the reason for outage. They gave you money. If the service is running, good. If it's not, you screwed up. No matter what actually happened, you should've been prepared. Sad but true.

"No matter what actually happened, you should've been prepared."

This is just plain false. Being prepared comes at a cost. If you over-prepare, then your customers have to pay more for no good reason, and they don't necessarily want to. You have to draw a line and make a judgement call.

There are such things as natural (or political) disasters so serious that it would be extremely stupid to plan for them. And there are other disasters in between this and run of the mill. Again, it's a judgement call. And it's not your "fault" if the customer wants a combination of low price and reliability, and you made a reasonable tradeoff in order to achieve it.

As long as you are being honest with your customer and explain this somewhere.

The problem is one of expectations, if you don't say anywhere what have you prepared for and what are you going to do when something you didn't prepare for happens, you are misleading the customer, as they will rightly assume you have prepared for most ordinary things (heroku outage, for instance.)

It is false. Most of the time, at least. But then again I wasn't expressing my opinion but a probable opinion of Your Regular Customer.

Yes, Heroku should put up a different error notification when the problem is on their side, but I doubt it would make that much of a difference in the eyes of the user.

That's all.

I don't think the OP is right in general.

What if your customer sees Heroku's name, and gets confused?

She starts asking questions like: Who is on the other end? Am I in business with X or with Heroku? Who should i call?

This is actually a really good point. I can't find it now, but there's an article floating around somewhere about a government employee in some small town accusing Apache or Debian or some project of being "hackers" because his web server broke and was showing the default "congrats it works!" page instead of the town web site.

An ex-boss accused me of hacking and embezzlement(!) when he discovered he was paying £10/month for DNS services a few years back. It was something in place from years before I even started!

He also flipped out over firewall issues on his new Macbook (the site is down and taken the internet with it!) and problems with his ISP (some weird adult filter).

Point is, some people expect a site to be a single atomic thing with no dependencies or connections between. Through anything "unusual" in there and confusion and chaos reigns. My ex-boss was somewhat extreme I admit, but I've experienced it to a lesser degree many many times.

That's why Apache's default page is now <h1>It works!</h1>.

That's not an isolated occurrence. Most Web projects get it all the time from randoms, sometimes completely unaffiliated with the site in question.

Just don't mention Heroku by name. "Webhost" or something is enough. I'm in two minds about the whole thing though.

As someone who was inconvenienced by the outage, and with no mitigation strategy in place, I DON'T blame Heroku. The weight is placed squarely on me (lone tech in our company) for not having researched how to distribute services alongside Heroku, or fall back to something else, or whatever the proper term is.

I've been googling like mad since this morning, finding a few mostly-unanswered StackOverflow questions and a smattering of blog posts, but I haven't learned much. The only clear-cut answers I've seen are:

1. Hire a sysadmin who knows more than you do (But whole point is that I want to learn for myself!).

2. Pay for a service that will host in multiple geographic locations for you, and do the switchover (recovery? fallback? I don't know my terms here) for you.

3. A few mentions of "load balancers" and "heartbeat monitors". Sounds self-explanatory, and these are my current terms of googling.

Any suggestions on where to start acquiring this sort of skill? I'm prepared to teach myself anything, but the problem is not knowing the terms for what I want to learn.

EDIT: Well, just watching this thread is helping a bit.

How do you not blame Heroku? You are paying them a good chunk of money to handle not only hosting, but failover strategies, multiple locations, etc. If you have to worry about any of those things they aren't doing their job.

That's like if you chose MySQL as the database, and then when an update had a huge bug that broke your site, you say "totally my fault that I don't have a version of the site that uses PostgreSQL."

To clarify part of this, Heroku doesn't have the same position as (say) a router manufacturer because they are offering an all-in-one 'platform'. And unlike MySQL, they are charging you a decent rate for using it.

Failover across geographically distributed datacenters is a challenge that doesn't get talked about all that much.

As a small company you probably aren't able to easily get your own IP block allocated (that I know of) so BGP [0] isn't really an option and the best you can do is probably DNS switching. Use a good DNS provider and set your TTLs to something low like 30 seconds or 1 minute. Then when you have an outage, change the DNS entry to point to a secondary datacenter, which would have a static error page or a reduced-functionality site. There seems to be some debate around whether low DNS TTLs increase users' request times, but we haven't seen it.

There are some companies that will handle the monitoring and switchover for you (Dyn comes to mind) but we prefer to manually switchover for the time being. We have a Big Red Button sinatra app that reports the status of the site and allows you to fail over to the secondary and recover when the primary returns; I'm planning on open sourcing it once it gets some documentation.

I think the reason failover doesn't get talked about as much in the startup world is just because it's hard to do and the costs are disproportionately high for a small company unless availability is really critical to you. For most people, just using multiple availability zones on EC2 is probably sufficient.

[0] http://ajohnstone.com/achives/high-availability-across-multi...

Even vendors that promise that, e.g. Amazon, aren't infallible.

The gist that I'm getting from a few places seems to be: Have separate hosts/service providers, and do the load balancing yourself, or make switch yourself, when one fails. I've yet to find many detailed examples, though, as most similar articles deal with load balancing with in your own locally-managed network, or co-located set of machines.

The more generalized "Cloud plus Dedicated" fallback/load balancing seems fairly involved, and raises a lot of other questions, but at least I've got a path to follow now. Also would be more expensive, as a backup server might just be hanging around doing nothing at times.

Then again, it would pay for itself in satisfied customers after just a single event.

The cost of running your own infrastructure at this level is slower development and ongoing hassle, vs. Heroku.

Unless your application absolutely must be up with higher availability than Heroku provides, it's probably not worth the effort. The easiest thing to do is to use something like Cloudflare in front of Heroku, so at least when Heroku is down, you can serve a static page to customers informing them of the problem and estimated time to fix.

Heroku has a mechanism for displaying custom error and maintenance pages, served off of S3.


The earlier Heroku outage also brought down custom error pages. Our site was only displaying a 500 server error via nginx.

But was it serving _your_ 500 error page? In which case you could make it say anything you want. Heroku platform errors are not 500s; they're 502s (or 503s, I forget).

From a customer's perspective, there are only two parties in their relationship with you: you, and them. When something goes wrong with your application, you either accept the blame, or you make the customers feel like they broke something. To the average user, seeing an error message like "heroku is down" (or any other jargon) leaves the possibility that they might have broken something, and the failure is on their end. The end result of this interaction is that your software has made your user feel bad about themselves. This is not a way to get your users to return to you.

Heroku's error message could be friendlier, but it currently contains only words that any user can understand, which reassures your customers that even though the service they are looking for is unavailable, there is nothing they could have done to improve the situation. Your customers might leave with a lowered opinion of your service, but your app doesn't make them feel ashamed of themselves, which is a much better outcome.

If we put our normal-user-hat on for a minute: I don't see how that would make any difference for 90%+ of users. They'll see the website isn't working, but another site is, so your site is broken. End of story.

(With my developer hat on, Heroku outages are fun: our internal switchboard at http://www.pagerduty.com lights up like a christmas tree)

Heroku isn't for apps that can't stand downtime. My experience has been that if you have 2-3 heroku apps, and you monitor them with a 3rd party tool, you'll see random "server not found" behavior every few weeks. (And no, they're not just timing out from dyno spinup). Usually this isn't a system wide outage and never gets mentioned on their status page.

So only use heroku if:

a) Uptime is non-critical & you just don't want to deal with setting up a server b) Uptime is non-critical & You don't know how to set up a server

You can customize the error page. It is your fault for not reading the docs and serving the default.


I don't want my users to know I use Heroku, and by using Heroku I understand that if they go down my site goes down. It really is "our" problem at that point with regards to how our users understand it.

If Heroku is down, and I discover that site X that I want to visit was hosted on Heroku, I'm more likely to hear that Heroku is back up or that site X is back up, than just that site X is back up. I also can skip checking any other sites I know to be using Heroku during the outage. It is therefore mildly useful data to a user.

Add a custom 500 error page, problem sovled, you can make it say anything you want.

It’s your fault for not having a fault tolerant site that runs on another service provider. This is what happens when you put your eggs in one basket and that basket bursts into flames.

If reliability is so important, make it a priority instead of just expecting stuff to work or for a more politically correct error message — which leads me to my next point: who cares about the ERROR message? The damage has been done by that point and half the people won't bother to read any further. Queue sounds of people clicking back buttons as fast as they can.

Question: given that Heroku involves a certain amount of platform lock-in, how do you write a Heroku app that runs on another service provider?

It hardly makes sense if Heroku says "well, it's your fault for trusting us."

This seems specious. Correctly assigning blame won't matter to readers; most people couldn't care less. Sticking a different brand name on the failure is side-stepping the issue: nothing stays up 100% of the time. Create a custom page that treats the situation with a little bit of levity.

If legitimate downtime happens often enough that someone would actually internalize the difference between your failures and Heroku's, you have bigger problems than your error page.

Quite off topic, but I'm always sad to see really poor scalability:

To my surprise, this blog post hit the top spot on HN at least briefly. My blog started throwing some app errors.

I've had a couple of hit HN stories on my blog without a problem, and it was hosted from my apartment on an old server with 256MB of RAM. Now, it is static pages served through nginx, but I'm pretty sure that a few thousand hits shouldn't require 10 Heroku dynos to not fall over.

Kids these days. (the mindset, not the age)

I doubt it requires anywhere near 10 dynos, normally I just run 1, and I suspect 2 or 3 dynos would work fine today... but since 10 dynos only costs 45 cents per hour, and it's presumably just for an hour or two, I simply threw two handfuls at the problem and went back to my actual work. Handling a one-time spike didn't seem like something worth optimizing when I could throw the price of a cup of coffee at the problem.

Ah, my fault for not recognizing pragmatism at work. Rare events are definitely not worth optimizing for.

I've seen so much blame in IT/systems/coding in the last 15 years that I can't recount it all. Anytime a vendor or service provider or consultant is involved, get ready for finger pointing when things go wrong (from both sides). I think many managers like being able to blame them and see this as a benefit of the relationship. Outside providers should just expect to be blamed for things they did not do and charge for that accordingly.

I think that for completeness, it should display a complete blame derivation graph that explains to the user the full chain of events, right back to the original person who was ultimately responsible.

After all, it wouldn't be fair for Heroku to be blamed just because a piece of networking equipment failed - the user should be informed which vendor is at fault, and in turn, which supplier the failed component within said equipment came from.

at the end of the day, if you have a service that other people/businesses/clients rely on, that they need 24/7 up time, then you really need to have a plan B that is not on heroku or aws. a REAL disaster recovery plan needs to be thought out and implemented. if you dont want your users to see the "there is a problem with this app" on heroku, then its your job to figure out that plan B is. If you cant afford it a plan B, then well, tough shits. as someone that has worked in the hosting business for years on the operations side, its also the responsibility of the client to plan that scenario where your primary host is not reachable (regardless if its an application level issue, network or power outage). the hosting company can only build so many N+1 backups (network/power/etc) as they can afford/physically fit. you can buy all the load balancing you want, redundant web servers and database servers. if you arent hosting in a secondary place and your primary host fails, all those redundant servers you are paying for arent going to mean a damn thing.

We know when Heroku is down cause our emails from client's app drown our inboxes, and our clients get pounded by their clients.

I think it would be ideal to allow you to customize these messages to make things easier, but I can't imagine the infrastructure they would need to have in place to support this.

The option presented by the article is lot simpler.

Many companies I know would immediately fire a service provider for ever disclosing their existence to an end customer. If anything, Heroku's customers should be able to replace the default error message such that it conforms to the the customer's site branding.

You can customize your error pages to be whatever you want.


So the solution is for you to run a script which monitors Heroku for outages and changes the error page?

If you are doing that, you might as well write an app against another platform.

They ought to create an interface for serving up a custom 500 error page.

What if the problem only affects a subset of users? Then wouldn't any application errors for unaffected users (e.g. a typo in code) say it is Heroku's fault when it really isn't?

It may be difficult from a platform for them to tell where the outage is exactly.

Also I don't think they should tell everyone heroku is hosting it so I don't think that is a good solution.

re your second point: Yeah I pondered that, and they could make the "ooops" more generic. However, we freely talk about being on Heroku since by and large it seems to engender customer confidence. And it's exactly a secret since the dns will point to proxy.heroku.com after all.

All this ruckus for 18 mins of downtime? Moved my main app off heroku this monday for various reasons (mainly to get better log access and to run the app in europe).

The only change I would make, if any, is to remove the sentence about being the application owner. Aside from that, that's all I'm going to tell a customer anyway.

This is what a culture of pointing fingers leads to. The author should realize the customer does not care why the site is down.

"throw more dynos at it"

You kids and your lingo these days....

/back in my day/ we used to have servers. REAL, PHYSICAL servers :)

Does Heroku not have custom ErrorDocumenrs? We had those in the 90's...

Isn't it your fault to have picked Heroku in this case?

Yes, it is your fault for trusting what Heroku says about its availability. But it would be classy for Heroku to take responsibility. It is not classy for Heroku to say "it's your fault because you trusted us" in front of the users, which seems to be the principal defense of Heroku in these comments.

Or you could get on Engine Yard and stop caring what Heroku does. That's worked wonders for my business.

Or you could get on EngineYard and stop caring what Heroku does. That's worked wonders for my business.

Wow, didn't know that (don't use Heroku). Good article. Good solution.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact