Heroku down?

michaelfairley · on June 15, 2012

This is a more widespread EC2/EBS issue: http://status.aws.amazon.com/

DigitalSea · on June 15, 2012

I couldn't see any red circles indicating an issue with EC2/EBS.

michaelfairley · on June 15, 2012

The circle is green with a little "note" on it.

"8:50 PM PDT We are investigating degraded performance for some volumes in a single AZ in the us-east-1 region."

mibbitier · on June 15, 2012

A green circle with a little note on it, for "There was a complete power failure"

God only knows what they reserve the red circle for...

DigitalSea · on June 15, 2012

Wouldn't that only affect a small subset of visitors. For example why would I be seeing any issues if I'd be hitting an Asia-pacific volume instead of a us-east region one? Seems like it goes deeper than that.

mechanical_fish · on June 15, 2012

One problem which we've seen before is: If a large percentage of AWS infrastructure goes down, the customers don't just quietly suffer. Instead they scramble to try and launch infrastructure in other zones or regions, which creates a cascading series of load spikes throughout the AWS system.

AWS is a fascinating science experiment. Pity about the websites, though.

michaelfairley · on June 15, 2012

It's now yellow with this: "9:27 PM PDT We continue to investigate this issue. We can confirm that there is both impact to volumes and instances in a single AZ in US-EAST-1 Region. We are also experiencing increased error rates and latencies on the EC2 APIs in the US-EAST-1 Region."

AWS has been historically bad at reporting the severity of their outages promptly.

bobsy · on June 15, 2012

Is Heroku still in beta? The name has been around for a while. I would have thought the platform would have stabilized and complete outages were very unlikely.

In the past 30 days they have had 2 outages which have lasted more than 2 hours. That's a lot of down time.

espeed · on June 15, 2012

This is an Amazon outage.

parasight · on June 15, 2012

Not very interesting for a customer of Heroku.

betterth · on June 15, 2012

Any customer of Heroku who didn't weigh the cost/benefit of Cloud hosting is a fool and deserves to watch their app have downtime.

Cloud hosting is fantastic but it's a trade off. There are so many layers of abstraction between you and the hardware that you are completely at the mercy of one, two or (more!) technical organizations, each with their own support systems and varying levels of opacity into their infrastructure.

The fact that Amazon went down IS VERY interesting for a customer of Heroku.

And if it isn't, than that customer is a fool for outsourcing so much of their system without even understanding the risks involved.

krosaen · on June 15, 2012

I don't know, I think heroku should provide the value proposition that - hey we have it covered, if we rely on another cloud it's on use to build in redundancy and reliability atop it so as not to burden you, the app developer who is paying us to take care of operations and scaling.

What would be interesting to me about an Amazon outage being behind a heroku outage would be to keep a tally, and if heroku didn't manage to build in more reliability to be resilient to even an amazon outage in a particular region, to question whether they were a good fit.

rplnt · on June 15, 2012

And Amazon had a power outage. That does not mean that customers can't (or won't) blame Amazon.

bdesimone · on June 15, 2012

Dear Heroku -- I know it's my job to make sure my site is available (/thread). However, I think I speak for most enterprise customers when I say I will throw money at your company the second you come up with a multi-zone/highly-available offering.

cardmagic · on June 15, 2012

Throw money at http://appfog.com/ they have multi datacenters already

lclarkmichalek · on June 15, 2012

Their pricing page https://console.appfog.com/pricing isn't very enlightening

bdesimone · on June 15, 2012

It's an open offer. Do they support python? edit: they do, but it's not clear how to easily deploy a multi-zone application. Could you point me towards some docs?

cardmagic · on June 15, 2012

Right now any single app can be deployed from any one of a bunch of infrastructures, AppFog is working on the ability to run one app in multiple infrastructures simultaneously too.

Pythondj · on June 18, 2012

appfog.com also has outages as well - http://blog.appfog.com/october-27th-downtime-postmortem/ - when going with public cloud, your app's well being is always in 3rd party hands. If you want to mitigate the risk, host your own private PaaS on your own infrastructure - then you can only point your finger at yourself when outages occur.

greghinch · on June 15, 2012

Looks like they support everything except Python :(

cardmagic · on June 15, 2012

AppFog supports Python!

timaelliott · on June 15, 2012

https://status.heroku.com/incidents/151

Looks like they didn't actually do any of the remediation steps.

stevewilhelm · on June 15, 2012

Dear competitors,

Please take this outage as proof that you need to build our your own infrastructure and hire your own operations team in multiple geographic locations.

In the mean time, we will continue to focus on building new features and products that our customers love on our EC2, Heroku, and cloud based system.

ericd · on June 15, 2012

Nah, I'm going to pick dedicated hardware with SSDs for IO consistency that beats the pants off of AWS for a fraction of the price and not much more time commitment, rather than for the potentially better uptime.

stickfigure · on June 15, 2012

not much more time commitment

This is fiction.

patrickgzill · on June 15, 2012

How much time commitment do you figure on when your site is down and paying customers are calling and emailing you?

stickfigure · on June 15, 2012

If reliability is that critical, you need multiple data centers. This is far easier to implement with EC2 than by building out hardware.

Also: Most downtime is caused by bad code deployment, poorly-conceived network or system configuration changes, and sysadmins with fat fingers. Do you really think your hired talent is going to be better than Amazon's hired talent?

patrickgzill · on June 16, 2012

Sorry to be rude, but that is a bit incendiary of a comment. However, given stats for the past 12 months, I can show that my availability for the 2 racks I manage, is in fact, higher than AWS. Thanks for the compliment!

ericd · on June 15, 2012

I don't consider a few hours per month administering servers to be much more than the time I would spend working around Heroku's proprietary app model for the more esoteric things I would want to do. I would guess it's less, but I don't really know. Throw saving thousands a month into the mix and it's not something I'd lose sleep over. Database servers that can handle 10's of thousands of iops and web servers that can handle thousands of uncached requests per second makes machine administration a lot more pleasant than it was just 5 years ago. Hardware has been scaling vertically quickly enough that it's no longer strictly necessary to scale horizontally in a massive way as you grow.

calinet6 · on June 15, 2012

This is fact.

Honestly, it's not that hard to set up a server, and furthermore, it's just not that different to maintain a hard server than a virtual server.

And the most important part: when you have your own hardware, you at least maintain control over every aspect of your systems. The value of this cannot be overstated.

stickfigure · on June 15, 2012

it's just not that different to maintain a hard server than a virtual server

I'm guessing you haven't used Heroku. Server setup is "git push master".

you at least maintain control over every aspect of your systems

There are still plenty of things you don't control - network feeds to your cage, continuous power, bugs and failures in the hardware you buy. You cannot provision new systems without either buying machines or having a hot standby, and somebody needs to make a trip to the cage. If you're getting hardware by the month from a service, you will probably have faster turnaround (hours not days), but once again you're giving up some control.

patrickgzill · on June 15, 2012

OR, buy a couple of monster 64GB+ RAM systems, add SSDs, and place in LA or Denver.

I have an ancient quad-900Mhz Xeon with 6GB RAM (customer does not want to migrate) that has an uptime of 1600 days, and for which total network issues during that period was a few hours (wonky power to switch).

Cloud is too often comparable to "vapor" in terms of the claims of redundancy and availability.

stevewilhelm · on June 15, 2012

This solution naively assumes a fixed resource need. Growing startups steadily provision additional resources.

They also try new things with large amounts of data. This requires scaling up to additional machines for hours or days at a time, and then scaling back some or most of them when they optimize new ideas and services for production.

When the new services are a hit with clients, traffic increases and the whole cycle starts again.

druiid · on June 15, 2012

And you've noted the EXACT use case for the cloud. If you are a new app, or a hot start-up and get tons of sporadic traffic, the cloud is absolutely where you need to be. To do anything else is beyond foolish.

Now, what I think the original post there is speaking about is for mid to large size enterprise companies that have stable, but significant traffic. In cases like this they must do a cost-benefit calculation because you risk a lot if you don't. Then the cloud might very well not be the right solution, because costs could be 10x more than anything else... so the answer in my mind is not always clear.

ericd · on June 15, 2012

I believe friendfeed ran off of a single machine for most of its life. A few years down the line, I think most people who have only used EC2/Heroku would be shocked at how much traffic a single recent Xeon 8+ core machine with 32+ gigs of RAM and RAIDed SSDs can handle and the price at which it does it. Before that, even a mid-tier VPS is probably a better option than EC2 for most.

Sure, EC2 is probably best for a startup that expects to double every week from a nontrivial starting point and has large machine resource needs per user (viral video startups, for example). The vast, vast majority of startups won't have anything that resembles that kind of growth graph, though, and thus shouldn't blindly follow what the Pinterests of the world do. It's a completely different type of demand. If they find out that they actually are going to have double digit daily organic growth percentages, then they can switch to EC2 before it gets out of hand, but otherwise, it's premature optimization.

toddmorey · on June 15, 2012

http://quora.com is down, too. Seems like an AWS outage.

DigitalSea · on June 15, 2012

Once again proof that "the cloud" isn't always the best solution. I am amazed that a cloud provider like Amazon can still suffer from outages considering the size of their cloud infrastructure and supposedly being decoupled, obviously not decoupled enough. Perhaps it's my lack of understanding of cloud hosting, but when issues like this present themselves it obviously shows that cloud hosting has a long way to go.

polemic · on June 15, 2012

We've distributed our machines between availability zones. We lost a machine with the latest outage. Application impact to our users? None.

No point bemoaning a lack of decoupling, if you don't actually use it.

rplnt · on June 15, 2012

It's not like Amazon haven't had multizone outages though.

jwilliams · on June 15, 2012

What's the alternative? Go it alone and run your own infrastructure?

davidwparker · on June 15, 2012

Agreed. 99.97% uptime for May is excellent. I've worked with a lot of enterprise systems (to include military!) and we would love to have that high % of uptime.

icebraining · on June 15, 2012

My cheap shared hosting provider has 99.98% this month, and they give you a free month if it goes below 99.8%, which has only happened once in two years.

Somehow Heroku doesn't seem that great to me.

pxlpshr · on June 15, 2012

Cmon. Your cheap shared hosting doesn't compare when it comes to scale and features.

icebraining · on June 15, 2012

Yes, but it doesn't compare when it comes to price either, considering a single Dyno costs $36/month when I'm paying $44/year for the whole plan.

(And that comes with excellent support - for example, I asked if they were planning to offer Python and they said "Sure, just gives us a couple of days to set up a machine with Python for you", even when I was only interested in the cheapest plan).

I know they don't serve the same market, but I find it strange that a service that costs an order of magnitude more doesn't have a better uptime than cheap shared hosting.

davidandgoliath · on June 15, 2012

Primarily because in order for amazon to pull a profit, they're not stuffing these systems in high-quality datacenters ;)

chrisconley · on June 15, 2012

Going on last week's outage of 2 hours and last night's of 8 hours, that puts June's uptime at ~98.6%.

acdha · on June 15, 2012

Only if you count small outages which affected a fraction of the customers as affecting the entire service. I had 100% uptime during that period using only the multi-AZ redundancy.

michaelfairley · on June 15, 2012

EBS is a deep, dark black magic that time has shown doesn't actually work.

moe · on June 15, 2012

I'm not a big EC2 fan but there is not much "black" about network volumes. Also a few hundred thousand customers seem to disagree with the "doesn't work" part.

garethadams · on June 15, 2012

"proof" - I do not think that word means, what you think it means.

pardner · on June 15, 2012

I'm pretty sure nothing qualifies as "always the best" solution. "The cloud" can be an imperfect solution yet still be the best solution for certain apps.

timaelliott · on June 15, 2012

Cloud is an excellent solution. The issues is companies not taking advantage of multiple availability zones.

xatax · on June 15, 2012

Is anyone else getting a message that "sathish@DOMAIN.com has been unsubscribed from future notifications."? (Redacted just in case.)

There's a notification at the top of the page for me with that message, but it didn't appear in Chrome. Session collision maybe?

ers35 · on June 15, 2012

I also received that message.

jackmoore · on June 15, 2012

Their current status is that they are investigating issues with their infrastructure provider: https://status.heroku.com/

edouard1234567 · on June 15, 2012

Very cool new status page look. Timeline style! kudos to Heroku's team. Too bad I'm such in a bad mood when I visit it.

zhoutong · on June 15, 2012

It seems that Heroku has been down for at least 4 hours in June. This makes the June uptime less than 99.5%.

brittohalloran · on June 15, 2012

between this (now 8 hour!?!?) stretch and the 2 hour outage last week, June uptime is down in the mid 98's

M4v3R · on June 15, 2012

There was ~12.5 hours of downtime in june. That means their uptime for june is now 96.5%. No wonder why they have decided to show May uptime instead.

rjsamson · on June 15, 2012

Yup. Serving up error messages for me. Let's hope things are resolved a bit quicker than last time.

danboarder · on June 15, 2012

http://hootsuite.com is offline as well.

kingrolo · on June 15, 2012

I've only been using Heroku as we're working with a client who's managing it themselves for the last month or so. I'm surprised at how much downtime there's been. Is this typical or is it just an unlucky spot?

cardmagic · on June 15, 2012

This is why http://AppFog.com/ is investing in multiple IaaS and is not being hit nearly as hard. You can still sign up and even create apps.

bks · on June 15, 2012

Looks like they are back. Now the fun part starts for the rest of us. Time to make sure that everything started and that the apps are running.

jackmoore · on June 15, 2012

My apps are down, and so is the heroku.com homepage.

jszielenski · on June 15, 2012

Yes it is. Again. Damn it.

peterjancelis · on June 15, 2012

Heroku is back up.