Wouldn't that only affect a small subset of visitors. For example why would I be seeing any issues if I'd be hitting an Asia-pacific volume instead of a us-east region one? Seems like it goes deeper than that.
One problem which we've seen before is: If a large percentage of AWS infrastructure goes down, the customers don't just quietly suffer. Instead they scramble to try and launch infrastructure in other zones or regions, which creates a cascading series of load spikes throughout the AWS system.
AWS is a fascinating science experiment. Pity about the websites, though.
It's now yellow with this: "9:27 PM PDT We continue to investigate this issue. We can confirm that there is both impact to volumes and instances in a single AZ in US-EAST-1 Region. We are also experiencing increased error rates and latencies on the EC2 APIs in the US-EAST-1 Region."
AWS has been historically bad at reporting the severity of their outages promptly.
Is Heroku still in beta? The name has been around for a while. I would have thought the platform would have stabilized and complete outages were very unlikely.
In the past 30 days they have had 2 outages which have lasted more than 2 hours. That's a lot of down time.
Any customer of Heroku who didn't weigh the cost/benefit of Cloud hosting is a fool and deserves to watch their app have downtime.
Cloud hosting is fantastic but it's a trade off. There are so many layers of abstraction between you and the hardware that you are completely at the mercy of one, two or (more!) technical organizations, each with their own support systems and varying levels of opacity into their infrastructure.
The fact that Amazon went down IS VERY interesting for a customer of Heroku.
And if it isn't, than that customer is a fool for outsourcing so much of their system without even understanding the risks involved.
I don't know, I think heroku should provide the value proposition that - hey we have it covered, if we rely on another cloud it's on use to build in redundancy and reliability atop it so as not to burden you, the app developer who is paying us to take care of operations and scaling.
What would be interesting to me about an Amazon outage being behind a heroku outage would be to keep a tally, and if heroku didn't manage to build in more reliability to be resilient to even an amazon outage in a particular region, to question whether they were a good fit.
Dear Heroku -- I know it's my job to make sure my site is available (/thread). However, I think I speak for most enterprise customers when I say I will throw money at your company the second you come up with a multi-zone/highly-available offering.
It's an open offer. Do they support python?
edit: they do, but it's not clear how to easily deploy a multi-zone application. Could you point me towards some docs?
Right now any single app can be deployed from any one of a bunch of infrastructures, AppFog is working on the ability to run one app in multiple infrastructures simultaneously too.
appfog.com also has outages as well - http://blog.appfog.com/october-27th-downtime-postmortem/ - when going with public cloud, your app's well being is always in 3rd party hands. If you want to mitigate the risk, host your own private PaaS on your own infrastructure - then you can only point your finger at yourself when outages occur.
Please take this outage as proof that you need to build our your own infrastructure and hire your own operations team in multiple geographic locations.
In the mean time, we will continue to focus on building new features and products that our customers love on our EC2, Heroku, and cloud based system.
Nah, I'm going to pick dedicated hardware with SSDs for IO consistency that beats the pants off of AWS for a fraction of the price and not much more time commitment, rather than for the potentially better uptime.
If reliability is that critical, you need multiple data centers. This is far easier to implement with EC2 than by building out hardware.
Also: Most downtime is caused by bad code deployment, poorly-conceived network or system configuration changes, and sysadmins with fat fingers. Do you really think your hired talent is going to be better than Amazon's hired talent?
Sorry to be rude, but that is a bit incendiary of a comment. However, given stats for the past 12 months, I can show that my availability for the 2 racks I manage, is in fact, higher than AWS. Thanks for the compliment!
I don't consider a few hours per month administering servers to be much more than the time I would spend working around Heroku's proprietary app model for the more esoteric things I would want to do. I would guess it's less, but I don't really know. Throw saving thousands a month into the mix and it's not something I'd lose sleep over. Database servers that can handle 10's of thousands of iops and web servers that can handle thousands of uncached requests per second makes machine administration a lot more pleasant than it was just 5 years ago. Hardware has been scaling vertically quickly enough that it's no longer strictly necessary to scale horizontally in a massive way as you grow.
Honestly, it's not that hard to set up a server, and furthermore, it's just not that different to maintain a hard server than a virtual server.
And the most important part: when you have your own hardware, you at least maintain control over every aspect of your systems. The value of this cannot be overstated.
it's just not that different to maintain a hard server than a virtual server
I'm guessing you haven't used Heroku. Server setup is "git push master".
you at least maintain control over every aspect of your systems
There are still plenty of things you don't control - network feeds to your cage, continuous power, bugs and failures in the hardware you buy. You cannot provision new systems without either buying machines or having a hot standby, and somebody needs to make a trip to the cage. If you're getting hardware by the month from a service, you will probably have faster turnaround (hours not days), but once again you're giving up some control.
OR, buy a couple of monster 64GB+ RAM systems, add SSDs, and place in LA or Denver.
I have an ancient quad-900Mhz Xeon with 6GB RAM (customer does not want to migrate) that has an uptime of 1600 days, and for which total network issues during that period was a few hours (wonky power to switch).
Cloud is too often comparable to "vapor" in terms of the claims of redundancy and availability.
This solution naively assumes a fixed resource need. Growing startups steadily provision additional resources.
They also try new things with large amounts of data. This requires scaling up to additional machines for hours or days at a time, and then scaling back some or most of them when they optimize new ideas and services for production.
When the new services are a hit with clients, traffic increases and the whole cycle starts again.
And you've noted the EXACT use case for the cloud. If you are a new app, or a hot start-up and get tons of sporadic traffic, the cloud is absolutely where you need to be. To do anything else is beyond foolish.
Now, what I think the original post there is speaking about is for mid to large size enterprise companies that have stable, but significant traffic. In cases like this they must do a cost-benefit calculation because you risk a lot if you don't. Then the cloud might very well not be the right solution, because costs could be 10x more than anything else... so the answer in my mind is not always clear.
I believe friendfeed ran off of a single machine for most of its life. A few years down the line, I think most people who have only used EC2/Heroku would be shocked at how much traffic a single recent Xeon 8+ core machine with 32+ gigs of RAM and RAIDed SSDs can handle and the price at which it does it. Before that, even a mid-tier VPS is probably a better option than EC2 for most.
Sure, EC2 is probably best for a startup that expects to double every week from a nontrivial starting point and has large machine resource needs per user (viral video startups, for example). The vast, vast majority of startups won't have anything that resembles that kind of growth graph, though, and thus shouldn't blindly follow what the Pinterests of the world do. It's a completely different type of demand. If they find out that they actually are going to have double digit daily organic growth percentages, then they can switch to EC2 before it gets out of hand, but otherwise, it's premature optimization.
Once again proof that "the cloud" isn't always the best solution. I am amazed that a cloud provider like Amazon can still suffer from outages considering the size of their cloud infrastructure and supposedly being decoupled, obviously not decoupled enough. Perhaps it's my lack of understanding of cloud hosting, but when issues like this present themselves it obviously shows that cloud hosting has a long way to go.
Agreed. 99.97% uptime for May is excellent. I've worked with a lot of enterprise systems (to include military!) and we would love to have that high % of uptime.
My cheap shared hosting provider has 99.98% this month, and they give you a free month if it goes below 99.8%, which has only happened once in two years.
Yes, but it doesn't compare when it comes to price either, considering a single Dyno costs $36/month when I'm paying $44/year for the whole plan.
(And that comes with excellent support - for example, I asked if they were planning to offer Python and they said "Sure, just gives us a couple of days to set up a machine with Python for you", even when I was only interested in the cheapest plan).
I know they don't serve the same market, but I find it strange that a service that costs an order of magnitude more doesn't have a better uptime than cheap shared hosting.
Only if you count small outages which affected a fraction of the customers as affecting the entire service. I had 100% uptime during that period using only the multi-AZ redundancy.
I'm not a big EC2 fan but there is not much "black" about network volumes. Also a few hundred thousand customers seem to disagree with the "doesn't work" part.
I'm pretty sure nothing qualifies as "always the best" solution. "The cloud" can be an imperfect solution yet still be the best solution for certain apps.
I've only been using Heroku as we're working with a client who's managing it themselves for the last month or so. I'm surprised at how much downtime there's been. Is this typical or is it just an unlucky spot?