EDIT: Various non-Heroku EC2-East-based sites (e.g. Quora) seem to be down as well, lending more evidence to this being an EC2/EBS outage.
Also Amazon Relational Database Service (N. Virginia) is unavailible.
Seems like its snowballing.
That IP isn't in the EC2 IP ranges (https://forums.aws.amazon.com/ann.jspa?annID=1528), so at the very least it's not hosted on EC2.
It couldn't come soon enough for us, but also for Heroku as AppFog seem to have its foundations built on a multi-zone/region/provider architecture.
How do you explain to your customers/users/etc that you were down and have absolutely no control of when you will be back online? How can you explain it to yourself?
"Running a web server is very expensive. After we've built the site, if we want to keep it running, someone needs to be on-call 24 hours a day. That means at least one full-time staff member who does nothing else-- more if we want them to stay sane.
"To save us and you some money, we've contracted maintenance of our servers out to a third-party service. This is great for us, since they run it more reliably than we could, and it's great for you, because it costs a lot less. But the downside is that things still break sometimes, and when they do, it's completely out of our hands. We're left waiting for things to get fixed just like you are.
"So we understand your frustration; we're frustrated too. But unfortunately, downtimes do happen. Guaranteeing our service 100% of the time would cost hundreds of thousands of extra dollars per year, and for most of our users, that's simply not worth the cost. Our provider guarantees 99.[nines]% reliability for much less money, and this is the 0.01%.
"If you have something that absolutely must get done, shoot us an email right now and we'll take care of it for you as soon as the site is active again.
"Although this is technically out of our hands, we aren't trying to shift the blame; we made this decision with open eyes, and we stand by that decision. Again, we sincerely apologize for the temporary inconvenience. We hope we can make it up to you with some new features we'll be rolling out this month :)"
It doesn't have to be that hard.
It's not like you can just wake up one day and say "I'm gonna go build a fully fault tolerant distributed system that works across multiple data centers!" and then you're done by the time you go to sleep.
Go actually talk to some Netflix engineers. They'll tell you the same thing.
That makes it sound like Netflix has a more reliable platform than the PaaS company.
Though they're not 100% degradation proof today either: http://i.imgur.com/MJfqj.png
Even if you aren't a cat picture site, many startups can deal with some downtime occasionally and it's the right tradeoff to make.
Which is ridiculous, 9s for most services use years as standard. Of course if heroku did that they wouldn't look so good.
[edit: i realise they could be hiding worse months, but i don't think that's what the post i am replying to meant. perhaps i am reading it wrong.]
The measurement in question is supposed to be about consistency.
31 versus 28 days.
We're already looking into alternatives -- perhaps not leaving Heroku altogether, but certainly not depending on them 100 percent. There's no way that we can entrust the business to something that can just catastrophically fail at any moment. I've been running my own servers for years, and they've never had such unpredictable issues.
I increasingly have to think that a few servers, on different providers, with the application deployed via Capistrano, will be more fault-tolerant than Heroku. At least, it seems that way right now.
Anything, including service providers, can catastrophically fail at any moment. Fault-tolerant architectures are based on redundancy (including infrastructure provider redundancy, as you mention), not on "guaranteed" SLAs.
I also feel like I've let my admin skills deteriorate because I've been dependent on heroku. Back when I was running everything myself, worst case scenario I could set up a new VPS from a backup in another datacenter. Now if heroku goes out I just have to twiddle my thumbs while I wait for updates.
Eh, I'll probably continue converting to Heroku and not look back.
So they have their infrastructure (network, power) working very well.
The downtime we've had were our own unique hardware/software issues that come with a complex bare metal installation.
...would be cool if there was a Linux package (or distro) that you could boot-up and then just change your git remote to and have your app up-and-running on your own hardware.
BTW, while the point is to enable private paas, you won't get around the issues that hit sites like this without heeding all the warnings and recommendations about building in redundancy for high availability.
This was noted well in this post: http://www.newvem.com/blog/main/2012/06/aws-cloud-best-pract...
"""It is a lot cheaper to add 1% uptime to a 95% SLA than it is to add 0.09% to a 99.9% SLA. Cloud application vendors (SaaS) need to pay very close attention to the additional resources that are invested in order to support a 99.9XX…% uptime SLA, and perhaps build it into their pricing plans."""
Heroku takes full responsibility for your app's health,
keeping it up and running through thick and thin..."
*edit: Came up 12:42 AM ET.
You know there are OTHER data centers, right?
edit: I meant literally the EngineYard website at that address. Some EngineYard websites were up and some were down, no doubt based on region.
My love hate relationship with Heroku continues...
Various software used to hardcode 1a, so 1a received disproportionate load. Now, everyone's a-e is randomized among the "true" a-e, meaning that even if everyone hardcodes 1a, the load will still be evenly distributed.
That said, an outage is still an outage.
It seems this can be changed, though, fortunately for Heroku.