
Heroku learns the hard way from Amazon EC2 outage - blasdel
http://searchcloudcomputing.techtarget.com/news/article/0,289142,sid201_gci1378426,00.html
======
antirez
> A 15-person start up like Heroku could never support its thousands of users
> for a measly few million in venture capital with traditional hosting

I wonder if this is true. If they work with 22 virtualized instances, maybe 10
good servers can provide more or less the same performances. Let's forget for
a moment if EC2 is the way to go or not, and all the benefits, but I can't see
how a company with a few millions can't afford running 10 big linux boxes
instead to use EC2.

~~~
moe
Something seems to be off here. Their frontpage claims 45k running
applications. There's no way to do that with only 22 instances. One of these
figures must be wrong.

~~~
jamesheroku
This article contains quite a bit of inaccurate information.

Despite the size of 22 double-XL instances, they were a small portion of our
overall footprint in EC2; it takes well over that kind of capacity to run the
platform.

Those instance types all happened to be in one availability zone for a variety
of reasons. Our platform overall does not live in a single zone.

Losing machines is not a problem for us (we cycle them constantly, in fact).
Normally losing even that many machines would not even be noticed by our
customers; this was an unusual case in which several factors cascaded into a
larger problem.

To be clear, this downtime (45 minutes or so, with full normal state by 90
minutes) was, unfortunately, our fault - not Amazon's. EC2 instances
vaporizing is an expected part of using the service.

We've made a couple of operational changes that will prevent these issues in
the future, and we sincerely apologize to any customers who were affected.

------
houseabsolute
I'm truly shocked that there was not one person in their crew with the
capacity to realize that the very existence of an "availability zone"
indicates that you should spread your resources across several of them. How
many other "lessons" are waiting to pounce on them? They really need to
cultivate their sense of paranoia if they plan to deliver consistent value to
their clients.

~~~
jgilliam
My guess is the reason heroku hasn't done it yet is because of latency issues
between availability zones. Ping times can be 6x between availability zones,
vs. within it. [http://orensol.com/2009/05/24/network-latency-inside-and-
acr...](http://orensol.com/2009/05/24/network-latency-inside-and-across-
amazon-ec2-availability-zones/)

More and more cloud services are running on AWS deliberately so there is very
low latency. I can host a SOLR index on one service, a MongoDB on another
service, and still get reasonable performance.

~~~
houseabsolute
Yeah, it's expected that you're going to get microsecond pings within the same
building and millisecond pings to buildings across the country. I guess I can
admit that they may have had their reasons, but they can't be good enough to
justify this lapse. If it's as inexpensive to run this setup as the article
says, they should be able to afford two completely separate stacks of servers,
one in each availability zone. It's not going to be easy to figure out how to
sync the data across each one, but the fact remains that you have to exist at
n+1 at least if you're going to deliver reliability.

------
illumen
"A server failing was normal, he said, but it was unheard of for a whole class
of resources to suddenly vanish. "

Whole centers go out all the time, for lots of different reasons. Using
multiple data centers from different providers is the only solution.

~~~
mbrubeck
Agreed. Also: Hardware failures are (close to) statistically independent
across separate instances. Software bugs are not. And when you run on a
virtualized platform with a complex infrastructure, everything acts like
software.

------
wastedbrains
We had just deployed a new version of our app to Heroku as this happened. I
thought our release was incredibly buggy and constantly crashing until I
noticed Heroku's status update. They came back quickly, and for us it is a
known risk of having our server in the cloud.

