

Lessons Netflix Learned from the AWS Outage - ravstr
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html

======
mattew
"Currently, Netflix uses a service called "Chaos Monkey" to simulate service
failure. Basically, Chaos Monkey is a service that kills other services. We
run this service because we want engineering teams to be used to a constant
level of failure in the cloud. Services should automatically recover without
any manual intervention. We don't however, simulate what happens when an
entire AZ goes down and therefore we haven't engineered our systems to
automatically deal with those sorts of failures. Internally we are having
discussions about doing that and people are already starting to call this
service "Chaos Gorilla"."

I am wondering how they could simulate the loss of an AZ. Any ideas?

~~~
RyanKearney
Perhaps they have groups set up in their "Chaos Monkey" tool? Like a sort of
take down ALL services in GROUP B type of command?

------
woodrow
Interesting that they're not using EBS to provide durable storage for
Cassandra, but instead using S3 (along with S3-backed, or "ephemeral storage",
AMIs). I wonder if that means they're batching up and writing their database
logs to S3, plus running enough instances across AZs that it's generally okay
to keep everything in memory even when an instance or two fails.

Anyone have any experience running a NoSQL datastore in this fashion?

~~~
ddlatham
They're probably using the local ephemeral drives for Cassandra storage rather
than S3. I'm guessing they're then moving snapshots into S3 or elsewhere.

[http://www.mail-
archive.com/user@cassandra.apache.org/msg110...](http://www.mail-
archive.com/user@cassandra.apache.org/msg11022.html)

~~~
dmuino
That's exactly what we do.

------
radioactive21
I really love articles like this from companies summarizing a failure or
disruption of service.

It's like a lessons learn. I hope more companies do this.

~~~
haribilalic
It _is_ is a lessons learnt.

~~~
radioactive21
Agree. I was using like as in "like totally," I realized it after I posted it
but didnt have time to go back and correct it.

------
huntero
In the discussions after the AWS outage, a lot of people seemed to be assuming
that Netflix was able to stay up because they had the $$$$ to spread their
service across multiple regions, not just AZ's.

It looks like that wasn't the case, they stay in one region but avoid EBS like
the plague(among other things).

------
SriniK
Yet another great post. Seems like they rely on simpledb a lot.

Scaling up/down webserver-LB-appserver stack is the easiest part. Managing
db(sql/no-sql) is a juggle. It is great that nflx avoided the db hosting them
selves by adopting simpledb.

One common thing among other players that went down during the aws outage:
quora - mysql 4sq - mongodb reddit - postgre

------
g123g
In a way this outage can turn out financially positive for Amazon as more and
more customers will start using multiple regions instead of just one region.
In addition to multiple instances that customers will need to bring up, they
will be paying for the inter-region data transfer costs. This should
compensate for the customers they will lose because of this outage and new
customers who will choose some other cloud.

