Lessons Netflix Learned from the AWS Outage

mattew · on April 29, 2011

"Currently, Netflix uses a service called "Chaos Monkey" to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service "Chaos Gorilla"."

I am wondering how they could simulate the loss of an AZ. Any ideas?

efsavage · on April 29, 2011

> I am wondering how they could simulate the loss of an AZ. Any ideas?

  Nelson: How many chaos monkeys will there be? 
  Bart Simpson: One at first, but he'll train others.

SpikeGronim · on April 29, 2011

There are several ways to do it. Kill all the instances. Use a firewall to blackhole all the instances. Use traffic shaping to degrade the latency or packet loss of all the instances.

ceejayoz · on April 29, 2011

> I am wondering how they could simulate the loss of an AZ. Any ideas?

Kill all instances in an AZ?

gbelote · on April 29, 2011

> I am wondering how they could simulate the loss of an AZ. Any ideas?

They could instrument whatever library they use to interact with AWS and make it report failures or fail to respond to "create new instance"-like commands.

RyanKearney · on April 29, 2011

Perhaps they have groups set up in their "Chaos Monkey" tool? Like a sort of take down ALL services in GROUP B type of command?

woodrow · on April 29, 2011

Interesting that they're not using EBS to provide durable storage for Cassandra, but instead using S3 (along with S3-backed, or "ephemeral storage", AMIs). I wonder if that means they're batching up and writing their database logs to S3, plus running enough instances across AZs that it's generally okay to keep everything in memory even when an instance or two fails.

Anyone have any experience running a NoSQL datastore in this fashion?

ddlatham · on April 30, 2011

They're probably using the local ephemeral drives for Cassandra storage rather than S3. I'm guessing they're then moving snapshots into S3 or elsewhere.

http://www.mail-archive.com/user@cassandra.apache.org/msg110...

dmuino · on April 30, 2011

That's exactly what we do.

eli · on April 29, 2011

Keep in mind that these ain't banking records. If a queue reordering occasionally fails or a rating doesn't save, it's probably ok.

radioactive21 · on April 29, 2011

I really love articles like this from companies summarizing a failure or disruption of service.

It's like a lessons learn. I hope more companies do this.

haribilalic · on April 30, 2011

It is is a lessons learnt.

radioactive21 · on May 2, 2011

Agree. I was using like as in "like totally," I realized it after I posted it but didnt have time to go back and correct it.

huntero · on April 29, 2011

In the discussions after the AWS outage, a lot of people seemed to be assuming that Netflix was able to stay up because they had the $$$$ to spread their service across multiple regions, not just AZ's.

It looks like that wasn't the case, they stay in one region but avoid EBS like the plague(among other things).

SriniK · on April 29, 2011

Yet another great post. Seems like they rely on simpledb a lot.

Scaling up/down webserver-LB-appserver stack is the easiest part. Managing db(sql/no-sql) is a juggle. It is great that nflx avoided the db hosting them selves by adopting simpledb.

One common thing among other players that went down during the aws outage: quora - mysql 4sq - mongodb reddit - postgre

g123g · on April 30, 2011

In a way this outage can turn out financially positive for Amazon as more and more customers will start using multiple regions instead of just one region. In addition to multiple instances that customers will need to bring up, they will be paying for the inter-region data transfer costs. This should compensate for the customers they will lose because of this outage and new customers who will choose some other cloud.