I'm curious how certain services can survive Chaos Monkey. Memcached is one exam...

viraptor · on March 24, 2015

The surviving part doesn't have to be in the server itself. Here's an idea for memcached "surviving":

- set up N available servers

- make clients store to N servers at the same time when they calculate the value

- query M (where M<=N) servers before deciding you have to recalculate

- If you got a response from server between 2 and M, re-store the value everywhere (just pay attention to preserving timeouts)

And you get distributed, self-healing, chaos monkey resistant memcached without any support on the server side.

Also if you want to avoid stampede, you could insert 1s TTL placeholders that mean "back off, someone else is calculating" into keys you know are popular and may experience contention. Just make sure you use CAS so you don't overwrite data with placeholder.

toomuchtodo · on March 24, 2015

That's what I assumed. You lose capacity as you increase reliability, as you're going to need to redundantly store that data somewhere in your memcached cluster if you don't want to go back to the persistent layer. Thanks!

jedberg · on March 24, 2015

The memcache at Netflix is triple replicated, so if one node goes away, the clients ask one of the other nodes with the same data in a different data center, and then repopulate the node in their own datacenter if a replacement has been launched.

(I worked on the system)

There is also another project called Dynamite [0] that puts a gossip/Cassandra-like protocol in front of redis.

[0] http://techblog.netflix.com/2014/11/introducing-dynomite.htm...

toomuchtodo · on March 24, 2015

Thanks for the info!

throw_away · on March 24, 2015

If you can't survive chaos monkey, how are you going to survive real life?

guelo · on March 24, 2015

Have a bunch of memcached servers so that losing one isn't a big deal.