I'm curious how certain services can survive Chaos Monkey. Memcached is one example; if you start destroying instances, you're going to stampede your persistent datastore to get that memcached replacement hot again.
The surviving part doesn't have to be in the server itself. Here's an idea for memcached "surviving":
- set up N available servers
- make clients store to N servers at the same time when they calculate the value
- query M (where M<=N) servers before deciding you have to recalculate
- If you got a response from server between 2 and M, re-store the value everywhere (just pay attention to preserving timeouts)
And you get distributed, self-healing, chaos monkey resistant memcached without any support on the server side.
Also if you want to avoid stampede, you could insert 1s TTL placeholders that mean "back off, someone else is calculating" into keys you know are popular and may experience contention. Just make sure you use CAS so you don't overwrite data with placeholder.
That's what I assumed. You lose capacity as you increase reliability, as you're going to need to redundantly store that data somewhere in your memcached cluster if you don't want to go back to the persistent layer. Thanks!
The memcache at Netflix is triple replicated, so if one node goes away, the clients ask one of the other nodes with the same data in a different data center, and then repopulate the node in their own datacenter if a replacement has been launched.
(I worked on the system)
There is also another project called Dynamite [0] that puts a gossip/Cassandra-like protocol in front of redis.