Alternative Memcache Usage: A Highly Scalable, Highly Available, In-Memory Shard Index

DenisM · on Jan 5, 2009

Summary: assuming that your lookup patterns can be answered with a hash table, we can use scale-out hash table (memcached) in lieu of a database index.

I am pessimistic. For one thing author did no address how he plans to keep index fresh in the face of evictions or memcached nodes crashes. It gets really bad - if you did not find the data, is it absent or has it been evicted? Can you create a new user named "FOO" right now? You won't know until you scan the whole data set, which is a "size of data" operation.

There is a case to be made for a resilient RAM-only distributed databases, but using memcached is not it.

herewego · on Jan 5, 2009

Hi, I'm the author of the post.

My intent in using Memcache as the mechanism for this type of indexing was more an attempt to relate the subject with something that most developers are at least somewhat familiar with. I explicitly address the weaknesses of Memcache in the section titled "Weaknesses". I also recommend some work-arounds to lessen the effect of Memcache's limitations. In the "Wrap-Up" section, I even go so far as to say that Memcache is really only one example of a distributed hash-table and that there are alternatives. Rolling your own is another option, and both of solutions are probably better suited to the indexing problem than Memcache.

Again, I was afraid that the subject would be lost on most people if there wasn't at least some relation made between the concept and something that concretely exists. Try to look at it as an exercise in thinking "outside the box", using the best example I could think of.

Thanks for the comments, good and bad. These critiques really do help.

DenisM · on Jan 6, 2009

Stonebreaker has mentioned this idea in his "database column" blog about a year back. You might want to check what he's up to.

The idea has legs - in a distributed system you no longer need to commit to disk to guard against failures (assuming your power supply is diversified), and if your data fits in RAM then it's certainly a good idea. RAM is getting cheaper at Moore's law curve, which means more and more tasks are getting within reach.

jrockway · on Jan 5, 2009

Yeah, using a cache to store data you need to always have is a terrible idea. I used to store session data in a volatile region like this (Cache::FastMmap, if you care), until one day I realized that sessions were being deleted long before they had expired. Oops, I was using a lossy store for data that must never be lost. Now I use BerkeleyDB for that, and make sure that expired sessions are manually cleaned up every so often.

chadr · on Jan 5, 2009

Perhaps memcachedb would be better for this scenario (vs regular memcached) since it is persistent. It's memcached with berkley db.

DenisM · on Jan 5, 2009

To be fair, idexes are redundant structures so data will not be lost permanently in this case.

fhars · on Jan 5, 2009

No, they will magically reappear the next time memcached is restarted. The way the article describes the idea the same thing will happen when there is a database error during user deletion: The user will be gone from the cache, but may reappear later. To make this scheme work, both the database and memcached must be made to support two phase commit, or the application must try to get the shard information from the database if it isn't in the cache.

trezor · on Jan 5, 2009

There is a case to be made for a resilient RAM-only distributed databases, but using memcached is not it.

Agreed. With the risk of sounding offensive, these kinds of ideas seems about as informed and rational as some of the arguments I hear from MySQL-advicates about why their database is the best RDBMS on the planet.