Hacker News new | past | comments | ask | show | jobs | submit login
Segcache: A memory-efficient and scalable key-value cache for small objects [pdf] (usenix.org)
87 points by luu on April 15, 2021 | hide | past | favorite | 16 comments



I wonder if dormando who sometimes comes around would care to run memcache with the same traces as are used in this paper, which are available at https://github.com/twitter/cache-trace. I'm not sure I care about a cache that can scale to 24 cores, as in my experience I usually end up with hundreds of caches each with a few cores rather than fewer, bigger cache servers, but it still would be interesting to see what memcached can do.


Reads-wise I've done tests with up to 48 cores... It scales just about linearly. Writes don't scale much past a single core but really nobody asks for this so I haven't worked on it.

For fun I did a test benchmarking misses, getting 60 million rps on a single machine :) For my high throughput tests the network overhead is so high, that to discover the limit of the server the benchmark client has to be run over localhost. Not terribly useful; most people's networks will peg well before the server software. Especially true if your objects aren't tiny or if you batch requests at all.

I've yet to see anyone who really needs anything higher than a million RPS. The extra idle threads and general scalability help keep the latency really low, so they're still useful even if you aren't maxing out rps.

You can see tests here too: https://memcached.org/blog/persistent-memory/ - these folks might dismiss this testing as "not a cache trace", but I don't feel that's very productive.

Specifically to the cache traces though, that's just not how I test. I never get traces from users but still have to design software that /will/ typically work. Instead I test each subsystem to failure and ensure a non pathological dropoff. IE; if you write so fast the LRU would slow you down, the algorithm degrades the quality of the LRU instead of losing performance; which is fine since in most of these cases it's a bulk load, peak traffic period, load spike, etc.

I've seen plenty of systems overfit to production testing where shifts in traffic (new app deploy, new use case, etc) will cause the system to grind to a halt. I try to not ship software like that.

All said I will probably try the trace at some point. It looks like they did perfectly good work. I would mostly be hesitant to say it's a generic improvement. I also need to do up a full blog post on the way I test memcached. Too many people are born into BigCo culture and have never had to test software without just throwing it in production or traffic shadow or trace. I'm a little tired of being hand-waived off when they run in one use case and mine runs on many thousands.


Based on his response on Twitter I'd guess not https://twitter.com/dormando/status/1381706098290225152

But also agree with you that in that high concurrency usually not needed when using distributed caching. As I said in my tweet response (https://twitter.com/thinkingfish/status/1382039915597164544) IO dominates. And there's failure domain to consider. However, I've been asked a few times now about having the storage module used by itself or via message over shared memory in a local setting. That may very well present different requirements on cache scalability.


> However, if this design is used as a local cache over shared memory for a high throughput system (trading? ML feature store?) the scalability can be handy.

In that case, one might be concerned about the hit rates. While FIFO & LRU have been shown to work very well for a remote cache, especially in social network workloads, it is a poor choice in many other cases. Database, search, and analytical workloads are LFU & MRU biased due to record scans. I'd be concerned that Segcache's design is not general purpose enough and relies too heavily on optimizations that work for Twitter's use cases.

Unfortunately as applications have dozens of local caches, they are rarely analyzed and tuned. Instead implementations have to monitor the workload and adapt. Popular local caches can concurrently handle 300M+ reads/s, use an adaptive eviction policy, and leverage O(1) proactive expiration. As they are much smaller, there is less emphasis on minimizing metadata overhead and more on system performance, e.g. using memoization to avoid SerDe roundtrips to the remote cache store. See for example Expedia using a local cache to reduce their db reads by 50% which allowed them to remove servers, hit SLAs, and absorb spikes (at a cost of ~500mb) [1].

[1] https://medium.com/expedia-group-tech/latency-improvement-wi...


I might be dumb about estimating throughput. According to https://github.com/Cyan4973/xxHash, the best hash function can only do 100s M hashes per second, how can a local cache run at such throughput? I assume when measuring cache throughput, one need to calculate hash, look up, (maybe compare keys), and copy the data.


Are you comparing a single threaded hash benchmark to a multithreaded cache benchmark? An unbounded concurrent hash table has 1B reads/s on a 16 core machine (~60M ops/s per thread)


Oh, my bad, thank you for pointing out! BTW, just for my curiosity, where is the 1B reads/s benchmark?


A multi-threaded benchmark of a cache should be fully populated and use a scrambled Zipfian distribution. This emulates hot/cold entries and highlights the areas of contention (locks, CASes, etc). A lock-free read benefits thanks to cpu cache efficiency causing super linear growth.

This shows if the implementation could be a bottleneck and scales well enough, after which the hit rate and other factors are more important than raw throughput. I would rather sacrifice a few nanos on a read than suffer much lower hit rates or have long pauses on a write due to eviction inefficiencies.

[1] https://github.com/ben-manes/caffeine/wiki/Benchmarks#read-1...

[2] https://github.com/ben-manes/caffeine/blob/master/caffeine/s...


Thanks! I commented here and then realized it might be on Twitter. I appreciate you answering the same question in both places :-)


lol I'm glad you asked there since I am on HN about once every 3 months...


Thank you for the interests and thoughts! It is good to know that other places also use mutli cache approaches instead of multi-threading. This is Juncheng, one of the authors. The main contributions are 1. timely removal of expired objects, 2. reducing per-object metadata, 3. reducing memory fragmentation. Near-linear scalability is "good-to-have". :)


Here’s a blog version of the paper for anybody who isn’t into 12 page double column pdf files

https://twitter.github.io/pelikan/2021/segcache.html


>Segcache prioritizes expiration over eviction by making sure expired objects are removed efficiently. It further offers the following features to achieve high memory efficiency and high throughput:

>>It has been shown at Twitter, Facebook, and Reddit that most objects stored in in-memory caches are small. Among Twitter’s top 100+ Twemcache clusters, the mean object size has a median value less than 300 bytes, and the largest cache has a median object size around 200 bytes. In contrast to these small objects, most existing solutions have relatively large metadata per object. Memcached has 56 bytes of metadata per key, Redis is similar, and Pelikan’s slab storage uses 39 bytes2. This means more than one third of the memory goes to metadata for a cache where the average object size is 100 bytes.

>To summarize, we designed a new storage backend for Pelikan called Segcache. Segcahe groups objects of similar TTLs into segments, and provides efficient and proactive TTL expiration, tiny object metadata, and almost no memory fragmentation. As a result of this design, we show that Segcache can significantly reduce the memory footprint required to serve Twitter’s production workloads. Besides, it allows Pelikan to better utilize the many cores offered by modern CPUs.


Is there any documentation on how to use the project? Browsing the repo it looks like I'd have to read the code to be able to use it...


Internally I’ve written a generator to spit out a deploy config based on input vectors such as (qps, data size, connections). But it assumes using Twitter’s container environment.

Publicly I think I want to do two things: 1. write a blog about cache operations and how to config Pelikan properly in general; 2. create some templates for common deploy mechanisms. What do you use for deploying services? What are your cache requirements? I can produce an example and put that in the repo/doc





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: