For fun I did a test benchmarking misses, getting 60 million rps on a single machine :) For my high throughput tests the network overhead is so high, that to discover the limit of the server the benchmark client has to be run over localhost. Not terribly useful; most people's networks will peg well before the server software. Especially true if your objects aren't tiny or if you batch requests at all.
I've yet to see anyone who really needs anything higher than a million RPS. The extra idle threads and general scalability help keep the latency really low, so they're still useful even if you aren't maxing out rps.
You can see tests here too: https://memcached.org/blog/persistent-memory/ - these folks might dismiss this testing as "not a cache trace", but I don't feel that's very productive.
Specifically to the cache traces though, that's just not how I test. I never get traces from users but still have to design software that /will/ typically work. Instead I test each subsystem to failure and ensure a non pathological dropoff. IE; if you write so fast the LRU would slow you down, the algorithm degrades the quality of the LRU instead of losing performance; which is fine since in most of these cases it's a bulk load, peak traffic period, load spike, etc.
I've seen plenty of systems overfit to production testing where shifts in traffic (new app deploy, new use case, etc) will cause the system to grind to a halt. I try to not ship software like that.
All said I will probably try the trace at some point. It looks like they did perfectly good work. I would mostly be hesitant to say it's a generic improvement. I also need to do up a full blog post on the way I test memcached. Too many people are born into BigCo culture and have never had to test software without just throwing it in production or traffic shadow or trace. I'm a little tired of being hand-waived off when they run in one use case and mine runs on many thousands.
But also agree with you that in that high concurrency usually not needed when using distributed caching. As I said in my tweet response (https://twitter.com/thinkingfish/status/1382039915597164544) IO dominates. And there's failure domain to consider. However, I've been asked a few times now about having the storage module used by itself or via message over shared memory in a local setting. That may very well present different requirements on cache scalability.
In that case, one might be concerned about the hit rates. While FIFO & LRU have been shown to work very well for a remote cache, especially in social network workloads, it is a poor choice in many other cases. Database, search, and analytical workloads are LFU & MRU biased due to record scans. I'd be concerned that Segcache's design is not general purpose enough and relies too heavily on optimizations that work for Twitter's use cases.
Unfortunately as applications have dozens of local caches, they are rarely analyzed and tuned. Instead implementations have to monitor the workload and adapt. Popular local caches can concurrently handle 300M+ reads/s, use an adaptive eviction policy, and leverage O(1) proactive expiration. As they are much smaller, there is less emphasis on minimizing metadata overhead and more on system performance, e.g. using memoization to avoid SerDe roundtrips to the remote cache store. See for example Expedia using a local cache to reduce their db reads by 50% which allowed them to remove servers, hit SLAs, and absorb spikes (at a cost of ~500mb) .
This shows if the implementation could be a bottleneck and scales well enough, after which the hit rate and other factors are more important than raw throughput. I would rather sacrifice a few nanos on a read than suffer much lower hit rates or have long pauses on a write due to eviction inefficiencies.
>>It has been shown at Twitter, Facebook, and Reddit that most objects stored in in-memory caches are small. Among Twitter’s top 100+ Twemcache clusters, the mean object size has a median value less than 300 bytes, and the largest cache has a median object size around 200 bytes. In contrast to these small objects, most existing solutions have relatively large metadata per object. Memcached has 56 bytes of metadata per key, Redis is similar, and Pelikan’s slab storage uses 39 bytes2. This means more than one third of the memory goes to metadata for a cache where the average object size is 100 bytes.
>To summarize, we designed a new storage backend for Pelikan called Segcache. Segcahe groups objects of similar TTLs into segments, and provides efficient and proactive TTL expiration, tiny object metadata, and almost no memory fragmentation. As a result of this design, we show that Segcache can significantly reduce the memory footprint required to serve Twitter’s production workloads. Besides, it allows Pelikan to better utilize the many cores offered by modern CPUs.
Publicly I think I want to do two things:
1. write a blog about cache operations and how to config Pelikan properly in general;
2. create some templates for common deploy mechanisms. What do you use for deploying services? What are your cache requirements? I can produce an example and put that in the repo/doc