I might be dumb about estimating throughput.
According to https://github.com/Cyan4973/xxHash, the best hash function can only do 100s M hashes per second, how can a local cache run at such throughput? I assume when measuring cache throughput, one need to calculate hash, look up, (maybe compare keys), and copy the data.
Are you comparing a single threaded hash benchmark to a multithreaded cache benchmark? An unbounded concurrent hash table has 1B reads/s on a 16 core machine (~60M ops/s per thread)
A multi-threaded benchmark of a cache should be fully populated and use a scrambled Zipfian distribution. This emulates hot/cold entries and highlights the areas of contention (locks, CASes, etc). A lock-free read benefits thanks to cpu cache efficiency causing super linear growth.
This shows if the implementation could be a bottleneck and scales well enough, after which the hit rate and other factors are more important than raw throughput. I would rather sacrifice a few nanos on a read than suffer much lower hit rates or have long pauses on a write due to eviction inefficiencies.