Following links to the OSDI'18 paper  gives this overview:
"Setup. In all experiments, Noria and other storage backends run on an Amazon EC2 c5.4xlarge instance with 16 vCPUs; clients run on separate c5.4xlarge instances unless stated otherwise."
That paragraph goes on to explain the nature of the workload.
Their Figure 2 gives example SQL commands.
His read throughput is linear to the number of cores, which means that it must have been run with the same number of cores to threads. Otherwise context switching and sharing of cores would have resulted in sublinear growth (you can get superlinear growth, but it won't look like that). Assuming his numbers are not fake, his github charts indicate a 32+ core machine.
My 2015 benchmarks were on a Azure G4, Xeon E5-2698B v3 @ 2.00GHz (16 core, hyperthreading disabled), 224 GB, Ubuntu 15.04. They show a slight superlinear growth for reads (Caffeine).
Both benchmarks use a Zipf distribution. I believe he said it only supports single-writer semantics, whereas mine is roughly per-entry (per hashbin). So reads should be safe to compare, and we can ignore his slow write rate. His should be faster as I do more work to maintain complex eviction policies, whereas his can reading a pseudo-immutable map with no eviction.
Since he does not perform any cache operations, as it is a concurrent map, we should be comparing against another unbounded concurrent map. That is over 1 billion reads per second on the above hardware. That is why I think something is miserably wrong or I'm misunderstanding something fundamental. Achieve 25M/s is a very disappointing result.
> Since he does not perform any cache operations, as it is a concurrent map
Why is it a concurrent map? To me it seems to be a regular rust hashtable. Unless you are going to tweak something most of this is going to be the very slow hashing algorithm.
Both Zipf (skewed) and Uniform distributions have the same read throughput. That would not occur because of hardware caches and memory effects. A Zipf means that some entries are read much, much more often (80/20 rule, hot/cold cache entries). The hardware caches should benefit from this and that is how you get superlinear effects.
A uniform is the same as a random access, e.g. all eviction policies would degrade to random replacement as there is no recency/frequency/locality benefits. This means if the dataset cannot reside entirely within the CPU cache then we should see a different throughput. Yet, they are the same.
His benchmarks only have 10k elements so maybe it all fits in the cpu cache. If that's true then 25M/s is absolutely awful. Something is wrong.
Where do you see this? I only see it lock on refresh.
I've read all your comments below, and figured I'd try to give a single overall reply, so here goes.
First, it's important to note the difference between Noria (the database) and evmap (the data structure). Noria is the research project I've been working on, and which was published in [OSDI'18](https://www.usenix.org/conference/osdi18/presentation/gjengs...). It's primary contribution is partially materialized, dynamic data-flow, something no existing systems provide. We use that to implement materialized views, and find that it gives expressive and high-performance materialized views, beyond what you can currently find in other existing systems. evmap is _not_ novel. Neither the epoch scheme, nor the swapped map. It is simply a neat data-structure that we happen to _use_ in Noria, and which we find gives good performance in the face of highly concurrent reads. We could easily swap out evmap for another concurrent map if we find one that performs better.
The reason I built evmap initially was out of sheer laziness. I wanted a data-structure that allowed basically perfectly scaleable reads when writes are infrequent compared to writes, and I didn't want to implement my own hash map, because then you lose out on all the optimizations that more mature implementations carry. evmap ticks both those boxes. It scales pretty much linearly, and can use any map you want internally -- it does _not_ need to have any knowledge of the internals of the map, uses only standard map APIs, and supports things like dynamic map resizing (which many concurrent maps lack) without any special consideration.
As for performance, I'm not sure I know where you have those numbers from. The most recent paper I'm aware of with a concurrent key/value store is [FASTER from Microsoft](https://www.microsoft.com/en-us/research/uploads/prod/2018/0...), which was published in SIGMOD'18. The best numbers I see reported in Figure 8, with 56 cores, is 100-150M reads/s when there are no writes. That is obviously impressive, and far more than the simple evmap design can provide, but if you look at the competition (like MassTree), those get closer to 50M/s for a read _only_ benchmark, which is pretty close to evmap given the difference in hardware and number of cores.
You _are_ right that evmap is slower though, and I can hazard some guesses as to why. First, evmap is a multi-value map, so all operations need to chase one more pointer (to the value list for the key). Second, evmap has essentially half the cache space available to it since the readers sort of end up sharing the cache between the two maps. And third, evmap reads are do the double-epoch-counting trick, which adds two mostly-uncontended-but-not-free increments on the critical path. I haven't looked into the details of what FASTER and MassTree do, but they probably do something clever.
And finally, to some of your more specific points of contention:
- There are as many cores as threads in the evmap benchmark.
- Zipf and uniform having similar performance isn't that odd, because the main cost in that benchmark ends up being the readers needing to re-acquire read access to the cache line for the map pointer, which isn't affected by distirbution.
- evmap is indeed not lock-free, but the read path is. The writer does take a lock, but that lock is just used to arrange for _new_ read handles to be added, which should be a rare occurrence. It should generally be uncontended.
Phew! Thanks for all your comments and digging -- hopefully this helps answer some of your concerns :)
Java and C++ concurrent maps scale reads quite well, both in practice and in research. I was comparing only in-memory read-only map lookups, not those that persist like FASTER and MassTree. I should note though that decade or older hardware will have lower multi-threaded throughput by using many slow cores (Azul, Sparc), so those raw numbers are not comparable to today's x86 machines.
In my experience when adding atomic writes to mostly uncontended fields during this lookup, the performance degrades to 33% of the raw map. Therefore 25M/s did not make sense and I believe there is a limiting factor that could be removed to increase throughput.
While doing some digging of my own, I noticed that the benchmark is using the default Rust hashing function, SIP, which is cryptographically secure. I should probably re-run with FNV hashing, as that can potentially yield an order-of-magnitude speedup: https://cglab.ca/~abeinges/blah/hash-rs/. I'm currently away from my main setup, but will make that change and re-benchmark when I get back to work in a few days' time! Or maybe you'll beat me to it. It should just be a matter of using [`with_hasher`](https://docs.rs/evmap/4.0/evmap/struct.Options.html#method.w...) and passing in one of [these](https://docs.rs/fnv/1.0/fnv/type.FnvBuildHasher.html). That might be the entire explanation for the difference!
You're right about the persistence point, that was a brain-fart on my part. evmap does not provide persistence, and so that's not the right comparison point!
Digging a little further, I realized that the benchmarker spends a bunch of time on generating random numbers. In particular, generating a Zipf-distributed number (which the benchmarker does even when you run uniform) takes about 100ns: https://github.com/jonhoo/rust-zipf, which sets an upper limit of 10M ops/s that the benchmarker can measure. With Zipf removed entirely, I can get the std HashMap up to 8M reads/s, but no higher, which makes me think that the map really then is the bottleneck (generating a uniformly random number takes ~5ns, so shouldn't be the bottleneck). Running evmap with uniform and the Zipf-generation removed gives 73M reads/s, so ~4.5M reads/s/thread, which is again about 1/2 of the standard library HashMap with no synchronization.
So, all that said, I do not believe this is an error in evmap. Rather, if you believe 8M read/s per core on a HashMap is slow, then it's the Rust HashMap implementation in the standard library that is slow. evmap's 2x overhead doesn't seem that bad to me.
I'm glad I dug through this though!
For microbenchmarks, I try to remove any overheads during runtime by precomputing. That means fully populating the map (only hits) and generating the distribution of keys. Then the threads start at a random index and increment for the key on each operation. This is a good example in Java,
I would expect ~15-30ns per integer key lookup on a decent hashtable, w/o concurrency, at a modest size (few collisions). Using a concurrent map that supports lock-free reads doesn't have to be significantly different. Since hashtable design make such a big difference, it sounds like you've narrowed it down to that as a reasonable bottleneck. Thanks for investigating.
That it is a multi-value map? Could be more too; could also be just one part in a larger system and not everything is a race.