Rust at speed – building a fast concurrent database [video]

marknadal · on Jan 6, 2019

I very much enjoyed this talk and I'm one of those in-the-database-industry "tell me something I don't know" types.

He put better words to many concepts I've worked on for years, in simpler terms that what other languages have and/or could express. Has me very excited for Noria.

As others have commented though, the 25M ops/sec is a little confusing, it is certainly better than modern/legacy systems, but it is subpar for other recent/similar research and production systems (like ours). Theirs is about 2X Redis, but should be capable with that threading and type of machines of ~300M+ ops/sec, unless their cached reads were on very large values (but he says in the talk it is on Reddit-like messages, so it can't be) or if benchmarked relative to the client and it is including roundtrip latency to local server.

Either way, the quality of the ideas, work, and presentation certainly strike me as Jon or Noria is an up and coming smash success to keep an eye on.

Jonhoo · on Jan 6, 2019

I'm glad to hear that you enjoyed it, and that you found the concepts and ideas understandable!

It's worth pointing out that this was intended to be a relatively high-level talk. For the more technical details, I'd recommend reading the [Noria paper](https://www.usenix.org/conference/osdi18/presentation/gjengs...). As for the performance numbers, I've tried to give some better discussion of that [in a response](https://news.ycombinator.com/item?id=18838359) to NovaX's comments above.

dijit · on Jan 5, 2019

While we're on the subject of Rust (and, in particular, /learning/ Rust), I'd like to ask folks here: is there a developer advocate who does what Francesc Campoy does for Go but with Rust?

I'm talking about Francesc's "JustForFunc"[0] video series where he commonly does code reviews of Go code. I find it a really good way to see examples of working code and best practice and I've been scouring youtube looking for the same but for Rust.

Ideas?

[0]: https://www.youtube.com/channel/UC_BzFbxG2za3bp5NRRRXJSw

Dowwie · on Jan 5, 2019

coincidentally, the person giving this talk hosts live coding sessions and does some code review

thefounder · on Jan 6, 2019

I've seen this guy talking non-sense (at a Google I/O?) about Go's best practices few years ago even though he was a total noob in Go(and not only).

Reasearch a bit the authors before to take their advice seriously! Even if they work for X company, it doesn't mean they are authoritative on whatever they evanghelize.

Lately more and more seem to give talks just to earn some points.

f2f · on Jan 6, 2019

i think steveklabnik was hired very early on in the rust project's life as a community liaison. not sure if he still does that work but he posts here regularly and perhaps can point to one, or advocate for such a position if one does not exist.

steveklabnik · on Jan 6, 2019

My job is technically docs; I do the community stuff because I love it.

I don’t think there’s anyone doing this in Rust that I’m aware of.

mlindner · on Jan 9, 2019

steveklabnik is leaving Mozilla so... https://words.steveklabnik.com/thank-u-next

NovaX · on Jan 6, 2019

I’m confused by the read/write performance numbers. I don’t disagree that lock-free reads are better, but 25M reads/sec at 16 threads is really bad. I can do 13.5M with a niave exclusive lock on an Lru cache, and 380M reads / 48M writes on a concurrent cache. The left/right concurrency isn’t novel and I feel bad being so very underwhelmed. What am I missing?

dpezely · on Jan 6, 2019

How does your workload and server spec compare to theirs?

Following links to the OSDI'18 paper [1] gives this overview:

"Setup. In all experiments, Noria and other storage backends run on an Amazon EC2 c5.4xlarge instance with 16 vCPUs; clients run on separate c5.4xlarge instances unless stated otherwise."

That paragraph goes on to explain the nature of the workload.

Their Figure 2 gives example SQL commands.

[1] https://jon.tsp.io/papers/osdi18-noria.pdf

NovaX · on Jan 6, 2019

He was showing a microbenchmark of docs.rs/evmap and not of Nora. This is a load test on the local machine: https://github.com/jonhoo/rust-evmap/tree/master/benchmark

His read throughput is linear to the number of cores, which means that it must have been run with the same number of cores to threads. Otherwise context switching and sharing of cores would have resulted in sublinear growth (you can get superlinear growth, but it won't look like that). Assuming his numbers are not fake, his github charts indicate a 32+ core machine.

My 2015 benchmarks were on a Azure G4, Xeon E5-2698B v3 @ 2.00GHz (16 core, hyperthreading disabled), 224 GB, Ubuntu 15.04. They show a slight superlinear growth for reads (Caffeine). https://github.com/ben-manes/caffeine/wiki/Benchmarks#server...

Both benchmarks use a Zipf distribution. I believe he said it only supports single-writer semantics, whereas mine is roughly per-entry (per hashbin). So reads should be safe to compare, and we can ignore his slow write rate. His should be faster as I do more work to maintain complex eviction policies, whereas his can reading a pseudo-immutable map with no eviction.

Since he does not perform any cache operations, as it is a concurrent map, we should be comparing against another unbounded concurrent map. That is over 1 billion reads per second on the above hardware. That is why I think something is miserably wrong or I'm misunderstanding something fundamental. Achieve 25M/s is a very disappointing result.

the_mitsuhiko · on Jan 6, 2019

I'm not the author but you should probably ask there and not on HN.

That said:

> Since he does not perform any cache operations, as it is a concurrent map

Why is it a concurrent map? To me it seems to be a regular rust hashtable. Unless you are going to tweak something most of this is going to be the very slow hashing algorithm.

NovaX · on Jan 6, 2019

His chart is very wrong: https://github.com/jonhoo/rust-evmap/blob/master/benchmark/r...

Both Zipf (skewed) and Uniform distributions have the same read throughput. That would not occur because of hardware caches and memory effects. A Zipf means that some entries are read much, much more often (80/20 rule, hot/cold cache entries). The hardware caches should benefit from this and that is how you get superlinear effects.

A uniform is the same as a random access, e.g. all eviction policies would degrade to random replacement as there is no recency/frequency/locality benefits. This means if the dataset cannot reside entirely within the CPU cache then we should see a different throughput. Yet, they are the same.

His benchmarks only have 10k elements so maybe it all fits in the cpu cache. If that's true then 25M/s is absolutely awful. Something is wrong.

NovaX · on Jan 6, 2019

It is not lock-free. Every time he reads he acquires a lock by calling epochs.lock(). That may not explain the numbers entirely, but this benchmark is completely bogus.

the_mitsuhiko · on Jan 6, 2019

> Every time he reads he acquires a lock by calling epochs.lock().

Where do you see this? I only see it lock on refresh.

NovaX · on Jan 6, 2019

The benchmark allocates a new ReadHandle on every get(key) operation, whose constructor performs work under a shared lock. This is a design flaw of the benchmark, not the library.

Jonhoo · on Jan 6, 2019

I'm not sure where you get that from? The benchmark code clones one read handle for each read thread at the beginning, which is then used for the entire benchmark: https://github.com/jonhoo/rust-evmap/blob/d307999c1ad78d10ec...

Jonhoo · on Jan 6, 2019

Hi NovaX! I'm the presenter in the video above.

I've read all your comments below, and figured I'd try to give a single overall reply, so here goes.

First, it's important to note the difference between Noria (the database) and evmap (the data structure). Noria is the research project I've been working on, and which was published in [OSDI'18](https://www.usenix.org/conference/osdi18/presentation/gjengs...). It's primary contribution is partially materialized, dynamic data-flow, something no existing systems provide. We use that to implement materialized views, and find that it gives expressive and high-performance materialized views, beyond what you can currently find in other existing systems. evmap is _not_ novel. Neither the epoch scheme, nor the swapped map. It is simply a neat data-structure that we happen to _use_ in Noria, and which we find gives good performance in the face of highly concurrent reads. We could easily swap out evmap for another concurrent map if we find one that performs better.

The reason I built evmap initially was out of sheer laziness. I wanted a data-structure that allowed basically perfectly scaleable reads when writes are infrequent compared to writes, and I didn't want to implement my own hash map, because then you lose out on all the optimizations that more mature implementations carry. evmap ticks both those boxes. It scales pretty much linearly, and can use any map you want internally -- it does _not_ need to have any knowledge of the internals of the map, uses only standard map APIs, and supports things like dynamic map resizing (which many concurrent maps lack) without any special consideration.

As for performance, I'm not sure I know where you have those numbers from. The most recent paper I'm aware of with a concurrent key/value store is [FASTER from Microsoft](https://www.microsoft.com/en-us/research/uploads/prod/2018/0...), which was published in SIGMOD'18. The best numbers I see reported in Figure 8, with 56 cores, is 100-150M reads/s when there are no writes. That is obviously impressive, and far more than the simple evmap design can provide, but if you look at the competition (like MassTree), those get closer to 50M/s for a read _only_ benchmark, which is pretty close to evmap given the difference in hardware and number of cores.

You _are_ right that evmap is slower though, and I can hazard some guesses as to why. First, evmap is a multi-value map, so all operations need to chase one more pointer (to the value list for the key). Second, evmap has essentially half the cache space available to it since the readers sort of end up sharing the cache between the two maps. And third, evmap reads are do the double-epoch-counting trick, which adds two mostly-uncontended-but-not-free increments on the critical path. I haven't looked into the details of what FASTER and MassTree do, but they probably do something clever.

And finally, to some of your more specific points of contention:

- There are as many cores as threads in the evmap benchmark. - Zipf and uniform having similar performance isn't that odd, because the main cost in that benchmark ends up being the readers needing to re-acquire read access to the cache line for the map pointer, which isn't affected by distirbution. - evmap is indeed not lock-free, but the read path is. The writer does take a lock, but that lock is just used to arrange for _new_ read handles to be added, which should be a rare occurrence. It should generally be uncontended.

Phew! Thanks for all your comments and digging -- hopefully this helps answer some of your concerns :)

NovaX · on Jan 6, 2019

Thanks for the reply. I wasn't trying to critique your work, but when something seems wrong it's worth trying to understand why. That's either a mistake in the project or my understanding.

Java and C++ concurrent maps scale reads quite well, both in practice and in research. I was comparing only in-memory read-only map lookups, not those that persist like FASTER and MassTree. I should note though that decade or older hardware will have lower multi-threaded throughput by using many slow cores (Azul, Sparc), so those raw numbers are not comparable to today's x86 machines.

In my experience when adding atomic writes to mostly uncontended fields during this lookup, the performance degrades to 33% of the raw map. Therefore 25M/s did not make sense and I believe there is a limiting factor that could be removed to increase throughput.

[1] https://preshing.com/20160201/new-concurrent-hash-maps-for-c...

[2] http://people.csail.mit.edu/shanir/publications/LazySkipList...

[3] https://arxiv.org/pdf/1809.04339.pdf

[4] https://dl.acm.org/citation.cfm?id=3210408&dl=ACM&coll=DL

[5] https://www.usenix.org/legacy/event/atc11/tech/final_files/T...

[6] https://web.stanford.edu/class/ee380/Abstracts/070221_LockFr...

[7] https://arxiv.org/pdf/1601.04017.pdf

Jonhoo · on Jan 6, 2019

I completely agree with that sentiment :)

While doing some digging of my own, I noticed that the benchmark is using the default Rust hashing function, SIP, which is cryptographically secure. I should probably re-run with FNV hashing, as that can potentially yield an order-of-magnitude speedup: https://cglab.ca/~abeinges/blah/hash-rs/. I'm currently away from my main setup, but will make that change and re-benchmark when I get back to work in a few days' time! Or maybe you'll beat me to it. It should just be a matter of using [`with_hasher`](https://docs.rs/evmap/4.0/evmap/struct.Options.html#method.w...) and passing in one of [these](https://docs.rs/fnv/1.0/fnv/type.FnvBuildHasher.html). That might be the entire explanation for the difference!

You're right about the persistence point, that was a brain-fart on my part. evmap does not provide persistence, and so that's not the right comparison point!

Jonhoo · on Jan 6, 2019

I did a quick-test now, and with 16 cores + FNV hashing + disabling hyperthreads, I got ~41M ops/s total on those cores. That got me to digging a little further, and I decided to use the same benchmark harness to benchmark just a std::collections::HashMap with FNV hashing. Running a single-threaded write-then-read benchmark yields a throughput of 5M reads/s, which is about 2x that of evmap (which seems like a reasonable overhead).

Digging a little further, I realized that the benchmarker spends a bunch of time on generating random numbers. In particular, generating a Zipf-distributed number (which the benchmarker does even when you run uniform) takes about 100ns: https://github.com/jonhoo/rust-zipf, which sets an upper limit of 10M ops/s that the benchmarker can measure. With Zipf removed entirely, I can get the std HashMap up to 8M reads/s, but no higher, which makes me think that the map really then is the bottleneck (generating a uniformly random number takes ~5ns, so shouldn't be the bottleneck). Running evmap with uniform and the Zipf-generation removed gives 73M reads/s, so ~4.5M reads/s/thread, which is again about 1/2 of the standard library HashMap with no synchronization.

So, all that said, I do not believe this is an error in evmap. Rather, if you believe 8M read/s per core on a HashMap is slow, then it's the Rust HashMap implementation in the standard library that is slow. evmap's 2x overhead doesn't seem that bad to me.

I'm glad I dug through this though!

NovaX · on Jan 6, 2019

That makes sense, thanks!

For microbenchmarks, I try to remove any overheads during runtime by precomputing. That means fully populating the map (only hits) and generating the distribution of keys. Then the threads start at a random index and increment for the key on each operation. This is a good example in Java, https://github.com/ben-manes/caffeine/blob/master/caffeine/s...

I would expect ~15-30ns per integer key lookup on a decent hashtable, w/o concurrency, at a modest size (few collisions). Using a concurrent map that supports lock-free reads doesn't have to be significantly different. Since hashtable design make such a big difference, it sounds like you've narrowed it down to that as a reasonable bottleneck. Thanks for investigating.

https://tessil.github.io/2016/08/29/benchmark-hopscotch-map....

Jonhoo · on Jan 6, 2019

@NovaX it's worth noting that swapping out the underlying hash map that evmap uses is really easy, which is one of the advantages of the design as far as I'm concerned! For example, here's the diff for moving to someone else's custom hash map implementation: https://github.com/jonhoo/rust-evmap/compare/hashbrown. A better benchmarking harness is a good idea, though I'd like to see something that is somewhat disconnected from evmap! Pre-generating the randomness is something I've done in the past, but it does have the downside that you end up not actually exercising the distribution well (unless you generate vast amounts of keys)...

steveklabnik · on Jan 6, 2019

You might want to try with hashbrown too; there’s an open PR to replace HashMap’s guts with it, but there’s also a crates.io package so you don’t need to rebuild the stdlib yourself.

Jonhoo · on Jan 6, 2019

Someone else just suggested this on Twitter, but sadly the results are pretty much the same: https://twitter.com/Jonhoo/status/1082014596028715008

frankmcsherry · on Jan 6, 2019

> What am I missing?

That it is a multi-value map? Could be more too; could also be just one part in a larger system and not everything is a race.

yamafaktory · on Jan 5, 2019

Link to the Github project: https://github.com/mit-pdos/noria.

Dowwie · on Jan 5, 2019

Postgresql developers have responded to alteratives by adding functionality that those alternatives offer. What could postgresql learn from Noria's value proposition?

jeremysalwen · on Jan 5, 2019

They need to add support for incremental refreshes of materialized views. It is something they have thought about for a while: https://rhaas.blogspot.com/2010/04/materialized-views-in-pos... and makes the lists of most requested features: https://rhaas.blogspot.com/2016/01/postgresql-past-present-a...

linuxhansl · on Jan 6, 2019

To observe the "epoch counters" between threads (which can run on different cores), don't you still need memory fences/barriers to make the epochs visible to other threads - which I found is a significant part of the cost of locking.

Perhaps that accounts for "only" 25m reads/s, although I'm surprised that mutexes are sooo much slower.

BagOfPistchios · on Jan 6, 2019

Yes you need both compiler barriers for the ordering, as well as memory fences to ensure the global visibility of previous operations in order. It is possible that the mutex implementation he used does no spinning in case of contention and just context switches each time. That could explain the poor performance of mutexes but I don't know the implementation.

Jonhoo · on Jan 6, 2019

The mutexes were standard Rust Mutex, which I believe just forwards directly to pthread locks. I'm not sure what kind of spinning behavior they have though.

the_mitsuhiko · on Jan 6, 2019

Rust mutexes are in the process of being redone. The popular parking-lot library js about to replace the platform native ones.

kibwen · on Jan 6, 2019

Can you link to the discussion this is referencing? I can't imagine parking_lot outright replacing the mutex in std, rather than being added as an optional alternative... what would be the supported way of deliberately using the platform-native mutex?

the_mitsuhiko · on Jan 6, 2019

The discussion is partially here: https://internals.rust-lang.org/t/standard-library-synchroni...

The PR for std is here: https://github.com/rust-lang/rust/pull/56410

Mostly comes to the platform native APIs habing various soundness issues.

Jonhoo · on Jan 6, 2019

My hunch would be that the results from the talk wouldn't materially change, except that the Mutex numbers would perhaps be _slightly_ better. The scalability issue would remain though, as it is fundamental to mutexes!

gamegoblin · on Jan 7, 2019

parking-lot mutexes don't support lock poisoning like the standard library mutexes. Wouldn't this be a totally breaking change? Unless poisoning has been added to parking-lot?

Jonhoo · on Jan 6, 2019

Yup, some synchronization is certainly necessary there too. Take a look at the full code for the read path here if you're curious! https://github.com/jonhoo/rust-evmap/blob/d307999c1ad78d10ec...

BagOfPistchios · on Jan 6, 2019

As I understand it, it achives lockfree reads by having two caches: One for modifying and one for reading, and atomically swaps them after modifications (and applies the changes to what was previously the read cache). Am I missing something or doesn't this reduce the overall memory available for cache by a factor 2x?

dpezely · on Jan 6, 2019

He explains this at roughly the 40 minute mark. [1]

"There's 7 more lines of `unsafe` that avoids keeping both maps. You still keep two maps, but you de-duplicate the data between those two maps."

https://www.youtube.com/watch?v=s19G6n0UjsM&t=40m1s

nwah1 · on Jan 5, 2019

Really cool ideas for automatic cache invalidation and leveraging rust's ownership to guarantee safety with concurrent reads and writes.

StreamBright · on Jan 5, 2019

I am not sure how this would work in a production database but the problem of concurrent reads and writes are pretty well solved in Clojure. https://clojure.org/reference/refs

I was also wondering if the Datomic approach is better.

stingraycharles · on Jan 5, 2019

Clojure's refs are actually STM based on MVCC, as explained in the link you mentioned. The problem with it is that it's really, really slow. It works elegantly and if performance is not too big of an issue, it's an appropriate solution.

There's a reason why Datomic is generally known as being quite slow, and Cognitect being really insistent on not allowing anyone to publish any benchmarks, ever.

(As a sidenote, I'm a long term Clojure developer and an absolute fan of Datomic's data model, but the context of the original article is all about speed)

StreamBright · on Jan 5, 2019

Well just saying something is slow is a little hand wavy. I understand that writing a database in Clojure/Java is not great (hello Cassandra, Hadoop, Hbase, ....) but in my Clojure code I found STM rather useful for sharing the same reference and it was always enough fast for the use cases I was running into.

I guess Datomic might be slow, I was more referring to the idea of using a transactor to manage the concurrent writes as opposed to the model the was proposed in the video. Splitting up your database to different parts makes sense to me.

frankmcsherry · on Jan 6, 2019

If you want to check out more about high-throughput incremental view maintenance in a Datomic-like system, here's a decent read (caveat: I'm involved):

https://www.nikolasgoebel.com/2018/09/13/incremental-datalog...

Also, a recent video from ClojureConj on the subject:

https://www.youtube.com/watch?v=ZgqFlowyfTA

Scarbutt · on Jan 5, 2019

Datomic is not a database you want if you have any performance requirements. Tuning it for production is a nightmare, only cognitect can do it, so if you use it for anything serious, make sure to buy a support contract.

They partly solve the 'no one can tune this black box' with their cloud offering by doing it for you, but cloud is a different product and dependent on AWS, it doesn't have an on-premise version.

dustingetz · on Jan 5, 2019

If you look at what Reddit [1] and Facebook [2] did in their data layers to scale past what the relational model can offer, they start to look pretty darn close to the Datomic architecture. Linearized. Time aware. Immutable. Reads from cache. Horizontal scaling. Query/writer/storage separation. And obviously they are cloud architectures, built and operated by teams of distributed systems engineers capable of tuning performance of such a system.

[1] https://news.ycombinator.com/item?id=15726376

[2] http://www.dustingetz.com/:datomic-facebook-tao/

Neither of these architectures resemble the relational model anymore. They resemble a half-baked bug-ridden implementation of half of Datomic.

TLDR: Nobody likes their database and zero teams at scale run vanilla out-of-box database.

This might shed light on what people mean when they are talking about tuning Datomic: http://www.dustingetz.com/:datomic-performance-gaare/ I'm sure there is more but this is the only thing I have ever come across.

StreamBright · on Jan 5, 2019

I see. Just like Hadoop, Cassandra, Hbase, MySQL, Couchbase and many other systems. You might pull out the opensource card but there are companies live out of supporting these great opensource systems because it is not trivial to install, configure and tune them. I am not sure how this new system will be any different.