Exploring Lock-Free Rust (2017)

judofyr · on Sept 22, 2022

Isn’t there a race condition here? Thread 1 acquires the lock, but stalls before it writes the value. Thread 2 will not acquire the lock, but immediately attempt to read the value - which will be None and the unwrap will panic.

maxbond · on Sept 22, 2022

I concur. I imagine it's for compactness for the blog, but friends don't let friends call unwrap like that. Would've been better to return a Result.

eqvinox · on Sept 22, 2022

Very interesting, but might need a (2017). Does anyone know if this is still state of the art?

evgpbfhnr · on Sept 23, 2022

Also curious how crossbeam compares with e.g. urcu which sound perfect for this kind of many-read-few-updates scenarii (rust's urcu crate is a wrapper over the c lib, https://crates.io/crates/urcu -- there might be other native implementations somewhere) Doesn't seem to be any comparison of the two around that I could find, the algorithm are close but probably different enough to matter?

jeroenhd · on Sept 22, 2022

This is the first of a series of three blog posts, with quite an interesting conclusion in the final post[1]:

> On my 2012 desktop workstation with 3GHz Xeon W3550, the Java benchmark reports an average of 7.3 ns per getTransformed invocation. The Rust benchmark reports 128 ns in get_transformed, a whopping 17 times slower execution. [...]

This article was written five years ago and things may have changed for better or for worse. However, this is an obvious example of how using Rust can make your code slower: by making you implement all the nitty-gritty details yourself (i.e. multi-threaded locking mechanisms), getting everything right and up to speed is left up to you rather than to the standard library, like in Java, where dedicated people have been improving their solution to this problem for decades now.

Rust can be lightning fast compared to Java but it's worth considering using the "slower" alternative for certain workloads (in this case, many cross-thread operations) to get a performance benefit. All the talk about Rust's fearless multithreading does not necessarily make it a good language, merely a less correctness-bug-prone alternative to C++!

I've run the benchmarks on my machine with an up-to-date Rust install; to prevent library bugs, I've also upgraded Rusts's dependencies to the highest level they'll go. For the Java samples, I'm running Java 18 with the default GC (haven't set up 19 yet on my machine).

The results are kind of wild. On the Rust side, the consumers take 46-70ns/op on my machine while the producer takes 400-800ns/op. When I run the Java benchmark, the consumers take a mere 4.71-12.24ns/op (10-65x faster!) but the producer takes a whopping 15486-25810ns/op, 40-368x slower! I've run the benchmark on Java 8 and there the performance for consumers is even better (3.4-6.4ns/op) but the producers continue to be incredibly slow.

I'm not sure what's causing this massive difference in performance (the comments on the article suggest that the producing code is triggering GC, which might very well explain the difference) but it's an interesting result to note.

I'm sure Java experts and Rust experts will both have suggestions about optimizing this code further to make the samples go faster, of course. Rust's multithreaded workloads _can_ be a lot faster (the prime drag race proves that [2]) but I don't think that's the point of this benchmark.

[1] https://morestina.net/blog/784/exploring-lock-free-rust-3-cr...

[2] https://plummerssoftwarellc.github.io/PrimeView/report?id=19...

tialaramex · on Sept 22, 2022

One reason to like the "Prime Drag Race" is that it's easier to check whether you just cheated by mistake - because there's a correct answer. Ah, I forgot to actually do the computation, no wonder this new version was so quick!

I'm pretty sure the Java benchmarking code is bogus, and I'm also pretty sure the Rust benchmarking code is bogus, and I have no idea whether the effects cancel out, or are even meaningful. I think if I was presented with this code my question would be "What is the actual problem? What are we actually trying to do, because this isn't anything"

Because this is a wait-free algorithm, the most practical arrangement is to do unlimited produces, and measure the consumers, then separately do unlimited consumption and measure the produces, but even with this, you also want to measure the transformation rate and that isn't done here so far as I can see.

What if I told you the Rust does 10 000 transformations per second, while the Java does only 100? What if vice versa? Suddenly the "performance" takes on a different hue in each case. A single overall number without this information is at least somewhat misleading, even if (which I don't think is assured here) the measurements are both against flat out runs.