
Lock-Free Rust: Crossbeam in 2019 - gbrown_
https://stjepang.github.io/2019/01/29/lock-free-rust-crossbeam-in-2019.html
======
pcwalton
People (including me) always talk about how memory safety without GC is the
most important feature of Rust, but just as important is how easy Rust makes
fine-grained parallelism. In Rust, you can often start off writing sequential
code and then bolt parallelism on after the fact and have it all "just work"
with improved performance, thanks to libraries like Crossbeam and Rayon and
the way the ownership/borrowing rules enforce data race freedom. That's really
powerful.

~~~
deepsun
Wait, but Crossbeam does have a GC inside it.

~~~
kinghajj
Yes, but that's only to support its goal of efficient lock-free data
structures, not for Rust's goal of memory safety.

------
2bitencryption
Allow me to ask a dumb question:

At the end of the day, how can something that is safely concurrent also be
lock-free? At the lowest level, what is the primitive that enforces the
safety, if it's not a lock or mutex?

My brain starts to run in cricles when I think of this scenario: two threads
trying to write to one piece of data. To do so safely, they need to take
turns. Therefore, one has to wait for the other to complete. Therefore, one
must acquire a mutually exclusive... hold on, that's a mutex!

Can someone please clear this up for me?

~~~
coder543
One word: atomics.

A simple example would be concurrently writing into a shared queue. To make it
really simple, let's assume that this queue's buffer can only be written once,
and then the program has to exit.

If we have a buffer in this queue that can hold 20 messages, and an atomic
integer that represents the current index, then we could have two (or however
many) threads writing into it at the same time by just doing "index =
myAtomic.fetch_add(1)", which will _atomically_ add one to the index, then
return the previous value. Atomics are supported at a hardware level, so
they're generally pretty efficient, and there definitely is no lock like a
Mutex involved here. In the end, both threads are able to write into shared
memory without conflicting with each other. Using one or two more atomic
integers, we could support having one or more readers reading off of the queue
at the same time that we're writing to it, and we could get to the point where
we're able to start reusing the buffer in a circular fashion.

~~~
pmalynin
While what you said is all true, its not turtles all the way down.

Eventually when it comes down to the hardware level the processor will have to
assert #LOCK (or whatever mechanism it uses). So arguably even atomics aren't
"lock" free, somewhere along the line some piece of hardware will have to
block all other hardware to do its thing. DRAM can only read one thing at a
time (and has to write it back afterwards).

~~~
deepsun
Yes, you're right. It's turtles until RAM locking mechanisms.

RAM does the locking inside it, and expose it as "compare-and-set" and
"compare-and-swap" etc primitives. Various computing languages that use those
primitives usually call that "atomic data structures". The thing is that RAM
is way faster in that regard than locking on user/kernel level, so for program
it looks like there's no locks. But atomics indeed do slow down your program,
if just a little, because "compare-and-set" is still slower than just "set".

~~~
Gladdyu
The RAM does not know anything about locking, it's the cache coherence
protocol. The CPU will request a cache line in exclusive state, it does the
operation on the memory and ensures that during it has always been exclusive
to that core. After all the other cores will observe the change (and depending
on the memory model+ordering, the operations that have/will happen before and
after).

------
jph
Excellent! Crossbeam is for concurrent programming.

Crossbeam provides lock-free data structures (e.g. ArrayQueues), thread
synchronization (e.g. ShardedLock), memory sharing (e.g. AtomicCell), and
utilties (e.g. Backoff).

Thank you to the author and team! Next I am very much looking forward to a
Crossbeam implementation of the LMAX Disruptor ring queue: [https://lmax-
exchange.github.io/disruptor/](https://lmax-exchange.github.io/disruptor/)

~~~
strictfp
LMAX? Really? Did you get LMAX to work well? I never did.

~~~
jph
More specifically, I mean a Rust ring buffer data structure, implemented by
using Crossbeam tooling, and sized to fit into CPU cache.

For readers who are interested: [https://dzone.com/articles/ring-buffer-a-
data-structure-behi...](https://dzone.com/articles/ring-buffer-a-data-
structure-behind-disruptor)

------
nindalf
> So these are the things I’d like to see in Crossbeam this year:
> AtomicReference and ConcurrentHashMap written in Rust.

Really looking forward to ConcurrentHashMap.

~~~
pedrocr
I've hacked around this issue in a contended hashmap by doing a
Vec<RwLock<HashMap>> where you index into the Vec with the first bits of the
key hash and then use the HashMap within it:

[https://github.com/pedrocr/syncer/blob/master/src/rwhashes.r...](https://github.com/pedrocr/syncer/blob/master/src/rwhashes.rs)

Worked fine but a full ConcurrentHashMap would be much nicer.

~~~
pas
Any thoughts on evmap?

~~~
pedrocr
I had never seen it. The API seems more complex to support
consistency/performance tradeoffs. I'm not sure if it would work in my case as
I definitely want writes to block if two threads are accessing the same entry.
It also doesn't support concurrent writes so that would be a huge penalty.

It may very well be that my hack is actually a very reasonable way to go about
this. If I need more concurrency I can just increase the number of bits of the
hash and get more individual buckets. It does make me slightly uneasy that
when two keys hash to the same prefix I end up with a pathological worst case
when I should get perfect scaling instead because there's no actual
contention.

~~~
pas
The Vec<RwLock<HashMap>> seems like a great hack, though you might still
benefit from trying to come up with a scheme that avoids that RwLock (which
internally uses a Mutex - as far as I know, even on reads), which can be slow
if you have a lot of reads. (That's why evmap [and most lock-free structures
I've ever heard of] use epochs [which is kind of like double buffering writes
to batch up updates].)

~~~
pedrocr
The problem is that I'm actually using the HashMap not only for concurrency
but also for synchronization. Looking at the Java ConcurrentHashMap it
wouldn't work. I need the equivalent of an RwLock per key so that stale data
is never read and there are never two writers for the same key. But thinking
about it, it's a fairly different data structure from ConcurrentHashMap.

------
tombert
Lock-free structures have always felt like the "holy grail" of concurrent
programming. I remember being blown-away when I read through the paper on
CTries (which I'm assuming ConcurrentHashMap would be based on), and even more
blown away about how well they performed.

I always assumed that Ctries basically necessitated a GC, but I am very happy
to be wrong about this!

------
crimsonalucard
Forgive me, I'm not that familiar with rust, but I assumed that borrow
checking got rid of the notion of two threads sharing the same data structure
and therefore got rid of the need for locks? What's going on with this
library? Are locks often used in rust?

~~~
winstonewert
No, Rust has locks and shared data structures. What it does is enforce their
usage. It will be a compile error if you try to modify the same data structure
from multiple threads without a lock.

~~~
steveklabnik
You sometimes do not need a lock, depending. Scoped threads can have two
different threads modify two different elements of a vector simultaneously
without locks, for example.

------
presscast
I've been looking for something like this in Go (yes, yes, I know... lack of
generics...). Does such a thing exist?

~~~
majewsky
I doubt it. Go has good performance for most applications out of the box, but
if you're hitting the limits of what `chan` can do, it's either about to get
very ugly (which goes against everything that Go stands for) _or_ you should
be looking at something like Rust at least for the hot paths.

~~~
Thaxll
Not really if chan are problem perf wise then you use mutex, it doesn't have
to be ugly.

~~~
vardump
Mutexes _cause_ performance problems.

~~~
Thaxll
What I'm saying is if channel are too slow you can re implement something with
mutex that will be faster and it doesn't have to be ugly.

[https://youtu.be/DJ4d_PZ6Gns?t=535](https://youtu.be/DJ4d_PZ6Gns?t=535) (
it's one of the best Go performance video on Youtube )

~~~
jerf
There are applications for which even that is not enough. When you get to that
point, it's best to not use Go. In fact it's best to have known that you would
get to that point and not use Go in the first place.

One example is high-performance routing on 10Gbps networks using user-space
networking. Pretty much every cycle counts into your throughput numbers,
because the rest of the hardware can barely keep up with the network card even
before you're trying to do something. Go is a poor fit for this use case.

(This from someone who has been accused of being a Go shill. I use it a lot
and like it a lot professionally, but it is no more suitable for every task
than any other language.)

~~~
Thaxll
It's true but you're talking about a very specific case, a case that can only
runs on C / C++ / Rust, for 99% of scenarios it won't be an issue.

There are large scale online services ( in the millions req/sec ) that runs on
Go without problems.

------
CyberDildonics
> the blog post titled Lock-freedom without garbage collection, which
> demonstrates that one doesn’t need a language with a tracing garbage
> collector to write fast lock-free programs. The secret sauce is a technique
> called epoch-based garbage collection, which is much different from
> traditional garbage collectors and is easily implemented as a library.

I'm not sure where there is a focus on lock free programming needing any kind
of garbage collection. Garbage collection is just for heap allocation and
there are already lock free heap allocators. Memory allocation isn't, in my
experience, a major difficulty of lock free data structures.

~~~
gpderetta
The issue is memory reclamation in node based data structures with multiple
concurrent lock free writers. With GC it is a non issue. Otherwise you have
resort to reference counting (or GC by another name), hazard pointers, RCU or
similar.

~~~
vardump
> Otherwise you have resort to reference counting (or GC by another name),
> hazard pointers, RCU or similar.

Yeah. And atomic reference counting is expensive (20-80M contested atomic ops
per second until _all_ CPU cores are saturated), hazard pointers... are hard,
and RCU can block.

~~~
CyberDildonics
If someone is trying to do concurrency by using 80 million contested atomic
ops per second, they are likely doing just about everything wrong. The
currency of concurrency is isolation from dependencies and synchronization. 80
million small synchronizations per second is the polar opposite of how to get
good performance.

~~~
vardump
That was my point. The reason why atomic reference counting can be a bad idea.

~~~
CyberDildonics
Anything can be a bad idea if it is abused and having 80 million individual
overlapping reads of individual shared objects is total nonsense. That kind of
synchronization on tiny bits of data would just indicate terrible design
choices more than anything. Atomic reference counting can be extremely useful,
simple, elegant and fast, but there is no single silver bullet to concurrency.

~~~
gpderetta
Having an arbitrary large number of concurrent read operations and expect no
or minimal contention is completely reasonable.

~~~
CyberDildonics
That depends on the technique, the lifetime of the data and the lifetime of
the memory that holds it.

If you want to make sure the data won't be touched and the memory won't be
freed while you read it, reference counting can be a great technique.

If the memory allocation (including size) is already known to be stable but
the data could change, the data can be read along with a time or version
number that will let the reader make sure it didn't change while it was being
read.

If the data can be read atomically, this isn't a problem. If the 'data' is
just a pointer that is meant to transfer ownership to the reading thread this
isn't a problem, etc.

The underlying principal here is that there are many different techniques and
design trade-offs when it comes to concurrency and synchronization.
Discounting one thing because it isn't a silver bullet is ridiculous, because
there are not silver bullets. A system has to be architected as a whole.

