Hacker News new | past | comments | ask | show | jobs | submit login
Lock-Free Rust: Crossbeam in 2019 (stjepang.github.io)
322 points by gbrown_ 5 months ago | hide | past | web | favorite | 72 comments

People (including me) always talk about how memory safety without GC is the most important feature of Rust, but just as important is how easy Rust makes fine-grained parallelism. In Rust, you can often start off writing sequential code and then bolt parallelism on after the fact and have it all "just work" with improved performance, thanks to libraries like Crossbeam and Rayon and the way the ownership/borrowing rules enforce data race freedom. That's really powerful.

And if it doesn't "just work" then there will be a clear error message pointing to the problem instead of nasal demons haunting you. That way you can debug and fix the problem much more easily.

Wait, but Crossbeam does have a GC inside it.

Yes, but that's only to support its goal of efficient lock-free data structures, not for Rust's goal of memory safety.

Allow me to ask a dumb question:

At the end of the day, how can something that is safely concurrent also be lock-free? At the lowest level, what is the primitive that enforces the safety, if it's not a lock or mutex?

My brain starts to run in cricles when I think of this scenario: two threads trying to write to one piece of data. To do so safely, they need to take turns. Therefore, one has to wait for the other to complete. Therefore, one must acquire a mutually exclusive... hold on, that's a mutex!

Can someone please clear this up for me?

One word: atomics.

A simple example would be concurrently writing into a shared queue. To make it really simple, let's assume that this queue's buffer can only be written once, and then the program has to exit.

If we have a buffer in this queue that can hold 20 messages, and an atomic integer that represents the current index, then we could have two (or however many) threads writing into it at the same time by just doing "index = myAtomic.fetch_add(1)", which will atomically add one to the index, then return the previous value. Atomics are supported at a hardware level, so they're generally pretty efficient, and there definitely is no lock like a Mutex involved here. In the end, both threads are able to write into shared memory without conflicting with each other. Using one or two more atomic integers, we could support having one or more readers reading off of the queue at the same time that we're writing to it, and we could get to the point where we're able to start reusing the buffer in a circular fashion.

It’s when you want to synchronize multiple atomics it gets complicated.

With a mutex you can lock for an entire ”transaction” where you fetch a unique index and write your value to the queue. With just atomics you can get a unique index into the queue concurrently with other threads, sure, but how do you mark it as ready to be unqueued once you are done writing your data?

Thats where lock free data structures come into play. They have solved this intricate synchronization between multiple atomics so you don’t have to.

While what you said is all true, its not turtles all the way down.

Eventually when it comes down to the hardware level the processor will have to assert #LOCK (or whatever mechanism it uses). So arguably even atomics aren't "lock" free, somewhere along the line some piece of hardware will have to block all other hardware to do its thing. DRAM can only read one thing at a time (and has to write it back afterwards).

The hardware guarantees forward progress at the very least and possibly some amount t of fairness. So the underlying hardware is at least lock free and possibly wait free. Usually a CPU cannot hold a cache line indefinitely, another CPU can always sneak in and steal it.

Yes, you're right. It's turtles until RAM locking mechanisms.

RAM does the locking inside it, and expose it as "compare-and-set" and "compare-and-swap" etc primitives. Various computing languages that use those primitives usually call that "atomic data structures". The thing is that RAM is way faster in that regard than locking on user/kernel level, so for program it looks like there's no locks. But atomics indeed do slow down your program, if just a little, because "compare-and-set" is still slower than just "set".

The RAM does not know anything about locking, it's the cache coherence protocol. The CPU will request a cache line in exclusive state, it does the operation on the memory and ensures that during it has always been exclusive to that core. After all the other cores will observe the change (and depending on the memory model+ordering, the operations that have/will happen before and after).

For a general overview of what lock-free and wait-free algorithms mean, see https://en.wikipedia.org/wiki/Non-blocking_algorithm.

Much of the foundational theoretical work in this area was done by Maurice Herlihy in the 1980s and 1990s. For example, https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf I briefly paid for an ACM membership just so I could read his original work. (Well, I paid so I could read the literature in transactional memory, but I mostly ended up reading papers authored by Herlihy, co-authored by Herlihy, or analyzing Herlihy's work.)

"The Art of multiprocessor programming" by Herilhy and Shavit is also great.

I've read one of these articles and bookmarked the other. They give some introduction to the problem space, with pictures. I'm no expert though and can't rate their accuracy; Just know they helped me feel a little more confident about the basics.



FYI lock-free programming does not mean "free of locks" but rather "free of lockups", as in, there isn't any possible arrangement in which two threads could end up blocking each other indefinitely (as could happen with mutexes and deadlocks).

From [0]: "An algorithm is lock-free if, when the program threads are run for a sufficiently long time, at least one of the threads makes progress (for some sensible definition of progress)."

[0] https://en.wikipedia.org/wiki/Non-blocking_algorithm

As I understand it, it does actually mean "free of locks", since a lock-free algorithm, by definition, cannot be implemented in terms of classical locks (typically mutexes) [1] [2]. It's "free of lockups" by virtue of, well, being free of locks.

Note that there's also the distinction between lock-free algorithms and lock-free data structures. Crossbeam is the latter, I believe (but I could be wrong).

[1] https://www.cs.cmu.edu/~410-s05/lectures/L31_LockFree.pdf

[2] https://erdani.com/publications/cuj-2004-10.pdf

If you are in an environment where you can guarantee that a piece of code runs in bounded time (no interrupts, no faults, etc), then locks can actually be used in lock-free algorithms since code inside the lock will always be making forward progress and will exit in or before some deterministic time. If the lock is fair and there's a bounded number of threads accessing the lock, then it technically becomes wait-free under these conditions since time to completion is bounded by n_threads * task_time;

This environment only happens to exist in very specific and uncommon settings, so it's not a generally useful concept for lock-free programming. It's common in the linux kernel to see spinlocks taken in sections which have disabled all interrupts, for example, but this is not an environment that normal programs can live in.

I don't think that is quite correct. By that definition, any alhorithm that is free of deadlocks would be a lock free algorithm. But, I'm fairly sure that isn't the case - otherwise a hashmap protected by a single mutex would be considered a lock free data structure since as there is only a single mutex, deadlock would be impossible.

I think the key paragraph from the Wikipedia article is:

"In particular, if one thread is suspended, then a lock-free algorithm guarantees that the remaining threads can still make progress. Hence, if two threads can contend for the same mutex lock or spinlock, then the algorithm is not lock-free. (If we suspend one thread that holds the lock, then the second thread will block.)"

Yes, you are correct. I missed the "some thread must always be able to make progress regardless of the state of the other threads".

What are you calling the "lowest level"? Because at the CPU level there is no arbitrary scope mutex. That's not a thing that exists at the lowest levels. The mutex/lock that exists in basically every language is actually built on top of atomics, that's your "building block" so to speak.

There is an x86 instruction prefix named 'lock', but it only applies to a single instruction and is considered in the family of being atomic rather than a mutex.

You can't completely remove locks. You can make them very small and hold them for a very short tine using hardware support. Atomic increments, atomic compare-and-exchange are supported by modern CPUs.

On that base, you can build lock-free data structures.

There are some pretty good answers to this question posted, but the simple one is that there is, really, no such thing as a lock-free anything.

When people say "lock-free", they mean they rely on basic hardware primitives that end up doing pretty much what you would do with locks in your code, but you can't see it. Generally, the hardware can do it faster, but rarely as much faster as people think, or want. In particular, you will still have stalls -- 30 cycles, 100 cycles, sometimes longer, whenever anything clashes, "lock-free" or no. Everything you do to cut contention at your level will pay just as it would if you were using regular locking. If you can cut your contention enough, lock-free ends up little faster (although it can still reduce latency spikes).

If the lock-free stuff is buried in a library well-maintained by somebody else who is terrifyingly competent, then use it. But it is very, very unlikely that maintaining your own will be worthwhile; the most likely outcome is extremely subtle bugs. And who cares how fast buggy code is? There are much easier ways to make wrong code fast.

The take away is that lock free != Fast. Some lock free algorithms and data structures happen to be fast, but in general maintaining the lock-, and especially wait-, free guarantees is not cheap.

Excellent! Crossbeam is for concurrent programming.

Crossbeam provides lock-free data structures (e.g. ArrayQueues), thread synchronization (e.g. ShardedLock), memory sharing (e.g. AtomicCell), and utilties (e.g. Backoff).

Thank you to the author and team! Next I am very much looking forward to a Crossbeam implementation of the LMAX Disruptor ring queue: https://lmax-exchange.github.io/disruptor/

LMAX? Really? Did you get LMAX to work well? I never did.

More specifically, I mean a Rust ring buffer data structure, implemented by using Crossbeam tooling, and sized to fit into CPU cache.

For readers who are interested: https://dzone.com/articles/ring-buffer-a-data-structure-behi...

> So these are the things I’d like to see in Crossbeam this year: AtomicReference and ConcurrentHashMap written in Rust.

Really looking forward to ConcurrentHashMap.

I've hacked around this issue in a contended hashmap by doing a Vec<RwLock<HashMap>> where you index into the Vec with the first bits of the key hash and then use the HashMap within it:


Worked fine but a full ConcurrentHashMap would be much nicer.

Any thoughts on evmap?

I had never seen it. The API seems more complex to support consistency/performance tradeoffs. I'm not sure if it would work in my case as I definitely want writes to block if two threads are accessing the same entry. It also doesn't support concurrent writes so that would be a huge penalty.

It may very well be that my hack is actually a very reasonable way to go about this. If I need more concurrency I can just increase the number of bits of the hash and get more individual buckets. It does make me slightly uneasy that when two keys hash to the same prefix I end up with a pathological worst case when I should get perfect scaling instead because there's no actual contention.

The Vec<RwLock<HashMap>> seems like a great hack, though you might still benefit from trying to come up with a scheme that avoids that RwLock (which internally uses a Mutex - as far as I know, even on reads), which can be slow if you have a lot of reads. (That's why evmap [and most lock-free structures I've ever heard of] use epochs [which is kind of like double buffering writes to batch up updates].)

The problem is that I'm actually using the HashMap not only for concurrency but also for synchronization. Looking at the Java ConcurrentHashMap it wouldn't work. I need the equivalent of an RwLock per key so that stale data is never read and there are never two writers for the same key. But thinking about it, it's a fairly different data structure from ConcurrentHashMap.

Lock-free structures have always felt like the "holy grail" of concurrent programming. I remember being blown-away when I read through the paper on CTries (which I'm assuming ConcurrentHashMap would be based on), and even more blown away about how well they performed.

I always assumed that Ctries basically necessitated a GC, but I am very happy to be wrong about this!

Forgive me, I'm not that familiar with rust, but I assumed that borrow checking got rid of the notion of two threads sharing the same data structure and therefore got rid of the need for locks? What's going on with this library? Are locks often used in rust?

The borrow checker allows two threads to share the same data structure immutably without locks. If you need to mutate the data structure from multiple threads, you need some kind of synchronization, and the borrow checker enforces that this synchronization is present. Rust doesn't prevent you from using synchronized shared mutable state, though of course if you want to program without any shared mutable state that's a perfectly valid option too.

No, Rust has locks and shared data structures. What it does is enforce their usage. It will be a compile error if you try to modify the same data structure from multiple threads without a lock.

You sometimes do not need a lock, depending. Scoped threads can have two different threads modify two different elements of a vector simultaneously without locks, for example.

Rust enforces exclusive access to a variable before it allows mutation.

But there are plenty of data types like smart pointers, where multiple copies of the original smart pointer points to the same underlying data.

In the eyes of the compiler, these are different variables, and can be manipulated independently.

When writing said smart pointer, however, Rust won't allow you to share data between copies unless you explicitly ask for disabling some of the compile checks. When you do this, you need to manually ensure (thread) safety. And this is where locks and other concurrency primitives generally come in handy.

Ah so locks come into play when you deliberately disable compile checks?

Not exactly. Disabling the borrow checker is an internal implementation detail of the locking types, but it happens opaquely to the code using the locks.

The locking types make assurances with regards to the lifetimes of their parameters / contents / return values and these are still enforced by the compiler.

The borrow checker tracks two kinds of borrows (in addition to ownership), &T for a constant reference and &mut T for a mutable reference. But it's not really about constant/mutable, it's really about shared/unique or, if you prefer, reader/writer. Essentially the borrow checker is a compile-time reader-writer lock: multiple threads or multiple structures can have a read borrow at once, but only one can have a write borrow and you need to get rid of the read borrows first.

But there are common use cases where you need to have concurrent write access to the same structure from multiple threads. The way to implement this is "interior mutability," where you write a method that takes an &self - a shared reference - but can still mutate it. A simple example of this is the std::sync::atomic types (https://doc.rust-lang.org/std/sync/atomic/). There's a method AtomicUsize::compare_and_swap that takes an &self, not an &mut self, as well as the values for comparing and swapping. Because it executes in an atomic machine instruction (or otherwise figures out how to make it atomic), it's okay for compare_and_swap to take only a &self. There's no risk of data races/memory corruption from multiple threads calling it at once. You're dynamically enforcing safety: if the compare doesn't work, the compare_and_swap operation returns an error. There's also a method AtomicUsize::get_mut(&mut self) -> &mut usize that takes a unique reference to the AtomicUsize type, and gives you a unique reference to the underlying integer. As long as you hold a unique borrow, you can do normal memory read/write operations to the integer inside it, but at this point the compiler will statically make sure that nobody else can take shared references to the same AtomicUsize.

You can always do what you want via message-passing between threads, but sometimes you want higher performance. (I guess that's why you're using threads in the first place; you can always write code in a single-threaded manner if you want.) std::sync::atomic and a few other standard library types help you with this; crossbeam really helps you with this. And they all integrate into the borrow checker pretty smoothly.

See also the Rust book's chapter on interior mutability https://doc.rust-lang.org/book/ch15-05-interior-mutability.h... and the "Too Many Lists" exercise https://cglab.ca/~abeinges/blah/too-many-lists/book/README.h... for more examples.

I've been looking for something like this in Go (yes, yes, I know... lack of generics...). Does such a thing exist?

I doubt it. Go has good performance for most applications out of the box, but if you're hitting the limits of what `chan` can do, it's either about to get very ugly (which goes against everything that Go stands for) or you should be looking at something like Rust at least for the hot paths.

Not really if chan are problem perf wise then you use mutex, it doesn't have to be ugly.

Mutexes cause performance problems.

What I'm saying is if channel are too slow you can re implement something with mutex that will be faster and it doesn't have to be ugly.

https://youtu.be/DJ4d_PZ6Gns?t=535 ( it's one of the best Go performance video on Youtube )

There are applications for which even that is not enough. When you get to that point, it's best to not use Go. In fact it's best to have known that you would get to that point and not use Go in the first place.

One example is high-performance routing on 10Gbps networks using user-space networking. Pretty much every cycle counts into your throughput numbers, because the rest of the hardware can barely keep up with the network card even before you're trying to do something. Go is a poor fit for this use case.

(This from someone who has been accused of being a Go shill. I use it a lot and like it a lot professionally, but it is no more suitable for every task than any other language.)

It's true but you're talking about a very specific case, a case that can only runs on C / C++ / Rust, for 99% of scenarios it won't be an issue.

There are large scale online services ( in the millions req/sec ) that runs on Go without problems.

The point is: channels are slower than mutices.

The one thing I would love to see a good story in chan/go concurrency in general for still is 1-to-n fanout. Right now a lock and a loop seems to be the answer, which is a bit of a blunt tool. Possible I'm missing an option, though!

1 producer to many consumers?

just spawn goroutines that select on the (optionally buffered) channel

if you need to fan out even further, repeat this on the spawned routines

Not quite. Produce one message, have arbitrary <n> consumers all read that exact message.

>about to get very ugly

That's all relative. There are legitimate reasons to trade a bit of complexity for a bit of speed. Go is not special in this regard.

There are some similar lock free data structures in golang up on github such as "onering".

One challenge of writing this kind of code in go is that go routines are not pre-emptive, so you can't just port over an implementation that works in c.

> the blog post titled Lock-freedom without garbage collection, which demonstrates that one doesn’t need a language with a tracing garbage collector to write fast lock-free programs. The secret sauce is a technique called epoch-based garbage collection, which is much different from traditional garbage collectors and is easily implemented as a library.

I'm not sure where there is a focus on lock free programming needing any kind of garbage collection. Garbage collection is just for heap allocation and there are already lock free heap allocators. Memory allocation isn't, in my experience, a major difficulty of lock free data structures.

The issue is memory reclamation in node based data structures with multiple concurrent lock free writers. With GC it is a non issue. Otherwise you have resort to reference counting (or GC by another name), hazard pointers, RCU or similar.

Yep, but even reference counting doesn't really help: the basic building block of many lock free algorithms is the ability to read/write a reference to/from a shared location atomically.

With a reference counted pointer, you cannot "read" it without incrementing the count, or else someone else might drop the count to zero and free the memory. You would need hardware support for atomically reading a pointer, dereferencing it, and then incrementing the value at that location.

GC (whether tracing or epoch based), hazard pointers or avoiding dynamic memory allocations in the first place (so all memory is pre-allocated, outside of the lock free parts) are the only solutions to this I'm aware of.

True. IIRC there are ways [1] to implement try lock free updates of reference counted pointers but they are a bit exotic and not cheap.

[1] at the very least you can implement a lock free n-CAS operation given a plain CAS.

> Otherwise you have resort to reference counting (or GC by another name), hazard pointers, RCU or similar.

Yeah. And atomic reference counting is expensive (20-80M contested atomic ops per second until all CPU cores are saturated), hazard pointers... are hard, and RCU can block.

I guess you can think of hazard pointers as a degenerate mark and sweep concurrent GC, where the hazard pointers themselves are the roots and the scan goes only one level deep.

If someone is trying to do concurrency by using 80 million contested atomic ops per second, they are likely doing just about everything wrong. The currency of concurrency is isolation from dependencies and synchronization. 80 million small synchronizations per second is the polar opposite of how to get good performance.

But one issue with RC is that pure read operations might require refcount updates so even a read mostly data structure that it is expected to scale well with the number of readers will perform horribly as each reader acquires and release references.

That was my point. The reason why atomic reference counting can be a bad idea.

Anything can be a bad idea if it is abused and having 80 million individual overlapping reads of individual shared objects is total nonsense. That kind of synchronization on tiny bits of data would just indicate terrible design choices more than anything. Atomic reference counting can be extremely useful, simple, elegant and fast, but there is no single silver bullet to concurrency.

Having an arbitrary large number of concurrent read operations and expect no or minimal contention is completely reasonable.

That depends on the technique, the lifetime of the data and the lifetime of the memory that holds it.

If you want to make sure the data won't be touched and the memory won't be freed while you read it, reference counting can be a great technique.

If the memory allocation (including size) is already known to be stable but the data could change, the data can be read along with a time or version number that will let the reader make sure it didn't change while it was being read.

If the data can be read atomically, this isn't a problem. If the 'data' is just a pointer that is meant to transfer ownership to the reading thread this isn't a problem, etc.

The underlying principal here is that there are many different techniques and design trade-offs when it comes to concurrency and synchronization. Discounting one thing because it isn't a silver bullet is ridiculous, because there are not silver bullets. A system has to be architected as a whole.

Garbage collection just shifts some of the responsibility, which means that garbage collector dictates part of the algorithm.

When there are 'lock free' data structures that are just flipping pointers with compare and exchange instructions, they are essentially ignoring the storage of memory all together and hoping the garbage collection takes care of it, while optimally the data structure would confront the storage of data and organization of access in an integrated way.

> Garbage collection just shifts some of the responsibility, which means that garbage collector dictates part of the algorithm.

No. In this case it allows cheap, (mostly) unsynchronized deallocation. Otherwise you'll have to synchronize object deletion. Not only can that be tricky to get right (use-after-free hazard), but the required synchronization consumes a lot of CPU cycles. Ask any C++ programmer.

With (limited) GC, you only need to trace reachability to free the objects, avoiding almost all of the expensive synchronization operations.

You don't even need to use GC in general case, just for the data structures.

> No. In this case it allows cheap, (mostly) unsynchronized deallocation.

There isn't anything about garbage collection that enables something that couldn't be done before. If a lock free data structure is holding other non-trivial data structures that have no notion of concurrency, you are already in a position where you have to take what you can get.

> synchronization consumes a lot of CPU cycles. Ask any C++ programmer.

I've done a large amount of lock free programming in C++ and the best scenario is lock free data structure that takes responsibility for all the data stored in it and isn't just a container of pointers is the best approach to control the amount and types of synchronization. Atomic reference counting with the last reading thread responsible for deletion is just fine as an approach.

Ultimately, having granular but heap allocated data structures in a 'lock free' container needs to come with proper expectations, since it is a poor way to scale concurrency. Granular synchronization can be fine if typical usage isn't going to see much overlap or if the whole thing won't see much use in the first place, but for concurrency to scale, larger chunks of data separated out by their dependencies needs to be used.

> There isn't anything about garbage collection that enables something that couldn't be done before. ...

True, there's always another way. But you can say same about almost any subject. In my experience about lock-free (and wait-free) programming, safe and high performance deallocation of the objects is often a problem. (Especially in kernel mode when context switching is not an option.)

Whenever high performance is required and it's possible to avoid atomics (CAS, FAA, LL/SC, etc.), (false) cache line sharing or even memory operations altogether, I'll take it. Not to mention CAS (or equivalent) loops and mutexes...

> True, there's always another way. But you can say same about almost any subject.

This thread was started because the article focuses on lock free programming with garbage collection when garbage collection is not any sort of fundamental factor in lock free programming.

> In my experience about lock-free (and wait-free) programming, safe and high performance deallocation of the objects is often a problem.

Concurrency bottlenecks are fundamentally about synchronization. The solution is frequently to synchronize less and that means doing more in between synchronizing. This implies allocating and deallocating more at one time.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact