At the end of the day, how can something that is safely concurrent also be lock-free? At the lowest level, what is the primitive that enforces the safety, if it's not a lock or mutex?
My brain starts to run in cricles when I think of this scenario: two threads trying to write to one piece of data. To do so safely, they need to take turns. Therefore, one has to wait for the other to complete. Therefore, one must acquire a mutually exclusive... hold on, that's a mutex!
Can someone please clear this up for me?
A simple example would be concurrently writing into a shared queue. To make it really simple, let's assume that this queue's buffer can only be written once, and then the program has to exit.
If we have a buffer in this queue that can hold 20 messages, and an atomic integer that represents the current index, then we could have two (or however many) threads writing into it at the same time by just doing "index = myAtomic.fetch_add(1)", which will atomically add one to the index, then return the previous value. Atomics are supported at a hardware level, so they're generally pretty efficient, and there definitely is no lock like a Mutex involved here. In the end, both threads are able to write into shared memory without conflicting with each other. Using one or two more atomic integers, we could support having one or more readers reading off of the queue at the same time that we're writing to it, and we could get to the point where we're able to start reusing the buffer in a circular fashion.
With a mutex you can lock for an entire ”transaction” where you fetch a unique index and write your value to the queue. With just atomics you can get a unique index into the queue concurrently with other threads, sure, but how do you mark it as ready to be unqueued once you are done writing your data?
Thats where lock free data structures come into play. They have solved this intricate synchronization between multiple atomics so you don’t have to.
Eventually when it comes down to the hardware level the processor will have to assert #LOCK (or whatever mechanism it uses). So arguably even atomics aren't "lock" free, somewhere along the line some piece of hardware will have to block all other hardware to do its thing. DRAM can only read one thing at a time (and has to write it back afterwards).
RAM does the locking inside it, and expose it as "compare-and-set" and "compare-and-swap" etc primitives. Various computing languages that use those primitives usually call that "atomic data structures". The thing is that RAM is way faster in that regard than locking on user/kernel level, so for program it looks like there's no locks. But atomics indeed do slow down your program, if just a little, because "compare-and-set" is still slower than just "set".
Much of the foundational theoretical work in this area was done by Maurice Herlihy in the 1980s and 1990s. For example, https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf I briefly paid for an ACM membership just so I could read his original work. (Well, I paid so I could read the literature in transactional memory, but I mostly ended up reading papers authored by Herlihy, co-authored by Herlihy, or analyzing Herlihy's work.)
From : "An algorithm is lock-free if, when the program threads are run for a sufficiently long time, at least one of the threads makes progress (for some sensible definition of progress)."
Note that there's also the distinction between lock-free algorithms and lock-free data structures. Crossbeam is the latter, I believe (but I could be wrong).
This environment only happens to exist in very specific and uncommon settings, so it's not a generally useful concept for lock-free programming. It's common in the linux kernel to see spinlocks taken in sections which have disabled all interrupts, for example, but this is not an environment that normal programs can live in.
I think the key paragraph from the
Wikipedia article is:
"In particular, if one thread is suspended, then a lock-free algorithm guarantees that the remaining threads can still make progress. Hence, if two threads can contend for the same mutex lock or spinlock, then the algorithm is not lock-free. (If we suspend one thread that holds the lock, then the second thread will block.)"
There is an x86 instruction prefix named 'lock', but it only applies to a single instruction and is considered in the family of being atomic rather than a mutex.
On that base, you can build lock-free data structures.
When people say "lock-free", they mean they rely on basic hardware primitives that end up doing pretty much what you would do with locks in your code, but you can't see it. Generally, the hardware can do it faster, but rarely as much faster as people think, or want. In particular, you will still have stalls -- 30 cycles, 100 cycles, sometimes longer, whenever anything clashes, "lock-free" or no. Everything you do to cut contention at your level will pay just as it would if you were using regular locking. If you can cut your contention enough, lock-free ends up little faster (although it can still reduce latency spikes).
If the lock-free stuff is buried in a library well-maintained by somebody else who is terrifyingly competent, then use it. But it is very, very unlikely that maintaining your own will be worthwhile; the most likely outcome is extremely subtle bugs. And who cares how fast buggy code is? There are much easier ways to make wrong code fast.
Crossbeam provides lock-free data structures (e.g. ArrayQueues), thread synchronization (e.g. ShardedLock), memory sharing (e.g. AtomicCell), and utilties (e.g. Backoff).
Thank you to the author and team! Next I am very much looking forward to a Crossbeam implementation of the LMAX Disruptor ring queue: https://lmax-exchange.github.io/disruptor/
For readers who are interested: https://dzone.com/articles/ring-buffer-a-data-structure-behi...
Really looking forward to ConcurrentHashMap.
Worked fine but a full ConcurrentHashMap would be much nicer.
It may very well be that my hack is actually a very reasonable way to go about this. If I need more concurrency I can just increase the number of bits of the hash and get more individual buckets. It does make me slightly uneasy that when two keys hash to the same prefix I end up with a pathological worst case when I should get perfect scaling instead because there's no actual contention.
I always assumed that Ctries basically necessitated a GC, but I am very happy to be wrong about this!
But there are plenty of data types like smart pointers, where multiple copies of the original smart pointer points to the same underlying data.
In the eyes of the compiler, these are different variables, and can be manipulated independently.
When writing said smart pointer, however, Rust won't allow you to share data between copies unless you explicitly ask for disabling some of the compile checks.
When you do this, you need to manually ensure (thread) safety. And this is where locks and other concurrency primitives generally come in handy.
The locking types make assurances with regards to the lifetimes of their parameters / contents / return values and these are still enforced by the compiler.
But there are common use cases where you need to have concurrent write access to the same structure from multiple threads. The way to implement this is "interior mutability," where you write a method that takes an &self - a shared reference - but can still mutate it. A simple example of this is the std::sync::atomic types (https://doc.rust-lang.org/std/sync/atomic/). There's a method AtomicUsize::compare_and_swap that takes an &self, not an &mut self, as well as the values for comparing and swapping. Because it executes in an atomic machine instruction (or otherwise figures out how to make it atomic), it's okay for compare_and_swap to take only a &self. There's no risk of data races/memory corruption from multiple threads calling it at once. You're dynamically enforcing safety: if the compare doesn't work, the compare_and_swap operation returns an error. There's also a method AtomicUsize::get_mut(&mut self) -> &mut usize that takes a unique reference to the AtomicUsize type, and gives you a unique reference to the underlying integer. As long as you hold a unique borrow, you can do normal memory read/write operations to the integer inside it, but at this point the compiler will statically make sure that nobody else can take shared references to the same AtomicUsize.
You can always do what you want via message-passing between threads, but sometimes you want higher performance. (I guess that's why you're using threads in the first place; you can always write code in a single-threaded manner if you want.) std::sync::atomic and a few other standard library types help you with this; crossbeam really helps you with this. And they all integrate into the borrow checker pretty smoothly.
See also the Rust book's chapter on interior mutability https://doc.rust-lang.org/book/ch15-05-interior-mutability.h... and the "Too Many Lists" exercise https://cglab.ca/~abeinges/blah/too-many-lists/book/README.h... for more examples.
https://youtu.be/DJ4d_PZ6Gns?t=535 ( it's one of the best Go performance video on Youtube )
One example is high-performance routing on 10Gbps networks using user-space networking. Pretty much every cycle counts into your throughput numbers, because the rest of the hardware can barely keep up with the network card even before you're trying to do something. Go is a poor fit for this use case.
(This from someone who has been accused of being a Go shill. I use it a lot and like it a lot professionally, but it is no more suitable for every task than any other language.)
There are large scale online services ( in the millions req/sec ) that runs on Go without problems.
just spawn goroutines that select on the (optionally buffered) channel
if you need to fan out even further, repeat this on the spawned routines
That's all relative. There are legitimate reasons to trade a bit of complexity for a bit of speed. Go is not special in this regard.
One challenge of writing this kind of code in go is that go routines are not pre-emptive, so you can't just port over an implementation that works in c.
I'm not sure where there is a focus on lock free programming needing any kind of garbage collection. Garbage collection is just for heap allocation and there are already lock free heap allocators. Memory allocation isn't, in my experience, a major difficulty of lock free data structures.
With a reference counted pointer, you cannot "read" it without incrementing the count, or else someone else might drop the count to zero and free the memory. You would need hardware support for atomically reading a pointer, dereferencing it, and then incrementing the value at that location.
GC (whether tracing or epoch based), hazard pointers or avoiding dynamic memory allocations in the first place (so all memory is pre-allocated, outside of the lock free parts) are the only solutions to this I'm aware of.
 at the very least you can implement a lock free n-CAS operation given a plain CAS.
Yeah. And atomic reference counting is expensive (20-80M contested atomic ops per second until all CPU cores are saturated), hazard pointers... are hard, and RCU can block.
If you want to make sure the data won't be touched and the memory won't be freed while you read it, reference counting can be a great technique.
If the memory allocation (including size) is already known to be stable but the data could change, the data can be read along with a time or version number that will let the reader make sure it didn't change while it was being read.
If the data can be read atomically, this isn't a problem. If the 'data' is just a pointer that is meant to transfer ownership to the reading thread this isn't a problem, etc.
The underlying principal here is that there are many different techniques and design trade-offs when it comes to concurrency and synchronization. Discounting one thing because it isn't a silver bullet is ridiculous, because there are not silver bullets. A system has to be architected as a whole.
When there are 'lock free' data structures that are just flipping pointers with compare and exchange instructions, they are essentially ignoring the storage of memory all together and hoping the garbage collection takes care of it, while optimally the data structure would confront the storage of data and organization of access in an integrated way.
No. In this case it allows cheap, (mostly) unsynchronized deallocation. Otherwise you'll have to synchronize object deletion. Not only can that be tricky to get right (use-after-free hazard), but the required synchronization consumes a lot of CPU cycles. Ask any C++ programmer.
With (limited) GC, you only need to trace reachability to free the objects, avoiding almost all of the expensive synchronization operations.
You don't even need to use GC in general case, just for the data structures.
There isn't anything about garbage collection that enables something that couldn't be done before. If a lock free data structure is holding other non-trivial data structures that have no notion of concurrency, you are already in a position where you have to take what you can get.
> synchronization consumes a lot of CPU cycles. Ask any C++ programmer.
I've done a large amount of lock free programming in C++ and the best scenario is lock free data structure that takes responsibility for all the data stored in it and isn't just a container of pointers is the best approach to control the amount and types of synchronization. Atomic reference counting with the last reading thread responsible for deletion is just fine as an approach.
Ultimately, having granular but heap allocated data structures in a 'lock free' container needs to come with proper expectations, since it is a poor way to scale concurrency. Granular synchronization can be fine if typical usage isn't going to see much overlap or if the whole thing won't see much use in the first place, but for concurrency to scale, larger chunks of data separated out by their dependencies needs to be used.
True, there's always another way. But you can say same about almost any subject. In my experience about lock-free (and wait-free) programming, safe and high performance deallocation of the objects is often a problem. (Especially in kernel mode when context switching is not an option.)
Whenever high performance is required and it's possible to avoid atomics (CAS, FAA, LL/SC, etc.), (false) cache line sharing or even memory operations altogether, I'll take it. Not to mention CAS (or equivalent) loops and mutexes...
This thread was started because the article focuses on lock free programming with garbage collection when garbage collection is not any sort of fundamental factor in lock free programming.
> In my experience about lock-free (and wait-free) programming, safe and high performance deallocation of the objects is often a problem.
Concurrency bottlenecks are fundamentally about synchronization. The solution is frequently to synchronize less and that means doing more in between synchronizing. This implies allocating and deallocating more at one time.