
Building a lock free continuous ring buffer in Rust - Argorak
https://ferrous-systems.com/blog/lock-free-ring-buffer/
======
dragontamer
Looking at this blog-post more carefully... I'm not convinced that this ring
buffer is actually correct unfortunately.

> buffer.write.store(buffer.write.load() + write_len)

This is... an atomic load, followed by an add, followed atomic store.

A TRUE lock-free queue would be buffer.write.AtomicAdd(write_len), (which is a
singular, atomic add).

There are many other issues here, but this is the most egregious issue I was
able to find. The whole thing doesn't work, its completely non-safe and
incorrect from a concurrency point of view. Once this particular race
condition is solved, there's at least 3 or 4 others that I was able to find
that also need to be solved.

EDIT: Here's my counter-example

    
    
        write = 100 (at the start)
        write_len = 10 for both threads.
    
        |-----------------------------------|
        |   Thread 1      |   Thread 2      |
        |-----------------------------------|
        | write.load (100)| write.load (100)|
        | 100+write_len   |                 |
        | write.store(110)| 100+write_len   |
        |                 | write.store(110)|
        |-----------------------------------|
    
        write = 110 after the two threads "added" 10 bytes each
    

Two items of size 10 were written to the queue, but only +10 bytes happened to
the queue. The implementation is completely busted and broken. Just because
you're using atomics doesn't mean that you've created an atomic transaction.

It takes great effort and study to actually build atomic transactions out of
atomic parts. I would argue that this thread should be a lesson in how easy it
is to get multithreaded programming dead wrong.

\---------

For a talk that actually gets these details right... I have made a recent
submission:
[https://news.ycombinator.com/item?id=20096907](https://news.ycombinator.com/item?id=20096907)
. Mr. Pikus breaks down how atomics and lock-free programming needs to be
done. It takes him roughly 3.5 hours to describe a lock-free concurrent queue.
He's not messing around either: its a dense talk on a difficult subject.

Yeah, its not easy. But this is NOT an easy subject by any stretch of the
imagination.

~~~
ekimekim
The article is discussing an exclusive-writer, exclusive-reader ring buffer.
Interleaved writers is intentionally not considered.

> A common strategy to reduce the amount of coordination that needs to happen
> between the two threads (writer, reader) is to associate each coordination
> variable (pointer) with a single thread that has exclusive write access to
> it.

~~~
dragontamer
> reduce the amount of coordination

Interesting premise, but it doesn't hold up to scrutiny.

1\. They have three shared variables.

2\. Those three shared variables are being falsly shared in that data-
structure.

    
    
        struct ContiguousAsyncBuffer {
          buf: *mut u8,
          len: usize,
          read: AtomicUsize,
          write: AtomicUsize,
          watermark: AtomicUsize,
        }
    

Look at that struct, all of the atomics are inside the same 64-byte cache
line!! That's the worst sharing possibility imaginable!

3\. Therefore: the dumb-and-simple "Mutex / Spinlock" would actually have less
sharing (share only a ONE variable: the mutex), without any false-sharing of
various atomic variables.

\---------

I kid you not: a "lock/unlock" queue probably performs better than the atomic
thing they did here (and the atomic-thing they're doing doesn't even work
correctly).

~~~
lalaland1125
You keep on stating that the their queue doesn't work but you have yet to
provide any correct arguments or evidence that it doesn't work.

The only argument you provided was simply incorrect as the article is only
concerned with a single reader and writer.

You are right that adding some padding might improve performance.

~~~
dragontamer
Using a spinlock / mutex would improve performance if we're only talking about
2 threads. That's the main problem. Spinlocks are also much, much, much easier
to think about.

Frankly, the only reason why this implementation "works" is because they're
taking advantage of SeqCst, and the article spends absolutely no time talking
about why SeqCst is the cornerstone of this implementation.

They actually don't need to use atomics at all when restricted to 2 threads.
They just need the seq-cst. Atomics entirely are irrelevant in the 2-thread
case.

\----------

In any case, code like:

> buffer.write.store(buffer.write.load() + write_len)

This is horribly, horribly, horribly bad code. This is bad, really bad, no one
should be encouraged to write code like this.

Yeah, it ends up being correct because of SeqCst and the very specific
assumptions of the blog post. But is that really a victory worth talking
about?

This kind of code should NOT be encouraged. The reason why its "correct" is
incredibly subtle and complicated. Its incredibly difficult for me to even put
it to words why this seems like it'd work... without giving a ton of links to
various memory-consistency talks and papers.

~~~
forrestthewoods
I’d love to see a competing version with benchmarks.

I’m not saying you’re wrong. But your making a lot of strong arguments without
any evidence. Is 3 atomics on the same cache line a legit problem? I don’t
know! But you’re gonna have to prove negative behavior for me to integrate
that knowledge into my workflow.

~~~
dragontamer
> Is 3 atomics on the same cache line a legit problem?

This is a commonly known problem called false sharing.
[https://en.wikipedia.org/wiki/False_sharing](https://en.wikipedia.org/wiki/False_sharing)

To be fair: I don't know how these particular atomics work, maybe they're all
on their own cache lines, and 192 bytes are used up for those three atomics so
that things are done properly (unlikely, but possible). But atomics in my
experience usually assume the "external" programmer figures out the buffering
issue to prevent false sharing.

2\. These are atomics declared to be SeqCst memory ordering. That means the L1
(and probably L2) caches are flushed every time you read or write to them on
ARM systems. (x86 has stricter guarantees and almost "natively" implements
SeqCst, so you won't have as many flushes on x86). The cache must be flushed
on EVERY operation declared SeqCst. Its absurdly slow, the absolute slowest
memory model you can have.

3\. Well-written spinlocks are absurdly fast. Benchmark them if you don't
believe me. The assembly in x86 is

    
    
        Lock: 
        while (AtomicExchange(spinlock, true) != false) hyperthread_yield(); // loop until you grab false.
        acquire_barrier(); // Prevent reordering of memory
    
        Unlock: 
        release_barrier(); // Prevent reordering of memory. NOP on x86 but important in ARM / POWER9
        spinlock=false; // Don't even need an atomic here actually
    

Literally one assembly instruction to unlock, 3-assembly instructions to lock
(the atomicexchange + loop) + hyperthread_yield if you wanna play nice with
your siblings (reduces performance for you, but overall makes the whole system
faster).

From a cache-perspective, the spinlock will play nice with L1 cache (if a
thread unlocks, then locks again before a new thread comes in, it all stays
local to L1 and no cross-thread communication occurs).

So pretty much every aspect of the hypothetical spinlock implementation is
faster in my mind. I don't really need to benchmark to see which one is
faster.

EDIT: At bare minimum: because the Spinlock will be properly written using
Acquire / Release half-barriers, they will be ABSOLUTELY faster than SeqCst
Atomics. You'll need to use Acquire/Release Atomics if you want to "keep up"
with the Acquire/Release based Spinlocks.

~~~
forrestthewoods
> I don't really need to benchmark to see which one is faster.

I know that iterating an Array is radically faster than iterated a LinkedList.
But I also know this because I’ve seen the difference first hand. Countless
times over the years in fact. I even have a bit of a gut feeling for how much
slower it can be.

This case I don’t have the same level of experience. I’m sitting in an airport
on my tablet right now. So I can’t throw any quick benchmarks together.

Suffice to say I very much would like to see an actual for reals comparison.
Because I haven’t dealt with this before and even if I’m convinced one way is
faster seeing is believing! Performance benchmarks are almost always
surprising.

Edit: from Reddit here is a different spsc-ringbuffer that may be more to your
liking! [https://github.com/polyfractal/bounded-spsc-
queue](https://github.com/polyfractal/bounded-spsc-queue)

~~~
dragontamer
If you have spare time, the Pikus talks at CPPCON talk about the performance
hits associated with various primitives.

[https://www.youtube.com/watch?v=9hJkWwHDDxs](https://www.youtube.com/watch?v=9hJkWwHDDxs)

This is the one you want, with benchmarks associated with everything. Of
course, benchmark on your own system too, and learn hardware performance
counters (!!). This sort of stuff gets into nanosecond-level measurements, so
learning to use the hardware profiler of your CPU really, really helps to see
these details.

Honestly, because of how intricate the L1 / L2 caches can be on a modern
system... learning hardware performance counters while you're measuring
benchmark code / benchmark tests like this is helpful.

------
ohazi
Although dragontamer's complaints are mostly correct, I think he's being a
little uncharitable and is largely missing the point.

Yes, writing lock-free code that's generic, that supports many-to-many
reads/writes, and that's correct on all architectures is hilariously hard, and
most people should not try to implement these from scratch at work. Other
approaches can be more performant, and can have acceptable trade-offs for most
use cases.

Are there issues with this one? Maybe. As dragontamer stated multiple times,
this stuff is pretty hard to reason about.

HOWEVER, this ring buffer was designed to run on embedded systems. As usual,
the constraints are often a little bit different when you're writing software
for a microcontroller.

As an example, let's imagine you have periodic bursts of data coming in on a
serial port at 2 Mbps, and your microcontroller is running at 32 MHz. Also,
your serial port only has a 4 byte hardware buffer, and the cost of missing a
byte is... I don't know, sending an endmill through the wall.

You can absolutely make this work, and you can even formally verify that
you'll always meet timing, but you're going to have a really hard time doing
this if you also have to analyze every other piece of code that will ever try
to acquire that lock. A single-producer, single-consumer, lock-free queue with
fixed message sizes can be implemented correctly in about 20 lines of C. It
has a lot of annoying constraints, and you still have to use atomics
correctly, and you still need a memory barrier, and don't even think about
resetting the queue from either thread, and... (etc).

But if you're careful, a queue like this can make an otherwise impossible task
relatively manageable. I can no longer count the number of times a construct
like this has saved a project I was involved with.

Would it automatically run correctly and at full performance on a Xeon running
Linux? Fuck no. But that's not the point.

The desirable quality of lock-free queues for embedded systems is correctness,
not performance.

~~~
sansnomme
Can you give the 20 lines of example code? Would really love to see some
modern atomics code.

~~~
ohazi
Okay, not quite 20 lines, but something like this:

[https://gist.github.com/ohazi/40746a16c7fea4593bd0b664638d70...](https://gist.github.com/ohazi/40746a16c7fea4593bd0b664638d7017)

I've dressed it up to look like C11, but vendor specific macros from your
microcontroller support library are more common. I think you can relax some of
the memory ordering too, but I don't remember the details off-hand.

Also, sometimes you can have a modified version of queue_push / queue_pop
where a DMA handles the memcpy and then you only have to update the pointers
in the cleanup interrupt (along with configuring the next or next-next DMA if
the queue isn't full / empty).

~~~
gpderetta
The atomic_signal_fence in queue pop/push seems suspicious. Unless you are
doing something like asymmetric fences or actually synchronizing with signal
handlers you probably want atomic_thread_fence. Still I'm not sure why you
want a fence at all there. Publishing is done by the atomic_store to queue_rd
and wr which already include (I'm being fast and loose here) the equivalent of
a seq_cst fence.

~~~
ohazi
The reason you want a barrier there is to make sure the memcpy and the pointer
update don't get reordered so the other thread doesn't have a chance to access
that block before it's ready.

You don't actually need any synchronization with the other thread, which is
why I _think_ this can work instead of atomic_thread_fence. But you could be
right, anyone using this should check this carefully.

Edit: I think you're right. If it is needed, it should be a thread fence. I'm
not 100% clear on whether the pointer update takes care of this, so it may not
actually be needed.

~~~
gpderetta
The block is 'published' when the store to the write pointer is performed.
That's a store release which will have an implicit barrier _before_ the store.

~~~
ohazi
Okay yeah, you're definitely right. I didn't realize that the stdatomic
load/stores were by default seq_cst. The libraries I'm familiar with have
macros for standalone atomic reads and writes, but you have to place any
barriers manually. If you were using a library like that, you'd place them in
between the memcpy and the pointer update, but here you get that for free.

------
0xffff2
>This is the story of how Andrea Lattuada (PhD student at ETH Zurich) and
James Munns (from Ferrous Systems) designed and implemented (two versions!) of
an high-perf lock-free ring-buffer for cross-thread communication. _If any of
those words look scary to you, don 't fret, we'll explain everything from the
basics._

This is not the message I want to see introducing such a complex topic.
Writing correct lock-free code is _incredibly_ hard, and you're _not_ going to
get it right or even understand it from a single post. I've done a bit of
reading on this subject, and I'm not even sure that it's possible to write a
correct lock-free queue (of which this is a variant) in a non-garbage
collected language without some pretty substantial trade-offs regarding memory
management.

~~~
dragontamer
Pikus does it. I made a post about it.
[https://news.ycombinator.com/item?id=20096907](https://news.ycombinator.com/item?id=20096907)

The talk is 3.5 hours, about as long as you'd expect one to be for "writing a
concurrent lockfree queue". Its an incredibly difficult topic, but I'm
convinced that Pikus's implementation is actually correct.

Although, maybe there's a subtle bug somewhere. And Pikus is REALLLY
stretching the definition of a queue to make his implementation work. (This
queue may sometimes function like a stack, but that's okay because there's no
way for the end programmer to figure it out without a global lock)

\---------

EDIT: Oh, I guess you wouldn't consider it "truly lock free", because he has a
global lock in the case of the rare situation of a full queue. He uses double-
checked lock pattern for that case (and causes an expensive, but temporary,
global lock on the data-structure) and basically says "This situation only
happens once per hour, so its probably not bad".

So Pikus's implementation is "lock free mostly", but still falls back to a
global spinlock in exceptionally rare corner cases.

~~~
0xffff2
Oh, that's actually a different lock-free talk from Pikus than I expected. I
haven't it watched yet. Thanks!

I probably wrote my original comment a little too quickly. Yes, it's possible
to write useful mostly lock-free data structures, but it's hard enough that I
consider anyone claiming to have done so in a way that beats the equivalent
locked structure (as you pointed out above) to be making an extraordinary
claim that requires extraordinary evidence and serious consideration of the
difficulty involved.

~~~
dragontamer
> I probably wrote my original comment a little too quickly.

Nah. Anyone who actually understands this subject recognizes that a truly
lock-free concurrent queue is probably one of the most difficult problems in
all of Comp. Sci.

Even if you're using absolutes, its basically warranted because of the
difficulty of the problem. No one will blame you for talking about how hard
this problem is, even with a bit of hyperbole being used

------
kazinator
I implemented such a buffer in 2006. It was used for fast message passing from
user space to the kernel. When a message was too large to fit at the end of
the circular buffer, a zero-length message was placed into the remaining
space; that indicated "wrap to the beginning". It's so obvious, it must have
been implemented numerous times by others before me.

If I were to implement the same thing again, I would just split the messages
and use scattered writes ( _struct iovec_ ) to process them. I cannot remember
exactly, but I think the only reason the messages were linear was just for the
kernel thread to be able to write them out in a single call.

Oh wait, now I remember; I think the buffers were linear due to the production
side: sprintf being used to produce some of the message payloads, and that
requiring a linear buffer. But of course it's possible to support both wrapped
messages and the "tail space not used" indicator.

~~~
kmill
I was wondering about having a special message to indicate the end of the
buffer while reading the article, so thanks for pre-emptively answering my
question!

Another thing I was wondering about is using a simple buddy allocator with the
ring buffer being just a list of pointers. There would be a second ring buffer
to send memory back to the producer to be freed, and you would statically
allocate enough memory so fragmentation doesn't affect the particular
application---you know each message has a limited lifetime, for instance. (The
allocator would not be shared between different ring buffers.)

Then again, I don't see any benefits to doing this.

------
CUViper
You can make wrapping writes look contiguous by mapping the same memory twice,
one after the other. This is done in slice-deque such that the entire buffer
can be viewed as a contiguous slice, regardless of the start/end positions.

[https://crates.io/crates/slice-deque](https://crates.io/crates/slice-deque)

~~~
pslam
This trick may break ordering and cache aliasing rules on many architectures.
If you're dipping into this kind of thing, it needs per-architecture
whitelisting.

~~~
BeeOnRope
I'm not sure about the ordering part, but about "cache aliasing rules" \- can
you give some details? This situation can be trivially set up via mmap, so if
that breaks it would seem to be a problem?

I understand that aliasing is an issue when it comes to hardware design, but
the practical need to support systems that allow this to occur means that all
major architectures I'm aware of support transparent V->P aliasing, either by
having their caches physically indexed, _effectively_ physically indexed like
VIPT or some strategy where V is used to look up the way in L1 but misses due
to aliasing will be resolved in the L2 (i.e., aliasing "works" but is slow),
or some other similar strategy.

I would be curious what systems don't work this way.

Note that I'm talking only about data - systems definitely have all sorts of
rules about modification of instruction-containing pages.

------
banachtarski
Also important, not using a power of two sized ring is leaving lots of
optimization on the table.

~~~
monocasa
Eh, maybe/maybe not. This implementation requires a family big boy CPU with a
really MMU; I'd be willing to bet that most of the extra latency of the
divides are hidden in the memory access latencies.

~~~
opencl
There are two implementations described in this blog post and one of them is
specifically designed to run on microcontrollers.

------
_Codemonkeyism
Reminds me of the LMAX architecture.

Slides of a talk I gave about LMAX

[https://www.slideshare.net/Stephan.Schmidt/lmax-
architecture...](https://www.slideshare.net/Stephan.Schmidt/lmax-architecture-
jax-conference)

------
person_of_color
What is considered the bible on lock free programming?

~~~
queensnake
I don't think there is one, yet. 'Multiprocessor Programming':
[https://www.amazon.com/Art-Multiprocessor-Programming-
Revise...](https://www.amazon.com/Art-Multiprocessor-Programming-Revised-
Reprint/dp/0123973376/) has some algorithms, though.

------
amelius
I'm guessing this still uses locks, but at the CPU/cache level instead of
inside the algorithm.

~~~
taneq
I've always thought "lock free" was a misnomer for this kind of thing because
they use multi-stage atomic instructions which obviously require some kind of
implied locking. I'd probably call them "non-blocking".

