

Lock-free Multi-producer Multi-consumer Queue on Ring Buffer - krizhanovsky
http://natsys-lab.blogspot.ru/2013/05/lock-free-multi-producer-multi-consumer.html
Work queue has always been one of the most hot points in server software.
  Here is how to scale it effectively to multi-core environment.
======
akadien
How is this different from disruptor queues ([http://lmax-
exchange.github.io/disruptor/](http://lmax-exchange.github.io/disruptor/))?

(Here is a plug for our C version:
[https://github.com/redjack/varon-t](https://github.com/redjack/varon-t)).

~~~
krizhanovsky
The algorithms are simply different and solve different problems: disruptor is
a messaging queue (if a producer P0 emits M0 into the queue then all consumers
C0..CN receive M0) while our queue is a classical work queue (if P0 and P1
emit message M0 and M1, then only one consumer Ci receives M0 and only
consumer Cj receives M1 (i could be equal to j)).

Our implementation competes with boost::lockfree::queue and it's much faster
since Boost implementation uses more heavy synchronization techniques. The
benchmark on GitHub also has Boost implementation, so you can compare both the
queues.

Unfortunately, there is no adequate algorithm description for disruptor queue,
only some indistinct descriptions mostly suitable for business people rather
than engineers. So it was not easy to dig it's source code. However, I learned
it and there are some notes about the implementation.

The implementation is bit inaccurate: there are a lot of branches without
branch prediction information available at compile time, to avoid cache line
bouncing it wastes two cache lines instead of simple align an item on cache
line (I mean vrt_padded_int). I didn't pay too much attention to memory
barriers usage, but giving that X86-64 provides relatively strict memory
ordering, probably some of them also could be eliminated. Disruptor uses very
cleaver ideas, but I believe its performance can be improved after good code
review.

One more point is that while our queue implementation is C++, it's still self
sufficient and can be easily ported to C for using in kernel space. It's
doubtful (in my humble opinion) that generic container depends on non-standard
libcork and moreover logging library (clogger).

------
thrownaway2424
Interesting, and the author clearly is an expert on the topic. However, I'm
curious about how this fits in to a sensible program architecture. When do you
need such a queue? It seems to me quite unlikely that production and
consumption operate at the exact same rate and with the same CPU usage rate
such that the queue runs in a steady state of neither full nor empty. It seems
much more likely that the queue will just be either full or empty all the
time, in which case this algorithm will be terrible because it busy-waits
instead of having condition variables that can be used to wake the consumers
or block the producers. By the way blocking the producers is also known as
flow control, which many people believe is a desirable property of software.

If your queue is normally empty, it would be fine for the producing threads to
just jump straight into the consuming routine and nevermind the queue. If the
queue is normally full, it's fine for them to just take a lock and sleep on a
condition.

Normally what you want a p/c queue for is to paper over short term variance in
a system where on long time scales the consumer side is faster than the
producer side. In that role, an ordinary queue with a mutex will work fine. If
the consumer is stalled for long enough to fill the queue, blocking the
producer is exactly what you want.

Of course, some languages favor the use of this pattern. I'm thinking of Go
channels. Totally lock-free Go channels would be pretty awesome, but busy-
waiting in select would not be awesome.

~~~
krizhanovsky
This is a good question. Currently we've integrated the queue in two projects
in production state: user-space proxy server and VPN capturer (Linux kernel
module).

In both the cases we enhanced the queue with conditions like is the queue
empty or is it full, so we can drop packets if consumers are overloaded and
there is no sense to put a packet to the queue.

Secondly, the queue is designed to work on multi-core environments and this is
usually multi-node NUMA systems for modern hardware. So we adjusted both the
applications in such manner that administrator can assign different number of
cores for all processors for consumers and producers depending on current
workload. This makes the system mode balanced, so queue is empty or full
rarely.

And finally, for kernel space we also implemented lightweight also lock-less
condition wait for the queue ([http://natsys-lab.blogspot.ru/2013/08/lock-
free-condition-wa...](http://natsys-lab.blogspot.ru/2013/08/lock-free-
condition-wait-for-lock-free.html)). It also gives us lower CPU consumption
when there is no workload (so the system eats less power) and even better
performance due to reduced cache bouncing.

------
colanderman
Note that "lock-free" doesn't mean "doesn't use mutexes"; it means "at least
one consumer/producer will always make progress". (Any algorithm using locks
fails this test, as one stubborn consumer/producer can hold up the entire
system.) The article doesn't seem to really touch on this property at all
(although at first glance the algorithm _does_ appear to be lock-free).

------
nano_o
I'm a bit skeptical about the code: it seems that it's using only the atomic
fetch-and-add synchronization instruction, and not compare-and-swap. However,
I think that whether queues are implementable with only fetch-and-add is an
open problem in distributed computing (in technical terms, whether queues are
in the Common2 family. See for example Afek et al., "Common2 Extended to
Stacks and Unbounded Concurrency".)

~~~
eloff
I've seen many queues implemented similarly to this, and it works fine. Every
thread that calls fetch and add receives a different result, therefore,
wrapping asside, there can only be one thread with access to each slot. If
there is only one consumer and one producer you don't need fetch and add at
all (that is how fastflow is implemented.) You must check that your slot is
populated by either comparing with the other index (producer if your the
consumer), which is stupid because that cache line is highly contended. The
better way is set a slot to null after you've consumed it, and if your the
producer, you first check if the current slot is null before attempting to
increment the index. A race condition exists where the slot can be null before
incrementing but not afterward (e.g. only one slot free but several producers
saw that and incremented.) In reality all that means is the producer can skip
slots and the consumer must keep incrementing if it sees a null slot rather
than assume an empty queue (can be tempered by checking x slots and then
falling back to checking the producer index.) Consumer and producer indexes
must be on seperate cache lines and some effort can be made to keep consumer
and producer on sperate cache lines in the actual queue. e.g. each slot can be
a whole cache line and pop() and push() can be batched to work with 8 elements
at a time. In other words OP is a noob and I could write a queue that whoops
his hiney (danger: attempt at humor.)

~~~
nano_o
Ok, so it is relatively easy to obtain a safe implementation and from there
you can use a heuristic to prevent a thread from looping forever trying to
find an available slot. Would that sum up the issue in practice?

~~~
eloff
I'd rephrase that as it's easy to create a safe implementation, but the
dangers and biggest payoffs are realized when you don't access both producer
and consumer counters in the same thread all the time. cache lines shared
between the consumer and producer and written to by at least one of them are
the hottest points of contention. With careful design you can get that to zero
in the common cases.

------
tilt_error
In what way is this different from the wait-free circular buffer
implementation described by Sape Mullender [1] (of Plan 9 fame). I listened in
to a presentation he gave last november and one of the take-aways was that it
scales poorly with the number of cores.

He implemented this in Plan 9, but the performance gain was not tremendous due
to shared memory effects among the cores.

[1] [http://lore.ua.ac.be/Teaching/CapitaMaster/antwerpen-
threads](http://lore.ua.ac.be/Teaching/CapitaMaster/antwerpen-threads)

~~~
krizhanovsky
Unfortunately, there is no good explanation of the algorithm which you refers,
also I didn't find a source code for this.

------
coldcode
The subject might be interesting but the formatting of the article is
horrible. Is it that difficult to remove the line terminators at the end of
each line so that you get actual paragraphs?

~~~
NLPsajeeth
You may prefer this version of the article which is better formatted but split
across multiple pages: [http://www.linuxjournal.com/content/lock-free-multi-
producer...](http://www.linuxjournal.com/content/lock-free-multi-producer-
multi-consumer-queue-ring-buffer)

