
A fast multi-producer, multi-consumer lock-free concurrent queue for C++11 - setra
https://github.com/cameron314/concurrentqueue
======
colanderman
Maybe someone can educate me. I've studied them for a while, but I don't get
the fascination with lock- or wait-free queues.

If you actually care about real-time guarantees enough to look into lock-free
solutions, you _already_ ought to have non-preemptive workers (i.e.
cooperative multitasking). This then allows you to have only one queue per CPU
(since exclusive access is guaranteed), simplifying the problem.

With one queue per CPU, multi-producer becomes trivial (atomic doorbell +
round robin), and multi-consumer is easy (CAS to acquire a work item). You
don't have fairness guarantees for consumers, but who cares, only when they
are not saturated can they not be fair and then that's a good thing.

Application stalls don't happen (one non-preemptible queue per CPU),
application crashes are managed at a higher level, and IRQs can be disabled
for the 5 cycles you spend in the critical sections.

I assume I'm missing something, since I'm somewhat new at this, and lock-
freedom is clearly in vogue, but I don't know what it is.

~~~
RossBencina
Your architecture proposal seems sound assuming that you have that much
control over the whole system, and you don't need to integrate with any legacy
components. In the common case, you need to deal with limited control (e.g. in
a userspace app) and you have legacy components (e.g. a consumer OS, a Java
VM, etc).

If you're doing best-effort soft-real-time on a consumer OS then you probably
don't have control over thread affinity to _guarantee_ one thread per CPU, and
you certainly can't disable IRQs.

Even on servers, you can get Linux to set up your thread-per-CPU thing, but
still no IRQ masking. Also, you'd need your whole stack to do cooperative
multitasking in a fully composable way before you could do as you propose.
Maybe that's plausible in Go, but if you're in Java (as a lot of the HFT
people seem to be, for reasons I don't understand) then probably only _some_
of your system is structured as one-thread-per-CPU cooperative multitasking,
and you may still need to offload high-latency work using lock-free queues.

In my domain (real-time audio on desktop OSes), neither Windows nor macOS have
working deterministic priority inversion mitigation. So using lock-free queues
between priority domains (e.g. between GUI thread and real-time audio thread)
avoids priority inversion risk (compared to using mutexes). Granted, this is a
rather specialised use-case.

Another reason people are interested in these things is that there's a
perception that lock-free algorithms perform better for inter-thread queuing
than the alternatives. In particular, there are claims that they scale better
when there are a large number of producer and consumer threads/cores. I think
this is the most controversial area, since the queue algorithms may scale
better than e.g. a single mutex on a single queue, but it's not necessarily
the case that an application architecture that uses a MPMC queue is the best
fit for purpose.

~~~
valarauca1

         If you're doing best-effort soft-real-time on a consumer OS then 
         you probably don't have control over thread affinity to _guarantee_ one thread per CPU
    

Which consumer OS? FreeBSD, OpenBSD, Linux, and Windows all offer the
functionality. To my knowledge MacOS is the only consumer OS that doesn't.

    
    
         and you certainly can't disable IRQs
    

This has been in Linux for over a decade. You can modify IRQ affinity from the
root shell via the proc-fs [1].

[1] [https://www.kernel.org/doc/Documentation/IRQ-
affinity.txt](https://www.kernel.org/doc/Documentation/IRQ-affinity.txt)

~~~
Unklejoe
Hmm. Interesting. I would have never thought to use that within a program
during runtime.

I don't think that the IRQ affinity facility is meant to be used like that
("that" being the masking of interrupts for a given CPU to create critical
regions with a program).

For starters, it seems like you'll need to be running as root to be able to
change it. That could be an issue for some applications. It also says that you
can’t disable interrupts for all cores.

It also seems very non portable.

~~~
valarauca1

        I would have never thought to use that within a program during runtime.
    

During Runtime is unlikely. It does have clear benefits especially when
dealing with 10GbE nics. Having N-cores handle 2x NIC's the caching thrashing
can bottleneck your maximum bandwidth. While dedicating 1:1:1 NIC -> Core ->
IRQ solves this.

Generally setting up IRQ masking _should be_ part of your pre-startup
configuration. Not setting it dynamically at run-time.

    
    
        It also says that you can’t disable interrupts for all cores
    

Yes/No.

One can disable _all_ interrupts for a core via separate mechanisms. But this
means you need to implement your own scheduler on that core, and remove that
core from the scheduler. Also you lose the ability to fire syscalls, lose
kernel level Mutexes, Futexes, IO, etc. It gets very complicated very fast.
Having _some_ interrupts is generally a good thing.

Lastly you'll still be subject to memory bus interrupts (but they aren't
called interrupts, they're non trivial stalls) to maintain Cache coherency and
R/W ordering (in AMD64 and ARMv8). You can never opt out of these.

If you want to disable _all interrupts_ on _all cores_ why are you even
running an OS?

    
    
        It also seems very non portable.
    

AMD64 Linux is literally the most used OS in data centers, but you are
nonetheless correct.

~~~
Unklejoe
Thanks. Lots of good information there.

As for disabling interrupts on all cores, the specific case which I was
thinking of was when you’re running a preemptive OS and you have a bunch of
threads (== num CPUs) which all need to temporarily disable interrupts to
create a critical region at the same time. I could be misunderstanding the
original goal though. My understanding of the parent comment was that
disabling IRQs would be used to prevent a thread from being scheduled out in
the middle of an operation (like a kernel spinlock on a UP system), so that
you can avoid the lock even on a non-cooperative multitasking system. I guess
by temporarily disabling IRQs, the system becomes temporarily cooperative.

~~~
valarauca1

         My understanding of the parent comment was that disabling
         IRQs would be used to prevent a thread from being scheduled
         out in the middle of an operation
    

1\. The scheduler already respects this. It won't suspend during the _body_ of
a function. Only at entry/exit.

2\. `chrt --sched-deadline X` can ensure your thread will run for `X`ns
without interruption (higher priority then all other tasks). And ensure you
always get the same time budget consistently. [1] [2] [3]

3\. The original goal of decoding audio can _mostly_ be handled with DEADLINE
+ setting CPU affinity + masking interrupts on that CPU. Modifying IRQ Masks
dynamically hits a lot of internal locking, and hurts your cache coherency.
Any gains in _one process_ will reflect massively negatively on the entire
system.

4\. This really seems overkill. Deadline Scheduling will do 99% of it seems
you/parent need/want. Basically your process cant be interrupted for
NanoSeconds, and will run every NanoSeconds I suggest:

    
    
         chrt --sched-deadline [TIME_BETWEEN_RUNS_NS] --sched-runtime [TIME_WILL_RUN_NS] -p [PID]
    

(I may have the Runtime/Time between backwards).

Most my experiencing is the Networking for HFT not audio decoding FYI.

[1]
[https://en.wikipedia.org/wiki/SCHED_DEADLINE](https://en.wikipedia.org/wiki/SCHED_DEADLINE)

[2] [http://man7.org/linux/man-
pages/man1/chrt.1.html](http://man7.org/linux/man-pages/man1/chrt.1.html)

[3] Consistently means _as close to absolutely prefect as possible_. Your OS
is a dynamic system so it'll be +/\- a few NanoSeconds. Generally the delta is
very small. Caches stall, RedBlack trees are re-balanced, work is stolen, etc.

------
RossBencina
New algorithms are always welcome. But it isn't a general-purpose MPMC queue
(nor does it claim to be). The following constraints are listed in the
"Reasons not to use" section:

\- not linearizable wrt multiple producers

\- not NUMA aware

\- not sequentially consistent, quote: "things (such as pumping the queue
until it's empty) require more thought to get right in all eventualities"

These are subtleties (especially the first and the last) that may bite if you
don't know what you're doing and just want to "plug in a queue." It's going to
depend on how you use the queue.

Side note: There are other examples of novel lock-free algorithms that have
only been published by blog, powerpoint or github (e.g. Dmitry Vyukov's well-
known work, Cliff Click's concurrent hash table, Jeff Preshing's hash table.)
However, in general, lock-free algorithms are widely known to be very
difficult to get correct (not too dissimilar to getting a Distributed
Consensus Algorithm correct). I can't help thinking that we need a higher bar
of correctness than the author's claims and some unit tests. Would you use a
distributed algorithm that didn't come with a correctness proof? Personally
I'd like to see a formal proof, peer review, and a Spin model. Peer-review
need not be via academic channels, just something more than self-publication.

~~~
eternalban
> Would you use a distributed algorithm that didn't come with a correctness
> proof?

That happens all the time in practice. Prior to Kyle Kingsbury's influential
blogs on noSQL darlings, for example, it was not even on the geek-pop radar.

And then, quite frankly, there is the gap between implementation and formal
description of an algorithm. Incorrect implementation will obviate the
guarantees your formal proof is asserting.

~~~
RossBencina
> there is the gap between implementation and formal description of an
> algorithm. Incorrect implementation will obviate the guarantees your formal
> proof is asserting.

Agree. One aspect of this wrt implementing lock-free algorithms in C++:
Usually the academic papers assume a sequentially consistent memory model, so
when you're implementing the algorithm you have to work out how to place the
memory barriers correctly (a potentially non-trivial task, as other comments
on this page demonstrate).

------
vvanders
There's a reason that there aren't many lock free implementations. If you
don't get one from your CPU vendor then there's a high chance that it's not
correct.

~~~
DSMan195276
I would agree, but I'd add that it has been made a little bit easier to handle
with the addition of `atomic` in C11 and C++11, since it means you can write
atomic code without having to drop to inline assembly to ensure you use the
right instructions to make it lock-free. _That said_ , that's only one piece
of the puzzle, and you really have to know what you're doing to ensure you
write it correctly.

That said, after looking at this code the author _appears_ to know what
they're doing. I'd have to read it a lot closer to really make sure though.

~~~
ape4
`atomic` may require a lock, depending on the CPU.

~~~
DSMan195276
True. In that event though, there's little other options. Locking those
variables individually like that is probably worse then other locking options
though, so it is worth keeping in mind. but while it would be slower, it would
still be just as 'correct' as using actual atomic variables.

------
std_throwaway
Could someone please explain for the uninitiated what lock-free actually means
and why it matters.

After reading [http://www.drdobbs.com/lock-free-data-
structures/184401865](http://www.drdobbs.com/lock-free-data-
structures/184401865) i got this:

* Normal locking means that the process which holds the lock can hold it arbitrarily long thereby locking out all other processes. Also a live-lock and dead-lock can occur if there is a conflict between processes which try to acquire the same set of resources but cannot acquire all of them at once.

* Wait-free means that no algorithm working with the data structure will be delayed arbitrarily. This is pretty strong. A simple example would be a ring-buffer with a single reader and writer.

* Lock-free means that no process can block the resource for longer than it takes to read/write it. There will always be at least one process that can make progress while the others may have to wait (weaker than wait-free, but stronger than ordinary locked).

Normal processes on most operating systems can be interrupted at any
instruction. This would make it impossible to carry out a multiple-instruction
sequence to lock-modify-unlock the data structure because it could leave the
data structure locked. Does this in turn mean that there must be a "commit"
instruction that is uninterruptible?

~~~
RossBencina
Lock-freedom and wait-freedom are formal terms in concurrent algorithm theory.
Best to check out Herlihy and Shavit, "The Art of Multiprocessor Programming,"
which I have unfortunately temporarily misplaced.

Informally:

"Lock-free" means that at each time step, at least one thread that is
interacting with a shared object makes progress towards completing the
operation (e.g. an enqueue or dequeue operation). Neither deadlock nor
livelock (where no thread makes progress) can happen (hence "lock free"). This
does not guarantee fairness or starvation-freedom (a fast thread could in-
theory DoS a slow thread).

"Wait-free" means that at each time step, every thread will make progress.
This might e.g. involve an algorithm where fast threads help slow threads
complete their operations.

Wait-freedom is indeed a stronger condition, but it's usually more expensive
to implement (although not always).

> Does this in turn mean that there must be a "commit" instruction that is
> uninterruptible?

Yes, lock-free algorithms make use of atomic instructions such as CAS
(compare-and-swap). Sometimes it's used as a "commit" but there might be
multiple atomic operations depending on the algorithm and the data structure
(so I guess, a kind-of multi-phase commit sequence). This is a nice intro
paper by one of the giants of the field:

Maged M. Michael, "The Balancing Act of Choosing Nonblocking Features" ACM
Queue vol. 11, no. 7
[http://queue.acm.org/detail.cfm?id=2513575](http://queue.acm.org/detail.cfm?id=2513575)

Update: I should add that "lock free" hasn't always been a formally defined
technical term, and some people use it informally to mean "doesn't use
mutexes," or even "only uses atomic operations." Under such a relaxed
definition a hand-coded spin-lock might be considered "lock free," but it
really isn't -- if a thread holding the spinlock crashes, the spinlock would
never be unlocked; such a situation could not arise with a formally lock-free
algorithm.

~~~
pzh
It's interesting that many people consider lock-free algorithms to be
appropriate for real-time programming, but by the formal definitions, lock-
free doesn't guarantee that a particular thread won't be starved or that an
operation would finish by a certain amount of time. Maybe in these cases,
wait-free would be more appropriate...

~~~
RossBencina
> Maybe in these cases, wait-free would be more appropriate...

In some cases wait-free algorithms are used (e.g. real-time Java queues).

There's ongoing research into whether lock-free queues are wait-free in
practice.[0] For example, under some reasonable scheduling assumptions, lock-
free operations have been shown to have bounded time execution.[1] That's a
result for uniprocessors. I'm not aware of a corresponding result for
multiprocessors, but I haven't gone looking for a couple of years. There are
some leads in the first citation.

[0] "Are Lock-Free Algorithms Practically Wait-Free?"
[http://tce.technion.ac.il/wp-
content/uploads/sites/8/2015/06...](http://tce.technion.ac.il/wp-
content/uploads/sites/8/2015/06/SC-2.1-K-Censor-Hillel.pdf)

[1] Anderson, J. H. et al. 1997. “Real-Time Computing with LockFree Shared
Objects.” ACM Transactions on Computer Systems. 15(2):134–165
[https://cs.unc.edu/~anderson/papers/tocs97.pdf](https://cs.unc.edu/~anderson/papers/tocs97.pdf)

------
drfuchs
From the documentation, it doesn't seem clear that there's any guarantee that
a particular queued item will ever eventually be dequeued (in a program that
runs forever).

Consider the case where each producer thread queues N items, and then waits
until at least one of its N items is dequeued before immediately topping back
up; while the consuming thread dequeues at a slower rate than the producers
are able to produce. Maybe no item from producer number 1 ever gets dequeued?
Or did I miss something in the documentation?

~~~
kasey_junk
The documentation says it's not serializable or linearizable in some cases,
nor does it have preemption or fairness garauntees so you are correct.

------
loeg
If you like this, you may also like
[http://concurrencykit.org/](http://concurrencykit.org/) .

~~~
film42
Facebook's folly also has a MPMC queue:

[https://github.com/facebook/folly/blob/master/folly/MPMCQueu...](https://github.com/facebook/folly/blob/master/folly/MPMCQueue.h)

------
Unklejoe
Another useful lockless library - [http://liblfds.org](http://liblfds.org)

It's written in C though.

------
anti-thought
For what its worth, the base algorithm is extremely similar in concept to an
Erlang queue implementation I saw recently [1]. But I do like doing these
language implementation comparisons. The Erlang one definitely suffers from
the base language being copy-on-write though.

[1] -
[https://github.com/dstar4138/libemp/tree/develop/src/buffers...](https://github.com/dstar4138/libemp/tree/develop/src/buffers/multiq)

------
Blackthorn
I'd like to raise my hat to some absolutely fabulous documentation. That's one
of the best READMEs I've seen in a very long time.

~~~
pzh
The README is really good for users of the library. I was looking for a
description of the algorithm, though, and I couldn't find any. Does anyone
know what algorithm this library implements? (e.g. a literature reference
would be helpful). I'm familiar with a couple of PQ implementations based on
skip-lists: Sundell & Tsigas and Linden & Jonsson--but this library doesn't
seem to be based on any of them.

------
CyberDildonics
I've used this and relied on it, it works very well.

~~~
tomovo
I have also used this for queuing audio buffers in my game. It works great and
the author was nice enough to do a quick adjustment for me. How I used it:
[http://www.catnapgames.com/2016/01/19/new-sound-
mixer/](http://www.catnapgames.com/2016/01/19/new-sound-mixer/)

------
martincmartin
Facebook's folly library contains a general purpose, multi producer multi
consumer queue[0]. It's general purpose and linearizeable. It's used
extensively inside Facebook.

Disclaimer: I work at Facebook.

[1]
[https://github.com/facebook/folly/blob/master/folly/MPMCQueu...](https://github.com/facebook/folly/blob/master/folly/MPMCQueue.h)

------
dkersten
I found this earlier this year and was using it for a little toy game project
to communicate between the rendering thread and everything else. Its super
easy to use :)

I didn't develop the project to the point where I could comment on its
performance though and most of my processing was happening in shaders anyway.

------
hitlin37
ok, i have been also looking into concurrent vectors lately to get something
like this running between two threads.

Someone know or point here to a header only implementation of vector/queue
with locks and _really_ simple code with explanation ? Something written as
beautifully as herb sutter's or scott meyers examples, They are easy code to
play around and understand :)

*An implementation without vendor concurrent library. Looking for an implementation on Linux/ARM. Although i do have access to libpthread threads and locks.

~~~
nlightcho
A really simple templated queue that uses std::mutex.

[https://gist.github.com/AlexanderDzhoganov/cffa9aaae5a325cdf...](https://gist.github.com/AlexanderDzhoganov/cffa9aaae5a325cdfab818abfc1fc16c)

~~~
hitlin37
Thanks, i will give it a try.

------
nathancahill
Can't wait to see the aphyr teardown.

~~~
marvy
I though that was for distributed stuff??

