
A Fast General-Purpose Lock-Free Queue for C++ - cameron314
http://moodycamel.com/blog/2014/a-fast-general-purpose-lock-free-queue-for-c++
======
amelius
This is an interesting showcase of lock-free programming. Thank you for this!

However, I see the following concerns:

1\. The code is so complicated that (imho) an actual formal proof would be
needed to show that this actually works.

2\. How portable is this? Will this work on other architectures than i86,
which have different ordering guarantees?

3\. This reduces the time for enqueueing and dequeueing by a factor of x,
which is immense (near 100x or 1000x?), but we have to keep in mind that the
code uses only a small fraction of total time spent by an application.
Therefore, it would be nice to see that this reduction is worthwhile in a
practical application.

4\. The code for this queue is actually so large (2800 lines) that a large
amount of cache memory is needed to keep this code in active memory, and keep
it running as fast as it could.

Again, thanks, this will be interesting to study.

~~~
cameron314
1\. It is complicated, and, of course, with any complex code comes the
possibility of bugs (especially when threads are involved!). I'd love to have
a formal proof myself, but it's not a simple undertaking -- the sheer number
of possible codepaths at execution time would make an exhaustive check
infeasible. (And expressing the _entire_ thing as a formal model is beyond my
means.) What I've done instead is to try to test individual subsystems in
isolation (you can check out the tests if you're interested): There's a free-
list, a thread-local storage map, an add-only lock-free list, two block
mapping methods, and the core algorithm itself. Then I wrote a series of unit
tests (that are really more like integration tests, I suppose) that test
individual functions, and everything together (and boy did those tests ever
catch bugs during development!). There's never too many tests. There's never
too much model-checking. But I only have limited time :-) So in the meantime,
I made an initial release when I thought it was stable and correct to the best
of my knowledge, with a reasonable amount of testing that I feel is reasonably
thorough. I'd gladly accept more tests contributed by somebody else.

2\. It is meant to be 100% portable to any system that has a C++11 compiler. I
was very careful about the orderings I wrote, so yes, it should work on CPUs
with weaker memory models.

3\. The factor depends on what part of the API is used, and with how many
threads. I absolutely agree that a fast queue itself probably won't buy you
much in terms of application performance. But if you're going to write a data
structure, why not write a high-performance one? Much more fun, IMO, and
certainly can't hurt. 4\. Here I must disagree. The number of lines of code of
a heavily templated data structure are entirely unrelated to the compiled size
:-) In particular, if you use either only producer tokens or only no producer
tokens, then a large portion of the code is unused and will be optimized out.
Same for the bulk methods.

------
byuu
Please don't be upset with me for asking ... but I am sincerely curious: I
know that an MPMC lock-free queue is a vastly challenging thing (the best I
could do lock-free myself was one producer thread, one consumer thread), but
is this really the minimum level of complexity needed to implement this idea?
There's a lot of comments (which is great), but even with that, the code just
seems incredibly _massive_ to me for what it is (over 2800 lines and 100KB, of
course it has a lot of great commenting in there as well.)

If there really is no way to reduce the complexity (or at least without
ruining the performance), then I understand. I know code size isn't a
particularly important metric, I just find code that's been reduced to its raw
essence a lot easier to understand and learn from than over-architected, over-
abstracted code.

I ask because this is something I'm interested in understanding (I want to
write a nice thread-safe allocator), but the code even with the comments and
article is fairly overwhelming to me at the moment.

Either way, great job on making this! Really impressive performance metrics,
too.

~~~
cameron314
You're certainly right that it's not the minimum amount of complexity. But
it's not that hard to implement a simple MPMC queue (e.g. if you can assume
DCAS on your platform) -- what's hard is making it fast. So part of the
complexity derives from a design that calls for blocks of elements instead of
a simlpe linked list, for example; but a lot of the speed comes from that
design too.

The rest of the code size (and complexity) arises from the feature set (which
I intended to be fairly complete): Users can provide tokens for better
performance -- but don't have to. There's bulk methods. There's a move
constructor and assignment operator (for the tokens too!). There's traits so
that you can override the malloc/free that's used, and all the constants.
There's memory debugging machinery (hidden in an #ifndef NEBUG). It all adds
up, but not without reason, I think :-)

I will add a "how to read the code" section to the README later on when I have
a minute. I think it would be instructive for those curious.

------
Animats
OK, now demonstrate that this doesn't require fence instructions to prevent
problems on superscalar CPUs (anything bigger than a low-end ARM today).

~~~
cameron314
Of course it requires fences. It wouldn't be correct otherwise, and would
probably not pass the tests even on x86, thanks to compiler re-ordering.

------
davidtgoldblatt
This doesn't seem to correctly implement a queue to me?

Suppose thread A enqueues an item, signals somehow to thread B that thread B
may proceed, and then thread B enqueues an item. There's no guarantee that
thread A's item will be dequeued before thread B's.

Of course, non-linearizable doesn't necessarily mean useless (this is in fact
an active research area), but it makes me think the performance comparisons
aren't really fair; you're comparing genuinely linearizable queues with one
that isn't bound by that performance-hampering requirement.

~~~
cameron314
I consider it a queue, despite its non-linearizability. I'll quote from my
README where I addressed this concern:

One particular consequence of this design (which seems to be non-intuitive) is
that if two producers enqueue at the same time, there is no defined ordering
between the elements when they're later dequeued. Normally this is fine,
because even with a fully linearizable queue there'd be a race between the
producer threads and so you couldn't rely on the ordering anyway. However, if
for some reason you do extra explicit synchronization between the two producer
threads yourself, thus defining a total order between enqueue operations, you
might expect that the elements would come out in the same total order, which
is a guarantee my queue does not offer. At that point, though, there
semantically aren't really two separate producers, but rather one that happens
to be spread across two threads. In this case, you can still establish a total
ordering with my queue by creating a single producer token, and using that
from both threads to enqueue (taking care to synchronize access to the token,
of course, but there was already extra syncrhonization involved anyway). I
expect this use case to be fairly rare, though!

~~~
davidtgoldblatt
This isn't really a great solution though, since it pushes the problem of
ensuring FIFO ordering outside of the queue and onto the user; but being FIFO
is entirely the point of the queue data structure. If your queue doesn't
provide ordering across multiple threads, it can't meaningfully be called a
multi-threaded queue, so it's not very fair to compare it to genuine multi-
threaded queues. Unless I've misread the documentation, this is just a
collection of SPMC queues, together with a read operation that checks all the
queues for an item to dequeue, right? But there is no synchronization that
occurs to tie any of these SPMC queues together. Even if you wrapped every
single access to this data structure in a lock, for instance, you would not be
able to ensure FIFO ordering of elements inserted into it.

I think the estimation that having inter-thread ordering requirements is
uncommon is incorrect. Probably the most common use of MPMC queues is as a
concurrency primitive in systems using a CSP-like abstraction, where such
requirements are extremely common. But you can't implement, say, Go's channels
in terms of this data structure.

~~~
cameron314
Yes, it is essentially an abstraction over a collection of SPMC queues (when
using tokens to dequeue, it is actually a bit smart in that it tries to pair
producers with consumers where possible).

If you wrapped every enqueue operation in a lock and used the same producer
token, then you'd essentially have a single SPMC queue, and it would be fully
FIFO. But obviously this defeats the purpose somewhat, so yes, it's not a good
solution if you need that guarantee.

I honestly didn't consider external inter-thread ordering at all while I was
designing the queue. I guess I was picturing more of a producer-consumer
model, than a message passing system (where this dependence between messages
across threads/producers is far more common, it seems).

I'd change this constraint if I could, but I can't. It's the fundamental
assumption the very heart of the entire design. I'd have to start from scratch
(which I'm not prepared to do, at this point). But I've changed the README to
make this limitation crystal clear (you're not the only one with reservations
with regards to this aspect).

I think it's still useful, perhaps not for all applications, but for many.
Sorry if I came across as trying to purport it as something that it wasn't;
that was not my intention at all.

------
krislee
This idea is very similar to the Java Disruptor pattern ([http://lmax-
exchange.github.io/disruptor/](http://lmax-exchange.github.io/disruptor/)). So
is this an implementation of this pattern in C++ of sole independent creation
of the author?

~~~
cameron314
Yes. I hadn't looked at the disruptor pattern until yesterday, and I still
don't fully understand it. I designed this queue from scratch.

------
stinos
May I ask why you choose to use inline on the non-template class-methods? Did
it make any difference in the code produced? For the rest: this seems very
well done. Love all the proper comments and the huge amounts of
static_asserts.

~~~
robert_tweed
Not the author, but since the whole thing is defined in a header, everything
has to be declared inline to avoid violating the one definition rule.

Inline hasn't had anything to do with performance for years (compilers just do
what they want anyway), but it's somewhat confusing that the meaning was
changed.

~~~
byuu
I found a cool trick around that recently. Use anonymous namespaces when you
want to have "standard-library/boost"-like template header libraries. Anything
inside of an anonymous namespace has no external linkage, and it can
transparently use anything inside its parent namespace.

For instance:

    
    
        namespace nall { namespace {
          void foo() {
          }
        }}
    

You can include this in multiple compilation units, invoke it with
nall::foo(), and have no issues with multiple symbols.

(and of course as rav said, as long as you put the function definition inside
the class definition, that can also work. But my trick will work if you want
the functions declared outside the class, or want non-class functions.)

~~~
T-zex
Wouldn't that become a separate function per translation unit?

~~~
byuu
I want to say that GCC will do the same thing with the inline trick. It has
the option to inline the function (where there won't be any actual functions
in the resulting binary), or declare it as a unit-unique function (where the
actual function will be repeated in the binary, one for each object that
includes our header and uses the function), or recognize it's the same
function in each unit and merge it back to just one function for the entire
binary (which seems even more likely to happen with LTO.)

Theoretically, the same thing should happen here. The only difference is you
don't _have_ to hint that you want the function inlined (and maybe you really
don't), but you can still mark the function inline (or always_inline) if you
want to ask GCC to please do that, and it doesn't ignore you.

However, I haven't actually analyzed the resulting binaries comparing the two
approaches, so I can't say for sure if GCC can optimize the multiple functions
to one in the former case (inline) but not the latter case (anonymous
namespace). It would seem like an oversight if that were true.

