Hacker News new | comments | show | ask | jobs | submit login
Lock-free Multi-producer Multi-consumer Queue on Ring Buffer (natsys-lab.blogspot.ru)
57 points by krizhanovsky 1259 days ago | hide | past | web | 18 comments | favorite

How is this different from disruptor queues (http://lmax-exchange.github.io/disruptor/)?

(Here is a plug for our C version: https://github.com/redjack/varon-t).

The algorithms are simply different and solve different problems: disruptor is a messaging queue (if a producer P0 emits M0 into the queue then all consumers C0..CN receive M0) while our queue is a classical work queue (if P0 and P1 emit message M0 and M1, then only one consumer Ci receives M0 and only consumer Cj receives M1 (i could be equal to j)).

Our implementation competes with boost::lockfree::queue and it's much faster since Boost implementation uses more heavy synchronization techniques. The benchmark on GitHub also has Boost implementation, so you can compare both the queues.

Unfortunately, there is no adequate algorithm description for disruptor queue, only some indistinct descriptions mostly suitable for business people rather than engineers. So it was not easy to dig it's source code. However, I learned it and there are some notes about the implementation.

The implementation is bit inaccurate: there are a lot of branches without branch prediction information available at compile time, to avoid cache line bouncing it wastes two cache lines instead of simple align an item on cache line (I mean vrt_padded_int). I didn't pay too much attention to memory barriers usage, but giving that X86-64 provides relatively strict memory ordering, probably some of them also could be eliminated. Disruptor uses very cleaver ideas, but I believe its performance can be improved after good code review.

One more point is that while our queue implementation is C++, it's still self sufficient and can be easily ported to C for using in kernel space. It's doubtful (in my humble opinion) that generic container depends on non-standard libcork and moreover logging library (clogger).

Yes! Thank you for the link to Varon-T, I've been looking for a C implementation of disruptor queues.

Interesting, and the author clearly is an expert on the topic. However, I'm curious about how this fits in to a sensible program architecture. When do you need such a queue? It seems to me quite unlikely that production and consumption operate at the exact same rate and with the same CPU usage rate such that the queue runs in a steady state of neither full nor empty. It seems much more likely that the queue will just be either full or empty all the time, in which case this algorithm will be terrible because it busy-waits instead of having condition variables that can be used to wake the consumers or block the producers. By the way blocking the producers is also known as flow control, which many people believe is a desirable property of software.

If your queue is normally empty, it would be fine for the producing threads to just jump straight into the consuming routine and nevermind the queue. If the queue is normally full, it's fine for them to just take a lock and sleep on a condition.

Normally what you want a p/c queue for is to paper over short term variance in a system where on long time scales the consumer side is faster than the producer side. In that role, an ordinary queue with a mutex will work fine. If the consumer is stalled for long enough to fill the queue, blocking the producer is exactly what you want.

Of course, some languages favor the use of this pattern. I'm thinking of Go channels. Totally lock-free Go channels would be pretty awesome, but busy-waiting in select would not be awesome.

This is a good question. Currently we've integrated the queue in two projects in production state: user-space proxy server and VPN capturer (Linux kernel module).

In both the cases we enhanced the queue with conditions like is the queue empty or is it full, so we can drop packets if consumers are overloaded and there is no sense to put a packet to the queue.

Secondly, the queue is designed to work on multi-core environments and this is usually multi-node NUMA systems for modern hardware. So we adjusted both the applications in such manner that administrator can assign different number of cores for all processors for consumers and producers depending on current workload. This makes the system mode balanced, so queue is empty or full rarely.

And finally, for kernel space we also implemented lightweight also lock-less condition wait for the queue (http://natsys-lab.blogspot.ru/2013/08/lock-free-condition-wa...). It also gives us lower CPU consumption when there is no workload (so the system eats less power) and even better performance due to reduced cache bouncing.

…in which case this algorithm will be terrible because it busy-waits instead of having condition variables…

I'm seeing sched_yield() calls in there. It looks like a blocked process will yield its CPU core to available, productive work.

If there isn't enough work to keep all the CPU cores busy then it will spin around asking "Am I ready?" more often than required, but at that point you have a machine under less than full load and it doesn't really matter. (power use aside, also assuming they stay in cache while doing that).

Note that "lock-free" doesn't mean "doesn't use mutexes"; it means "at least one consumer/producer will always make progress". (Any algorithm using locks fails this test, as one stubborn consumer/producer can hold up the entire system.) The article doesn't seem to really touch on this property at all (although at first glance the algorithm does appear to be lock-free).

I'm a bit skeptical about the code: it seems that it's using only the atomic fetch-and-add synchronization instruction, and not compare-and-swap. However, I think that whether queues are implementable with only fetch-and-add is an open problem in distributed computing (in technical terms, whether queues are in the Common2 family. See for example Afek et al., "Common2 Extended to Stacks and Unbounded Concurrency".)

I've seen many queues implemented similarly to this, and it works fine. Every thread that calls fetch and add receives a different result, therefore, wrapping asside, there can only be one thread with access to each slot. If there is only one consumer and one producer you don't need fetch and add at all (that is how fastflow is implemented.) You must check that your slot is populated by either comparing with the other index (producer if your the consumer), which is stupid because that cache line is highly contended. The better way is set a slot to null after you've consumed it, and if your the producer, you first check if the current slot is null before attempting to increment the index. A race condition exists where the slot can be null before incrementing but not afterward (e.g. only one slot free but several producers saw that and incremented.) In reality all that means is the producer can skip slots and the consumer must keep incrementing if it sees a null slot rather than assume an empty queue (can be tempered by checking x slots and then falling back to checking the producer index.) Consumer and producer indexes must be on seperate cache lines and some effort can be made to keep consumer and producer on sperate cache lines in the actual queue. e.g. each slot can be a whole cache line and pop() and push() can be batched to work with 8 elements at a time. In other words OP is a noob and I could write a queue that whoops his hiney (danger: attempt at humor.)

Ok, so it is relatively easy to obtain a safe implementation and from there you can use a heuristic to prevent a thread from looping forever trying to find an available slot. Would that sum up the issue in practice?

I'd rephrase that as it's easy to create a safe implementation, but the dangers and biggest payoffs are realized when you don't access both producer and consumer counters in the same thread all the time. cache lines shared between the consumer and producer and written to by at least one of them are the hottest points of contention. With careful design you can get that to zero in the common cases.

As far as I can tell, your "hiney-whooping" queue isn't lock-free (i.e., if one producer dies, all consumers will block waiting for that entry).

There are different kinds of lock freedom, but in this case if you read carefully, that's a normal operating state and does not block consumers. It's the race I mentioned. The consumers have to be written to skip over a limited number of null entries and then check the actual producer index. In most cases that means a little extra iteration. It would be very rare to need check the producer index, which makes it fast.

Indeed, fetch-and-add can only solve consensus for up to 2 actors [1]. However, that only directly relates to wait-free (= guaranteed individual progress) data structures. I forget how that relates to lock-free data structures.

[1] http://en.wikipedia.org/wiki/Fetch-and-add

In what way is this different from the wait-free circular buffer implementation described by Sape Mullender [1] (of Plan 9 fame). I listened in to a presentation he gave last november and one of the take-aways was that it scales poorly with the number of cores.

He implemented this in Plan 9, but the performance gain was not tremendous due to shared memory effects among the cores.

[1] http://lore.ua.ac.be/Teaching/CapitaMaster/antwerpen-threads

Unfortunately, there is no good explanation of the algorithm which you refers, also I didn't find a source code for this.

The subject might be interesting but the formatting of the article is horrible. Is it that difficult to remove the line terminators at the end of each line so that you get actual paragraphs?

You may prefer this version of the article which is better formatted but split across multiple pages: http://www.linuxjournal.com/content/lock-free-multi-producer...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact