(Here is a plug for our C version: https://github.com/redjack/varon-t).
Our implementation competes with boost::lockfree::queue and it's much faster since Boost implementation uses more heavy synchronization techniques. The benchmark on GitHub also has Boost implementation, so you can compare both the queues.
Unfortunately, there is no adequate algorithm description for disruptor queue, only some indistinct descriptions mostly suitable for business people rather than engineers. So it was not easy to dig it's source code. However, I learned it and there are some notes about the implementation.
The implementation is bit inaccurate: there are a lot of branches without branch prediction information available at compile time, to avoid cache line bouncing it wastes two cache lines instead of simple align an item on cache line (I mean vrt_padded_int). I didn't pay too much attention to memory barriers usage, but giving that X86-64 provides relatively strict memory ordering, probably some of them also could be eliminated. Disruptor uses very cleaver ideas, but I believe its performance can be improved after good code review.
One more point is that while our queue implementation is C++, it's still self sufficient and can be easily ported to C for using in kernel space. It's doubtful (in my humble opinion) that generic container depends on non-standard libcork and moreover logging library (clogger).
If your queue is normally empty, it would be fine for the producing threads to just jump straight into the consuming routine and nevermind the queue. If the queue is normally full, it's fine for them to just take a lock and sleep on a condition.
Normally what you want a p/c queue for is to paper over short term variance in a system where on long time scales the consumer side is faster than the producer side. In that role, an ordinary queue with a mutex will work fine. If the consumer is stalled for long enough to fill the queue, blocking the producer is exactly what you want.
Of course, some languages favor the use of this pattern. I'm thinking of Go channels. Totally lock-free Go channels would be pretty awesome, but busy-waiting in select would not be awesome.
In both the cases we enhanced the queue with conditions like is the queue empty or is it full, so we can drop packets if consumers are overloaded and there is no sense to put a packet to the queue.
Secondly, the queue is designed to work on multi-core environments and this is usually multi-node NUMA systems for modern hardware. So we adjusted both the applications in such manner that administrator can assign different number of cores for all processors for consumers and producers depending on current workload. This makes the system mode balanced, so queue is empty or full rarely.
And finally, for kernel space we also implemented lightweight also lock-less condition wait for the queue (http://natsys-lab.blogspot.ru/2013/08/lock-free-condition-wa...). It also gives us lower CPU consumption when there is no workload (so the system eats less power) and even better performance due to reduced cache bouncing.
I'm seeing sched_yield() calls in there. It looks like a blocked process will yield its CPU core to available, productive work.
If there isn't enough work to keep all the CPU cores busy then it will spin around asking "Am I ready?" more often than required, but at that point you have a machine under less than full load and it doesn't really matter. (power use aside, also assuming they stay in cache while doing that).
He implemented this in Plan 9, but the performance gain was not tremendous due to shared memory effects among the cores.