
Simple, Fast, Practical Non-Blocking and Blocking Concurrent Queue Algorithms - luu
http://www.cs.rochester.edu/u/michael/PODC96.html
======
jhallenworld
You can get better performance in the one writer thread / one reader thread
case by using simple lock-free circular buffers. No special instructions are
needed for this. Cache line thrashing can be minimized by allocating the
pointers into four cache lines: one writer local, one for writer-to-read
communications, one reader local, and one for reader-to-writer communications.

You do need barriers: issue a write barrier before giving a copy of the write
pointer to the reader. Issue a read barrier before reading the data.

Avoid cache line thrashing in a mostly full case by never allowing the
circular buffer to become completely full (leave at least one cache line
free).

The reader thread can quickly poll many circular buffers for work. The
pointers will all reside in the reader's cache until someone has written some
data (and updates the reader's copy of the write pointer).

You can get more benefits by delaying the update of the other thread's
pointers: on the writer's side, until we have no more data available to write.
On the reader's side, until we have read all the data (or some larger chunk of
it). This allows the cache-line prefetching hardware to work (to prefetch
likely used data).

Anyway, if you really want to use a linked list, at the very least allocate
multiple items per cache line, and then link the cache lines together (so one
link pointer or less per cache line).

~~~
fmstephe
If you want to see an implementation along the lines that jhallenworld
describes Martin Thomson has some very high quality implementations here in
Java. These are just single producer, single consumer queues

[https://github.com/mjpt777/examples/tree/master/src/java/uk/...](https://github.com/mjpt777/examples/tree/master/src/java/uk/co/real_logic/queues)

I have a number of Go implementations of the same queues here

[https://github.com/fmstephe/queues](https://github.com/fmstephe/queues)

------
jleahy
This should really be tagged as 1996, especially as since this was written the
world of lock-free algorithms has moved forwards a lot. At the very least
Fober et al suggested an alternative that performs a tad better in 2002
([http://nedko.arnaudov.name/soft/L17_Fober.pdf](http://nedko.arnaudov.name/soft/L17_Fober.pdf)).

------
coder23
I see a problem with implementing the non-blocking version in C. Compare-and-
swap must operate on an entire struct. But for example gcc only allows integer
types. Quote from gcc docs: _The definition given in the Intel documentation
allows only for the use of the types int, long, long long as well as their
unsigned counterparts. GCC will allow any integral scalar or pointer type that
is 1, 2, 4 or 8 bytes in length._

C11 has a new header stdatomic.h which only allows atomic for integer types.

How do you perform CAS on a struct in C( without using a mutex of course )?

~~~
gosu
By storing the struct in union with an integral type - or, if you need a value
but not an address, casting to such a union. You might also think to just
abuse pointer casting, but IIRC that breaks C strict aliasing assumptions and
causes GCC's optimizations to do bad things to your program.

