
Yes, You Have Been Writing SPSC Queues Wrong - NotThe1Pct
http://www.vitorian.com/x1/archives/370
======
tbirdz
I know the author doesn't intend this code to be "production ready", but I
just wanted to point a problem that may not be completely obvious, if you are
trying to use this structure for multithreaded communication.

The structure declares the read and write indices like so:

    
    
        std::atomic<IndexT> read_idx;
        std::atomic<IndexT> write_idx;
    

These two variables are going to be stored next to each other in memory. If
IndexT is say a 32-bit int, then the read_idx and write_idx will be 4 bytes
each. A cache line is 64 bytes on intel, so these two variables are going to
end up on the same cache line.

The problem with this, is that cache coherence protocols work on the
granularity of lines, not bytes. So when one core writes to anywhere on the
line (for example the producer to write_idx), then the other core (say the
consumer core) has to invalidate its cache entry for that line, and get the
new value off the bus, or maybe even hit memory.

Regardless of the specifics, the point is the consumer and producer cores can
no longer independently write to their respective variables. Whenever they
write to their variable, it will cause some cache coherence operations to be
done to update the line that variable is in, in the other core.

This is false sharing, since the producer and consumer don't actually need to
access the others variable. The cores are just forced to share access to the
variables because the variables are on the same cache line.

This can really have a big impact for multithreaded workloads, and bring down
the overall thread throughput.

One solution is to add some padding between the variables. Something like:

    
    
        std::atomic<IndexT> read_idx;
        unsigned char pad[64];
        std::atomic<IndexT> write_idx;
    

will do. That will put the variables on different cache lines. And you might
want to pad out some other variables to different cache lines as well.

~~~
NotThe1Pct
Yes, it's not meant to be production ready. Still, the padding you added is
particular to Intel/AMD CPUs and this article is more about constrained
memory/embedded.

On Intel you would not care about the missing slot when you have 256GB memory
available.

I've added the atomic<> in the github and updated the article as well.

However with atomic<> it adds an mfence instruction which is not really
necessary and might add a ton of latency.

~~~
gpderetta
You can specify the memory ordering of operations to get the desired fences.

For an spsc queue just load aquires and store releases are sufficient and will
have no overhead at all on x86.

Don't use operator++ to increment ( it will use an expensive lock xadd) just
code the explicit load + add + store sequence. It is safe on this specific
case.

------
rkevingibson
I'm curious what you think the problems are that unbounded indices causes?
Unsigned overflow is well defined in C++, so I don't see the problem here.
AFAIK, the only issue would be with requiring power of 2 sizes.

~~~
loeg
Yeah. Just requires power of 2 sizes and handling of write_idx < read_idx.

~~~
ashtuchkin
AFAIK write_idx < read_idx case is handled automatically by doing the unsigned
subtraction.

Power of 2 requirement stands, though.

~~~
gpderetta
Yes, this exact trick is used for tcp sequence numbers, for example, which are
allowed to overflow, and as long as the unack'd bytes are less than 2 __31
everything works fine as long as computation is done with unsigned integers.

~~~
gpderetta
(too late to edit)

HN ate my asterisks. s/231/2^31/

------
jessaustin
Why is push() checking for overlap? I thought not caring about that was the
point of a ring buffer?

[EDIT:] To clarify for those who didn't (couldn't?) read TFA's first sentence,
it specifically invokes "I've been writing ring buffers wrong all these years"
[0], recently featured on HN.

[0] [https://www.snellman.net/blog/archive/2016-12-13-ring-
buffer...](https://www.snellman.net/blog/archive/2016-12-13-ring-buffers/)

~~~
LgWoodenBadger
I can't read the site because it's blocked for me, but isn't a ring-buffer a
different class of data-structure than a queue? AIUI, queues are for non-lossy
FIFO behavior.

~~~
loeg
No, ring buffers are just an implementation of a queue. You can implement
lossy/lossless ring buffers or linked-list queues.

------
decker
I didn't see the referenced article and initially thought this was going to be
a post about not using volatile instead of atomic load / store.

I suppose this could be a useful implementation, however, it's not clear how
useful it would be since the main place it has an impact is if you have a
large number of very small queues. The overhead associated with the unused
element is sizeof(T) / (sizeof(T) * n + sizeof(RingBuffer) + malloc_overhead),
where T is likely to be at most the size of a pointer on the platform. If we
assume a standard x86-64 platform with a sizeof(T) to be 8, we can eyeball the
struct and guess that GCC will probably spit out a 24 byte structure since
it's going to want 8-byte alignment on the T* after the uint32 values. On top
of that, not counting malloc overhead, the array will take up n * sizeof(T)
bytes. For regular malloc, the overhead on an 8 byte allocation is going to be
an extra 12 bytes (according to malloc.c). Assuming we are allocating our ring
buffer on the stack, this brings our total size up to 48 bytes for the
"correct" ring buffer vs 56 bytes for the "wrong" ring buffer. Actual savings
will be at most 14% and go to 0% as the size of the buffer goes to infinity,
proportional to 1 / n, which is less than what one might assume. Consequently,
this means that for any buffer over length 2, the power of two buffer size
requirement will likely waste more memory than it saves since we are saving at
most 8 bytes, but losing the difference between a power of two and the desired
queue size, so odds are it makes more sense to write a 1 or 2 element ring
buffer as a special case than to use this implementation for all ring buffers.

~~~
NotThe1Pct
Quite frankly I wrote this more like an exercise. I have never used it in
practice other than the github test. I saw people picking up on the other post
and didn't really like the indices going unbounded. Then I spent a few minutes
thinking how to solve it.

While I think the implementation per se is not that useful (I agree with you),
I believe the actual trick can be reused in a different situation or type of
container.

Check out the github repo now, I added a choice of allocator. That is nice
because the main use of this kind of tool is when you have a slab of mapped
memory that you go partitioning.

------
jstapels
I honestly don't see how any of these implementations are better than simply
using a bool to track whether the queue is empty or not (as suggested in one
of the comments on the original article). You could even use the highest bit
of the write index to store the bool value if memory was an issue.

~~~
buzzybee
Some of these implementations also reduce the number of branches, which is
generally a good performance idea in modern pipelined, branch-predicting CPUs,
but not possible in all cases or even a guaranteed win for performance.

FWIW I checked what I did with the last ring buffer I wrote: I used position +
length to eke out that last cell. As the original article notes, it's not a
concurrency-friendly solution, but I wasn't writing for that in this case.

------
andars
This link: [http://www.sunburst-
design.com/papers/CummingsSNUG2002SJ_FIF...](http://www.sunburst-
design.com/papers/CummingsSNUG2002SJ_FIFO1.pdf) was posted in the previous
discussion on ring buffers. It provides a solution that seems to me to be the
best of any I have seen. See page 2, section 2.2. For a buffer of size 2^N,
use N+1 bit indices. If all bits of the indices are equal, the buffer is
empty. If all but the most significant are equal, the buffer is full.

~~~
danbruc
That is essentially what this implementation does, too, wrapping pointers
after twice the size which is just another way of saying using one additional
bit.

