They are specific. See 5 in 5.1.2.4 Multi-threaded executions and data races in ...

haberman · on Jan 24, 2016

Interesting. I bought C++ Concurrency in Action today and am learning all these details of the memory model.

You mention of caches makes me realize that a single-reader single-writer queue can probably also be optimized by putting the head pointer and the tail pointer on different cache lines. The reader and writer can cache the other's value on their own page, and only reload it when the queue otherwise looks full or empty, respectively. This should allow the reader and writer to act without needing to synchronize cache lines for many operations.

djcapelis · on Jan 24, 2016

I think that works on some architectures but would not be guaranteed behavior. In theory on a system with no memory ordering constraints, the pointers could update before the cache lines for the underlying ring buffer. Which means that without an acquire barrier at the beginning of aring_take, which without an item count, would require writes to the shared head/tail pointers to sync anyway, you can't ensure the the data written to the ring buffer is visible to the thread, even though your pointers would indicate there's data there and so you may load either torn or completely different data into the consumer thread.

Which means that if your memory barriers are operating correctly, they're just causing two cache lines to sync for the metadata instead of one, if the pointers are split across two lines.

Whereas with an item count, only one line ever has to sync, even if the pointers are split across two, since the pointers aren't shared between threads.

In practice, the compiler might not be smart enough to realize this and might be enforcing order for all side effects before the barriers though, even if the other thread doesn't read them. So maybe this isn't a good approach.