State observed by threads 2-4: counter = [4,5,1,2,3,4,5,6], state = [blue, green, blue]
Thread 1 wakes up, sees counter = 6, state = blue;
Logical conclusion: system state is counter++, state unchanged.
Actual scenario: counter+=6, state+=2
I believe this was first covered in my distributed computing class, but was probably most concretely illustrated in the case of concurrent garbage collection. Any unfairness in the thread scheduler can lead to some threads getting way more clock cycles than others. In low frequency scenarios you can convince yourself it's not a problem, only to be punished by success into finding out you're wrong when you have 10,000 active users instead of 10.
This is especially true with spinlocks, because it's a greedy algorithm that trades fairness for responsiveness. It can result in low throughput or even priority inversion. They are a very sharp knife. Indispensable for very specific tasks, certain bloodshed for others.
But either way, the counter can never be 6 since the max counter is set to the number of threads, and each thread at the barrier decrements the counter. Once a thread is spinning, it doesn't care about the counter anymore, only the state. Once the last thread reaches the barrier, the state flips, and all threads awaken. Each thread maintains a local copy of what the "reversed state" should be to awaken. If one thread gets preference and advances to the next barrier, it will decrement the counter it reset and will spin for the global state to flip again. So even if it reaches the second barrier before any of the other threads awaken from the first, it will wait for the rest of the threads to catch up.
Multicore systems have hardware support for some consensus algorithms. Speed of light is practically a rounding error, but not always. And clock synchronization is... I'll use the word attenuated, since I hesitate to call it 'solved' or 'avoided'.
The basic problem is that for any physical core, there is no guarantee that any writes will ever be seen by any other core (unless the necessary extra magic is done to make this so).
(There's a matching problem when reading, too, in that in theory writes made by another core can be never seen by the reading core, even if they have reached the domain covered by cache coherency, unless the necessary extra magic, etc.)
So here with this code, in theory, the writes being made simply are never seen by another core, and so it doesn't work.
As it is, in practise, writes typically make their way to other physical cores in a reasonably short time, whether or not the extra magic is performed; but this is not guaranteed. Also of course the ordering of the emergence of those writes is not guaranteed (although on some platforms, additional guarantees are provided - Intel is particular generous in this regard, and performance suffers because of it).
They're using atomic read/writes with sequential-consistency.
This means that the compiler will automatically put memory-barriers in the appropriate locations to guarantee sequential-consistency, but probably at a significant performance cost.
The implementation is likely correct, but its the slowest performance available (seq-cst is the simplest / most braindead atomic operation, but slowest because you probably don't need all those memory barriers).
I'd expect that Acq-release style barriers is possible for this use case, which would be faster on x86, POWER, and ARM systems.
In the C11 code? I'm not up to speed with that version of the spec, but unless there is behaviour invoked by the use of _Atomic, the code looks to me to be performing normal, non-atomic read/writes.
Another reply says there are full barriers being automatically used, but that doesn't address the problem I've described.
That is the behavior invoked by "atomic".
The general gist is that "manual" read/write barriers were too difficult to think about and led to bugs. Instead of having the programmer put the read/write barriers in the correct locations, modern compilers put a type-specification on variables... and then its the compiler's job to put the barriers in the correct location.
This turned out to be necessary, because the compiler's optimizer kept rearranging code (!!!) around memory barriers anyway. So the "compiler" needs to participate in any of the barriers, especially at higher levels of optimization (-O3 rearranges a lot of code). If the compiler needs to participate, you might as well have the compiler handle the placement of those barriers.
"Atomic" variables by default will be sequentially-consistent (ie: the _maximum_ use of memory barriers for _maximum_ levels of consistency).
Atomic writes have nothing to do with memory barriers. On modern platforms atomic writes are available without memory barriers of any kind.
Memory barriers do not cause events to become visible.
Only atomic writes do this; and so the question is not whether barriers are in use due to _Atomic, but whether atomic writes are in use due to the use of _Atomic.
_Atomic types can be read-modified-written atomically (if using a compound-assignment operator), or the postincrement/pre-increment operations.
In these circumstances, the memory-barriers used for those atomic-operations will be of the sequentially-consistent memory model.
So yes, that "++*barrier" operation is read-modify-write atomically AND with memory-barriers in sequential-consistency style
I don't believe that "atomics" are enough to solve this problem. At a minimum, I expect acq-consume to be required (not that acq-consume model even works in practice... so acq-release for today's compilers). That's why I'm still talking memory model / fences here.
Not only is an atomic operation required, but so too is the "publishing" of this information needing to happen in the proper order. Fortunately, C11's _Atomic implementation solves both issues.
The quoted material does not describe this assertion.
Consider something basic like implementing a mutex: once a thread unlocks a mutex, it is possible in theory that no other thread _ever_ sees the mutex as unlocked, and spin on it forever, but such a system would be useless.
In practice, general purpose CPUs have writes become visible as fast as possible.
You won't find this behavior defined strictly in memory models, because how can you? These talk only about ordering and eschew things like "global clocks" and even if you had such a clock, how are you going to put a numerical bound on the delay?
Most will leave it as a quality of implementation issue and specify that writes become visible "eventually" if they say anything at all.
Barriers only influence the order in which events become visible if and when they DO become visible.
Barriers say nothing about whether or not events actually do become visible; and it may be that they never do (if the necessary magic is not performed, etc).
You mention an atomic increment. I don't see atomic increments this in the C11 code, but I'm not up to speed with that version of the specification.
(Note that an atomic increment will solve the write-side problem, but not the read-side problem. The reader must still use a read barrier; but if there's an implicit full barrier, then that's being done.)
A fence or atomic operation guarantee that a store will be visible globally in a finite amount of time
Edit: that's true for C11 _Atomic as well (I had to double check), I.e. x++ if x is an _Atomic int is both atomic and seqcst
Firstly, if the writer has used no barriers, writes can emerge in any order (and may never emerge at all, anyway, even if barriers were used). Secondly, if the reader has not used barriers, then even if writes have been written in some given order, the reader will by no means observe that order.
> A fence or atomic operation guarantee that a store will be visible globally in a finite amount of time
A fence does NOT guarantee this. A fence controls and only only controls order of events. It has NO influence on whether events actually emerge. An atomic operation does, but this only solves the write-side of the problem; the reader still needs to issue a read barrier to guarantee seeing atomically written values.
I may be wrong, but I would be shocked if this was so.
I think you may be mixing up non-tearing writes as opposed to atomic writes, which are performed using either cache-line locking (Intel) or exclusive reservation granules (everyone else) and as such require the employment of a specific mechanism to achieve (typically a macro or function, which leads to the use of the particular special instructions for atomic writes).
The overhead of an atomic write is very large, and it is often not needed. It would make no sense for every write (to an _Atomic type) to be atomic.
As per intel docs, all aligned stores and loads are atomic with additional release and acquire semantics. XCHG (which is somewhat expensive) is used for SC stores, but plain loads still suffice for SC loads.
On other architectures load and stores are usually at least atomic, although with only relaxed ordering semantics.
Normally non-tearing + cache coherence is all that is required for relaxed atomic load/stores. You are confusing with general RMW that require specialized instructions on intel or ll/sc on RISCs (although ARM did add a bunch of specialized atomic instructions as well). To be pedantic, as far as I know, Intel doesn't lock the cacheline in any special way during an atomic RMW, it simply delays the read until all preceding stores have been flushed from the store buffer, then, if it has successfully acquired the line in exclusive mode, executes the load+store within whatever minimum exclusive cache hold period guaranteed by the coherence protocol. Acquiring a cacheline in exclusive mode is not specific to atomic RMWs but applies to any store and it is not really a lock as it can be taken away at any moment (i.e. the cc arbiter guarantees forward progress of the system as whole).
For example, compare the thread counter part of the "v" value to the number of threads. If equal, do an atomic store which writes the opposite phase with a thread count of 0. If not equal, spin until the phase bit changes.
This approach avoids a division and can support threads that leave the barrier entirely. (And theoretically also entering the barrier, but that's always dicey regardless of the barrier implementation.)
It seems to me that a counter being able to count to n is enough and there is no need of an explicit phase bit:
Wait(int *b, int n):
1: x = *b
if (x == n-1)
*b=0 //last, reset
else if cas(b, x, x+1)
2: if *b goto 2
else goto 1
Here's one I wrote:
Now it is a lot larger and CAS based, rather than an unconditional increment, but that was to support the extra feature of having the "breaker" run a function and return the value, which I needed in this context, as well as reporting the number of spins, etc. The earlier iterations had an unconditional increment.
The idea is substantially the same as Chris' and uses only 1 word of state (though my implementation stores the n, the break count, as a member, you could pass it in as Chris does): but it uses the sign bit rather than the (log(n) + 1)th bit to indicate the phase: the count goes negative when in the "draining" phase after the barrier is broken. It supports arbitrary break counts without division and the fast path is comparable to the one in the post, though "fast path" is kind of meaningless when you expect to spin most of the time (as long as you don't invoke the scheduler, at least).
That said, Chris's bit manipulation approach is very compact and elegant!
I think the Disruptor is probably still the most ideal approach for serializing multiple producers on an x86 platform. The core idea is to decouple the producer and consumer so that they do not contend over the same CAS variable (position in a ring buffer). If you only have a single producer, the CAS goes entirely uncontended and you will get tens to hundreds of millions of items per second without issue. Even with multiple producers, the figures are fairly incredible.
The trade off is that you basically have to sacrifice an entire core to busy-waiting if you want the world-class latency performance. Alternatives can yield or wait on conditions, but clearly would be slower.
I am not advocating for this. I am simply demonstrating how the Disruptor leverages these kinds of techniques (memory barriers, et. al.) to provide a high performance outcome for its intended use cases.
I would be exhilarated if someone could demonstrate an even faster way to do what the Disruptor does (for x86 platforms).
If you need a total message order and many-to-all messaging it is indeed hard to do better than the Disruptor, but that's kind of a very specialized need. If you can relax the message ordering for example, multiple 1-to-all queues can be significantly more efficient.
I somewhat disagree with this. I have been able to build a lot of business systems that have nothing to do with fintech on top of the exact same abstraction. Order matching engines are not the only thing that can benefit from being able to deal with millions of serialized events per second. Most video games, especially multiplayer ones, can benefit from this style of abstraction too.
If the consequences of any given event may impact the processing of any subsequent event, then yes. For many ordinary systems this constraint holds true.
Don't use spin locks unless you have profiled and have a clear understanding of the contention factors involved. Also be sure to test on single core and massively multicore.
Also look at whether you can use atomics. Atomics are better than lock guarded things in almost every way. The only time locks beat atomics is when you are updating a bunch of things in which case it may be faster to lock once than to update 10 atomic variables.
Message passing can also win sometimes over concurrent access even if it involves more copying. Profile and think.
Locks are mandatory when you're updating data that does not fit into a single atomic type. You cannot use N atomics to create atomic update of a data structure that must be semantically atomic across its N members.
But consider: the generation counter effectively acts as a lock that locks out reader threads while the update takes place. It also serves to lock out other writer threads.
So yes, one can call that "lock-free" if one really wants to, but I find it more honest to just admit that it's a custom lock with custom properties.
Edit: with double buffering I guess it would be lock free, but still not wait free as a fast producer can still starve consumers indefinitely.
With double buffering the reader can always fall back on the other buffer (hence it is lock-free, assuming somehow that writers always agree on the current write buffer), but then a fast writer can always prevent a slow reader from reading a consistent timestamp in a row which means it is not wait free.
Performance issues -> it's lock-free and scales profoundly better than non-lock-free.
It never took off. I think it didn't scale.
A lot of the "difficult parts" of the barrier is the synchronization / memory-barrier part (not the same as a spin-lock barrier. Ugggh, we need someone to revamp the language of low-level multiprocessor programming).
Anyway, the memory-barrier part is "solved" by using atomics and then handwaving all the loads/stores as sequentially-consistent. Yeah, that works, but would lead to subpar performance on all systems, as x86 has total-store ordering (less powerful than sequentially consistent), so x86-compilers would probably need to emit a memory barrier.
ARM / POWER8/POWER9/POWER10 systems are "even weaker", so they need an even more powerful barrier to be emitted by the compiler.
There's also the "Hyperthread yield" instruction that should be placed somehwere in POWER/x86 systems. (Most ARM systems largely don't have hyperthreads/SMT, but the ARM E2 does...). Without "hyperthread yield", the pipeline is going to be uselessly full of load/store instructions from this tight inner-loop.
EDIT: I had to search for it, but I now know the name of the instruction on x86. "PAUSE" is the hyperthread-yield instruction. https://www.felixcloutier.com/x86/pause
You may be "livelock starving" your brother-hyperthread because your current thread is using up all of the pipeline / decoding bandwidth. By using "hyperthread yield", your code will use the minimum amount of decoding bandwidth.
The article does talk about memory-barriers, but seems to be missing the discussion on hyperthread yield. So its probably not "as efficient" as it could get yet.
I can believe that the implementation works as discussed in the blogpost. But its still yet incomplete: a faster implementation probably exists, maybe not with "Fully Relaxed" atomics, but probably with acquire/release atomics (which are free on x86, and significantly faster on ARM/POWER9 than seq-cst atomics).
Note: ARM8+ and POWER9+ now have load-acquire and store-release instructions, serving as an intermediate performance between fully relaxed atomics and seq-cst atomics. (These instructions may have existed earlier, but I know for sure they're in the modern version of these chips).
Once acquire/release barriers are figured out in the right location, the use of hyperthread-yield on x86 and POWER systems would be best on highly-loaded systems. I'm not sure how to test the efficacy / throughput of hyperthread-yield however, I only know of its theoretical benefits. Coming up with a good test (to make sure we're actually improving performance) is itself a difficult question.
TL;DR: Good first pass implementation, and good discussion to get things started. But there's still a long way before I consider it a top-end implementation! For many projects, what is seen here is probably sufficient (especially professional projects where you need to move onto some other programming task). But a hobby project could keep optimizing a bit longer to squeeze out a few more cycles of efficiency here.
Coming up with a test to ensure the acquire-release is properly done is also somewhat difficult, and would require ARM / POWER hardware.
 See the Laws of Orders paper.
 without using asymmetric synchronization which is not appropriate for a general purpose barrier.
1. Seq-cst -- Strongest/slowest. (Aka: Java volatile)
2. Total Store order (aka: x86)
3. Acq-Release (aka: newer-ARM / newer-POWER)
4. Acq-Consume (aka: old-ARM / old-Power)
5. Relaxed -- Weakest/fastest, DEC Alpha.
That is to say, acquire-release is often compiled into no-op on x86, because x86 Total-store ordering is "stronger" than acquire-release.
Acquire-Consume was thought to be sufficient, but C++11 debate argued otherwise, and designed a stronger Acquire-release model. ARM/POWER designers then implemented acquire-release model.
I understand that things aren't perfectly linear, and that at the lower level the discussion is about read/read, read/write, write/read, and write/write barriers. But my brain has never really grok'd this level of detail. (Instead, I think in terms of the simplified list above). So please correct me if I'm wrong with the above assumptions.
ARM and power do allow for indipendent reads of indipendent writes (or maybe only power) so in principle an acq+rel fence could be cheaper than a full seqctst one, although I don't know if it is in practice.
This summarises (https://community.arm.com/arm-community-blogs/b/architecture...):
"The Armv8.0 support for release consistency is based around the “RCsc” (Release Consistency sequentially consistent) model described by Adve & Gharacholoo in , where the Acquire/Release instructions follow a sequentially consistent order with respect to each other. This is well aligned to the requirements of the C++11/C11 memory_order_seq_cst, which is the default ordering of atomics in C++11/C11."
"Instructions are added as part of Armv8.3-A to support the weaker RCpc (Release Consistent processor consistent) model where it is permissible that a Store-Release followed by a Load-Acquire to a different address can be re-ordered. This model is supported by the use of memory_order_release/ memory_order_acquire /memory_order_acqrel in C++11/C11."
1. PAUSE should only be executed inside of the spinlock.
2. Fast-path won't touch the "PAUSE" instruction. (Ex: lock is unlocked. So you grab the lock, bypass the while loop and carry on).
3. Slow-path is... well... the slow-path. You'll be spinning in there for thousands of cycles, waiting for "the other thread". Saving power by effectively "sleeping" a bit longer isn't a big issue.
I very well could assume that Intel did the measurements, and that PAUSE with 100+ delay of latency was higher performance / throughput than their original PAUSE implementation. It is a "slow-path" instruction after all.
If the spin-lock has any contention what-so-ever, you're looking at 100+ cycles to even communicate to L3 cache (~40-nanoseconds == 160 cycles to talk to L3 cache at all).
It makes no sense to be "any faster" than this, because L3 cache hasn't changed yet. You might as well sleep until the next point-in-time where L3 cache had a chance to update (aka: the value of the spinlock has changed).
L2 cache and L1 cache are "private" to each core, so its impossible for another core to change the values in L1 or L2. So the "closest" the spinlock's value can be is L3 cache. The only exception I can think of... is if both threads are SMT/Hyperthreads on the same core. This case has probably gotten slower, but all other cases have probably gotten faster (or more precisely: higher throughput).