This seems very similar to the centralized counting barrier described in [1] where each thread, upon hitting the barrier, decrements the count and spins until a global "sense" boolean is reversed. The last thread to hit the barrier, resets the counter and flips the "sense" freeing up all other threads. I guess the difference is this one treats the bool as odd/even values of an integer, but does that work for n_threads > 2?
These boolean arrangements ("flipping") fail when the number of watchers is high and the interval is small. The toggle can switch twice while you're queued to be run again, and then all bets are off, because it looks to you like nothing has happened. Your messaging rate can only be half of the frequency guaranteed by the kernel for waking every thread. And many OSes offer no ironclad guarantee.
How? Only one thread, the last one when the counter is 0, flips the switch. Before the switch is flipped, the counter is reset, so the next flip will only happen when every thread once again hits the next barrier and the counter decrements to 0.
State observed by thread 1: counter = 5, state = blue;
State observed by threads 2-4: counter = [4,5,1,2,3,4,5,6], state = [blue, green, blue]
Thread 1 wakes up, sees counter = 6, state = blue;
Logical conclusion: system state is counter++, state unchanged.
Actual scenario: counter+=6, state+=2
I believe this was first covered in my distributed computing class, but was probably most concretely illustrated in the case of concurrent garbage collection. Any unfairness in the thread scheduler can lead to some threads getting way more clock cycles than others. In low frequency scenarios you can convince yourself it's not a problem, only to be punished by success into finding out you're wrong when you have 10,000 active users instead of 10.
This is especially true with spinlocks, because it's a greedy algorithm that trades fairness for responsiveness. It can result in low throughput or even priority inversion. They are a very sharp knife. Indispensable for very specific tasks, certain bloodshed for others.
The counting barrier in the paper I linked is meant specifically for shared memory multiprocessors, not distributed systems, so what you're saying makes more sense. Both the counter and the state are shared variables across all threads. This clearly doesn't scale well with number of threads/processors due to cache contention, but the idea is to have a "simple" barrier. There are a few barrier implementations that operate via message passing further in the paper.
But either way, the counter can never be 6 since the max counter is set to the number of threads, and each thread at the barrier decrements the counter. Once a thread is spinning, it doesn't care about the counter anymore, only the state. Once the last thread reaches the barrier, the state flips, and all threads awaken. Each thread maintains a local copy of what the "reversed state" should be to awaken. If one thread gets preference and advances to the next barrier, it will decrement the counter it reset and will spin for the global state to flip again. So even if it reaches the second barrier before any of the other threads awaken from the first, it will wait for the rest of the threads to catch up.
The differences between multiprocessing and IPC are smaller than you think, and they have a habit of disappearing and coming back in different hardware refresh cycles.
Multicore systems have hardware support for some consensus algorithms. Speed of light is practically a rounding error, but not always. And clock synchronization is... I'll use the word attenuated, since I hesitate to call it 'solved' or 'avoided'.
I may be wrong, but I don't see this working in theory.
The basic problem is that for any physical core, there is no guarantee that any writes will ever be seen by any other core (unless the necessary extra magic is done to make this so).
(There's a matching problem when reading, too, in that in theory writes made by another core can be never seen by the reading core, even if they have reached the domain covered by cache coherency, unless the necessary extra magic, etc.)
So here with this code, in theory, the writes being made simply are never seen by another core, and so it doesn't work.
As it is, in practise, writes typically make their way to other physical cores in a reasonably short time, whether or not the extra magic is performed; but this is not guaranteed. Also of course the ordering of the emergence of those writes is not guaranteed (although on some platforms, additional guarantees are provided - Intel is particular generous in this regard, and performance suffers because of it).
> The basic problem is that for any physical core, there is no guarantee that any writes will ever be seen by any other core (unless the necessary extra magic is done to make this so).
They're using atomic read/writes with sequential-consistency.
This means that the compiler will automatically put memory-barriers in the appropriate locations to guarantee sequential-consistency, but probably at a significant performance cost.
The implementation is likely correct, but its the slowest performance available (seq-cst is the simplest / most braindead atomic operation, but slowest because you probably don't need all those memory barriers).
I'd expect that Acq-release style barriers is possible for this use case, which would be faster on x86, POWER, and ARM systems.
> They're using atomic read/writes with sequential-consistency.
In the C11 code? I'm not up to speed with that version of the spec, but unless there is behaviour invoked by the use of _Atomic, the code looks to me to be performing normal, non-atomic read/writes.
Another reply says there are full barriers being automatically used, but that doesn't address the problem I've described.
> but unless there is behaviour invoked by the use of _Atomic
That is the behavior invoked by "atomic".
The general gist is that "manual" read/write barriers were too difficult to think about and led to bugs. Instead of having the programmer put the read/write barriers in the correct locations, modern compilers put a type-specification on variables... and then its the compiler's job to put the barriers in the correct location.
This turned out to be necessary, because the compiler's optimizer kept rearranging code (!!!) around memory barriers anyway. So the "compiler" needs to participate in any of the barriers, especially at higher levels of optimization (-O3 rearranges a lot of code). If the compiler needs to participate, you might as well have the compiler handle the placement of those barriers.
"Atomic" variables by default will be sequentially-consistent (ie: the _maximum_ use of memory barriers for _maximum_ levels of consistency).
You're talking here about memory barriers, but I was asking if atomic writes are in use.
Atomic writes have nothing to do with memory barriers. On modern platforms atomic writes are available without memory barriers of any kind.
Memory barriers do not cause events to become visible.
Only atomic writes do this; and so the question is not whether barriers are in use due to _Atomic, but whether atomic writes are in use due to the use of _Atomic.
_Atomic types can be read-modified-written atomically (if using a compound-assignment operator), or the postincrement/pre-increment operations.
In these circumstances, the memory-barriers used for those atomic-operations will be of the sequentially-consistent memory model.
------
So yes, that "++*barrier" operation is read-modify-write atomically AND with memory-barriers in sequential-consistency style
-----
I don't believe that "atomics" are enough to solve this problem. At a minimum, I expect acq-consume to be required (not that acq-consume model even works in practice... so acq-release for today's compilers). That's why I'm still talking memory model / fences here.
Not only is an atomic operation required, but so too is the "publishing" of this information needing to happen in the proper order. Fortunately, C11's _Atomic implementation solves both issues.
> Built-in increment and decrement operators and compound assignment are read-modify-write atomic operations with total sequentially consistent ordering (as if using memory_order_seq_cst). If less strict synchronization semantics are desired, the standard library functions may be used instead.
This is true in a strict sense, but essentially no concurrent code works if you don't assume that writes become eventually visible to all other threads.
Consider something basic like implementing a mutex: once a thread unlocks a mutex, it is possible in theory that no other thread _ever_ sees the mutex as unlocked, and spin on it forever, but such a system would be useless.
In practice, general purpose CPUs have writes become visible as fast as possible.
You won't find this behavior defined strictly in memory models, because how can you? These talk only about ordering and eschew things like "global clocks" and even if you had such a clock, how are you going to put a numerical bound on the delay?
Most will leave it as a quality of implementation issue and specify that writes become visible "eventually" if they say anything at all.
Barriers only influence the order in which events become visible if and when they DO become visible.
Barriers say nothing about whether or not events actually do become visible; and it may be that they never do (if the necessary magic is not performed, etc).
You mention an atomic increment. I don't see atomic increments this in the C11 code, but I'm not up to speed with that version of the specification.
(Note that an atomic increment will solve the write-side problem, but not the read-side problem. The reader must still use a read barrier; but if there's an implicit full barrier, then that's being done.)
> The fact that another thread sees the incremented value implies that that thread will see all stores that happened-before that incremented value.
Firstly, if the writer has used no barriers, writes can emerge in any order (and may never emerge at all, anyway, even if barriers were used). Secondly, if the reader has not used barriers, then even if writes have been written in some given order, the reader will by no means observe that order.
> A fence or atomic operation guarantee that a store will be visible globally in a finite amount of time
A fence does NOT guarantee this. A fence controls and only only controls order of events. It has NO influence on whether events actually emerge. An atomic operation does, but this only solves the write-side of the problem; the reader still needs to issue a read barrier to guarantee seeing atomically written values.
An atomic write will need to be issued, and on the face of it, that C11 code is not issuing an atomic write. I could be wrong, but I would expect a particular function or macro to be used for an atomic write, and when it is not, you have normal writes.
> So all operations on _Atomic variables in C11 are implicitly atomic
I may be wrong, but I would be shocked if this was so.
I think you may be mixing up non-tearing writes as opposed to atomic writes, which are performed using either cache-line locking (Intel) or exclusive reservation granules (everyone else) and as such require the employment of a specific mechanism to achieve (typically a macro or function, which leads to the use of the particular special instructions for atomic writes).
The overhead of an atomic write is very large, and it is often not needed. It would make no sense for every write (to an _Atomic type) to be atomic.
The standard does guarantee that all the operations on _Atomic variables are indeed atomic and, by default, sequentially consistent. Standard library functions like atomic_load can be used to specify more relaxed ordering.
As per intel docs, all aligned stores and loads are atomic with additional release and acquire semantics. XCHG (which is somewhat expensive) is used for SC stores, but plain loads still suffice for SC loads.
On other architectures load and stores are usually at least atomic, although with only relaxed ordering semantics.
Normally non-tearing + cache coherence is all that is required for relaxed atomic load/stores. You are confusing with general RMW that require specialized instructions on intel or ll/sc on RISCs (although ARM did add a bunch of specialized atomic instructions as well). To be pedantic, as far as I know, Intel doesn't lock the cacheline in any special way during an atomic RMW, it simply delays the read until all preceding stores have been flushed from the store buffer, then, if it has successfully acquired the line in exclusive mode, executes the load+store within whatever minimum exclusive cache hold period guaranteed by the coherence protocol. Acquiring a cacheline in exclusive mode is not specific to atomic RMWs but applies to any store and it is not really a lock as it can be taken away at any moment (i.e. the cc arbiter guarantees forward progress of the system as whole).
It's a nice write up. I did this once with a slight modification. The phase really only needs one bit, and an arbitrary number of threads can be supported in a number of ways by having the last thread to arrive at the barrier clean up the state based on a second word that holds the number of threads.
For example, compare the thread counter part of the "v" value to the number of threads. If equal, do an atomic store which writes the opposite phase with a thread count of 0. If not equal, spin until the phase bit changes.
This approach avoids a division and can support threads that leave the barrier entirely. (And theoretically also entering the barrier, but that's always dicey regardless of the barrier implementation.)
Now it is a lot larger and CAS based, rather than an unconditional increment, but that was to support the extra feature of having the "breaker" run a function and return the value, which I needed in this context, as well as reporting the number of spins, etc. The earlier iterations had an unconditional increment.
The idea is substantially the same as Chris' and uses only 1 word of state (though my implementation stores the n, the break count, as a member, you could pass it in as Chris does): but it uses the sign bit rather than the (log(n) + 1)th bit to indicate the phase: the count goes negative when in the "draining" phase after the barrier is broken. It supports arbitrary break counts without division and the fast path is comparable to the one in the post, though "fast path" is kind of meaningless when you expect to spin most of the time (as long as you don't invoke the scheduler, at least).
That said, Chris's bit manipulation approach is very compact and elegant!
I think the Disruptor is probably still the most ideal approach for serializing multiple producers on an x86 platform. The core idea is to decouple the producer and consumer so that they do not contend over the same CAS variable (position in a ring buffer). If you only have a single producer, the CAS goes entirely uncontended and you will get tens to hundreds of millions of items per second without issue. Even with multiple producers, the figures are fairly incredible.
The trade off is that you basically have to sacrifice an entire core to busy-waiting if you want the world-class latency performance. Alternatives can yield or wait on conditions, but clearly would be slower.
> I don't see how you can use the Disruptor as a barrier
I am not advocating for this. I am simply demonstrating how the Disruptor leverages these kinds of techniques (memory barriers, et. al.) to provide a high performance outcome for its intended use cases.
I would be exhilarated if someone could demonstrate an even faster way to do what the Disruptor does (for x86 platforms).
If you need a total message order and many-to-all messaging it is indeed hard to do better than the Disruptor, but that's kind of a very specialized need. If you can relax the message ordering for example, multiple 1-to-all queues can be significantly more efficient.
I somewhat disagree with this. I have been able to build a lot of business systems that have nothing to do with fintech on top of the exact same abstraction. Order matching engines are not the only thing that can benefit from being able to deal with millions of serialized events per second. Most video games, especially multiplayer ones, can benefit from this style of abstraction too.
Do you really need a total ordering of messages instead of the partial ordering that separate queues could give you? Also do you really need broadcasting or point-to-point or anycasting would be more appropriate? That's what I mean by specialized.
> Do you really need a total ordering of messages instead of the partial ordering that separate queues could give you?
If the consequences of any given event may impact the processing of any subsequent event, then yes. For many ordinary systems this constraint holds true.
Be careful with spin locks. We used to have one in the ZeroTier core. We pulled it due to a crazy insane bug where it would hang on single core machines and another one where it would run away on CPU usage on massively multicore machines. These bugs had the same cause. Think about it.
Don't use spin locks unless you have profiled and have a clear understanding of the contention factors involved. Also be sure to test on single core and massively multicore.
Also look at whether you can use atomics. Atomics are better than lock guarded things in almost every way. The only time locks beat atomics is when you are updating a bunch of things in which case it may be faster to lock once than to update 10 atomic variables.
Message passing can also win sometimes over concurrent access even if it involves more copying. Profile and think.
> The only time locks beat atomics is when you are updating a bunch of things in which case it may be faster to lock once than to update 10 atomic variables.
Locks are mandatory when you're updating data that does not fit into a single atomic type. You cannot use N atomics to create atomic update of a data structure that must be semantically atomic across its N members.
Technically any algorithm can be implemented in a lock free way and there are n-way CAS constructs that indeed allow updating N memory locations atomically. In practice they are very slow.
You can use RCU to do this, but I'd argue that everything else ends up being more or less a lock by another name (maybe you could do read/write where the read is optimistic, but at least the write side will have some sort of lock). I'm curious what you had in mind?
You can do this sort of thing with write-ordering a "generation counter" that is updated before and after writing a bunch of other data elements. The problem is that it makes the read slow - you have to use a CAS-style loop even to read anything (you must check that the generation counter has the same value before and after reading the other data). It also adds more memory barriers than a simple lock.
This is the mechanism that I had in mind. It's actually a really good one if you have lots of concurrent reads and few writes because there isn't going to be any cacheline ping pong. With a more traditional read-write lock, even the readers write to the cacheline holding the lock, and that's a shared cacheline so it's going to ping pong around. By contrast, the generation counter is read-only to readers.
But consider: the generation counter effectively acts as a lock that locks out reader threads while the update takes place. It also serves to lock out other writer threads.
So yes, one can call that "lock-free" if one really wants to, but I find it more honest to just admit that it's a custom lock with custom properties.
It's not lock-free, it's wait-free and therein lies the problem. The reader thread(s) are not locked out - they are free to continue running, endlessly looping waiting for the moment when they can consistently read the pseudo-atomic data. That's quite different behavior than if they used a lock, which they would find already held, and they would sleep.
But it is not wait-free, not even non blocking; if I understand your model correctly, a writer blocked in the right moment will prevent a reader to make progress exactly as you described.
Edit: with double buffering I guess it would be lock free, but still not wait free as a fast producer can still starve consumers indefinitely.
Neither writer nor reader needs to block. The writer does CAS on the generation counter to ensure it is safe against other writer threads. The reader just spins until it reads the same generation counter value before and after the read.
What happens if a writer halts in the middle of a write? How the reader can ever complete a consistent read?
With double buffering the reader can always fall back on the other buffer (hence it is lock-free, assuming somehow that writers always agree on the current write buffer), but then a fast writer can always prevent a slow reader from reading a consistent timestamp in a row which means it is not wait free.
If I'm understanding it correctly, you are describing a seq lock, which can be implemented with #StoreStore barriers on the writer side and #LoadLoad on the reader side. Topically these are significantly cheaper than the #StoreLoad required for a lock acquisition.
I misplaced my copy of The Art Of Multiprocessor Programming with the details, but with a simple CAS primitive you can build a general N-CAS algorithm and from that implement any algorithm ina a truly wait free way. It will be slow and won't scale, but will indeed be wait free
A lot of the "difficult parts" of the barrier is the synchronization / memory-barrier part (not the same as a spin-lock barrier. Ugggh, we need someone to revamp the language of low-level multiprocessor programming).
Anyway, the memory-barrier part is "solved" by using atomics and then handwaving all the loads/stores as sequentially-consistent. Yeah, that works, but would lead to subpar performance on all systems, as x86 has total-store ordering (less powerful than sequentially consistent), so x86-compilers would probably need to emit a memory barrier.
ARM / POWER8/POWER9/POWER10 systems are "even weaker", so they need an even more powerful barrier to be emitted by the compiler.
-------
There's also the "Hyperthread yield" instruction that should be placed somehwere in POWER/x86 systems. (Most ARM systems largely don't have hyperthreads/SMT, but the ARM E2 does...). Without "hyperthread yield", the pipeline is going to be uselessly full of load/store instructions from this tight inner-loop.
EDIT: I had to search for it, but I now know the name of the instruction on x86. "PAUSE" is the hyperthread-yield instruction. https://www.felixcloutier.com/x86/pause
You may be "livelock starving" your brother-hyperthread because your current thread is using up all of the pipeline / decoding bandwidth. By using "hyperthread yield", your code will use the minimum amount of decoding bandwidth.
-----
The article does talk about memory-barriers, but seems to be missing the discussion on hyperthread yield. So its probably not "as efficient" as it could get yet.
I can believe that the implementation works as discussed in the blogpost. But its still yet incomplete: a faster implementation probably exists, maybe not with "Fully Relaxed" atomics, but probably with acquire/release atomics (which are free on x86, and significantly faster on ARM/POWER9 than seq-cst atomics).
Note: ARM8+ and POWER9+ now have load-acquire and store-release instructions, serving as an intermediate performance between fully relaxed atomics and seq-cst atomics. (These instructions may have existed earlier, but I know for sure they're in the modern version of these chips).
Once acquire/release barriers are figured out in the right location, the use of hyperthread-yield on x86 and POWER systems would be best on highly-loaded systems. I'm not sure how to test the efficacy / throughput of hyperthread-yield however, I only know of its theoretical benefits. Coming up with a good test (to make sure we're actually improving performance) is itself a difficult question.
TL;DR: Good first pass implementation, and good discussion to get things started. But there's still a long way before I consider it a top-end implementation! For many projects, what is seen here is probably sufficient (especially professional projects where you need to move onto some other programming task). But a hobby project could keep optimizing a bit longer to squeeze out a few more cycles of efficiency here.
Coming up with a test to ensure the acquire-release is properly done is also somewhat difficult, and would require ARM / POWER hardware.
It seems to me that a barrier primitive needs at the very least acq+rel semantics [1], which means that TSO implicit acquire and release are neither necessary nor sufficient[2]. In practice on x86 this means an atomic RMW which is seq-cst. I'm not familiar enough with ARM/Power to know if acq+rel is significantly faster than seq cst.
[1] See the Laws of Orders paper.
[2] without using asymmetric synchronization which is not appropriate for a general purpose barrier.
That is to say, acquire-release is often compiled into no-op on x86, because x86 Total-store ordering is "stronger" than acquire-release.
Acquire-Consume was thought to be sufficient, but C++11 debate argued otherwise, and designed a stronger Acquire-release model. ARM/POWER designers then implemented acquire-release model.
--------
I understand that things aren't perfectly linear, and that at the lower level the discussion is about read/read, read/write, write/read, and write/write barriers. But my brain has never really grok'd this level of detail. (Instead, I think in terms of the simplified list above). So please correct me if I'm wrong with the above assumptions.
An acquire+release fence is equivalent to two indivisible acquire and release fences but can't be implemented by simply issuing the two fences in isolation, as they would be unordered with each other. x86 has no native acq+relase fence and it can't be simulated, for example, with a load plus a store (i.e. separate acquire and release fences), but you need an mfence or atomic rmw.
ARM and power do allow for indipendent reads of indipendent writes (or maybe only power) so in principle an acq+rel fence could be cheaper than a full seqctst one, although I don't know if it is in practice.
"The Armv8.0 support for release consistency is based around the “RCsc” (Release Consistency sequentially consistent) model described by Adve & Gharacholoo in [1], where the Acquire/Release instructions follow a sequentially consistent order with respect to each other. This is well aligned to the requirements of the C++11/C11 memory_order_seq_cst, which is the default ordering of atomics in C++11/C11."
"Instructions are added as part of Armv8.3-A to support the weaker RCpc (Release Consistent processor consistent) model where it is permissible that a Store-Release followed by a Load-Acquire to a different address can be re-ordered. This model is supported by the use of memory_order_release/ memory_order_acquire /memory_order_acqrel in C++11/C11."
right, but the point is, does ARM have acq-rel RMWs that are not sequentially consistent? I guess you could build one with separate acq/rel load and stores and ll/sc. edit: or not, as the store and load will still be unordered with each other and the ll/sc does nothing here.
1. PAUSE should only be executed inside of the spinlock.
2. Fast-path won't touch the "PAUSE" instruction. (Ex: lock is unlocked. So you grab the lock, bypass the while loop and carry on).
3. Slow-path is... well... the slow-path. You'll be spinning in there for thousands of cycles, waiting for "the other thread". Saving power by effectively "sleeping" a bit longer isn't a big issue.
-----
I very well could assume that Intel did the measurements, and that PAUSE with 100+ delay of latency was higher performance / throughput than their original PAUSE implementation. It is a "slow-path" instruction after all.
If the spin-lock has any contention what-so-ever, you're looking at 100+ cycles to even communicate to L3 cache (~40-nanoseconds == 160 cycles to talk to L3 cache at all).
It makes no sense to be "any faster" than this, because L3 cache hasn't changed yet. You might as well sleep until the next point-in-time where L3 cache had a chance to update (aka: the value of the spinlock has changed).
L2 cache and L1 cache are "private" to each core, so its impossible for another core to change the values in L1 or L2. So the "closest" the spinlock's value can be is L3 cache. The only exception I can think of... is if both threads are SMT/Hyperthreads on the same core. This case has probably gotten slower, but all other cases have probably gotten faster (or more precisely: higher throughput).
[1]http://web.mit.edu/6.173/www/currentsemester/readings/R06-sc...