A compare-and-swap loop is not a spinlock. (It is the primitive used to implement a spinlock, but that's different.)
Pure spinlocks are almost always a bad idea, at least in userland, because threads spinning trying to acquire the lock look "busy" to the scheduler, so the scheduler may run them instead of the thread that owns the lock. It does make sense to spin a limited number of times before going through the slow path to acquire a mutex, but most OS mutex implementations have that functionality built in, so there's usually no need to do that manually.
True, but they might still appropriate for very tiny critical sections (a handful of instructions) of finely distributed locks where the probability of both contention and preemption is very low.
Also some applications have the luxury to exclusively dedicate on thread per core.
To save power and thermals to enable higher frequencies for other cores doing useful work, if the latency requirements and CPU architecture allows, be nice and put the spinning CPU in a lower power state.
Like x86 PAUSE or MONITOR/MWAIT instructions.
> Does this mean all spinlock operations are pinned to the low-power core?
That wouldn't make sense. Perhaps for example in a case a core is allocated to immediately operate on some data once the spinlock is released.
> If yes, how is this usually supported by OS/language runtimes?
That really depends as MONITOR and MWAIT require ring 0. That means executing those instructions in userland always causes an exception. Yeah, I develop kernel drivers... :-)
PAUSE  works at any ring level, so it works fine in usermode code. It's encoded just as "REP NOP".
> And as soon as the spinlock operations succeed, the work will be moved out of the low-power core, which would involve the usual context switch costs. So I'm assuming this strategy would need a lot of workload-specific benchmarking before you decide to use it.
If you're doing this, you're actively avoiding context switches. That's the whole point in this kind of spinning. Kernel is not involved for usermode code.
Off topic, but I remember reading that the Linux kernel prefers spinlocks to mutexes. Is there a good technical reason for that?
Think about what a true mutex does. The true mutex switches into kernel mode (aka: your program is no longer running, Linux is running). That means a spectre-guard / meltdown guard is executed (your TLB buffer may be flushed, as well as various other memory-guards to prevent Spectre from leaking data).
Once the guards are executed, kernel-mode has to find more work to do. Traversing the kernel-data structures can take 1 to 10 microseconds, depending on how cold the cache is. Finally, since another thread may be running (before your thread comes back), you probably lost all your data from L1 cache (and at minimum: your branch-predictor state because of Spectre).
A spinlock without any contention takes less than 10-nanoseconds to run, to maybe 50-nanoseconds with a bit of contention (!!). You're basically reading/writing data to L1 cache, maybe L3 cache under contention.
However, a scheduler invocation will be on the order of 5000 nanoseconds (~5 microeconds) or so, due to all of the work that the scheduler has to do.
Window's default spincount is something on the order of 4000 cycles. Spinning for 4000-cycles (or less) is an advantage towards a spinlock-like methodology. (4000 cycles x 4GHz == 1-microsecond). Just to give you an idea of the speed-magnitudes that are being discussed here.
There's also tools like , where you can tell a mutex to spin if the current lock holder is actively running (as opposed to blocked or preempted), to get the best of both worlds.
Have you ever waited at a door for someone, and after a long enough time, sat down on the floor? It's basically the same logic.
First, the "slow path" for acquiring a true mutex involves calling into the kernel. This is relatively expensive. At first, you hope that your waiting time will be very small. So you optimistically just spin in user-space waiting for the lock to be released. But this spinning is expensive: you're consuming the CPU doing nothing productive. Eventually, there's a basic principle to consider: the longer you wait, the more likely you are to wait longer. After along enough wait, you assume you're going to wait even longer, so it's worth it to pay the cost of calling into the kernel so that someone else who has productive work to do can do that while you're waiting.
Kernels will use spinlocks in various places because the kernel implementors know that the critical sections will be short, or they know that scheduling another task over the current one would be problematic.
Atomics are the easy part to understand. Memory fences and memory-ordering is the hard part that needs far more discussion. If you use atomics with the wrong memory fence, everything will break.
Memory fences are necessary to make the compiler, CPU, and caches put the data into main-memory in the correct order.
Do NOT write Atomics unless you understand fences. It is absolutely essential, even on x86, to put memory fences in the correct spot. Even though x86 automatically has acquire/release consistency... the compiler may move your data ordering to the wrong location. Even with "volatile" semantics, your data may be committed to memory in the wrong order.
Fredor Pikus has an amusing example where you can have a properly coded lock but the compiler (may) mess you up: https://youtu.be/lVBvHbJsg5Y?t=811
EDIT: if you use locks (spinlocks, mutex, condition variables, etc. etc.), the author of the library would have put memory fences in the correct place for you. Only if you use synchronization "manually" (and... you probably are doing that if you are writing atomics), do you have to think about memory fences.
Not every article has to be an exhaustive description of everything you need to know.
Anywho, sequential consistency is the default for C++11 and C11 atomics, so you pretty much can use them without a complete understanding of memory consistency models.
Still, C11 and C++ atomics aren't widely implemented yet in the GPU world. OpenCL 1.2 doesn't have them (and OpenCL 2.0 is not commonly implemented), CUDA doesn't have them yet, AMD ROCm doesn't have them yet.
And anyone with older compilers will probably be working with fence intrinsics, instead of "innate" barriers per atomic instruction.
I was thinking about adding atomic pointer references to a partially persistent (1) AABB tree, then allowing multiple actors in multiple threads each implementing a gameloop to query the AABB tree. Do I need to think about memory fences in that application?
I've also been thinking about implementing the above system without atomic pointers. Instead, the system would have an update phase where all writes to the AABB tree happen at once, then all of the actors are allowed to run concurrently for one game loop tick.
In both cases, I'm thinking of placing the AABB tree into shared memory, but my brief experience with using shared memory makes me wary of performance problems.
(1) Persistent in this sense, of course: https://en.wikipedia.org/wiki/Persistent_data_structure
Let me give an attempt at what a memory-fence actually is, and why I think you may need to think about memory fences.
Atomics solve one PART of the synchronization problem. The second problem, which is far more complex, is that the compiler, CPU, and Memory / Cache system reorders your reads-and-writes in a non-obvious way. These reorderings can lead to obscure bugs.
These reorderings happen because of single-threaded optimization. Programmers expect the compiler, CPU, and L1 cache to reorder statements to go as fast as possible. Ever wonder what the -O2 and -O3 compiler flags do? They'll eliminate variables and change your code in subtle ways.
Even if you run at -O0 optimizations, these conceptual reorderings exist at the CPU (store buffer) and L1 cache levels. So you cannot escape this unfortunate, complex reality.
Therefore: it is the programmer's responsibility to define a "Happens-before" relationship through memory fences / memory barriers. A "happens-before" relationship will kill optimizations (be it a compiler-optimization, CPU store-buffer optimization, or L1 cache optimization) in just the right way to make your code correct.
With that out of the way, lets go into your AABB tree question.
Assume "node" is a node to a binary AABB tree, with "node->left" and "node->right" referring to its two children. Lets assume both "left" and "right" are atomic<> pointers.
Now think of the code:
AABB_Node tmp = malloc();
node->left->store(tmp); // This is a race condition!
AABB_Node tmp = malloc();
node->left->initialize(); // Compiler wants variable-elimination optimization
"Happens-before" is the key concept. What we want to say is:
AABB_Node tmp = malloc();
The compiler will emit the low-level code needed to ensure the happens-before relationship will work through lower-level memory fence mechanisms. GPUs will have to clear their L1 cache (due to incoherent memory), while CPUs will only have to wait for the store-buffer to be flushed and MESI messages to be passed around (ARM has the "dmb ish" statement for example)
Regardless of the low-level details, the job of the programmer with memory fences remains the same. Specifying important "happens-before" relationships.
In the GPU world, where C++11 atomics do not exist yet, the happens-before relationship is created with a threadfence instruction. In CUDA, you would do:
AABB_Node tmp = malloc();
__threadfence_block(); // Block-level synchronization
node->left = tmp;
I'll butcher the explanations a bit (real code doesn't work like this, but it may help you to understand things). EDIT: Oh gosh, I got these backwards again in my first draft. I'm pretty sure I put them in the correct order now.
An "acquire-read" operation basically compiles from:
register = node->left
An "release-store" operation compiles from:
node->left = register;
node->left = register;
If you want scalability, you want to minimize inter-core and inter-socket traffic.
All in all, scaling up contentious operations is hard.
The formal definition IIRC is "a system is lock free, at least one concurrent process makes progress towards finishing in a unit of time" (There's a hard to achieve version called "wait free" which means "every process makes amortized progress towards finishing")
The important property of a lock free system is that pausing any thread/process does not stop the others from progressing; whereas with classical locks (mutex, semaphore, spin, whatever), if you pause a lock-holding process you risk starving the entire system.
With that being said, stopping any thread/process is rare in high-performance circumstances. In many cases, it is easier to avoid getting your thread stopped (as simple as spinning up one-thread per logical core), rather than to switch to a lock-free algorithm.
Besides, your typical OS Mutex (CriticalSection in Windows) handles the pause situation just fine. The OS will keep track of which threads are waiting on which locks, and restart the threads as resources become available. Its the "known problem" and well studied by pretty much all programmers.
Lock-free data structures are exponentially more difficult to write, especially when you are handling "out of memory" edge cases. The general recommendation is to write "mostly lock free", but use a lock for corner-cases.
CAS-loop with pointer swapping gets you lock free in most cases, and is very easy to understand. The only problem is that it doesn't work in all cases or data-structures, so you inevitably end up using locks to finish your program.
EDIT: And it should be noted that locks can be faster. So if you're going for absolute performance, you should write both implementations and measure.
While that's true from a performance perspective, it is not true from a reliability perspective - it's always possible that the thread holding a lock has faulted, was swapped out, or otherwise had a bug (e.g. in a library you called beyond your control) that caused it to stall. Also, if more than one lock is involved, a deadlock is possible as well (not all of which can be detected by an OS, and most OSes suck at detecting deadlocks if they even try).
> The OS will keep track of which threads are waiting on which locks, and restart the threads as resources become available.
But causes for a pause are not necessarily under your control or the OSes privilege; You can take a critical section, write to a local file .... and wait 20 seconds because the drive has gone to sleep, someone competing for the disk, swapping in/out, a recoverable disk error, or the fact that it is not actually local but rather virtualized SAN across the internet.
Also, someone might have paused a thread from the debug API for whatever reason (or kill -STOP on linux), which might be expected to only pause that thread -- but in a system that isn't lock free, that could pause anything and everything, in a non reproducible way.
> Lock-free data structures are exponentially more difficult to write
For sure. Personally, I find that they keep me on my toes "the right way", that is - they force me to consider what and where actually needs synchronization and can be contended (whereas a lock allows me to be lazy. which is sometimes good).
> The only problem is that it doesn't work in all cases or data-structures,
Yes. ABA problems are not solved by CAS, which is why I personally prefer LL/SC, but ... whatever, CAS is harder but still doable.
> So if you're going for absolute performance, you should write both implementations and measure.
Upon a quick review of history, my use of lock free structures and algorithms is dominated by reliability requirements ("nothing must unexpectedly pause the system" - mostly seqlocks), and only a few cases for performance - mostly shared lists and shared (often reference) counters.
If the write to the local file needs to be protected by a critical section (seems unlikely to me, but... I can imagine some situations like that), the alternative is performing the write inside of a CAS loop instead.
When the CAS-loop fails (because its been 20-seconds and the compare-and-swap has a grossly different value you were waiting on) you have to perform a 2nd I/O operation and restart the entire computation.
What is more performant? Locking your other threads? Or potentially doing extremely expensive work repeatedly?
Locking your other threads (especially if there are some other threads that have work to do) could very much be the more performant answer. In fact, I would argue that the locking implementation is more likely to be work efficient.
> Upon a quick review of history, my use of lock free structures and algorithms is dominated by reliability requirements ("nothing must unexpectedly pause the system" - mostly seqlocks), and only a few cases for performance - mostly shared lists and shared (often reference) counters.
I think that is reasonable and typical. I think beginners think of "lock free" as a potential improvement to efficiency. But the advantages of lock free are entirely independent of performance, it has more to so with requirements as opposed to performance in most cases.
If I could recommend one book on this topic it'd be "The Art of Multiprocessor Programming". The authors have contributed to this subject through original research and make the topic quite approachable.
According to Wikipedia, it's the definition of non-blocking, not the definition of lock-free... which I guess is a correction for both the parent comment and yours. 
As far as I recall ever seeing, people always talk about it in terms of guaranteed systemwide progress, which I don't find as enlightening as what happens when you pause a thread.
If you think context switches are expensive, then this is also expensive, but in a different way. Nothing is ever locked, but certainly the computer slows down from its maximum performance potential. (But there are many similar effects that happen even without atomic operations, like mis-predicting a branch.)
My experience in using both is that sometimes profiling dictates that a lock is too expensive. I used to maintain a program that incremented a counter from hundreds of threads. Using an OS-level mutex to achieve this resulted in way too many context-switches, to the point where we were only using about 30% of the CPU to run our program and the rest of the CPU time was spent on OS-level housekeeping. Using atomic load/store was much faster. Enqueuing all the writes in a thread-local counter and flushing them to the global structure in an OS lock was the fastest, but only provided eventual consistency.
Atomic operations execute atop the cache consistency protocol, which typically looks like: https://en.wikipedia.org/wiki/MOESI_protocol
It is indeed true that atomic operations will execute in a bounded time, and processors generally provide fairness guarantees as well.
Right now you're asserting things about all this, while not being familiar with relatively basic aspects of how it works.
Back in the day, I wanted to implement a cache in our software product that used reference counting to update objects and when the object went to 0, free it from the cache. Pretty straightforward, no rocket science, but this was also cross-platform including Solaris, HP-UX, etc.
This was easy to do on Windows/Intel because they implemented a native CAS at the time (almost 20 years ago now), but the other systems didn't. The only way would be to add a mutex per object, which was a non-starter because it used real kernel memory and resources, so I had to abandon the feature entirely.
The above is truly lock-free because it doesn't require any additional resources that have to be tracked or freed.
The java.util.concurrent package was added in Java 5 and before that existed as a third party library (by Doug Lee). It contains a lot of concurrency primitives and makes use of a lot of things, including lock free instructions and optimistic locking: https://en.wikipedia.org/wiki/Java_concurrency
I imagine it also impacts real-world performance, but lock-free algorithms aren't assured to be faster anyway.
Annoyingly I recall reading a good article on exactly this question - the importance of hardware atomics vs locking - but can't recall what it was. These locks still aren't as bad as ad-hoc locking, in terms of deadlock risk, as you're guaranteed that they are 'leaf locks', with no order-of-acquisition hierarchies/orderings to worry about.
How much do you trust random ARM-based SOCs to get this right at all of the necessary levels of cache consistency and memory access queues? Lots? Great, I am happy for you. Now, extend your confidence into some other chips, like PowerPC (various versions), MIPs and maybe a couple of others.
Eventually you are going to hit some very odd bugs, the really difficult bugs that will make you tear your hair out, and eventually that lock-free stuff, too.
"But we'll never run our stuff on a MIPS, or a Fthaghn-V1000." That's good . . . but the chipset you trust might get an update with slightly different memory behavior and you'll be sunk.
The projects I've been on that have shipped lock-free structures have done so only when using them was (a) high value, (b) there was a way to choose an alternate method (usually at compile time), and (c) the use was very limited (e.g., just a handful of critical places).
Depending on access mode, the compiler, or the hardware, can still change the order of operations. It adds whole another level of WTF:
> It is also possible to move stores from before an Acquire load or read-modify-write operation to after it, and move non-Acquire loads from before an Acquire operation to after it.
Most multi-threaded code works this way. Even safe languages like Java permit this. I believe the exceptions tend to be interpreted languages like Python, using green threads.
At the assembly level, most modern CPUs are permitted to perform out-of-order execution. I believe the exceptions are pretty rare these days. The custom PowerPC chip in the Xbox 360 guaranteed in-order execution, for instance . GPUs are a different beast.
It does become observable when you try to do synchronization yourself with atomics, without setting appropriate ordering requirements. Lock-free algorithms are usually tricky, and implementing them with minimum ordering requirements makes them even more puzzling.
> A gentle introduction to multithreading — Approaching the world of concurrency, two steps at a time.
(I wonder if "multiple steps at a time" sounds better)