
Lock-free multithreading with atomic operations - signa11
https://www.internalpointers.com/post/lock-free-multithreading-atomic-operations
======
comex
> The spin until success strategy seen above is employed in many lock-free
> algorithms and is called spinlock: a simple loop where the thread repeatedly
> tries to perform something until successful. It's a form of gentle lock
> where the thread is up and running — no sleep forced by the operating
> system, although no progress is made until the loop is over. Regular locks
> employed in mutexes or semaphores are way more expensive, as the
> suspend/wakeup cycle requires a lot of work under the hood.

A compare-and-swap loop is not a spinlock. (It is the primitive used to
implement a spinlock, but that's different.)

Pure spinlocks are almost always a bad idea, at least in userland, because
threads spinning trying to acquire the lock look "busy" to the scheduler, so
the scheduler may run them instead of the thread that owns the lock. It does
make sense to spin a limited number of times before going through the slow
path to acquire a mutex, but most OS mutex implementations have that
functionality built in, so there's usually no need to do that manually.

~~~
archy_
Why would it make more sense to "spin a limited number of times before going
through the slow path to acquire a mutex"? Does a mutex have more overhead in
the short term than a spinlock but after a few cycles become more efficient?

Off topic, but I remember reading that the Linux kernel prefers spinlocks to
mutexes. Is there a good technical reason for that?

~~~
dragontamer
> Does a mutex have more overhead in the short term than a spinlock but after
> a few cycles become more efficient?

Think about what a true mutex does. The true mutex switches into kernel mode
(aka: your program is no longer running, Linux is running). That means a
spectre-guard / meltdown guard is executed (your TLB buffer may be flushed, as
well as various other memory-guards to prevent Spectre from leaking data).

Once the guards are executed, kernel-mode has to find more work to do.
Traversing the kernel-data structures can take 1 to 10 microseconds, depending
on how cold the cache is. Finally, since another thread may be running (before
your thread comes back), you probably lost all your data from L1 cache (and at
minimum: your branch-predictor state because of Spectre).

A spinlock without any contention takes less than 10-nanoseconds to run, to
maybe 50-nanoseconds with a bit of contention (!!). You're basically
reading/writing data to L1 cache, maybe L3 cache under contention.

However, a scheduler invocation will be on the order of 5000 nanoseconds (~5
microeconds) or so, due to all of the work that the scheduler has to do.

\--------

Window's default spincount is something on the order of 4000 cycles. Spinning
for 4000-cycles (or less) is an advantage towards a spinlock-like methodology.
(4000 cycles x 4GHz == 1-microsecond). Just to give you an idea of the speed-
magnitudes that are being discussed here.

~~~
lallysingh
That's 4000 cycles @ 4 GHz == 1us right?

~~~
dragontamer
Yes, sorry. I'll go edit that correctly really quick...

------
dragontamer
Yet another article discussing atomics without discussing the memory model or
memory fences.

Atomics are the easy part to understand. Memory fences and memory-ordering is
the hard part that needs far more discussion. If you use atomics with the
wrong memory fence, everything will break.

Memory fences are necessary to make the compiler, CPU, and caches put the data
into main-memory in the correct order.

\----------

Do NOT write Atomics unless you understand fences. It is absolutely essential,
even on x86, to put memory fences in the correct spot. Even though x86
automatically has acquire/release consistency... the compiler may move your
data ordering to the wrong location. Even with "volatile" semantics, your data
may be committed to memory in the wrong order.

Fredor Pikus has an amusing example where you can have a properly coded lock
but the compiler (may) mess you up:
[https://youtu.be/lVBvHbJsg5Y?t=811](https://youtu.be/lVBvHbJsg5Y?t=811)

\----------

EDIT: if you use locks (spinlocks, mutex, condition variables, etc. etc.), the
author of the library would have put memory fences in the correct place for
you. Only if you use synchronization "manually" (and... you probably are doing
that if you are writing atomics), do you have to think about memory fences.

~~~
rrss
> Topics like sequential consistency and memory barriers are critical pieces
> of the puzzle and can't be overlooked if you want to get the best out of
> your lock-free algorithms. I will cover them all in the next episode.

Not every article has to be an exhaustive description of everything you need
to know.

Anywho, sequential consistency is the default for C++11 and C11 atomics, so
you pretty much can use them without a complete understanding of memory
consistency models.

~~~
dragontamer
Ah, you're right. They call them "barriers" instead of "fences", so I didn't
pick up on that paragraph.

Still, C11 and C++ atomics aren't widely implemented yet in the GPU world.
OpenCL 1.2 doesn't have them (and OpenCL 2.0 is not commonly implemented),
CUDA doesn't have them yet, AMD ROCm doesn't have them yet.

And anyone with older compilers will probably be working with fence
intrinsics, instead of "innate" barriers per atomic instruction.

------
jchw
Note that lock-free is not always faster because contention is complicated.

~~~
vardump
Also note that lock-free can scale worse, again due to contention.

If you want scalability, you want to minimize inter-core and inter-socket
traffic.

~~~
jchw
This is the more important point, realistically. I think I probably would've
been better off saying 'lock-free is not a panacea.'

All in all, scaling up contentious operations is hard.

------
amelius
I think "lock free" is a misnomer because in many cases these atomic
operations still perform locks at the hardware level.

~~~
beagle3
"lock free" regards to being free from any process/thread/device holding a
lock for an undetermined amount of time.

The formal definition IIRC is "a system is lock free, at least one concurrent
process makes progress towards finishing in a unit of time" (There's a hard to
achieve version called "wait free" which means "every process makes amortized
progress towards finishing")

The important property of a lock free system is that pausing any
thread/process does not stop the others from progressing; whereas with
classical locks (mutex, semaphore, spin, whatever), if you pause a lock-
holding process you risk starving the entire system.

~~~
mehrdadn
I love your last paragraph. It's extremely intuitive and yet I don't think
I've ever seen it explained that way.

~~~
vishnugupta
Not to take anything away from the parent's thoughtful comment. That important
property is, in fact, the definition of lock-free. Lock-free and wait-free
(which has a stronger guarantee) algorithms are sub-classifications of non-
blocking algorithms.

If I could recommend one book on this topic it'd be "The Art of Multiprocessor
Programming". The authors have contributed to this subject through original
research and make the topic quite approachable.

~~~
mehrdadn
> Not to take anything away from the parent's thoughtful comment. That
> important property is, in fact, the definition of lock-free.

According to Wikipedia, it's the definition of non-blocking, not the
definition of lock-free... which I guess is a correction for both the parent
comment and yours. [1]

As far as I recall ever seeing, people always talk about it in terms of
guaranteed systemwide progress, which I don't find as enlightening as what
happens when you pause a thread.

[1] [https://en.wikipedia.org/wiki/Non-
blocking_algorithm](https://en.wikipedia.org/wiki/Non-blocking_algorithm)

------
denormalfloat
I found it surprising that atomic operations in Java are converted in to
locking when compiling for some versions of android. The guarantees of Java
(before 9) on ordering are slightly too strong and result in locks being used
under the hood.

~~~
jillesvangurp
Interesting. This probably is because of some hardware limitations on some
older android hardware platforms.

The java.util.concurrent package was added in Java 5 and before that existed
as a third party library (by Doug Lee). It contains a lot of concurrency
primitives and makes use of a lot of things, including lock free instructions
and optimistic locking:
[https://en.wikipedia.org/wiki/Java_concurrency](https://en.wikipedia.org/wiki/Java_concurrency)

------
kabdib
Based on some experience on non-X86 systems, lock-free is where you start
finding bugs in compilers and runtime libraries, and also hardware.

How much do you trust random ARM-based SOCs to get this right at all of the
necessary levels of cache consistency and memory access queues? Lots? Great, I
am happy for you. Now, extend your confidence into some other chips, like
PowerPC (various versions), MIPs and maybe a couple of others.

Eventually you are going to hit some very odd bugs, the really difficult bugs
that will make you tear your hair out, and eventually that lock-free stuff,
too.

"But we'll never run our stuff on a MIPS, or a Fthaghn-V1000." That's good . .
. but the chipset you trust might get an update with slightly different memory
behavior and you'll be sunk.

The projects I've been on that have shipped lock-free structures have done so
_only_ when using them was (a) high value, (b) there was a way to choose an
alternate method (usually at compile time), and (c) the use was very limited
(e.g., just a handful of critical places).

------
pornel
If multi-threaded code is too easy for you, try multi-threaded code which is
executed in a different order than written!

Depending on access mode, the compiler, or the hardware, can still change the
order of operations. It adds whole another level of WTF:

[https://llvm.org/docs/Atomics.html#atomic-
orderings](https://llvm.org/docs/Atomics.html#atomic-orderings)

> It is also possible to move stores from before an Acquire load or read-
> modify-write operation to after it, and move non-Acquire loads from before
> an Acquire operation to after it.

~~~
MaxBarraclough
> If multi-threaded code is too easy for you, try multi-threaded code which is
> executed in a different order than written!

 _Most_ multi-threaded code works this way. Even safe languages like Java
permit this. I believe the exceptions tend to be interpreted languages like
Python, using green threads.

At the assembly level, most modern CPUs are permitted to perform out-of-order
execution. I believe the exceptions are pretty rare these days. The custom
PowerPC chip in the Xbox 360 guaranteed in-order execution, for instance [0].
GPUs are a different beast.

[0]
[https://en.wikipedia.org/wiki/Xenon_(processor)#Specificatio...](https://en.wikipedia.org/wiki/Xenon_\(processor\)#Specifications)

~~~
pornel
The difference is that reordering is not observable with just one thread. Both
language optimizations and CPU OOO don't change semantics within a single
thread. If you use higher-level primitives like mutexes, they come with fences
that also preserve that illusion.

It does become observable when you try to do synchronization yourself with
atomics, without setting appropriate ordering requirements. Lock-free
algorithms are usually tricky, and implementing them with minimum ordering
requirements makes them even more puzzling.

------
microcolonel
This presentation was fun, especially the part where he shows his spinlock
beating atomic CAS.

[https://youtu.be/ZQFzMfHIxng](https://youtu.be/ZQFzMfHIxng)

------
pizza234
I see that, in a related article, they missed a nice in-joke:

> A gentle introduction to multithreading — Approaching the world of
> concurrency, two steps at a time.

(I wonder if "multiple steps at a time" sounds better)

:-)

------
NKCSS
The site is dead unfortinately.

