The main use of spinlocks that i'm aware of is minimising latency in inter-processor communication.
That is, if you have a worker task which is waiting for a supervisor task to tell it to do something, then to minimise the time between the supervisor giving the order and the worker getting to work, use a spinlock.
For this to really work, you need to implement both tasks as threads pinned to dedicated cores, so they won't be preeempted. You will burn a huge amount of CPU doing this, and so it won't be "faster" from a throughput point of view. But the latency will be as low as it's possible to go.
The main use of spinlocks that I'm aware of is dealing with interrupt handlers in driver code. In this situation you generally can't go to sleep (sleeping with interrupts disabled is generally a good way of never waking up) so calling "mutex_lock" is simply out of the question. That's probably a niche use but that's literally the only situation I've ever used actual spin locks instead of mutexes, mainly for the reasons outlined by TFA.
Most of the time I would say if lock performance impacts your application you are likely doing something wrong. There are many ways of solving typical synchronization problems without locks and if you need locks there are many ways to amortize costs.
Mutexes are not faster than spin locks the same way slower cars are not faster than fast cars. You might still crash in a fast car and be slower to the end of the race (to the supermarket) but that is just your failure to use the power you were given responsibly.
The feature article ; the freaking article ; the friendly article ; the f$%^&*( article, c.f. RTFM.
And yeah, I believe that it derives from "RTFA" (read the fucking article) which would be what you told people who obviously commented without reading the article.
the F is usually something else though ...
Also, the disruptor design is based on a system where there is a 1:1 correspondance between threads and cores. If this is not the case it will likely still interact somewhat poorly with the OS's scheduler without any blocking at all, for the same reason as spinlocks (in fact, if you use the library you can customize this behaviour).
Random Google search seems to validate that.
The highest performance design is to use a ringbuffer. To write to the ringbuffer, you atomic-increment the "claim" counter, giving you an index. Now take index modulo ringbuffer size. That's the slot to write to. After you're done writing to it, set your "publish" counter to the index.
Each writer has a "publish" counter, and the "claim" counter isn't allowed to move beyond the lowest "publish" counter modulo ringbuffer size.
Each reader uses a similar strategy: the reader has a current counter. You find the minimum of all publish counter values, then process each slot up to that value, and set your counter to that value. The "claim" counter isn't allowed to move past the minimum of all reader counters.
Hence, everyone is reading from / writing to separate cache lines, and there are no locks at all. The sole primitive is atomic increment, which can never fail. (The only failure condition is "one of the slots hasn't been processed yet" (i.e. the publish counter is <= claim counter minus ringbuffer size) at which point everybody waits for the slot to be processed.
You can wait using multiple strategies: a spin loop (which isn't the same as a spinlock because you're only reading a value in memory, not calling any atomic primitives), yielding CPU time via sched_yield(), or calling sleep. All strategies have tradeoffs. Calling sleep is the least CPU intensive, but highest latency. Vice-versa for spin loop.
Takeaway: there are no locks. Just increments.
I'd be interested to see benchmarks showing that spinlocks with CAS have similar throughput to a ringbuffer with atomic increment.
Note that with the ringbuffer approach, each reader can process many slots at once, since you're taking the minimum of all published slot numbers. If you last processed slot 3, and the minimum publish count is 9, you can process slot 4 through 9 without doing any atomic operations at all. The design guarantees safety with no extra work.
It's not a minor trick; it's one of the main reasons throughput is orders of magnitude higher.
Beyond that, the ringbuffer approach also solves the problem you'll always run into: if you use a queue, you have to decide the max size of the queue. It's very hard to use all available resources without allocating too much and swapping to disk, which murders performance. Real systems with queues have memory usage patterns that tend to fluctuate by tens of GB, whereas a fixed-size ringbuffer avoids the problem.
Also as far as I'm aware, at least on intel all atomic operations take pretty much exactly the same number of clock cyles (except for CAS and DCAS which are ~10% and 50% more expensive, IIRC)
For context: before Pentium with its glueless 2x2 SMP/redundancy support there were various approaches to shared memory x86 multiprocessors with wildly different memory coherence models. (And some of the “lets design a board with eight 80386” are the reason why Intel had designed i586 to be glueless and such systems are probably still used to this day, althought unsupported)
As forward progress is a pretty basic requirements, in practice even LL/SC platforms in practice do that, but is instead of having a single instruction with guaranteed forward progress you have to use a few special (but sometimes underspecified) sequences of instructions between the ll/sc pairs.
FWIW RISC-V guarantees forward progress for reasonable uses:
> We mandate that LR/SC sequences of bounded length (16 consecutive static instructions) will eventually succeed, provided they contain only base ISA instructions other than loads, stores, and taken branches.
what happens if those 16 instructions touch 16 different cache lines? I'm not an hardware expert (and even less on coherency protocols), but I think it would be extremely hard to make sure livelocking is avoided in all cases, short of having some extremely expensive and heavy handed global 'bus' lock fallback.
That's quite the corner case.
What if your work queues are running on a multitasking operating system that runs services? And what about a hypervisor?
If you’re not sharing cores between VMs it’s typical to do the same at the hypervisor layer.
I wrote some software at an automated trading firm that would do the subset of "parsing ticks and maintaining order books" sufficient to map ticks to symbols, and no more work than that. My code was pinned to one core and spun waiting for an expensive network card to DMA some ticks into memory, then when ticks we cared about were found it would load them onto one or more queues. Each queue corresponded to a different thread spin waiting on its own pinned core.
So one use case is "transforming ticks into trading decisions slightly faster than you otherwise could"
Whether there is a use case for spinlocks outside low-latency trading, i have no idea!
I did a quick test on my Mac using 4 threads instead. At "heavy contention" the spin lock is actually 22% faster than parking_lot::Mutex. At "extreme contention", the spin lock is 22% slower than parking_lot::Mutex.
Heavy contention run:
$ cargo run --release 4 64 10000 100
Finished release [optimized] target(s) in 0.01s
Running `target/release/lock-bench 4 64 10000 100`
std::sync::Mutex avg 2.822382ms min 1.459601ms max 3.342966ms
parking_lot::Mutex avg 1.070323ms min 760.52µs max 1.212874ms
spin::Mutex avg 879.457µs min 681.836µs max 990.38µs
AmdSpinlock avg 915.096µs min 445.494µs max 1.003548ms
std::sync::Mutex avg 2.832905ms min 2.227285ms max 3.46791ms
parking_lot::Mutex avg 1.059368ms min 507.346µs max 1.263203ms
spin::Mutex avg 873.197µs min 432.016µs max 1.062487ms
AmdSpinlock avg 916.393µs min 568.889µs max 1.024317ms
$ cargo run --release 4 2 10000 100
Finished release [optimized] target(s) in 0.01s
Running `target/release/lock-bench 4 2 10000 100`
std::sync::Mutex avg 4.552701ms min 2.699316ms max 5.42634ms
parking_lot::Mutex avg 2.802124ms min 1.398002ms max 4.798426ms
spin::Mutex avg 3.596568ms min 1.66903ms max 4.290803ms
AmdSpinlock avg 3.470115ms min 1.707714ms max 4.118536ms
std::sync::Mutex avg 4.486896ms min 2.536907ms max 5.821404ms
parking_lot::Mutex avg 2.712171ms min 1.508037ms max 5.44592ms
spin::Mutex avg 3.563192ms min 1.700003ms max 4.264851ms
AmdSpinlock avg 3.643592ms min 2.208522ms max 4.856297ms
I remember a sort of sea change in my thinking on technical books during a period where I tended to keep them at work instead of at home. I noticed a curious pattern in which ones were getting borrowed and by whom. Reading material isn't only useful if it has something new to me in it. It's also useful if it presents information I already know and agree with, in a convenient format. Possibly more useful, in fact.
Basically the user mode mutex lock is implemented as:
// In user-mode, if the lock flag is free as 0, lock it to 1 and exit
while !atomic_compare_and_swap(&lock, 0, 1)
// Lock not free, sleeps until the flag is changed back to 0
futex_wait(&lock, 0) // syscall to kernel
Edit: the atomic_compare_and_swap can be just a macro to the assembly CMPXCHG, so it's very fast to acquire a lock if no one else holding the lock.
Looks like it’s relying glibc for lock elision.
Mind you the parking_lot::Mutex in the article is not the stdlib implementation, documented here: https://docs.rs/parking_lot/0.10.0/parking_lot/type.Mutex.ht...
And that looks like it’s not using pthread, instead relying on primitives: https://github.com/Amanieu/parking_lot/blob/master/src/raw_m...
1. Does a CAS with compare_exchange_weak() at line 69.
2. Then call lock_slow() at line 72 to do spinlocking (Guh!).
3. The call to parking_lot_core::park() at line 256 seems to sleep wait.
Which matches the description of that library:
"This library provides implementations of Mutex, RwLock, Condvar and Once that are smaller, faster and more flexible than those in the Rust standard library"
"Uncontended lock acquisition and release is done through fast inline paths which only require a single atomic operation.
Microcontention (a contended lock with a short critical section) is efficiently handled by spinning a few times while trying to acquire a lock.
The locks are adaptive and will suspend a thread after a few failed spin attempts."
The only thing that I'm missing is how often the sleeping threads wake, as a bad constant can increase the CPU use.
I would imagine the unlocking call would wake up the sleeping thread so it won't keep waking up periodically.
The user-mode only implementation sounds interesting. The only downside is it won't work with inter-process locking.
Likely a hard linear-quadratic expanding threshold. They tend to waste a lot of CPU power in specific algorithms.
There are better, more intrusive ways of solving this problem on language level.
Now, the "zero syscalls for uncontended pthread mutices via futex" optimization is Linux specific and may not be replicated elsewhere. Or it may. It's not Posix, but I know for instance win32 critical sections look a lot like spinlocks when not contended but do syscalls to block, which sounds a lot like a futex. So that would put that technique as dating to the 1990s at the latest. Futex landed in 2002.
What’s your point here? Posix is portable across posix compliant systems, it is not portable beyond that.
There are many systems that are posix, but Rust targets more than that.
What is the non-posix target you have in mind? Windows? Seems like conditionally compiling against either pthreads or win32 critical section [or anything else] is a feasible thing and reasonable action. Maybe even the spinlock as a last resort.
"So you might want to look into not the standard library implementation, but specific locking implentations for your particular needs. Which is admittedly very very annoying indeed. But don't write your own. Find somebody else that wrote one, and spent the decades actually tuning it and making it work.
Because you should never ever think that you're clever enough to write your own locking routines.. Because the likelihood is that you aren't (and by that "you" I very much include myself - we've tweaked all the in-kernel locking over decades, and gone through the simple test-and-set to ticket locks to cacheline-efficient queuing locks, and even people who know what they are doing tend to get it wrong several times).
There's a reason why you can find decades of academic papers on locking. Really. It's hard."
I was getting heckled recently about my allergy to trying to negotiate interprocess consensus through the filesystem. I've seen similar conversations about how hard it is to 'know' the state of files, especially in a cross platform or cross filesystem way (see also the decade old fsync bug in PostgreSQL we were talking about early last year). In our case several dozen machines all have to come to the same decision at the same time (because round robin) and I was having none of it. I eventually had to tell him, in the nicest way possible, to fuck off, I'm doing this through Consul.
The thing is that people who generally don't learn from their mistakes absolutely do not learn from their old mistakes. So for any bug they introduce (like a locking problem) that takes months or quarters to materialize, they will not internalize any lessons from that experience. Not only wasn't I gonna solve the problem the way he wanted, but if he tried to take over it, we'd have broken code at some arbitrary future date and he'd learn nothing. He could not understand why I in particular and people in general were willing to die on such a hill.
Anger is not the best tool for communication, but as someone once put it, it's the last signal you have available that your boundaries are being violated and this needs to stop. Especially if you're the one who will have to maintain a bad decision. As often as I critique Linus for the way he handles sticky problems, on some level he is not wrong.
Although, any particular thing happening to be one of those is a pretty rare event, so odds are good that this wasn't one.
There is really only a single place volatile actually works, and that is for memory-mapped hardware registers. Anybody who says it is useful for anything else is badly mistaken.
Except in MSVC, where it kinda/sorta means atomic.
This estimate is way too high for the uncontested mutex case. On a modern Linux/Xeon system using GCC, an uncontested mutex lock/unlock is well under 1 microsecond.
I have a lot of experience here from writing low-latency financial systems. The hot path we use is littered with uncontested mutex lock/unlock, and the whole path still runs under 20 microseconds. (With the vast majority of that time unrelated to mutex acquisition.)
The benchmark used in the blog post must be spending the vast majority of its time in some section of code that has nothing to do with lock/unlock.
If you have threads interacting, whether via mutexes or spinlocks, you have a high-latency system.
But the concept of short time does not even exist deterministically in userspace, usually, because it can always be preempted. So don't use pure spinlocks in userspace, unless you really really know what you are doing (and that includes knowing how your kernel works in great details, in the context of how you use it).
Oh there is one already. Meet PAUSE: https://www.felixcloutier.com/x86/pause
Edit: related post from 2018 https://news.ycombinator.com/item?id=17336853
The high waiting variance is because the benchmark randomly decides which locks to take, meaning that the amount of contention is variable.
I think the linux futex can be implemented through the VDSO (can somebody correct me on this), so that eliminates the worse of the sycall costs.
His benchmark is weird, but maybe I'm reading it wrong:
* Instead of reducing thread count to reduce contention he appears to increase the number of locks available. This is still a bad scenario for spinlocks since they will still have bad interactions with scheduler (they will use a large amount of cpu time when and get evicted from the run queue and need to be rescheduled).
* Also, I didn't see him pin any of the threads, so all those threads will start sharing some cpus since the OS does need to be doing some work on a them too.
* And Rust can be a little hard to read, but it seem he packed his locks on the same cache line? I don't see any padding in his AmdSpinlock struct. That would be a huge blow for the spinlock because of the false sharing issues. He's getting all the cache coherence traffic still because of it.
The worst cases for the spinlock are not understanding scheduling costs and the cache thrashing that can occur.
What they call the AMD spinlock (basically just a regular sane spinlock that tries to prevent cache thrashing) has its best performace with a low number of threads, assigned to different cores under the same L3 segment.
(Does anybody know if AMD's new microarchitecture went away from the MOESI/directory based cache towards Intel's MESIF/snoop model?)
The MOESI model might have performed better in this regard under worse case scenario since it doesn't need to write the cache line back and can just forward around the dirty line as it keeps track of who owns it.
And if you run under an MESIF-based cache and you can keep your traffic local to your L3 segment, you are backstopped there and never need to go anywhere else.
A spinlock is a performance optimization and should be treated as one. You need to have intimate knowledge of the architecture you are running under and the resources being used.
(edit: answered my own question, apparently the vdso is still pretty limited in what it exports, so no. it isn't happening at this time from what i can tell.)
futex cannot be implemented in the VDSO since it needs to call into the scheduler.
Another way to think about this: VDSO is used for syscalls that are (mostly) read-only and can avoid the kernel mode switch on a common fast path. The futex syscall is already the raw interface that is designed on the assumption that the caller only uses it in the slow path of whatever higher-level synchronization primitive they're implementing, so trying to use VDSO tricks to implement futex would be redundant.
I see that now. I looking for it in the AmdSpinlock struct, but that kind of makes sense.
> The futex syscall is already the raw interface that is designed on the assumption that the caller only uses it in the slow path of whatever higher-level synchronization primitive they're implementing, so trying to use VDSO tricks to implement futex would be redundant.
Ah. Thanks. I didn't know how far you could get with MWAIT, but I guess you still need to deschedule. I also didn't realize futex was a direct syscall and there was no user level api going on around it.
Is he running 32 threads even in the low contention case? And not pinning? There's something about his numbers that just seem a little too high for what I would expect. I've seen this around a lot, and the reason the mutex usually wins is that is basically does a spin of 1 or more then goes into a mutex (the pthread code appers to spin 100 times before falling back to a futex).
At work I use a spin lock on a shared memory region because it does test out to be lower latency than std::mutex and we're not under much contention. I've though about replacing it with a light futex-based library, but doesn't seem to be quicker.
He still seems to be getting some contention, and I'm trying figure out how.
The semantic of a futex wait is a request to the kernel to put the thread to sleep until another thread issues a signal for the same "key".
The trick is that the key is the address of a 32 bit memory location (provided to the kernel in both the signal and wait syscalls), which is normally the mutex address.
So a process first tries to acquire the mutex as for a normal spin lock (with a CAS, exchange, xadd or whatever work for the specific algorithm) attempting to set the lock bit to 1; if it fails (possibly after spin-trying a few times), it invokes the futex wait. As you can see the syscall is only done in the slow path.
On a futex release, the simple solution is to always invoke futex signal syscall after setting the lock bit to zero. To fast path the relase, a wait bit can be set on the acquire path just before calling futex wait, so on the release path, when setting the lock bit to zero, signal would only be called if the waiter bit was set (and the cleared by together with the lock bit).
As you can see the futex syscall is already the slow path and never need to be onvoked in the uncontented case. In fact the kernel doesn't even need to know about the futex untill the first contention.
 futexes are edge triggered, same as condition variables, so a signal will only wakes any thread blocked ona wait that happened-before the signal call. Thus there is a race condition if a lock rrlease and signal happens between the failed acquire attempt and the wait call; to prevent this futex wait as an additional parameter that is the expected value of the memory location: the kernel checks the futex address against this valur and will only if it hasn't changed will put the thread to sleep (this is done atomically).
 as there could me multiple waiters, a broadcast is actually needed here which leads to a thundering herd as all waiters will race to acquire the lock. The original futex paper used a wait count instead of just a bit but there are other options.
A mutex may try to spin more than a "couple of times" before it calls futex. Example: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_...
Calling this 'elision' is a bit confusing, since the glibc developers use the term 'lock elision' to refer to hardware lock elision via Intel TSX extensions.
What eliminates the syscall cost is optimistic spinning in userspace, so that locking only does futex_wait in cases of contention when the lock isn't released 'soon' after the thread starts trying to acquire it.
In a spin lock, the lock state is checked in a tight loop by all waiters. This will be using some sort of memory fence. FWIK, memory fence or barriers flush the CPU cache and would initiate reading the variable (spin lock state) for evaluation. I would expect spin locking overheads to increase with number of cores.
On NUMA, I think flushing is more expensive.
Hence, spin locks have an additional overhead of having to load and evaluate on every spin as against being woken up for mutexes (like a callback)
I've been writing MP code since the early 90's; SPARC, MIPS, x86, PPC, SH-4, Alpha, and all have instructions to make this easier for you.
Spin locks are very useful for locks held for short durations, and only on MP systems, particularly in the kernel, since a spin lock there can deadlock the whole system on a single CPU machine. The very general rule of thumb is that you generally want blocking locks, unless there is a very specific reason to use a spin lock. Those very specific reasons might be waiting briefly on some device in a driver, or running a realtime system where wake latency is critical, etc. Most of the time, spin locks will burn CPU for no good reason, and you still have to implement the non-spinlock case anyway, in case you are running on a 1 CPU system. So, might as well start with blocking locks, and only add spin locks where performance would benefit from it.
The real issue with spin locks is fairness. There’s no assurance that any given thread will ever make progress. A thread could starve forever. Production-ready mutexes like absl::Mutex make efforts toward fairness, even if they don’t have hard guarantees.
Well, as long as you do the test-CAS instead of the pure-CAS approach not every loop iteration results in cache line bouncing.
Plus intel has introduced the MWAIT instruction to implement something similar to futex in hardware, i.e. the hyperthread can sleep until another core updates the cacheline in question.
I agree that the naive implementations are not.
This actually does not have to be this way. You could have a linked list of spinlocks, one for each waiter. Each waiter spins on its own, unique spinlock. When the previous waiter is done it unlocks the next spinlock, and so on. The implementation gets a bit complicated on non-GC languages, since there are races between insertion/removal on the linked list. If the number of threads is fixed and long-lived then it becomes easier, since instead of a linked list you can have an array of spinlocks.
Note: in some architectures (x86?) you could possibly do away with atomics, since (I believe) int updates are atomic by default. Not really sure though.
The release can be a compiler only barrier followed by a simple store, but you do need an atomic RMW in the acquire path.
It is technically possible to implement a lock with just loads and stores (see Peterson lock) even in the acquire path but it does require a #StoreLoad memory barrier even on Intel, which is as expensive as an atomic RMW so it is not worth it.
Edit:fixed barrier name typo.
What is that? x86 is TSO...
Do you have an example of the full acquire and release sections in x86 assembly?
Edit: here is a good (but long) article with a x86 Peterson Lock example (with an analysis of the critical race without the barrier):
The additional delay it added for checking the state resulted in deadlock timer triggering a dump! Hence, checking a variable is very expensive.
Then the benchmark uses spinlocks in situations that spinlocks aren't great for. And, surprise, spinlocks are slower, than spinlocks-with-mutexes.
Spinlocks are great in situations such as:
* There are far more resources than threads,
* The probability of actually having to spin is small,
ideally if the time spent in the lock is a few instructions
* When you can't use optimistic concurrency*
There's plenty of linked list implementations, for example, that use optimistic concurrency. At that point you've got yourself a thread-safe message queue and that might be better than mutexes, too.
That said, I see why Rust makes things so annoying. I want lots of code to work in no-std so have a robust embedded and kernel ecosystem. It would be really nice to abstract over the locking implementation with type parameters, but that required "higher kinded types" which Rust doesn't have. The alternative is relying on carefully coordinated Cargo features, which is much more flaky and hard to audit (e.g. with the type system). Given that, I am not sure people over-use spin locks.
Kernel can run your code on a cooled core, giving it higher clock, for example. Ultimately making it run faster.
Of course this won't show in a benchmark where all the threads do mostly calculation rather than contention, but that's not the typical case. That mostly shows up in compute such as multithreaded video where latency does not matter one bit.
Typically you have more of a producer/consumer pattern where consumer sleeps, and it's beneficial to run it on a cold CPU, assuming the kernel woke it up beforehand.
Source: hit some latency issues with an ancient kernel on a nastily hacked big little architecture ARM machine. It liked to overheat cores and put heavy tasks on the overheated ones for alleged power saving. (Whereas running a task quicker saves power.)
Also, there are better spinlock implementations, such as speculative spinlocks, and queued locks.