
Mutexes are faster than Spinlocks - erickt
https://matklad.github.io/2020/01/04/mutexes-are-faster-than-spinlocks.html
======
twic
The author has an implicit definition of "faster" which it is important to be
aware of.

The main use of spinlocks that i'm aware of is minimising _latency_ in inter-
processor communication.

That is, if you have a worker task which is waiting for a supervisor task to
tell it to do something, then to minimise the time between the supervisor
giving the order and the worker getting to work, use a spinlock.

For this to really work, you need to implement both tasks as threads pinned to
dedicated cores, so they won't be preeempted. You will burn a huge amount of
CPU doing this, and so it won't be "faster" from a throughput point of view.
But the latency will be as low as it's possible to go.

~~~
simias
>The main use of spinlocks that i'm aware of is minimising latency in inter-
processor communication.

The main use of spinlocks that I'm aware of is dealing with interrupt handlers
in driver code. In this situation you generally can't go to sleep (sleeping
with interrupts disabled is generally a good way of never waking up) so
calling "mutex_lock" is simply out of the question. That's probably a niche
use but that's literally the only situation I've ever used actual spin locks
instead of mutexes, mainly for the reasons outlined by TFA.

~~~
vba
TFA? Tried to google but unsuccessful

~~~
hyperion2010
The aForementioned Article

the F is usually something else though ...

~~~
saghm
My OS professor in college told us "RTFM means read the manual; the 'f' is
silent"

------
nathanielherman
This experiment is a bit weird. If you look at
[https://github.com/matklad/lock-bench](https://github.com/matklad/lock-
bench), this was run on a machine with 8 logical CPUs, but the test is using
32 threads. It's not that surprising that running 4x as many threads as there
are CPUs doesn't make sense for spin locks.

I did a quick test on my Mac using 4 threads instead. At "heavy contention"
the spin lock is actually 22% faster than parking_lot::Mutex. At "extreme
contention", the spin lock is 22% slower than parking_lot::Mutex.

Heavy contention run:

    
    
      $ cargo run --release 4 64 10000 100
          Finished release [optimized] target(s) in 0.01s
          Running `target/release/lock-bench 4 64 10000 100`
      Options {
          n_threads: 4,
          n_locks: 64,
          n_ops: 10000,
          n_rounds: 100,
      }
    
      std::sync::Mutex     avg 2.822382ms   min 1.459601ms   max 3.342966ms  
      parking_lot::Mutex   avg 1.070323ms   min 760.52µs     max 1.212874ms  
      spin::Mutex          avg 879.457µs    min 681.836µs    max 990.38µs    
      AmdSpinlock          avg 915.096µs    min 445.494µs    max 1.003548ms  
    
      std::sync::Mutex     avg 2.832905ms   min 2.227285ms   max 3.46791ms   
      parking_lot::Mutex   avg 1.059368ms   min 507.346µs    max 1.263203ms  
      spin::Mutex          avg 873.197µs    min 432.016µs    max 1.062487ms  
      AmdSpinlock          avg 916.393µs    min 568.889µs    max 1.024317ms  
    

Extreme contention run:

    
    
      $ cargo run --release 4 2 10000 100
          Finished release [optimized] target(s) in 0.01s
          Running `target/release/lock-bench 4 2 10000 100`
      Options {
          n_threads: 4,
          n_locks: 2,
          n_ops: 10000,
          n_rounds: 100,
      }
    
      std::sync::Mutex     avg 4.552701ms   min 2.699316ms   max 5.42634ms   
      parking_lot::Mutex   avg 2.802124ms   min 1.398002ms   max 4.798426ms  
      spin::Mutex          avg 3.596568ms   min 1.66903ms    max 4.290803ms  
      AmdSpinlock          avg 3.470115ms   min 1.707714ms   max 4.118536ms  
    
      std::sync::Mutex     avg 4.486896ms   min 2.536907ms   max 5.821404ms  
      parking_lot::Mutex   avg 2.712171ms   min 1.508037ms   max 5.44592ms   
      spin::Mutex          avg 3.563192ms   min 1.700003ms   max 4.264851ms  
      AmdSpinlock          avg 3.643592ms   min 2.208522ms   max 4.856297ms

~~~
rwem
If you only have 4 threads it is likely that all your CPUs are sharing caches
and you won't see the real downside of the spinlock. They don't really fall
apart until you have several sockets.

~~~
nathanielherman
Note that I get a similar speedup with 6 and 8 threads on my Mac (which has 8
logical CPUs)

------
ww520
Modern day user-mode pthread mutex uses the futex kernel syscall [1] to
implement the lock, which avoids the syscall in the non-contended case, so it
can be very fast, acquiring the lock in the user-mode code entirely. I'm not
sure whether the Rust's mutex API is a wrapper on the pthread mutex or calling
the old kernel's mutex syscall directly.

Basically the user mode mutex lock is implemented as:

    
    
        // In user-mode, if the lock flag is free as 0, lock it to 1 and exit
        while !atomic_compare_and_swap(&lock, 0, 1)
            // Lock not free, sleeps until the flag is changed back to 0
            futex_wait(&lock, 0)  // syscall to kernel
    

When futex_wait() returns, it means the flag has been set back to 0 by the
other thread's unlock, and the kernel wakes my thread up so I can check it
again. However, another thread can come in and grab the lock in the meantime,
so I need to loop back to check again. The atomic CAS operation is the one
acquiring the lock.

[1]
[https://github.com/lattera/glibc/blob/master/nptl/pthread_mu...](https://github.com/lattera/glibc/blob/master/nptl/pthread_mutex_lock.c#L168)

Edit: the atomic_compare_and_swap can be just a macro to the assembly CMPXCHG,
so it's very fast to acquire a lock if no one else holding the lock.

~~~
bluejekyll
[https://github.com/rust-
lang/rust/blob/master/src/libstd/sys...](https://github.com/rust-
lang/rust/blob/master/src/libstd/sys/unix/mutex.rs#L26)

Looks like it’s relying glibc for lock elision.

Mind you the parking_lot::Mutex in the article is not the stdlib
implementation, documented here:
[https://docs.rs/parking_lot/0.10.0/parking_lot/type.Mutex.ht...](https://docs.rs/parking_lot/0.10.0/parking_lot/type.Mutex.html)

And that looks like it’s not using pthread, instead relying on primitives:
[https://github.com/Amanieu/parking_lot/blob/master/src/raw_m...](https://github.com/Amanieu/parking_lot/blob/master/src/raw_mutex.rs)

~~~
ww520
Then it's using the futex implementation. It's very efficient.

~~~
bluejekyll
I updated my comment. The fastest impl, parking_lot, appears to be a ground up
atomic based mutex that doesn’t rely on pthread at all.

~~~
ww520
It seems to be doing similar logic.

1\. Does a CAS with compare_exchange_weak() at line 69.

2\. Then call lock_slow() at line 72 to do spinlocking (Guh!).

3\. The call to parking_lot_core::park() at line 256 seems to sleep wait.

~~~
acqq
So this it acquires fully in userspace if there's no contention, it even spins
if there is contention, and then if that wasn't enough it lets the thread
sleep with the timeout.

Which matches the description of that library:

[https://github.com/Amanieu/parking_lot](https://github.com/Amanieu/parking_lot)

"This library provides implementations of Mutex, RwLock, Condvar and Once that
are smaller, faster and more flexible than those in the Rust standard library"

"Uncontended lock acquisition and release is done through fast inline paths
which only require a single atomic operation.

Microcontention (a contended lock with a short critical section) is
efficiently handled by spinning a few times while trying to acquire a lock.

The locks are adaptive and will suspend a thread after a few failed spin
attempts."

The only thing that I'm missing is how often the sleeping threads wake, as a
bad constant can increase the CPU use.

~~~
temac
Does it really sleep for a specified period instead of doing directed wake-
ups? If so that's very far from ideal...

~~~
bdash
No, a thread that fails to acquire the mutex sleeps until the thread that is
releasing the mutex explicitly wakes it. On Linux this is achieved via
FUTEX_WAIT / FUTEX_WAKE.

------
sharken
This is pretty much the conclusion in this game related post on spinlocks:
[https://probablydance.com/2019/12/30/measuring-mutexes-
spinl...](https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-
and-how-bad-the-linux-scheduler-really-is/)

~~~
jakswa
Oh wow/yikes, Linus Torvalds commented on that recently:
[https://www.realworldtech.com/forum/?threadid=189711&curpost...](https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723)

"So you might want to look into not the standard library implementation, but
specific locking implentations for your particular needs. Which is admittedly
very very annoying indeed. But don't write your own. Find somebody else that
wrote one, and spent the decades actually tuning it and making it work.

Because you should never ever think that you're clever enough to write your
own locking routines.. Because the likelihood is that you aren't (and by that
"you" I very much include myself - we've tweaked all the in-kernel locking
over decades, and gone through the simple test-and-set to ticket locks to
cacheline-efficient queuing locks, and even people who know what they are
doing tend to get it wrong several times).

There's a reason why you can find decades of academic papers on locking.
Really. It's hard."

~~~
hinkley
We've been arguing about concurrency primitives literally for decades and the
'worst' part is that for most of that time, all of the competing solutions
were documented by the same individual - Tony Hoare - within a narrow period
in the early 1970's. Soon that will be 50 years ago. Watching people argue is
like the People's Liberation Front of Judea scene in Life of Brian. As far as
I know, borrow checking may be the first real change in that arena in decades,
and I bet even that is older than most of us know.

I was getting heckled recently about my allergy to trying to negotiate
interprocess consensus through the filesystem. I've seen similar conversations
about how hard it is to 'know' the state of files, especially in a cross
platform or cross filesystem way (see also the decade old fsync bug in
PostgreSQL we were talking about early last year). In our case several dozen
machines all have to come to the same decision at the same time (because round
robin) and I was having none of it. I eventually had to tell him, in the
nicest way possible, to fuck off, I'm doing this through Consul.

The thing is that people who generally don't learn from their mistakes
absolutely do not learn from their _old_ mistakes. So for any bug they
introduce (like a locking problem) that takes months or quarters to
materialize, they will not internalize any lessons from that experience. Not
only wasn't I gonna solve the problem the way he wanted, but if he tried to
take over it, we'd have broken code at some arbitrary future date and he'd
learn nothing. He could not understand why I in particular and people in
general were willing to die on such a hill.

Anger is not the best tool for communication, but as someone once put it, it's
the last signal you have available that your boundaries are being violated and
this needs to stop. Especially if you're the one who will have to maintain a
bad decision. As often as I critique Linus for the way he handles sticky
problems, on some level he is not wrong.

~~~
ncmncm
Back around 2012 I worked with a guy, a FreeBSD kernel committer, who insisted
volatile was sufficient as a thread synchronization primitive. He convinced
our boss.

~~~
asveikau
Wouldn't that depend on the case? There is a nonzero amount of things in the
universe for which volatile with no locking will do.

Although, any particular thing happening to be one of those is a pretty rare
event, so odds are good that this wasn't one.

~~~
legulere
Volatile doesn’t even guarantee that data is written atomically in one step
and not e.g. byte-wise. Also it allows both the compiler as well as the CPU to
reorder it with any read or write. I can’t think of anything that would it
could be used for in a multithreaded environment.

~~~
ncmncm
Right.

There is really only a single place volatile actually works, and that is for
memory-mapped hardware registers. Anybody who says it is useful for anything
else is badly mistaken.

Except in MSVC, where it kinda/sorta means atomic.

------
dcolkitt
> Second, the uncontended case looks like > Parking_lot::Mutex avg 6ms min 4ms
> max 9ms

This estimate is way too high for the uncontested mutex case. On a modern
Linux/Xeon system using GCC, an uncontested mutex lock/unlock is well under 1
_microsecond_.

I have a lot of experience here from writing low-latency financial systems.
The hot path we use is littered with uncontested mutex lock/unlock, and the
whole path still runs under 20 microseconds. (With the vast majority of that
time unrelated to mutex acquisition.)

The benchmark used in the blog post must be spending the vast majority of its
time in some section of code that has nothing to do with lock/unlock.

~~~
gpm
You're misreading the benchmark, that's 6ms for 10,000 lock/unlocks per
thread, 320,000 lock/unlocks total. In other words 0.6 microseconds per thread
per lock.

~~~
rwem
That's still unreasonably high, isn't it? Even a Go sync.Mutex, not exactly a
hot-rod implementation, can be acquired and released in < 50ns on the garbage
hardware I have before me.

~~~
gpderetta
On Intel (and probably very similar on AMD) the cost of a completely
uncontented, cache hit, simple spin lock acquisition is ~20 clock cycles while
the release is almost free.

------
kcolford
This absolutely makes sense in userspace. The most important part of a
spinlock in an OS is that you can yield to the scheduler instead of taking up
CPU time on the core. But that defeats the purpose of using a spinlock when in
userspace because you still have to syscall

~~~
temac
A spinlock in the kernel typically only spins. You use it for the cases when
you can't schedule... So it better will be for a short time only.

But the concept of short time does not even exist deterministically in
userspace, usually, because it can always be preempted. So don't use pure
spinlocks in userspace, unless you _really_ _really_ know what you are doing
(and that includes knowing how your kernel works in great details, in the
context of how you use it).

------
extropy
How about a new opcode wait till memory address read equals? That would allow
implementing a power efficient spinlock.

Oh there is one already. Meet PAUSE:
[https://www.felixcloutier.com/x86/pause](https://www.felixcloutier.com/x86/pause)

Edit: related post from 2018
[https://news.ycombinator.com/item?id=17336853](https://news.ycombinator.com/item?id=17336853)

~~~
dbaupp
The benchmarked spin-locks are using it, via [https://doc.rust-
lang.org/std/sync/atomic/fn.spin_loop_hint....](https://doc.rust-
lang.org/std/sync/atomic/fn.spin_loop_hint.html)

Implementation: [https://doc.rust-
lang.org/src/core/hint.rs.html#64-93](https://doc.rust-
lang.org/src/core/hint.rs.html#64-93)

------
gok
Note that all of the locks tested here are unfair, which is why they all show
very high waiting variance. Until recently many mutex implementations aimed
for fairness, which made them much slower than spinlocks in microbenchmarks
like this.

~~~
Diggsey
Actually, the parking_lot mutex is fair:
[https://docs.rs/parking_lot/0.10.0/parking_lot/type.Mutex.ht...](https://docs.rs/parking_lot/0.10.0/parking_lot/type.Mutex.html#fairness)

The high waiting variance is because the benchmark randomly decides which
locks to take, meaning that the amount of contention is variable.

~~~
sanxiyn
Note that fair locking is still expensive. parking_lot achieves both speed and
fairness by starting with unfair locking and falling back to fair locking.

------
jnordwick
This comes around every so often, and it isn't very interesting in that the
best mutexes basically spins 1 or a couple times then falls back to a lock. It
isn't a true pure spinlock vs pure lock (mutex/futex) fight.

I think the linux futex can be implemented through the VDSO (can somebody
correct me on this), so that eliminates the worse of the sycall costs.

His benchmark is weird, but maybe I'm reading it wrong:

* Instead of reducing thread count to reduce contention he appears to increase the number of locks available. This is still a bad scenario for spinlocks since they will still have bad interactions with scheduler (they will use a large amount of cpu time when and get evicted from the run queue and need to be rescheduled).

* Also, I didn't see him pin any of the threads, so all those threads will start sharing some cpus since the OS does need to be doing some work on a them too.

* And Rust can be a little hard to read, but it seem he packed his locks on the same cache line? I don't see any padding in his AmdSpinlock struct. That would be a huge blow for the spinlock because of the false sharing issues. He's getting all the cache coherence traffic still because of it.

The worst cases for the spinlock are not understanding scheduling costs and
the cache thrashing that can occur.

What they call the AMD spinlock (basically just a regular sane spinlock that
tries to prevent cache thrashing) has its best performace with a low number of
threads, assigned to different cores under the same L3 segment.

(Does anybody know if AMD's new microarchitecture went away from the
MOESI/directory based cache towards Intel's MESIF/snoop model?)

The MOESI model might have performed better in this regard under worse case
scenario since it doesn't need to write the cache line back and can just
forward around the dirty line as it keeps track of who owns it.

And if you run under an MESIF-based cache and you can keep your traffic local
to your L3 segment, you are backstopped there and never need to go anywhere
else.

A spinlock is a performance optimization and should be treated as one. You
need to have intimate knowledge of the architecture you are running under and
the resources being used.

(edit: answered my own question, apparently the vdso is still pretty limited
in what it exports, so no. it isn't happening at this time from what i can
tell.)

~~~
atq2119
The locks should be on separate cachelines, that's what the CachePadded::new
is for.

futex cannot be implemented in the VDSO since it needs to call into the
scheduler.

Another way to think about this: VDSO is used for syscalls that are (mostly)
read-only and can avoid the kernel mode switch on a common fast path. The
futex syscall is already the raw interface that is designed on the assumption
that the caller only uses it in the slow path of whatever higher-level
synchronization primitive they're implementing, so trying to use VDSO tricks
to implement futex would be redundant.

~~~
jnordwick
> The locks should be on separate cachelines, that's what the CachePadded::new
> is for

I see that now. I looking for it in the AmdSpinlock struct, but that kind of
makes sense.

> The futex syscall is already the raw interface that is designed on the
> assumption that the caller only uses it in the slow path of whatever higher-
> level synchronization primitive they're implementing, so trying to use VDSO
> tricks to implement futex would be redundant.

Ah. Thanks. I didn't know how far you could get with MWAIT, but I guess you
still need to deschedule. I also didn't realize futex was a direct syscall and
there was no user level api going on around it.

Is he running 32 threads even in the low contention case? And not pinning?
There's something about his numbers that just seem a little too high for what
I would expect. I've seen this around a lot, and the reason the mutex usually
wins is that is basically does a spin of 1 or more then goes into a mutex (the
pthread code appers to spin 100 times before falling back to a futex).

At work I use a spin lock on a shared memory region because it does test out
to be lower latency than std::mutex and we're not under much contention. I've
though about replacing it with a light futex-based library, but doesn't seem
to be quicker.

He still seems to be getting some contention, and I'm trying figure out how.

------
hackworks
Not an expert here.

In a spin lock, the lock state is checked in a tight loop by all waiters. This
will be using some sort of memory fence. FWIK, memory fence or barriers flush
the CPU cache and would initiate reading the variable (spin lock state) for
evaluation. I would expect spin locking overheads to increase with number of
cores.

On NUMA, I think flushing is more expensive. Hence, spin locks have an
additional overhead of having to load and evaluate on every spin as against
being woken up for mutexes (like a callback)

~~~
elteto
> checked in a tight loop by all waiters

This actually does not have to be this way. You could have a linked list of
spinlocks, one for each waiter. Each waiter spins on its own, unique spinlock.
When the previous waiter is done it unlocks the next spinlock, and so on. The
implementation gets a bit complicated on non-GC languages, since there are
races between insertion/removal on the linked list. If the number of threads
is fixed and long-lived then it becomes easier, since instead of a linked list
you can have an array of spinlocks.

Note: in some architectures (x86?) you could possibly do away with atomics,
since (I believe) int updates are atomic by default. Not really sure though.

~~~
xtacy
Yep, that's the underlying principle behind Mellor-Crummey Spinlock:
[https://lwn.net/Articles/590243/](https://lwn.net/Articles/590243/).

~~~
elteto
Yes! The MCS spinlock, the name eluded me. The paper is actually a pretty good
read (it is linked in the lwn article).

------
lowbloodsugar
TFA makes the point that modern "mutex" implementations actually use spinlocks
first and only fall back to heavy, kernel Mutexes if there is contention. So
the title is click-baity. Mutexes _are_ slower than spinlocks. The "faster
than spinlocks" mutexes in this article are actually spinlocks that fallback
to mutexes.

Then the benchmark uses spinlocks in situations that spinlocks aren't great
for. And, surprise, spinlocks are slower, than spinlocks-with-mutexes.

Spinlocks are great in situations such as:

    
    
      * There are far more resources than threads,
      * The probability of actually having to spin is small,
        ideally if the time spent in the lock is a few instructions
      * When you can't use optimistic concurrency*
    

* because perhaps the number of memory locations to track is too complicated for my poor brain and I can't be arsed to remember how TLA+ works

There's plenty of linked list implementations, for example, that use
optimistic concurrency. At that point you've got yourself a thread-safe
message queue and that might be better than mutexes, too.

------
jakswa
Coming as a ruby developer who dabbles in async rust in my free time, these
posts/threads/links have been my best read of 2020 so far. My CS curriculum
barely covered real-world lock usage, much less the ties to linux/windows
schedulers + modern CPU caching. Big thanks all around to these internet
commentators.

------
Ericson2314
As the person that added a bunch of functionality to spin-rs making it roughly
match the std API, yes you should not use spinlocks in scheduled code.

That said, I see why Rust makes things so annoying. I want lots of code to
work in no-std so have a robust embedded and kernel ecosystem. It would be
really nice to abstract over the locking implementation with type parameters,
but that required "higher kinded types" which Rust doesn't have. The
alternative is relying on carefully coordinated Cargo features, which is much
more flaky and hard to audit (e.g. with the type system). Given that, I am not
sure people over-use spin locks.

------
lilyball
I'm really curious how macOS's os_unfair_lock compares here.

~~~
LeoNatan25
Is that implementation open source? Don’t remember which dyld contains it.

~~~
gok
[https://github.com/apple/darwin-
libplatform/blob/master/src/...](https://github.com/apple/darwin-
libplatform/blob/master/src/os/lock.c)

------
AnanasAttack
Thread scheduling (or waking up cores) is slow. Because of this, mutexes will
look better on dumb benchmarks, as the contending threads keep going to sleep,
while the single succesful owner has practically uncontended access

~~~
AstralStorm
There are various degrees of slow, in addition to kernel being smarter about
multiple cores and SMT siblings than your application.

Kernel can run your code on a cooled core, giving it higher clock, for
example. Ultimately making it run faster.

Of course this won't show in a benchmark where all the threads do mostly
calculation rather than contention, but that's not the typical case. That
mostly shows up in compute such as multithreaded video where latency does not
matter one bit.

Typically you have more of a producer/consumer pattern where consumer sleeps,
and it's beneficial to run it on a cold CPU, assuming the kernel woke it up
beforehand.

Source: hit some latency issues with an ancient kernel on a nastily hacked big
little architecture ARM machine. It liked to overheat cores and put heavy
tasks on the overheated ones for alleged power saving. (Whereas running a task
quicker saves power.)

------
bubbleRefuge
couldn't you eliminate the bad spinlock behavior by coding them to be go into
an efficient wait if to much spinning is going on ?

~~~
rwem
That’s what virtually all battle-hardened lock libraries do: spin for a bit
(but not too tightly, using a pause in the loop) then fall back to waiter
lists and futex.

------
shin_lao
Most mutexes implementations spin before truly acquiring the lock.

Also, there are better spinlock implementations, such as speculative
spinlocks, and queued locks.

------
CoolGuySteve
.

~~~
atq2119
This is the time for whole benchmark run, not for an individual lock/unlock.
The article is quite clear on that.

------
paulintrognon
Fun explanation of what a Mutex is:
[https://stackoverflow.com/questions/34524/what-is-a-
mutex](https://stackoverflow.com/questions/34524/what-is-a-mutex)

