
Measuring mutexes, spinlocks and how bad the Linux scheduler is - bazzargh
https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/
======
PaulDavisThe1st
(reposting my comment from the original)

There’s some potentially misleading information here. Background: I’ve spent
the last 20+ years writing low-latency realtime audio applications,
technically cross-platform but focused on Linux.

If you care about low latency on any general purpose OS, you need to use a
realtime scheduling policy. The default scheduling on these OS’s is intended
to maximise some combination of bandwidth and/or fairness. Low latency
requires ditching both of those in favor of limiting the maximum scheduling
delay of a thread that is otherwise ready to run.

Measuring how long synchronization primitives take without SCHED_FIFO is
illustrative, but only of why, if you care about scheduling latency, you need
SCHED_FIFO. There are several alternative schedulers for Linux – none of them
remove the need for SCHED_FIFO if latency is important.

It is absolutely not the case that using SCHED_FIFO automatically starves non-
SCHED_FIFO threads. Scheduling policy is set per-thread, and SCHED_FIFO will
only cause issues if the threads that use it really do “burn the CPU” (e.g. by
using spinlocks). If you combine SCHED_FIFO with spinlocks you need to be
absolutely certain that the locks have low contention and/or are held for
extremely short periods (preferably just a few instructions). If you use
mutexes (which ultimately devolve to futexes at the kernel level), the kernel
will take care of you a little better, unless your SCHED_FIFO thread doesn’t
block – if it doesn’t do that, that’s entirely on you. Blocking means making
some sort of system call that will cause the scheduler to put the thread to
sleep – could be a wait on a futex, waiting for data, or an explicit sleep.

In particular, this: “This was known for a while simply because audio can
stutter on Linux when all cores are busy (which doesn’t happen on Windows)” is
NOT true. Linux actually has better audio performance than Windows or macOS,
but only if the app developer understands a few basic principles. One of them
is using SCHED_FIFO appropriately.

Pro-audio/music creation scheduling is MUCH more demanding than video, and a
bit more demanding than games. Linux handles this stuff just fine – it just
gives you enough rope to shoot yourself in the foot if you don’t fully
understand what you’re doing.

~~~
jcelerier
> Blocking means making some sort of system call that will cause the scheduler
> to put the thread to sleep – could be a wait on a futex, waiting for data,
> or an explicit sleep.

Speaking of which, do you know of a tool that would allow to ensure that e.g.
your RT threads won't do bad things - short of putting a breakpoint on every
syscall-like thing ? I wonder if at that point the audio community should not
look into developing some clang-based static analyzer to enforce that.

~~~
PaulDavisThe1st
There have been several such tools over the years - I regret that off the top
of my head I don't recall the names of any of them.

Clang can be used for this but AFAIK you need to annotate functions/methods to
indicate whether or not they are intended to be RT-safe. This gets complex if
you have code that can be used in an RT context and in a non-RT context. The
annotation part would be relatively simple in, say, a small audio player. In a
DAW, it would be a substantial and complex task.

------
devit
The claim that "the Linux scheduler is bad" seems wrong.

First of all, a program that calls sched_yield() for synchronization is simply
broken, since doing so can cause 100% CPU usage (if the other thread is
stopped at an unfortunate place), so the only correct method among the ones
benchmarked is std::mutex.

When using std::mutex (the only correct approach), Linux and Windows both seem
to have similar average performance, but Windows has 10x latency, so Linux is
better.

~~~
newnewpdro
sched_yield() isn't being used for synchronization, it's to yield the calling
thread before its timeslice is exhausted because it's potentially preventing
the lock-holder thread from getting scheduled to run and releasing it.

Spinlocks assume the critical section is fast and non-blocking, so when the
spinning lasts longer than expected, throwing in a yield can help by handing
some CPU time to something else.

The lock-holder might be getting scheduled on the same CPU and waiting for
this thread to yield, or there might just be excessive threads runnable right
now and burning cycles pointlessly spinning isn't helping get the lock-holder
to run and release the lock any sooner.

In a situation where you have exactly as many threads as there are CPUs, and
they've been assigned their own respective CPUs, and there's nothing else
running on the system, I wouldn't expect the yield to help. But that ideal
isn't how things generally look in practice.

~~~
devit
You must not use sched_yield() and also must not spin indefinitely.

Instead, one should optionally spin for up to a constant number of iterations
and then call sys_futex.

~~~
newnewpdro
That's how you implement a userspace mutex in linux, not a spinlock.

~~~
gpderetta
Sure, but sched_yield is strictly worse than using a futex (except that you'll
need to check for waiters on an unlock which makes it slightly more
expensive).

~~~
newnewpdro
Ok, so you're arguing that userspace should use mutexes and not spinlocks.

Which I agree with, most of the time userspace spinlocks don't fit well.

But TFA is clearly comparing the two, and observing a variety of spinlock
implementations with sched_yield() demonstrating an interesting positive
effect on the spinlocks as tested.

~~~
gpderetta
Actually no, I wouldn't make that claim; while a futex based adaptive mutex is
a very good default, spinlocks can be still approriate for some applications.

What I'm saying is that if your use case is such that you expect enough
contention to consider using TATAS (which is actually a pessimization in the
uncontented case) and look into optimizing sched_yield, probably a spinlock is
not appropriate on the first place.

Edit: hence a spinlock shouldn't bother with yield and just do a tight xchg
spin (I haven't measured it inna while, but heard rumors that pause can
severely harm acquire latency on very recent CPUs as it will quickly put them
in a deeper power saving mode than in the past)

------
amluto
I’m not convinced the measurements are really meaningful in this article. The
author seems to be running a bunch of threads, measuring the time from when
one thread starts unlocking the lock to when another thread gets it and
calling this “idle”. I can’t find any mention of how many threads there are,
how many CPUs, or what else the system is doing.

So this is measuring a mess of a few things:

For mutexes (assuming all the implementations actually make syscalls for
unlocking), this has something to do with how long it takes for the scheduler
to activate the waiter and run it, plus how long the scheduler waits because
the system is doing something else or it wants to try to avoid bouncing a task
between CPUs.

For anything that does _not_ involve the kernel in unlocking, this is really
just a measurement of the pauses in execution of long-running threads that
continuously hog the CPU. If the thread is also calling sched_yield(), the
results are erratic, since sched_yield() more or less means “I don’t want to
run now, but I’m not actually waiting for anything visible to the scheduler
and, um, run me again soon maybe?” This is almost always the wrong solution.
Other than that, this test has essentially nothing to do with _which_ spinlock
is in use, unless the spinlock itself wastes time failing to figure out that
it’s ready.

From old memory, the Windows scheduler is (was?) mostly based on per-thread
priorities, highest priority wins, where the priorities are occasionally
automatically adjusted. So, if you set a high-ish priority, you get to hog the
CPU. In contrast, Linux tries quite hard not to let normal (non-RT) threads
hog the CPU, so a thread that effectively runs continuously measuring pauses
will detect significant pauses in which Linux decides that the rest of the
system deserves to run too.

~~~
RossBencina
I agree that the post needs to clarify the number of threads, what thread
priorities were assigned, and what measures were taken to associate the
threads with logical CPUs (affinities).

I think your characterization of sched_yield() is a bit off. The man page is
clear:

"sched_yield() causes the calling thread to relinquish the CPU. The thread is
moved to the end of the queue for its static priority and a new thread gets to
run."

and

"If the calling thread is the only thread in the highest priority list at that
time, it will continue to run after a call to sched_yield()."

In particular, it does not say "I’m not actually waiting for anything visible
to the scheduler and, um, run me again soon maybe?".

So assuming that the threads are running at some relatively high priority, I'm
struggling to see how delays on the order of 60ms could arise.

EDIT: reading further:

"sched_yield() is intended for use with real-time scheduling policies (i.e.,
SCHED_FIFO or SCHED_RR). Use of sched_yield() with nondeterministic scheduling
policies such as SCHED_OTHER is unspecified and very likely means your
application design is broken."

So maybe that's it.

[1] [http://man7.org/linux/man-
pages/man2/sched_yield.2.html](http://man7.org/linux/man-
pages/man2/sched_yield.2.html)

~~~
amluto
Indeed. Linux’s SCHED_OTHER doesn’t have static priorities.

There was a half-baked design for locking that kind of worked 20 years ago:
try to acquire the spinlock and, if it’s contended, yield. On a single CPU,
that probably means that the lock holder gets to run, and if they drop the
lock quickly, then the yielding thread will (on an otherwise mostly idle
system) get to run soon.

This was a dubious design 20 years ago, although, admittedly, OS lockout
primitives were worse back then. But now, with multiple CPUs on a system, the
whole scheme falls apart — the lock holder might be on a different CPU and
yield() has no effect on how quickly the lock is released.

So, in most cases, yield() really does mean “I don’t know what I’m doing, but
I don’t want to run right now.” Anything the kernel does in response is, at
best, a wild guess.

------
pixel_fcker
I think the answer, as always, is to measure on your own workloads. I’ve seen
real-world production code get significant speed ups by switching to spinlocks
from mutexes.

Very interesting post though. I’ve never considered the idle latency before.

------
mehrdadn
Windows has process and thread priorities, including so-called "real-time"
ones... but there seem to be no mentions of them in the article, as opposed to
Linux. Why?

------
ezoe
My take on this article is it doesn't worth implementing your own spinlock.
The OS can take advantage of their implementation details while you can't
portably implement it. The OS implementation changes in the future or varies.

Even if you use _mm_pause() to let CPU know it's doing spinlock, for the eye
of the OS, it still looks like a busy thread, it can't be distinguished from
non-spinlock computation heavy thread.

------
newnewpdro
I would like to see the measurements when the threads are explicitly assigned
to their respective CPUs.

It surprised me to not see even a mention of CPU affinity.

~~~
dpc_pw
For any sort of benchmarking like that you have to take care of affinities and
cpufreq governor at minimum.

~~~
newnewpdro
yep, I was assuming they at least didn't have any frequency scaling or
powersave governor junk going on but you're right - it wasn't even mentioned
so...

------
pmoriarty
Can someone knowledgable explain what a mutex is, what a spinlock is, and the
difference between mutexes and spinlocks?

~~~
ptr
A spinlock is a kind of mutex (“mutual exclusion”) where in a condition is
checked as fast as possible in a loop. This is useful when the expected wait
times are short as you don’t have to pay the context switch penalty. Typically
done when you have real parallelism with multiple CPUs/cores.

A mutex can also refer to a particular group of “mutual exclusion” methods
where in the current process/thread is suspended until the condition is true.

Hope this helps.

~~~
jonny383
Additional note that mutual exclusion is a concept that pops up a lot when
dealing with concurrent processing (think threads, multi-process type
architectures).

Typically mutex / spinlock is implemented to reduce and prevent race
conditions

~~~
edoceo
One can see this at a higher level in Golang, where a gofunc are
reading/updating a map - and run with -race then it shows where you have to
use a Mutex call to lock that Map for the access, R or RW. Higher level but
pretty simple to see.

~~~
zozbot234
One can see this at an even higher level in Rust, where you don't even need a
"race checker" to avoid data races in safe code - the compiler will force the
use of patterns like Mutex<> or Rwlock<> (or others) for stuff that might be
accessed concurrently in ways that would otherwise break mutual exclusion.

~~~
jonny383
So what would you call the application of the compiler enforcing non-race
condition patterns of it's not a "race checker"?

It's still there. It's just a compile-time "race checker", not runtime.

~~~
tsimionescu
Not really, in Rust's case. It's the general borrow checker which handles
this.

Since in Rust it is illegal for a value to be referenced for writing in one
place and referenced for anything else in any other place, regardless of
concurrency/parallelism, Rust code is data-race free by design.

I still think that GP's comment was a non-sequitur.

~~~
zozbot234
Rust has "interior" mutability, meaning that shared references are not true
'read only' references; they _can_ be used for writing if the type supports
such an operation. What makes Rust data-race free is the combination of
'interior' mutability and the Send and Sync traits, which control moving or
referencing objects across threads.

------
p0nce
A typical mutex has essentially two very different costs depending on whether
the lock is already taken by another thread:

\- Uncontended case + mutex: no syscall, fast path is basically the cost of
the memory barrier.

\- Uncontended case + spinlock: no syscall, fast path is basically the cost of
the memory barrier. Essentially same cost than a mutex, sometimes a tiny bit
cheaper but _if your lock isn't contented in the first place it will be very
hard to measure, since not being a bottleneck_.

\- Contended case + mutex: thread is waiting in the mutex list without taking
CPU. Hyper-threading may also give hand immediately to another thread, perhaps
lucky enough to run. The OS mutex may have a bit of capped spinning before
waiting in the scheduler.

\- Contended case + spinlock: Anything goes. CPU gets consumed just for
spinning. Wrong thread might get priority. If you are lucky you did put a HLT
instruction to mitigate the complete disaster that spinlocks are, perhaps it
even does something. Your consolation: no syscall or thread pausing... because
they are spinning.

tl;dr: spinlock are worse in the contended case and essentially the same
performance in the uncontended case against your typical OS mutex. SPINLOCKS
WON'T YIELD THE EXPECTED BENEFITS, USE REGULAR OS MUTEXES.

~~~
gpderetta
A problem with pthread mutex and std::mutex is that they are often very large
which is problematic for very fine grained locking, while you only need one
bit for a spin lock and can be embedded in other objects (say, a pointer) as
long as you can spare one bit.

Of course you can roll your own proper mutex if you want to get your hands
dirty with futexes. IIRC you only need two bits (one waiter bit and one locked
bit).

~~~
shaklee3
Not sure what you mean by a single bit for a spin lock. It's a cas operation,
which doesn't operate on a single bit.

Edit: I think I see what you're getting at, but the point from the article
about cache invalidation for the naive implementation still is valid.

~~~
Matthias247
You can do it with a single bit: Read the whole unit (eg 64 bit), flip the big
you need, and then do a Compare exchange which tries to replace the whole
retrieved thingy with the modified one.

That obviously requires you to read the current state upfront, whereas a plan
compare exchange from 0 to 1 on 32/64Bit avoids that.

~~~
p0nce
That doesn't work since this isn't atomic anymore.

~~~
gpderetta
Of course it is. The mutating operation is the CAS which requires the existing
value anyway.

~~~
p0nce
Mmmm... I had to write this down.

    
    
        0: CAS_bit_n(void* adr, int n)
        1: while(true):
        2:   old <- read unit(adr)
        3:   new <- compute-new-value-with-changed-bit-n(old, n)
        4:   if CAS(adr, old, new):
        5:     break
        6:   else
        7:     continue
    

Indeed it seems it is correct if you retry.

~~~
gpderetta
In fact normally a failed CAS (know as IBM-style CAS) will update the old
value anyway so no need to reload old inside the loop; in c++11:

    
    
      auto old = addr.load(std::memory_order_relaxed);
      while(not addr.compare_exchange_weak(old, old|1))
         ;
    

In fact you can set old to some random value and it will still convergence to
the correct value, but is of course suboptimal.

~~~
p0nce
Interesting! Now I'm interested if there has been one usage case for the
single bit spinlock in practice. It seems appropriate if writes aren't too
numerous (well leaving aside speed concerns). Or a case where std::mutex was
too large?

I've always wondered how the wait list was implemented.

~~~
gpderetta
Think about a concurrent hashtable using chaining for collision resolution;
distinct bucket chains can be updated concurrently, so you need a per bucket
lock but you do not want a full std::mutex as it would increase the size of
the bucket table by a large integer multiple. Instead, borrowing one bit from
the chain pointer (you can easily arrange the least significant bit of the
pointer to be always zero) is pretty much free.

Note: I wrote the above from the top of my head, I haven't actually had the
need to implement a concurrent hashtable yet.

------
RossBencina
I'm curious what is going on with the "Idle Time" measurements. Does anyone
have any theories?

My guess is that if the thread gets descheduled inside the lock() spin loop
(either by an explicit call to yield() or otherwise) then the "Four longest
idle times" reflect the fact that all of the worker threads have remained
descheduled for an extended period. But assuming that the threads have
elevated priority (why would you use spinlocks on non-high-priority threads?)
this seems implausible on an 8 core machine.

I would be curious to know what the results are if isolcpus was used and the
workers were explicitly assigned to those CPUs.

~~~
atq2119
The whole "idle time" vs. "longest wait" distinction is suspect and it's
surprising they don't go to more effort to justify it.

What "longest wait" measures is (b_i - a_i) in:

    
    
        Time a_0
        lock
        Time b_0
        unlock
        Time a_1
        lock
        Time b_1
        unlock
        ...
    

What "idle time" measures is (b_i - a_i) in:

    
    
        lock
        Time a_0
        unlock
        lock
        Time b_0
        Time a_1
        unlock
        lock
        Time b_1
        ...
    

At least for the spinlock variants, "unlock" should be an extremely cheap
operation (just a store for most variants), which means that those two
sequences ought to be essentially identical in terms of timing: it ought to be
possible to move the "Time a_i"s across the "unlock" operations with a timing
impact less than a microsecond, i.e. the two sequences ought to have
essentially the same timings.

The fact that the timings are supposedly different on a millisecond scale is
therefore quite suspicious.

Perhaps the chosen clock ends up calling a syscall, which can end up in the
scheduler on syscall return occasionally and throw off the measurements that
way? Or just the fact that there's one extra syscall in the critical section
sometimes causes other threads to make different yielding decisions which
throws off the scheduling?

More generally, it would have been great to see a deeper investigation into
scheduling behavior, which is possible on Linux using the tracing
infrastructure and visualizing with different tools like Hotspot.

Getting a worst-case delay of ~1.5ms with a naive ticket_spinlock that can
yield is really not surprising, to be honest. All it takes is for patterns
like:

1\. 3 threads pull their tickets in order. 2\. The middle thread yields and
something else is chosen to run by the scheduler, or perhaps the scheduler
decides not to run anything for a short while on the off-chance that something
else becomes runnable, which is a fair choice because after all, you yielded.
3\. The last thread is then blocked by the middle thread.

It just boils down to the fact that "yield" is a bad primitive. You should
fall back to futex-based synchronization instead so that the kernel has a
better idea of what you're trying to do. The article comes to the same
conclusion anyway, in the form of "just use std::mutex".

~~~
RossBencina
It's not entirely clear from the article, but my understanding was that
"longest wait" maintains per-thread timestamps, i.e. the time to lock() within
a single thread, whereas "idle time" uses a single global atomic timestamp
that measures the time from when the last thread unlocks until the next thread
locks -- even if the unlock() and subsequent lock() are on different threads.

------
tonetheman
This is a GREAT post. Super technical. Lots of stuff to learn.

------
pedrocr
Why isn't the conclusion to just use the mutex at least on Linux? It seems to
have both consistently low latency and great throughput. What else can you get
from the spinlock?

~~~
jiveturkey
> What these benchmarks have in common is that they measure code in which
> there is nothing else to do except fight over the lock. In that environment
> the only thing that makes sense is to use a spinlock. Because if you go to
> sleep, all that will happen is that some other thread will wake up that will
> fight over the same lock.

That said, his conclusion is in fact just as you stated, if you read in
between the lines. (This paper is really about the scheduler, so it's not
exactly that conclusion.)

~~~
trasz
And that’s precisely what most operating systems other than Linux do: use
adaptive mutexes as the default mechanism, not spinlocks.

~~~
gpderetta
Pthread_mutex on linux is also an adaptive mutex, although the spinning count
might not be optimal.

------
cryptica
IMO mutexes, semaphores, spinlocks should be used only in special cases (e.g.
games or other high perf logic where you need to maximize performance by a
fixed percentage) because they are anti-patterns for scalability. They only
help to scale up to a point after which all the threads either spend all their
time waiting for each other to release the locks or they can never acquire the
locks because the demand for that shared memory is too high. There are
exceptions for example if you have unlimited reader threads/processes compared
to a fixed number of writer threads/processes but even in those cases, I feel
that shared memory locks add invisible bottlenecks to scalability as the code
evolves over time.

For most of my use cases, I prefer to keep processes/threads parallel and
communicate via IPC chanels/sockets. This makes it easier to write code that
can scale without limit.

For most use cases, I don't see much point of writing scalable code that you
know only scales up to a certain point and then needs to be rewritten.

Avoid using shared memory if you have an embarassingly parallel problem.

~~~
tsimionescu
Approaches using IPC/sockets are significantly more complicated than shared-
memory parallelism in most programming languages. Sockets at least are also an
absurdly wasteful communication mechanism in low-concurrency situations,
compared to direct shared memory access.

If you have a simple REST web server, single-thread is simply not a realistic
option. The next simplest thing is to do shared memory parallelism, which
works perfectly for a few dozens to low hundreds of threads. That already
covers a lot of scale for many realistic problems. Moving from that to a
distributed system design is a significant jump in implementation complexity,
and will likely increase per-request latency with most straight-forward
implementations.

~~~
imtringued
In my experience it is very to easy get into situations where using only
mutexes is impossible. As soon as you need a lock per object and access two
objects then you have to make sure to only acquire one lock at a time. The end
result is that you are effectively using the stack as a message queue that can
only hold one element.

~~~
tsimionescu
The solution with multiple locks is to always take them in the same defined
order on any code path - that guarantees that you can't get into deadlock
situations.

