
On mutex performance and WTF::Lock - jsnell
https://blog.mozilla.org/nfroyd/2017/03/29/on-mutex-performance-part-1/
======
saagarjha
For those wondering, WTF stands for Web Template Framework:
[https://webkit.org/blog/6161/locking-in-
webkit/](https://webkit.org/blog/6161/locking-in-webkit/)

~~~
cperciva
My favourite​ mutex name is the "sex lock" \-- also known as the
shared/exclusive or sx lock.

~~~
cpeterso
Windows NT has a kernel mutex type called a "mutant" that was added for
compatibility with OS/2's mutex (back when the NT team was still pretending to
be building a new OS/2 kernel). Dave Cutler thought the OS/2 mutex semantics
were "brain-damaged" and called NT's version "mutant". If an OS/2 thread
exited while holding a mutex, the mutex would remain inaccessible forever.

[https://blogs.msdn.microsoft.com/larryosterman/2004/09/24/cl...](https://blogs.msdn.microsoft.com/larryosterman/2004/09/24/cleaning-
up-shared-resources-when-a-process-is-abnormally-terminated/)

~~~
pjmlp
They would have, if a Microsoft employee hadn't figured out a viable way to
use extended mode in Windows 3.x.

------
pizlonator
WTF locks now use a policy we call eventual fairness. You can barge but the
amount of time anyone loses due to the barging race is bounded.
[https://trac.webkit.org/changeset/203350/webkit](https://trac.webkit.org/changeset/203350/webkit)

~~~
obstinate
Hard fairness guarantees for locking are very expensive and only situationally
needful, so it's hard to justify paying a high cost to have them always on. It
seems like the best thing to do is to use the fastest locking primitive you
can find, and implement fairness on top of that in the rare case that you need
it.

~~~
pizlonator
Eventual fairness costs nothing.

~~~
obstinate
Right, I'm referring to the idea of immediate fairness. I'm agreeing w/ your
approach.

------
ajross
So the takeaway here is apparently that POSIX mutexes on macOS/iOS are really
slow due to an oddball design decision (fairness), and the browser vendors are
racing to see who can reimplment them best. Windows and Linux do this just
fine AFAICT.

Seems like it would be more effective for the WebKit folks to just work on
their Darwin compatriots to fix the C library locks, no?

~~~
Jweb_Guru
Fairness isn't an oddball design decision, it's an important assumption that
many realtime systems rely on.

~~~
ajross
It's oddball in the sense that literally no one else does it. So I don't know
what systems you mean, since by definition they don't run on Linux (glibc or
bionic) or Windows.

I'm not saying there's no use case for a fair mutex anywhere (though I'll come
really close and say that if you want fairness a lock is the wrong
abstraction), I'm saying that stuffing a context switch into every contended
call to pthread_mutex_unlock() is a terrible design decision.

~~~
Jweb_Guru
If you don't need fairness, don't use a fair lock. The default assumption
should be fairness because it doesn't require you to think (just like the
default transaction isolation level should be serializable). As the blog post
points out, OS X provides unfair mutexes too.

Windows locks are fair by default too, BTW (IIRC anyway). People just use the
unfair version when they need performance.

~~~
striking
[http://www.pcdoctor-
community.com/wp/posts/2007/10/fairness-...](http://www.pcdoctor-
community.com/wp/posts/2007/10/fairness-in-win32-lock-objects/)

Windows XP tried to implement some amount of fairness in its locks but did not
guarantee it. Windows Vista gave up on fairness entirely.

You should just be using unfair locks. Thread starvation is rare in most OSes
at this point.

~~~
eridius
Thread starvation is still very much a thing on a lot of devices. macOS and
iOS even have a Quality Of Service implementation in the kernel specifically
to ensure that high-priority threads get to run when they'd otherwise be
starved, and this can lead to low-priority threads being suspended for many
seconds at a time under high workload. In fact, it's now really dangerous to
use a regular spinlock on iOS and macOS because you can get into a priority
inversion livelock if threads from different QoS levels are contending the
same lock. The Obj-C runtime was seeing delays of tens of seconds in some
cases when using spinlocks, so it now uses a private "os_lock" which is a
spinlock that uses SPI to "donate" the QoS of the waiting thread to the one
that holds the spinlock.

~~~
LeoNatan25
As of iOS 10 and macOS 10.12, it is no longer private: `os_unfair_lock`.

[https://developer.apple.com/reference/os/os_unfair_lock](https://developer.apple.com/reference/os/os_unfair_lock)

~~~
eridius
Oh good! I'll have to do some research at some point to find out if this is
literally the same thing as the os_lock that the Obj-C runtime uses, but
either way, I'm glad it's available now.

~~~
LeoNatan25
It uses the same private API (or at least, what you described in your previous
post).

------
quinnftw
I'd be interested in seeing the rational behind the fairshare policy. I
suppose theoretically the firstfit policy could result in starvation via one
thread constantly preempting another during the lock contention, but I don't
imagine that would happen very often in practise.

~~~
alain94040
Starvation is a serious problem. Imagine two cores both trying to get a lock.
The lock is held by a cache line in one of the cores. If both cores try to get
the lock very often, the core further away may very well _never_ get the lock.

The synthetic benchmark results don't show that because they behave nicely:
each thread waits politely in between each attempt to get the lock. Real code
doesn't always do that.

~~~
ajross
> Imagine two cores both trying to get a lock. The lock is held by a cache
> line in one of the cores. If both cores try to get the lock very often, the
> core further away may very well never get the lock.

If that were true, then your problem is L2-cache-bound and trying to fit your
solution into SMP is the Wrong Thing. In fact, the single-threaded behavior
you end up with is going to be faster (by definition!) than the "fair"
architecture you seem to want.

No one serious thinks this behavior of the default locking primitive is a good
thing. Maybe ( _maybe_ ) it's a good fit for some particular problem
somewhere, but I'd want to see a benchmark. It's definitely not a consensus
opinion among people who do serious thought about synchronization.

~~~
eridius
> _No one serious thinks this behavior of the default locking primitive is a
> good thing._

Clearly some people do or OS X wouldn't do it.

~~~
59nadir
Would it be the first badly implemented thing in OS X?

~~~
eridius
Mutexes are heavily used and get a lot of scrutiny, and work was done to speed
them up when 10.11 (or was it 10.10? Whichever one added QoS) came out. I'm
positive the people who maintain it are aware of the tradeoffs between fair
and unfair locks and made the default fair intentionally.

Or more generally, just because Linux and Windows both behave in a certain
manner doesn't make that de facto correct. It's all tradeoffs, and different
people value different things. Correct-by-default (i.e. fair locks) is
valuable, the only question is whether it's worth the performance hit
(although you can always opt in to unfair locks to get better performance).

~~~
btschaegg
> I'm positive the people who maintain it are aware of the tradeoffs between
> fair and unfair locks and made the default fair intentionally.

I think you're neglecting a couple of less-technical factors here. Yes,
fairness was certainly made the default for reasons at _some point_ (but
still, even then, one might argue that an opt-in solution might have been
better). On the other hand, there's the likely possibility that those reasons
don't really hold true anymore. Think of Scene Graphs vs. Entity Component
Systems in high-performance videogame design - in this example the rise of
caching made a whole architecture out-dated.

On the other hand, like removing the GIL in Python, such decisions are not to
be taken lightly because of the things you _will_ break. It's very likely that
there _are_ applications that would still have problems with starving threads,
and just switching from opt-out to opt-in will make them break for no apparent
reason in the strangest of circumstances. I know, Apple likes to break things
more often than MS, but I'd guess that´s not a risk they're willing to take.
Imagine you're updating your OS and a dozen apps that worked for a decade and
don´t get updates anymore start behaving strangely.

So, it's not unreasonable to settle for a less-than-optimal solution that
still keeps things working and only makes things slower in the worst-case
scenario. That doesn't mean it's not open to criticism, though.

~~~
qb45
> Yes, fairness was certainly made the default for reasons at some point (but
> still, even then, one might argue that an opt-in solution might have been
> better).

I think an unfair or opt-in-fair solution was deemed worse than fair-by-
default for the simple reason that the most straightforward way to implement
mutexes is

    
    
      enter kernel mode
      maybe grab a spinlock if on SMP
      add yourself to the waiting list
      sleep until woken up
    
      do critical section
    
      enter kernel mode
      spinlock
      wake the first waiting thread
    

That's a functionality every mutex must have and it happens to give fair
semantics for free. Making semantics weaker for the purpose of optimization
(and not just for the sake of making application developer's life harder or
future flexibility which may happen to never be needed) actually takes
additional work on top of that.

Since we are talking about a uniprocessor desktop OS developed in the
nineties, it's plausible they didn't care about mutex performance as much as
today and giving this extra guarantee afforded by their simple implementation
seemed reasonable at the time.

~~~
gpderetta
note that every half decent mutex implementation _will not_ enter the kernel
(in neither lock or unlock) unless the mutex is contended.

~~~
qb45
Sure, but OP's benchmark clearly shows that OS X's default mutex is at most 5%
decent ;)

~~~
eridius
I'd love to see them perform the same benchmark on macOS 10.12. Pthread
mutexes were sped up in "the new OSes" according to a tweet made last year
(sorry no link), though I'm not sure if that was referring to 10.11 or 10.12.
Either way, the benchmark should perform better now, and I'm curious how much.
I'd run it myself except those numbers wouldn't be comparable to OP's Mac Mini
benchmark. Of course, OP's Linux benchmark numbers aren't either, because it's
a different (and probably much more powerful) machine.

------
ricardobeat
> I don’t know how firstfit policy locks compare to something like WTF::Lock
> on my Mac mini

As a layperson, it would have been nice to see at least some simple benchmarks
substantiating what is being said, after several paragraphs trying to explain
why WTF::Lock is not a good candidate, and given it's in the title of the
post.

------
leksak
> But fairness is not guaranteed for all OS mutexes; in fact, fairness isn’t
> even guaranteed in the pthreads standard

Is this true? I do not want to have to pay to read the standard

~~~
gpderetta
Good thing it is available for free then!

~~~
leksak
Where? I didn't find it

