
Locking in WebKit - pizlonator
https://webkit.org/blog/6161/locking-in-webkit/
======
eridius
Two things come to mind when reading this:

1\. WTF::Lock is using sequential consistency for all of the slow path
operations. Surely this isn't necessary and it could still use acquire/release
for most of them (like it does in the fast path).

2\. Ditching OS-provided mutexes has one possibly-important drawback on OS X
and iOS: lack of priority donation. Recent versions of OS X and iOS have a
Quality Of Service implementation in the kernel that classifies threads (and
libdispatch queues) according to one of several QOS levels, and then handles
normal thread priorities within those levels. For this reason you actually
cannot safely implement your own spinlock, as I documented a few months ago
([http://engineering.postmates.com/Spinlocks-Considered-
Harmfu...](http://engineering.postmates.com/Spinlocks-Considered-Harmful-On-
iOS/\);) it'll appear to work under many normal workloads, but once you start
using the spinlock from threads of different QOS levels at the same time, you
risk hitting a priority inversion livelock where a lower-QOS thread has the
spinlock locked but is never scheduled because higher-QOS threads spinning on
the lock are always given priority. The OS does actually have a safe spinlock
(which does priority donation), but it's not public API and so can't be used
outside of system code. Of course, this doesn't apply to adaptive spinlocks,
since they'll just end up in the slow path of parking their thread, but this
does still serve to highlight the issue they do have, which is if a high-QOS
thread is parked waiting for a lock that's owned by a low-QOS thread, the
high-QOS thread is effectively having its priority lowered to the low QOS
until the owning thread releases the lock. This is obviously not what you
want. The system-provided pthread_mutex_t does priority donation specifically
to fix this issue, but WTF::Lock doesn't.

~~~
pizlonator
Weakening the consistency of operations on the slow path is not going to be a
speed-up. Those paths are not dominated by pipeline effects inside the CPU.
They are dominated by context switches and memory contention. So, SC is
exactly right: it's harder to get wrong and exactly the same speed.

It's true that WTF::Lock doesn't support priority inheritance. That's fine for
WebKit's threads.

~~~
phabouzit
I disagree strongly as too strong barriers on non TSO hardware (such as say,
armv7 or armv8) will just cause even more of that memory contention for no
good reasons.

Also on a contended lock, there's a very strong possibility that you will
decide that the lock can be taken right before you block.

On intel of course, it doesn't matter.

~~~
pizlonator
It also doesn't matter on all other architectures. The slow path is dominated
by OS operations like yielding, which already execute the strongest barriers
possible.

------
cokernel_hacker
Spinning in locks are always tricky business.

The sharpest thorn, to me, is: what happens when you are running inside the
critical section and your thread gets preempted?

Well, now the next thread which tries to acquire the lock is stuck waiting.
But waiting for what? Waiting for the original thread to get scheduled.

Now, the OS has no idea that the original thread should get scheduled again
and is free to continue scheduling more and more work items.

Fortunately for the lock in this article, and most other locks which spin, it
is adaptive and will not spin for too long. But how long is long enough? If
the spin is timed for a few loads and stores, then all is probably well as
spins will not be attempted for very long.

I wonder how these locks figure out how long they should spin? In the nasty
case I previously mentioned, you'd want to spin for a very short while to
avoid large amounts of waste. But spins which are too short lead to higher
lock/unlock latency if the lock was held for any appreciable amount of time.

This leads me to the following conclusion: spinning inevitable leads to _some_
number of wasted CPU cycles and therefore increased latency.

I'm curious as to how the amount of spinning was chosen.

~~~
pizlonator
The post details why we spin for the amount of time that we do. It turns out
that there is a wealth of measurements that support our 40-spins-while-
yielding rule. It worked for a strangely diverse set of workloads:

\- IBM Research found it to be optimal on the portBOB benchmark on a 12-way
AIX POWER box in 1999. \- I found it to be optimal for DaCapo multi-threaded
benchmarks on both AMD and Intel SMPs in ~2007. \- I again found it to be
optimal for WebKit these days.

The point of spinning for a short time and then parking is that parking is
hella slow. That's why you can get away with spinning. Spinning is net
profitable so long as the amount you spin for is much smaller than the amount
of time the OS will waste while parking.

------
emehrkay
Random observation: WTF would be a great base exception in a new, hip
programming language.

    
    
       WTF.happened
       WTF.context
       ...and so on

~~~
amelius
Not a good idea, because now you cannot easily grep anymore for "WTF" in
source code comments :)

~~~
stephenr
Use a tool that lets you search specifically within comments, within string
literals, outside comments, outside string literals, etc.

All the JetBrains tools do this AFAIK.

------
Animats
It would be interesting to have more CPU support for critical sections. You
could have an option to prevent interrupts on a fence instruction during short
critical sections. Interrupts would be prevented for a maximum number of
instructions set by an operating system controlled register, so user threads
couldn't lock up a CPU. Running out the instruction count would cause an
exception interrupt. The limit on number of instructions would be small; maybe
10 or 20.

Side issue: what about page faults in critical sections? Should critical
sections be allowed to cross page boundaries?

------
netheril96
WebKit has its own lock and malloc implementations and many other redoes of
standard library. Given that Apple's engineers have a great deal of influence
on all of WebKit, libc++ and their operating systems, I wonder why they do not
port those WebKit implementations onto OS X and iOS.

~~~
stephenr
Because webkit runs on more than just OS X and iOS?

------
occasionalreedr
Code here:

[https://github.com/WebKit/webkit/blob/master/Source/WTF/wtf/...](https://github.com/WebKit/webkit/blob/master/Source/WTF/wtf/Lock.h)

[https://github.com/WebKit/webkit/blob/master/Source/WTF/wtf/...](https://github.com/WebKit/webkit/blob/master/Source/WTF/wtf/Condition.h)

------
netheril96
WTF seems to be a very useful library, especially since the C++ standard
library is so criminally small. I hope the WebKit guy can release it as a
standalone library that will benefit many C++ developers. (Currently I can
copy WTF into my own project, but I don't know how to use it due to lack of
documentation).

~~~
occasionalreedr
> I hope the WebKit guy can release it as a standalone library that will
> benefit many C++ developers.

+1. Even just the WTF::Lock, WTF::Condition and supporting code would be
useful.

------
occasionalreedr
Please replace the very slow Mac OS X pthread_mutex locks with the WTF::Lock
implementation to benefit all applications. I can live with it still being 64
bytes and the scheduling being not completely fair. The OS X pthread mutexes
are several times slower than Linux on the same hardware.

~~~
phabouzit
Which OS X Release? with which benchmark? uncontended pthread_mutex was made
significantly faster in OS X El Capitan (almost twice as fast as before), and
on my own mac is only twice as slow as OSSpinLock, and OSSpinLock unlock is a
store(lock, 0), where a mutex usually need an xchg or cmpxchg which amount for
most of the 2x performance loss.

I very very much doubt that linux implementation is that different (I'm
actually fairly confident pthread_mutex on linux and OS X El Capitan+ are
mostly on par)

------
kazinator
> _WTF::Lock and WTF::Condition only require one byte of storage each.
> WTF::Lock only needs two bits in that byte._

Packing several unrelated, highly contended mutexes into the same 32 bit word
is probably not a good idea, though. :)

This reminds me of early Unix and its "wait channels", which were just
addresses, not even objects. A process could wait, specifying the address of
any object whatsoever as a "channel". A wakeup for that address would resume
it. This required no storage in the addressed object itself; the extra
paraphernalia required for it was associated by a hash or whatever. And of
course that paraphernalia is transient; an object that nobody is waiting on
doesn't need it.

------
exDM69
One byte per mutex sounds a bit suspicious given that atomic operations
typically work on a word or cache line granularity. If you have two mutexes in
the same cache line, false sharing occurs and the CPU interconnect gets busy.
Locking two uncontended mutexes in same cache line on different cores ends up
executing serially because the cores need to snoop each other's caches.

Some benchmarks might be useful to see how this affects performance and CPU
perf counters.

~~~
pizlonator
The points isn't to have two mutexes in the same cache line. The point is to
have a mutex plus the data that the mutex protects in the same cache line.

------
fenesiistvan
I am wondering whether this is faster or not then SRW Lock on Windows:
[https://msdn.microsoft.com/en-
us/library/windows/desktop/aa9...](https://msdn.microsoft.com/en-
us/library/windows/desktop/aa904937%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396)

------
pron
The talk _Engineering Concurrent Library Components_ , by Doug Lea, is one of
the best overviews of the issues surrounding the design and implementation of
low-level concurrency building blocks:
[https://www.youtube.com/watch?v=sq0MX3fHkro](https://www.youtube.com/watch?v=sq0MX3fHkro)

------
alexeiz
My understanding is that WTF::Lock works best when there is no lock contention
or contention can be resolved with spinning (microcontention). Otherwise
WTF::Lock ends up blocking on pthread_mutex (std::mutex) inside
ParkingLot::parkConditionallyImpl, and in such case I expect it to be worse
than pthread_mutex.

------
coldtea
> _with the new WTF::Lock (WTF stands for Web Template Framework_

That made me smile. It's nice to see some playfulness in business.

------
andrewvijay
Wish I could understand such articles someday! Learnin.

------
awinter-py
> rarely an excuse for not having one, or even multiple, fine-grained locks in
> any object that has things that need to be synchronized

'wtf' is right. locking sometimes adds as much risk as it removes.

~~~
pizlonator
I often hear this kind of claim. Do you have evidence? Because I just
presented evidence that locks are awesome.

~~~
awinter-py
race conditions

> Textbooks will tell you that if you always lock in the same order, you will
> never get this kind of deadlock. Practice will tell you that this approach
> doesn't scale: when I create a new lock, I don't understand enough of the
> kernel to figure out where in the 5000 lock hierarchy it will fit.

[https://www.kernel.org/pub/linux/kernel/people/rusty/kernel-...](https://www.kernel.org/pub/linux/kernel/people/rusty/kernel-
locking/x441.html)

~~~
pizlonator
It's true that when you write code, that code can have bugs. When I write file
system code and I have a bug, we call it "data loss". When I write user
interface code and I have a bug, we call it "not user friendly". And when I
write code that uses locks and it has a bug we call it a "race condition".
Just because you can give it a name doesn't mean it's bad. Just because a
programming idiom can be misused doesn't mean it's bad. All programming idioms
can be misused!

WebKit's lock hierarchies don't involve 5000 locks. We try to do as little
work as possible while holding a lock, and that usually means that we don't
hold more than one lock at a time. It's true that when you hold multiple locks
at a time, it can get confusing. So, don't do that!

~~~
awinter-py
If you're just saying 'when locking is cheap, fine-grained locks can be easier
to deal with vs coarse locks' I can agree with that.

