1. WTF::Lock is using sequential consistency for all of the slow path operations. Surely this isn't necessary and it could still use acquire/release for most of them (like it does in the fast path).
2. Ditching OS-provided mutexes has one possibly-important drawback on OS X and iOS: lack of priority donation. Recent versions of OS X and iOS have a Quality Of Service implementation in the kernel that classifies threads (and libdispatch queues) according to one of several QOS levels, and then handles normal thread priorities within those levels. For this reason you actually cannot safely implement your own spinlock, as I documented a few months ago (http://engineering.postmates.com/Spinlocks-Considered-Harmfu... it'll appear to work under many normal workloads, but once you start using the spinlock from threads of different QOS levels at the same time, you risk hitting a priority inversion livelock where a lower-QOS thread has the spinlock locked but is never scheduled because higher-QOS threads spinning on the lock are always given priority. The OS does actually have a safe spinlock (which does priority donation), but it's not public API and so can't be used outside of system code. Of course, this doesn't apply to adaptive spinlocks, since they'll just end up in the slow path of parking their thread, but this does still serve to highlight the issue they do have, which is if a high-QOS thread is parked waiting for a lock that's owned by a low-QOS thread, the high-QOS thread is effectively having its priority lowered to the low QOS until the owning thread releases the lock. This is obviously not what you want. The system-provided pthread_mutex_t does priority donation specifically to fix this issue, but WTF::Lock doesn't.
It's true that WTF::Lock doesn't support priority inheritance. That's fine for WebKit's threads.
Also on a contended lock, there's a very strong possibility that you will decide that the lock can be taken right before you block.
On intel of course, it doesn't matter.
2) in addition to what Phil already said (which is that there's no priority inversion for Safari/WebKit to have as all threads taking these locks have the same priority to boot), the private lock you're talking about, named the handoff lock, has terrible unlock patterns, and is why it performs so poorly when WTF::Lock benched it.
The sharpest thorn, to me, is: what happens when you are running inside the critical section and your thread gets preempted?
Well, now the next thread which tries to acquire the lock is stuck waiting. But waiting for what? Waiting for the original thread to get scheduled.
Now, the OS has no idea that the original thread should get scheduled again and is free to continue scheduling more and more work items.
Fortunately for the lock in this article, and most other locks which spin, it is adaptive and will not spin for too long. But how long is long enough? If the spin is timed for a few loads and stores, then all is probably well as spins will not be attempted for very long.
I wonder how these locks figure out how long they should spin? In the nasty case I previously mentioned, you'd want to spin for a very short while to avoid large amounts of waste. But spins which are too short lead to higher lock/unlock latency if the lock was held for any appreciable amount of time.
This leads me to the following conclusion: spinning inevitable leads to _some_ number of wasted CPU cycles and therefore increased latency.
I'm curious as to how the amount of spinning was chosen.
- IBM Research found it to be optimal on the portBOB benchmark on a 12-way AIX POWER box in 1999.
- I found it to be optimal for DaCapo multi-threaded benchmarks on both AMD and Intel SMPs in ~2007.
- I again found it to be optimal for WebKit these days.
The point of spinning for a short time and then parking is that parking is hella slow. That's why you can get away with spinning. Spinning is net profitable so long as the amount you spin for is much smaller than the amount of time the OS will waste while parking.
The idea is to avoid spinning using hard-coded constants, but put a modicum of on-the-fly empirical science into it: measure the duration of the critical regions with a smoothed estimator, which is maintained in the mutex. If the actual wait exceeds this smoothed average by too much (say double) then conclude that the owner must be pre-empted and break the spin with rescheduling.
The smoothing is done very easily using a few bit operations that implement an IIR filter, similarly to the TCP RTO estimator.
...and so on
All the JetBrains tools do this AFAIK.
Side issue: what about page faults in critical sections? Should critical sections be allowed to cross page boundaries?
+1. Even just the WTF::Lock, WTF::Condition and supporting code would be useful.
I very very much doubt that linux implementation is that different (I'm actually fairly confident pthread_mutex on linux and OS X El Capitan+ are mostly on par)
Packing several unrelated, highly contended mutexes into the same 32 bit word is probably not a good idea, though. :)
This reminds me of early Unix and its "wait channels", which were just addresses, not even objects. A process could wait, specifying the address of any object whatsoever as a "channel". A wakeup for that address would resume it. This required no storage in the addressed object itself; the extra paraphernalia required for it was associated by a hash or whatever. And of course that paraphernalia is transient; an object that nobody is waiting on doesn't need it.
Some benchmarks might be useful to see how this affects performance and CPU perf counters.
That made me smile. It's nice to see some playfulness in business.
'wtf' is right. locking sometimes adds as much risk as it removes.
> Textbooks will tell you that if you always lock in the same order, you will never get this kind of deadlock. Practice will tell you that this approach doesn't scale: when I create a new lock, I don't understand enough of the kernel to figure out where in the 5000 lock hierarchy it will fit.
WebKit's lock hierarchies don't involve 5000 locks. We try to do as little work as possible while holding a lock, and that usually means that we don't hold more than one lock at a time. It's true that when you hold multiple locks at a time, it can get confusing. So, don't do that!