Hacker News new | past | comments | ask | show | jobs | submit login
Locking in WebKit (webkit.org)
121 points by pizlonator on May 6, 2016 | hide | past | favorite | 40 comments



Two things come to mind when reading this:

1. WTF::Lock is using sequential consistency for all of the slow path operations. Surely this isn't necessary and it could still use acquire/release for most of them (like it does in the fast path).

2. Ditching OS-provided mutexes has one possibly-important drawback on OS X and iOS: lack of priority donation. Recent versions of OS X and iOS have a Quality Of Service implementation in the kernel that classifies threads (and libdispatch queues) according to one of several QOS levels, and then handles normal thread priorities within those levels. For this reason you actually cannot safely implement your own spinlock, as I documented a few months ago (http://engineering.postmates.com/Spinlocks-Considered-Harmfu... it'll appear to work under many normal workloads, but once you start using the spinlock from threads of different QOS levels at the same time, you risk hitting a priority inversion livelock where a lower-QOS thread has the spinlock locked but is never scheduled because higher-QOS threads spinning on the lock are always given priority. The OS does actually have a safe spinlock (which does priority donation), but it's not public API and so can't be used outside of system code. Of course, this doesn't apply to adaptive spinlocks, since they'll just end up in the slow path of parking their thread, but this does still serve to highlight the issue they do have, which is if a high-QOS thread is parked waiting for a lock that's owned by a low-QOS thread, the high-QOS thread is effectively having its priority lowered to the low QOS until the owning thread releases the lock. This is obviously not what you want. The system-provided pthread_mutex_t does priority donation specifically to fix this issue, but WTF::Lock doesn't.


Weakening the consistency of operations on the slow path is not going to be a speed-up. Those paths are not dominated by pipeline effects inside the CPU. They are dominated by context switches and memory contention. So, SC is exactly right: it's harder to get wrong and exactly the same speed.

It's true that WTF::Lock doesn't support priority inheritance. That's fine for WebKit's threads.


I disagree strongly as too strong barriers on non TSO hardware (such as say, armv7 or armv8) will just cause even more of that memory contention for no good reasons.

Also on a contended lock, there's a very strong possibility that you will decide that the lock can be taken right before you block.

On intel of course, it doesn't matter.


It also doesn't matter on all other architectures. The slow path is dominated by OS operations like yielding, which already execute the strongest barriers possible.


1) indeed there are a couple of barriers that could be optimized

2) in addition to what Phil already said (which is that there's no priority inversion for Safari/WebKit to have as all threads taking these locks have the same priority to boot), the private lock you're talking about, named the handoff lock, has terrible unlock patterns, and is why it performs so poorly when WTF::Lock benched it.


Spinning in locks are always tricky business.

The sharpest thorn, to me, is: what happens when you are running inside the critical section and your thread gets preempted?

Well, now the next thread which tries to acquire the lock is stuck waiting. But waiting for what? Waiting for the original thread to get scheduled.

Now, the OS has no idea that the original thread should get scheduled again and is free to continue scheduling more and more work items.

Fortunately for the lock in this article, and most other locks which spin, it is adaptive and will not spin for too long. But how long is long enough? If the spin is timed for a few loads and stores, then all is probably well as spins will not be attempted for very long.

I wonder how these locks figure out how long they should spin? In the nasty case I previously mentioned, you'd want to spin for a very short while to avoid large amounts of waste. But spins which are too short lead to higher lock/unlock latency if the lock was held for any appreciable amount of time.

This leads me to the following conclusion: spinning inevitable leads to _some_ number of wasted CPU cycles and therefore increased latency.

I'm curious as to how the amount of spinning was chosen.


The post details why we spin for the amount of time that we do. It turns out that there is a wealth of measurements that support our 40-spins-while-yielding rule. It worked for a strangely diverse set of workloads:

- IBM Research found it to be optimal on the portBOB benchmark on a 12-way AIX POWER box in 1999. - I found it to be optimal for DaCapo multi-threaded benchmarks on both AMD and Intel SMPs in ~2007. - I again found it to be optimal for WebKit these days.

The point of spinning for a short time and then parking is that parking is hella slow. That's why you can get away with spinning. Spinning is net profitable so long as the amount you spin for is much smaller than the amount of time the OS will waste while parking.


I attempted to address these concerns using an adaptive algorithm in glibc some 15 years ago:

http://stackoverflow.com/questions/19863734/what-is-pthread-...

The idea is to avoid spinning using hard-coded constants, but put a modicum of on-the-fly empirical science into it: measure the duration of the critical regions with a smoothed estimator, which is maintained in the mutex. If the actual wait exceeds this smoothed average by too much (say double) then conclude that the owner must be pre-empted and break the spin with rescheduling.

The smoothing is done very easily using a few bit operations that implement an IIR filter, similarly to the TCP RTO estimator.


See the Doug Lea talk I linked to in another comment. Turns out that even the process of spinning is itself not so simple on some new processors.


Random observation: WTF would be a great base exception in a new, hip programming language.

   WTF.happened
   WTF.context
   ...and so on


Android has a Log.wtf() method for logging "What a Terrible Failure" conditions:

http://developer.android.com/reference/android/util/Log.html...


Android has Log.WTF which of course stands for What a Terrible Failure (and totally not that other word, the similarity is a complete coincidence /s )



Not a good idea, because now you cannot easily grep anymore for "WTF" in source code comments :)


Use a tool that lets you search specifically within comments, within string literals, outside comments, outside string literals, etc.

All the JetBrains tools do this AFAIK.


there's Log.wtf method in Android for errors that shouldn't have happened


It would be interesting to have more CPU support for critical sections. You could have an option to prevent interrupts on a fence instruction during short critical sections. Interrupts would be prevented for a maximum number of instructions set by an operating system controlled register, so user threads couldn't lock up a CPU. Running out the instruction count would cause an exception interrupt. The limit on number of instructions would be small; maybe 10 or 20.

Side issue: what about page faults in critical sections? Should critical sections be allowed to cross page boundaries?


WebKit has its own lock and malloc implementations and many other redoes of standard library. Given that Apple's engineers have a great deal of influence on all of WebKit, libc++ and their operating systems, I wonder why they do not port those WebKit implementations onto OS X and iOS.


Because webkit runs on more than just OS X and iOS?



WTF seems to be a very useful library, especially since the C++ standard library is so criminally small. I hope the WebKit guy can release it as a standalone library that will benefit many C++ developers. (Currently I can copy WTF into my own project, but I don't know how to use it due to lack of documentation).


> I hope the WebKit guy can release it as a standalone library that will benefit many C++ developers.

+1. Even just the WTF::Lock, WTF::Condition and supporting code would be useful.


Please replace the very slow Mac OS X pthread_mutex locks with the WTF::Lock implementation to benefit all applications. I can live with it still being 64 bytes and the scheduling being not completely fair. The OS X pthread mutexes are several times slower than Linux on the same hardware.


Which OS X Release? with which benchmark? uncontended pthread_mutex was made significantly faster in OS X El Capitan (almost twice as fast as before), and on my own mac is only twice as slow as OSSpinLock, and OSSpinLock unlock is a store(lock, 0), where a mutex usually need an xchg or cmpxchg which amount for most of the 2x performance loss.

I very very much doubt that linux implementation is that different (I'm actually fairly confident pthread_mutex on linux and OS X El Capitan+ are mostly on par)


> WTF::Lock and WTF::Condition only require one byte of storage each. WTF::Lock only needs two bits in that byte.

Packing several unrelated, highly contended mutexes into the same 32 bit word is probably not a good idea, though. :)

This reminds me of early Unix and its "wait channels", which were just addresses, not even objects. A process could wait, specifying the address of any object whatsoever as a "channel". A wakeup for that address would resume it. This required no storage in the addressed object itself; the extra paraphernalia required for it was associated by a hash or whatever. And of course that paraphernalia is transient; an object that nobody is waiting on doesn't need it.


One byte per mutex sounds a bit suspicious given that atomic operations typically work on a word or cache line granularity. If you have two mutexes in the same cache line, false sharing occurs and the CPU interconnect gets busy. Locking two uncontended mutexes in same cache line on different cores ends up executing serially because the cores need to snoop each other's caches.

Some benchmarks might be useful to see how this affects performance and CPU perf counters.


The points isn't to have two mutexes in the same cache line. The point is to have a mutex plus the data that the mutex protects in the same cache line.


I am wondering whether this is faster or not then SRW Lock on Windows: https://msdn.microsoft.com/en-us/library/windows/desktop/aa9...


The talk Engineering Concurrent Library Components, by Doug Lea, is one of the best overviews of the issues surrounding the design and implementation of low-level concurrency building blocks: https://www.youtube.com/watch?v=sq0MX3fHkro


My understanding is that WTF::Lock works best when there is no lock contention or contention can be resolved with spinning (microcontention). Otherwise WTF::Lock ends up blocking on pthread_mutex (std::mutex) inside ParkingLot::parkConditionallyImpl, and in such case I expect it to be worse than pthread_mutex.


>with the new WTF::Lock (WTF stands for Web Template Framework

That made me smile. It's nice to see some playfulness in business.


Wish I could understand such articles someday! Learnin.


> rarely an excuse for not having one, or even multiple, fine-grained locks in any object that has things that need to be synchronized

'wtf' is right. locking sometimes adds as much risk as it removes.


I often hear this kind of claim. Do you have evidence? Because I just presented evidence that locks are awesome.


race conditions

> Textbooks will tell you that if you always lock in the same order, you will never get this kind of deadlock. Practice will tell you that this approach doesn't scale: when I create a new lock, I don't understand enough of the kernel to figure out where in the 5000 lock hierarchy it will fit.

https://www.kernel.org/pub/linux/kernel/people/rusty/kernel-...


It's true that when you write code, that code can have bugs. When I write file system code and I have a bug, we call it "data loss". When I write user interface code and I have a bug, we call it "not user friendly". And when I write code that uses locks and it has a bug we call it a "race condition". Just because you can give it a name doesn't mean it's bad. Just because a programming idiom can be misused doesn't mean it's bad. All programming idioms can be misused!

WebKit's lock hierarchies don't involve 5000 locks. We try to do as little work as possible while holding a lock, and that usually means that we don't hold more than one lock at a time. It's true that when you hold multiple locks at a time, it can get confusing. So, don't do that!


If you're just saying 'when locking is cheap, fine-grained locks can be easier to deal with vs coarse locks' I can agree with that.


Only that lock-free (or wait free, or even obstruction-free) datastructures aren't exactly trivial to implement correctly either. Besides often having architectural impact (Uh, Oh, I now need to declare quiescent periods?), they're extremely hard to debug, and there's even less automated verification than for locking.


there are many design alternatives to locking. Lock-free synchronization is one of them, but queueing is another. The best alternative is to simplify your control flow so that you don't need as much synchronization.


I have never seen lock-free synchronization touted as a simpler alternative to locking. (And I do lock-free synchronization.) Avoiding synchronization as much as possible is a good principle - one which I use often, and I actually contend is the only way to write scalable code. But someone always need to eventually synchronize. Even if it's not your code explicitly, somewhere, your code will have to rely on code someone else wrote that synchronizes. This article is aimed to those people.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: