The big problem I have with locks is under high use, they slow parallel code dow...

zznzz · on Feb 18, 2023

What you describe (multiple cores sharing the same lock) sounds like true sharing, not false sharing. False sharing means cache line contention is when cores are using separate structures (i.e. not actually shared) which happen to be on the same cache line.

Compare and swap still typically requires the cache line to bounce around between cores, so if that’s the primary cost of locks for you it seems like compare and swap doesn’t really fix the problem of frequent synchronization between cores.

drmeister · on Feb 18, 2023

Ah - thank you. "True sharing" makes way more sense to describe the situation. The general problem with locks I've seen again and again - and maybe CAS doesn't improve it, but I thought it did. It is challenging to profile this stuff carefully. I've been disappointed many times by my attempts at multithreaded programming when locks are involved. I've implemented a multithreaded Common Lisp programming environment (https://github.com/clasp-developers/clasp.git) that interoperates with C++. Two locks that cause me massive headaches are (1) in the unwinder when we do a lot of stack unwinding in multiple threads and (2) in the memory manager when we do a lot of memory allocation in multiple threads. Parallel code runs multiple times slower than serial code. In another situation where we do parallel non-linear optimization without allocating memory, we can get 20 CPU working in parallel with no problems.

sapiogram · on Feb 18, 2023

If false sharing is truly the bottleneck in your code, can't you just make sure the lock reserves the entire cache line? Each core still needs to fetch from L3 cache to grab the lock itself, but that's roughly the same cost as a compare-and-swap.

hinkley · on Feb 18, 2023

The concurrent queue code in Java has/had an array the locks were placed in. Now the problem with putting locks in arrays instead of scattering them across the heap is that you get false sharing immediately. So the array is 8 or 16 times as big as it needs to be to get one lock per cache line.

charleslmunger · on Feb 18, 2023

The @Contended annotation will pad the memory layout of fields to fill a cache line and avoid false sharing.