
It's "locking" if it's blocking - nathell
http://www.yosefk.com/blog/its-locking-if-its-blocking.html
======
rcoh
While Lock-free code gets you part of the way to efficient parallelism (by
removing suspension-induced dead time) as he mentions, the performance impact
caused by repeated cache line invalidation can be a significantly bigger
problem.

In many modern processor architectures (x86, for example), a cache-coherence
protocol is used to ensure that cache lines provide data according to the
memory discipline. On x86, that discipline is Total Store Ordering. (See
<http://en.wikipedia.org/wiki/Memory_ordering>).

This means that if two processors are contending for the same value (eg.
trying to increment it, set it to true, or even read it), they will force the
CPU to invalidate the cache line containing the value on all the other cores,
leading to massive scalability bottlenecks. If multiple cores are contending
for the same location in memory, whether it's lock free or not, performance
will suffer.

More deviously is the case of false sharing, where 2 different values just
happen to fall on the same cache line. Even though they don't conflict, the
processor must still invalidate the line on every core. Modern compilers do
their best to prevent this, but sometimes they need a little help.

The takeaway is this: Don't try to implement your own locks using CAS -- even
something as simple as a lock is very hard to get right (performance-wise)
when scaling to dozens and hundreds of cores / threads. People have solved
this problem (people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf). Writing
fast concurrent code (especially lock-free code) is a minefield of weird
architecture gotchas. Watch your step.

~~~
_yosefk
Actually I didn't claim that lock-free is more efficient - I explicitly said
that it isn't necessarily more efficient, though I didn't discuss the reasons
that you do discuss; I only said that "lock-based" is _defined_ by the need to
deal with blocking upon suspension - not that it's necessarily a bad property
in any way, in particular in its efficiency impacts.

------
tdrd
This is a good academic post, but why are we still writing mutable-shared-
state concurrent code?

Either of message-passing concurrency and data immutability would trivialize
the problems discussed here.

~~~
psobot
The actor model and data immutability are good for programmer productivity and
reducing errors, but not for writing highly performant or low-level code. Even
then, _someone_ has to write the message queue, and how do you plan do to that
without any mutable shared state?

~~~
tdrd
Obviously there's no avoiding this, but the author writes in the context of
high-level code where OS-provided locking mechanisms are available. Why are we
discussing this in a low-level (or embedded, where every ounce of performance
matters) context?

~~~
psobot
You don't need to be embedded to have performance matter - even if your OS
gives you concurrency primitives, there are many situations where jumping into
kernel code is still "too expensive."

~~~
justincormack
That's why Linux has the futex which is userspace only if not contended only
jumps to kernel code for the contended case.

