
An Introduction to Lock-Free Programming - preshing
http://preshing.com/20120612/an-introduction-to-lock-free-programming
======
haberman
Nice article, though I think that any intro-level material on lock-free
programming should always include a "don't try this at home for anything
important" warning. Until you have some experience with this stuff you will
almost certainly make mistakes, but these mistakes might only manifest
themselves as crashes in extremely rare circumstances.

I wrote my first lock-free code in 2004 based on reading some papers by Maged
Michael from IBM. I wrote a lock-free FIFO in PowerPC assembly, and was
convinced it was safe and robust. When I emailed Maged about it, he pointed
out that if a thread was suspended on one specific instruction and some
specific memory was unmapped before it could run again, the program could
crash. I was amazed; I had thought hard about this algorithm, but had
completely missed that possibility.

Some other specific notes about the article:

> Basically, if some part of your program satisfies the following conditions,
> then that part can rightfully be considered lock-free.

The are actually several levels of lock-freedom defined in the literature:
lock-freedom, wait-freedom, and obstruction-freedom. For more info see:
<http://en.wikipedia.org/wiki/Non-blocking_algorithm>

> Processors such as PowerPC and ARM expose load-link/store-conditional
> instructions, which effectively allow you to implement your own RMW
> primitive at a low level, though this is not often done.

One benefit of load-linked/store-conditional (often abbreviated LL/SC) is that
it avoids the ABA problem (<http://en.wikipedia.org/wiki/ABA_problem>). In
practice this doesn't matter that much since x86 doesn't support LL/SC, but I
just think it's an interesting factoid to know.

> For instance, PowerPC and ARM processors can change the order of memory
> stores relative to the instructions themselves, but the x86/64 family of
> processors from Intel and AMD cannot.

(I've edited my reply here since my original assertion was incorrect). It's
true that x86/64 won't reorder stores (see
<http://en.wikipedia.org/wiki/Memory_ordering> for details) but it _will_
reorder loads, so memory barriers are still required in some situations.
However I believe that the atomic instructions ("lock cmpxchg", and "lock
xadd") imply full barriers on x86.

~~~
preshing
I totally agree with you about the dangers of lock-free programming, and that
every programmer will certainly make mistakes. I feel strongly enough about it
that I actually wrote several paragraphs about the importance of stress
testing lock-free algorithms, which I eventually decided to break off into a
separate post... I might get around to publishing that, we'll see. We're not
the first to feel this way :)

> The are actually several levels of lock-freedom defined in the literature:
> lock-freedom, wait-freedom, and obstruction-freedom. For more info see:
> <http://en.wikipedia.org/wiki/Non-blocking_algorithm>

It's true, and I had a hard time reconciling that Wikipedia page with how
others had been describing lock-free programming, which is why I spent so much
of the post on semantics.

> One benefit of load-linked/store-conditional (often abbreviated LL/SC) is
> that it avoids the ABA problem

Good point. Though I suspect if you access other cache lines between the LL &
SC, it may invalidate the reservation.

> I don't think this is true about x86/64.

It is true about x86/64, at least for the most common, non-SSE instructions
and normal, non-write-combined memory. I'll update the post to be more
specific. See volume 3, section 8.2.2 of the Intel 64 and IA-32 Developer's
Manuals:
([http://www.intel.com/content/www/us/en/processors/architectu...](http://www.intel.com/content/www/us/en/processors/architectures-
software-developer-manuals.html))

\- Reads are not reordered with other reads. \- Writes are not reordered with
older reads. \- Writes to memory are not reordered with other writes (with a
few exceptions for special instructions)

~~~
haberman
> It is true about x86/64

Yes, sorry I misspoke initially about stores being reordered, but it does
still appear that stores can be reordered after loads, see:
[http://bartoszmilewski.com/2008/11/05/who-ordered-memory-
fen...](http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-
an-x86/) . (Your Intel manual reference is surely a more authoritative source
but I'm not able to do a deep dive into it at the moment).

~~~
preshing
There's no contradiction. Normally, the x86/64 memory model is quite strong,
preserving LoadLoad, LoadStore, and StoreStore ordering. But doesn't preserve
StoreLoad ordering; eg. stores can be reordered after loads, as you point out.
As I understand it, that's mainly because there's a limit to how quickly each
store can be seen by other processors.

~~~
ajross
I'm pretty sure it's because of the store forwarding hardware. Intel CPUs can
return a successful load of a recently-stored address with low latency
(basically by keeping a special purpose cache of recent store addresses in the
pipeline and returning their values before the actual commit). But that means
that the "store" is viewed from the issuing CPU to have committed long before
loads that might have been filled by caches on the other CPUs. There's no way
to preserve both this optimization and a unified order.

------
tptacek
The idea that made this click for me 10 years or so ago was that distributed
systems express concurrent processes but can't reasonably use locks; model
your threads/processes/whatever as if they were nodes in a distributed system,
and use distributed system techniques to ensure things converge.

~~~
Retric
That's not exactly a shared memory model. A better example would be a static
look-up table for fixed game state information and a bunch of AI worker
threads.

With the correct protocol you can even write to that shared memory. For
example a sudoku solver where each tread looks for contradictions and then
overwrites the shared states with their findings. This works because there is
no contradictions, dirty reads are ok, and there is no contention with all
threads writing 0's and nobody writing 1's. You can go even further where any
thread can flip the error flag if a contradiction shows up or solution flag if
a solution is found, but again it works because nothing clears those flags.

PS: The article's author uses a less strict defintion of lock free where it's
ok to use locks for individual atomic operations like counters. But, that
still creates performance issues at scale.

~~~
preshing
Actually, I thought it was pretty clear in the article that it's not ok to use
locks (mutexes) under the definition given. I tried pretty hard to use the
most widely accepted definition of lock-free, even exchanging e-mails with
Maurice Herlihy about it.

~~~
Retric
_In this sense, the lock in lock-free does not refer directly to mutexes, but
rather to the possibility of “locking up” in some way, whether it’s deadlock,
livelock — or even due to hypothetical thread scheduling decisions made by
your worst enemy._

Suppose you had an operation that was guaranteed to either add your input to
some location in memory or fail gracefully in some fixed amount of time. It's
first come first served and implemented at a hardware level. Now this fits
with the definition of lock free in that a process can't prevent your process
from finishing, and works just fine if you want say a counter of number of
moves examined by your AI thread as a thread can always keep an internal
counter and try again say after the next batch goes though. However, you still
get into resource starvation issues depending on how it's used and with enough
threads you encounter a wide range of issues.

Lock free programming is not a binary condition so much as hierarchy of
conditions. Where various types of resource starvation (ex: at the cache or
buss level) are perhaps the most stringent. Scale up to say a billion cores
and 100 billion threads and many 'safe' operations create significant issues.

PS: Wikipedia uses Wait-freedom > Lock-freedom > Obstruction-freedom all under
<http://en.wikipedia.org/wiki/Lock_free>

~~~
preshing
Based on your comment, I should probably be more precise and say: "the
possibility of 'locking up' _the entire application_ in some way". I'll update
the post. There's nothing about this definition that prohibits a single thread
from starving.

~~~
scott_s
I think it's easier to phrase it in the positive sense: at least one thread is
always making progress towards completion.

------
rdtsc
There is also a niche research area of lock-free data structures and
algorithms.

On a more practical side, check out <http://www.liblfds.org/> \-- it is a
library of lock free data structures in C. (I am not the author). I have
successfully used this library in some realtime projects.

~~~
sbahra
<http://concurrencykit.org/> is also available, used in several production
systems.

------
sbahra
A while back I linked to the following:
[http://concurrencykit.org/presentations/lockfree_introductio...](http://concurrencykit.org/presentations/lockfree_introduction/#/)

Please make sure to navigate all the way down, before navigating to the right.
Press the space bar to see the full layout of the presentation.

------
rossjudson
Very good overview. I've built lock-free hash maps, and this article covers
the areas you need to be aware of. I'd also suggest looking up false sharing,
and also making sure that you really understand exactly how the memory
barriers are operating.

------
cheatercheater
Make sure to check out pi calculus for a different model of concurrency
without locks: [http://mainisusuallyafunction.blogspot.com/2011/09/lambda-
to...](http://mainisusuallyafunction.blogspot.com/2011/09/lambda-to-pi.html)

