Nice article, though I think that any intro-level material on lock-free programming should always include a "don't try this at home for anything important" warning. Until you have some experience with this stuff you will almost certainly make mistakes, but these mistakes might only manifest themselves as crashes in extremely rare circumstances.
I wrote my first lock-free code in 2004 based on reading some papers by Maged Michael from IBM. I wrote a lock-free FIFO in PowerPC assembly, and was convinced it was safe and robust. When I emailed Maged about it, he pointed out that if a thread was suspended on one specific instruction and some specific memory was unmapped before it could run again, the program could crash. I was amazed; I had thought hard about this algorithm, but had completely missed that possibility.
Some other specific notes about the article:
> Basically, if some part of your program satisfies the following conditions, then that part can rightfully be considered lock-free.
> Processors such as PowerPC and ARM expose load-link/store-conditional instructions, which effectively allow you to implement your own RMW primitive at a low level, though this is not often done.
One benefit of load-linked/store-conditional (often abbreviated LL/SC) is that it avoids the ABA problem (http://en.wikipedia.org/wiki/ABA_problem). In practice this doesn't matter that much since x86 doesn't support LL/SC, but I just think it's an interesting factoid to know.
> For instance, PowerPC and ARM processors can change the order of memory stores relative to the instructions themselves, but the x86/64 family of processors from Intel and AMD cannot.
(I've edited my reply here since my original assertion was incorrect). It's true that x86/64 won't reorder stores (see http://en.wikipedia.org/wiki/Memory_ordering for details) but it will reorder loads, so memory barriers are still required in some situations. However I believe that the atomic instructions ("lock cmpxchg", and "lock xadd") imply full barriers on x86.
I totally agree with you about the dangers of lock-free programming, and that every programmer will certainly make mistakes. I feel strongly enough about it that I actually wrote several paragraphs about the importance of stress testing lock-free algorithms, which I eventually decided to break off into a separate post... I might get around to publishing that, we'll see. We're not the first to feel this way :)
> It is true about x86/64, at least for the most common, non-SSE instructions and normal, non-write-combined memory
To embellish further and avoid jargon for those who didn't parse this, it basically means "It is true except for the interfaces which explicitly document where it's not.".
Intel CPUs have had a "uncached/write-combining" mode for years (originally specified through special MTRR registers, then in the Page Attribute Table, and most recently in the "non-temporal" MOV instructions in SSE) which officially relaxes the requirements for store ordering. The reason for having two sets of semantics is performance: when dealing with uncached memory (typically memory used by another device, e.g. a video framebuffer) combining a bunch of writes to the same DRAM burst unit (or set of such units) is much faster than doing each word-sized I/O individually.
Yes, sorry I misspoke initially about stores being reordered, but it does still appear that stores can be reordered after loads, see: http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fen... . (Your Intel manual reference is surely a more authoritative source but I'm not able to do a deep dive into it at the moment).
There's no contradiction. Normally, the x86/64 memory model is quite strong, preserving LoadLoad, LoadStore, and StoreStore ordering. But doesn't preserve StoreLoad ordering; eg. stores can be reordered after loads, as you point out. As I understand it, that's mainly because there's a limit to how quickly each store can be seen by other processors.
I'm pretty sure it's because of the store forwarding hardware. Intel CPUs can return a successful load of a recently-stored address with low latency (basically by keeping a special purpose cache of recent store addresses in the pipeline and returning their values before the actual commit). But that means that the "store" is viewed from the issuing CPU to have committed long before loads that might have been filled by caches on the other CPUs. There's no way to preserve both this optimization and a unified order.
To support your first point, I implemented Maged's lock-free allocator so we could compare against it (http://www.scott-a-s.com/files/michael.tar.gz), and I never got it working correctly on Itanium. I spent weeks trying to figure out what was wrong. I eventually concluded that I did not understand the Itanium architecture and the algorithms used in the allocator well enough to be confident I could fix it - and that it wasn't worth improving my understanding of either. (I asked Maged if he had any experience with his allocator on Itanium, and he did not.)
The idea that made this click for me 10 years or so ago was that distributed systems express concurrent processes but can't reasonably use locks; model your threads/processes/whatever as if they were nodes in a distributed system, and use distributed system techniques to ensure things converge.
That's not exactly a shared memory model. A better example would be a static look-up table for fixed game state information and a bunch of AI worker threads.
With the correct protocol you can even write to that shared memory. For example a sudoku solver where each tread looks for contradictions and then overwrites the shared states with their findings. This works because there is no contradictions, dirty reads are ok, and there is no contention with all threads writing 0's and nobody writing 1's. You can go even further where any thread can flip the error flag if a contradiction shows up or solution flag if a solution is found, but again it works because nothing clears those flags.
PS: The article's author uses a less strict defintion of lock free where it's ok to use locks for individual atomic operations like counters. But, that still creates performance issues at scale.
Actually, I thought it was pretty clear in the article that it's not ok to use locks (mutexes) under the definition given. I tried pretty hard to use the most widely accepted definition of lock-free, even exchanging e-mails with Maurice Herlihy about it.
In this sense, the lock in lock-free does not refer directly to mutexes, but rather to the possibility of “locking up” in some way, whether it’s deadlock, livelock — or even due to hypothetical thread scheduling decisions made by your worst enemy.
Suppose you had an operation that was guaranteed to either add your input to some location in memory or fail gracefully in some fixed amount of time. It's first come first served and implemented at a hardware level. Now this fits with the definition of lock free in that a process can't prevent your process from finishing, and works just fine if you want say a counter of number of moves examined by your AI thread as a thread can always keep an internal counter and try again say after the next batch goes though. However, you still get into resource starvation issues depending on how it's used and with enough threads you encounter a wide range of issues.
Lock free programming is not a binary condition so much as hierarchy of conditions. Where various types of resource starvation (ex: at the cache or buss level) are perhaps the most stringent. Scale up to say a billion cores and 100 billion threads and many 'safe' operations create significant issues.
Based on your comment, I should probably be more precise and say: "the possibility of 'locking up' the entire application in some way". I'll update the post. There's nothing about this definition that prohibits a single thread from starving.
The article's author uses a less strict defintion of lock free where it's ok to use locks for individual atomic operations like counters.
The author's definition does not allow one to use locks to make individual operations atomic. A process could always crash after obtaining the lock, but before releasing it, causing the entire application to become stuck.
Lock free algorithms still use primitives like memory barriers. A memory barrier in a shared memory system is expensive, but not nearly as difficult as a memory barrier in a distributed system.
Additionally, distributed systems not only experience failures of individual components, but they're often in a state where you can't reliably detect failures (e.g., tell a failed node apart from a node that's slow or a node that you can't temporarily talk to while other nodes can).
It's also possible to think of it as a continuum: for example, failure detection is simpler in a distributed system connected via Infiniband than in a system connected via commodity ethernet; locality matters in NUMA CPUs and so on. The idea that a commodity multi-core machine has some properties of distributed system is not without merit, but it isn't particularly useful or actionable as a working model.
There is also a niche research area of lock-free data structures and algorithms.
On a more practical side, check out http://www.liblfds.org/ -- it is a library of lock free data structures in C. (I am not the author). I have successfully used this library in some realtime projects.
Very good overview. I've built lock-free hash maps, and this article covers the areas you need to be aware of. I'd also suggest looking up false sharing, and also making sure that you really understand exactly how the memory barriers are operating.