
Common Pitfalls in Writing Lock-Free Algorithms - nikita
http://developers.memsql.com/blog/common-pitfalls-in-writing-lock-free-algorithms/
======
sounds
Probably the most useful observation in the article is hidden right before the
Results / Conclusion...

Any optimization should be measured. Otherwise how do you know if all your
hard work is worth it? In the author's example, the fastest implementation was
actually _not_ lock-free, it merely used a better spinlock than std::mutex.

It's a good idea to use existing implementations mostly so that you can
measure and (hopefully without too much trouble) change out the implementation
whenever you need to. Different hardware likely has different performance, so
measure often!

~~~
pacaro
Like the old carpenters adage "Measure twice, cut once"

~~~
tlarkworthy
We recently bought a laser cutter and realised that adage is based on the
assumption that the cutting takes the longest time. Now if we want a _really
good_ fit we do a binary search on the cut dimensions.

Our new laser cutter adage is "cut loads and don't bother measuring because it
doesn't take into account the size of your cut"

~~~
stouset
The adage is based on the fact that you can't "undo" a cut too short. If you
need a 6' board, and cut it to 5'10", you've probably just wasted a perfectly
good piece of lumber.

------
ajtulloch
If you found this interesting, you may also like the really clean
implementation of a single-producer single-consumer lock-free queue in
Facebook's folly C++ library.

[https://github.com/facebook/folly/blob/master/folly/Producer...](https://github.com/facebook/folly/blob/master/folly/ProducerConsumerQueue.h)
(code)

[https://github.com/facebook/folly/blob/master/folly/docs/Pro...](https://github.com/facebook/folly/blob/master/folly/docs/ProducerConsumerQueue.md)
(docs)

------
aray
First off, I'd call that barrier-less, which can be more important than lock-
less. With the important distinction that _it will break on ARM and 'weakly
consistent' memory model architectures_.

Android's SMP Primer does a really great way of covering this as well, with a
lot of race examples, so I recommend checking it out as a compliment:
<http://developer.android.com/training/articles/smp.html>

The author of this article (and probably most readers) will be concerned with
x86, which has a "processor-consistent" memory model. What isn't mentioned is
that ARM and other architectures have an even weaker ("weakly consistent")
memory model. The SMP guide above describes these in detail and how they can
be observed (race) here:
[http://developer.android.com/training/articles/smp.html#data...](http://developer.android.com/training/articles/smp.html#datamem_barriers)

If you are dealing with low-level problems like this--either in libraries,
operating systems, or systems tools and utilities--design for "weakly
consistent" memory models, or _please_ leave a huge warning to future
developers that it may have subtle and difficult race conditions on those
platforms.

~~~
oofabz
CompareAndSwap implementations use a memory barrier on ARM. That should be
sufficient to guarantee correctness.

------
CoolGuySteve
It may be lock free, but it's not very wait free.

Maybe I'm just tired, but OS X and Windows already have atomic lifo's that are
much better:
[http://www.opensource.apple.com/source/Libc/Libc-763.13/i386...](http://www.opensource.apple.com/source/Libc/Libc-763.13/i386/sys/OSAtomic.s)

(scroll to bottom, see declaration in /usr/include/libkern/OSAtomic.h)

[http://msdn.microsoft.com/en-
us/library/windows/desktop/ms68...](http://msdn.microsoft.com/en-
us/library/windows/desktop/ms684121\(v=vs.85\).aspx)

For linux, there's llist.h in the kernel, but I don't know of a userland
version and I haven't used it.

~~~
matthavener
The article demos a list, where threads are contending to append to the top of
it. Unlike a list, there is not much contention on a FIFO unless its typically
empty..

~~~
CoolGuySteve
You are correct, I meant lifo but wrote fifo.

OSEnqueueAtomic/OSDequeueAtomic and InterlockedSList, by virtue of only being
able to push and pop the head, are effectively LIFOs.

------
frozenport
Another problem with lock free algorithms: sometimes locks are cheap.

This is certainly the case on NVIDIA's CUDA platform, where atomic operations
incur a 50% penalty - which can sound like a lot but considering that the lock
region can be 20% of runtime. In practice this means a lock free runs in 100
seconds while one with primitive locks runs in 110 seconds (and was easy to
write debug and maintain).

------
starmole
Hm. maybe I am missing a lot there, so I would be happy for people to correct
me. But in my book a linked list is like a worst case to make "lock free". For
me the first step to make something lock free is that all the frequent
operations on a datastructure only need one item to change. A fixed size stack
in this case. Once you have multiple things to keep in lock the old style lock
is a necessity! You might invent one that works slightly better (like his
sleep(250)..) under some benchmarks. But to me lock free means an algorithm
where both producers and consumers have to only ever access one single atomic
value.

~~~
johnduui
"to me lock free means an algorithm where both producers and consumers have to
only ever access one single atomic value"

Well it may mean that to you, but unfortunately it doesn't mean that to anyone
else.

Lock free has a precise definition. If you don't know what it is, why get
involved in this discussion?

~~~
starmole
As I put in the first sentence, I might be missing a lot there. Could you
please point me to the definition you are referring to? I probably did not
express myself very well. And I already learned a bit by thinking about this
differently. You (and everybody else, so you are right!) define lock free as
dead lock free. Which can be done with a carefully chosen lock. I was always
thinking of lock free as a pure performance optimization which admittedly is
different to the algorithmic property. Am I on the right track? If yes, it
should answer your question why I got involved in the discussion: to learn.

~~~
edsrzf
Dmitry Vyukov keeps a great resource for learning about lock-free algorithms
(and other interesting things) here: [http://www.1024cores.net/home/lock-free-
algorithms/introduct...](http://www.1024cores.net/home/lock-free-
algorithms/introduction)

~~~
dannywoodz
Isn't the very first example in that article flawed?

    
    
        void decrement_reference_counter(rc_base* obj)
        {
            if (0 == atomic_decrement(obj->rc))
                delete obj;
        }
    

This is classic test-then-act, isn't it? What happens if another thread bumps
obj->rc after the comparison to 0, but before the deletion? That other thread
could find itself referring to a suddenly-deleted object, or am I missing
something?

------
jmilkbal
We can get high performance coupled with safer ways to deal with locks. You
don't need to be afraid of locks so much as find something that handles them
well along with whatever else you need. In Ada there exist "protected" objects
and types, a protect object being either a singleton or an instantiation of a
protected type. Protected types are one way to synchronize threads and provide
access control to data safely. There are protected entries with queues,
protected functions which can only read the data being protected and protected
procedures which while having the ability to write to protected data are
guaranteed to only be allowed access one at a time. There's plenty of great
features of the built-in tasking to go along with protected objects like
selective entries, time delayed aborts and a ton of other stuff. Ada really
hits the mark for locking. If you don't want locks, but still want
synchronization, there's also the Ada rendezvous. Good stuff and even more
robust with the new 2012 standard.

------
cjh_
I cannot upvote this enough, thank you for the very well written article.

------
Afforess
I found the observation about new/delete not being necessarily safe
interesting. Been writing Java too long, Java has no delete (garbage
collection) and presumably new would have been creating a linked list entry
object, which is safe.

~~~
taeric
But could cause a garbage collection cycle, right? If so, I think this still
counts in the "not safe" category.

~~~
swift
Generally garbage collection algorithms more sophisticated than reference
counting can handle cycles just fine. Certainly this is the case for Java.

Edit: Or perhaps you mean, could cause the garbage collector to run? I don't
see why that'd be an issue either.

~~~
taeric
That is what I meant. And having a garbage collector run could potentially
freeze the JVM, causing a performance hit. (I mean, it is basically a global
lock on the entire system, right?)

Basically, my reading of "safe" meant "would not block." A garbage collection
cycle stops everything, so is effectively a block, right?

~~~
swift
I read "safe" differently than you did; now I'm on the same page. With or
without a GC, if you want to guarantee safety in that sense no matter what,
you're right - you just can't allocate memory. So in the worst case running in
a managed environment doesn't buy you much, but often the guarantees you need
are a bit looser than that, and in that case managed memory can be really
beneficial.

------
andrewcooke
[(thanks) you should probably make sure your main site is readable if adobe
typekit is blocked - it is blocked by default w ghostery]

------
johnbellone
Multithreaded algorithms have always been on my list of things to master in
programming. This is an excellent article.

------
raverbashing
Well, lots of good insights, but I don't get this:

"We currently use boost::atomic for this. Our compiler is gcc 4.6, which
currently supports std::atomic, but its implementation is much less efficient
than boost’s. "

Really, never replace trust in a compiler by trust in a library.

They're getting down to all the nitty-gritty details and then saying 'a memory
barrier is expensive'. It seems contradictory to me.

GCC will get better, other compilers will get better as well, but using
another level of indirection (a library) is not usually a good idea.

Oh, and maybe replace the sleep by a sched_yield

And even better, don't have hundreds of threads on a single data structure,
especially a constrained one like stack, that's probably the easiest way of
having speed gains.

