

Lock-free Data Structures: Atomicity and Atomic Primitives - adamnemecek
http://kukuruku.co/hub/cpp/lock-free-data-structures-basics-atomicity-and-atomic-primitives

======
jberryman
> It’s quite difficult, if possible at all, to contrive an algorithm of a
> lock-free container, which uses read/ write only (I don’t know such data
> structures for a random number of threads).

I don't know either, but Peterson's Algorithm is a mind-melting and elegant
way to achieve mutual exclusion for two threads, using reads and writes (and
in reality probably some barriers), generalizable to N threads.

    
    
      http://en.wikipedia.org/wiki/Peterson%27s_algorithm

~~~
bandwitch
Peterson's Algorithm isn't lock-free, it's a blocking one.

~~~
jberryman
Hmm, are you sure that's correct? I believe both the qualities re. "progress"
and "bounded waiting" here

[http://en.wikipedia.org/wiki/Peterson%27s_algorithm#Progress](http://en.wikipedia.org/wiki/Peterson%27s_algorithm#Progress)

are enough to guarantee system-wide progress. Can you elaborate?

Edit: Nevermind. So of course we'd need the critical section and thread-
coordination mechanism to be a single atomic operation for this to be a lock-
free algorithm. I should RTFA.

~~~
virtue3
... if you look at the psuedo code the "busy wait" (with a busy while loop...)
comment and "critical section" comment should also be very large flags to this
being a blocking operation.

------
nkurz
Wow, that was a much better article than I was expecting. I'm eager to read
the others in the series.

 _Thus, for Intel Xeon 3.06 hHz (2005 model) atomic increment has the length
of 400 nop, CAS – 850 – 1000 nop. The Processor IBM Power4 1.45 hHz: 180 nop
for atomic increment and 250 nop for CAS. The measures are quite old,
processor architecture has made several steps forward since then, but the
order of figures is the same, I guess._

Does anyone know if his guess is correct? What are the numbers like for modern
processors?

~~~
arielweisberg
It's a bad idea to treat these operations as constants. They get slower as
traffic and contention increases. Different cache topologies will have
different timing and different thread scheduling arrangements will also have
different timings.

I find it helpful to think in terms of cache lines and cache coherence as that
is the biggest factor in performance on cache coherent CPUs when you want to
move a lot of data between cores.

A CAS is no worse than a store in that scenario because they both generate the
same coherence traffic. One key difference is that the store can drain in the
background while the CAS will pause processing.

Once you start thinking of the CPU as an asynchronous instruction processor it
starts to get more interesting. For instance in Java if you use volatile, it
will stall the CPU until the volatile write completes despite the fact that in
most cases you don't actually care. But you can use Atomic* and lazySet or
unsafe to make the store non-blocking while still getting the necessary
protection against reordering.

What I am getting at is that thinking about a CPU in terms of instructions is
the wrong way to go about it. Instructions have to actually execute somehow
and that is where the rubber meets the road. A single core can simultaneously
be performing several arithmetic operations, draining stores, fetching and
pre-fetching cache lines, and several other things I probably haven't learned
about. It's causing stalls in this process or introducing dependencies that
prevents things from moving along quickly.

~~~
nkurz
I agree with what you are saying, but am still interested in how the numbers
have changed over time. This paper seems like a reasonable update, as well as
a good introduction to the variables involved:

[http://sigops.org/sosp/sosp13/papers/p33-david.pdf](http://sigops.org/sosp/sosp13/papers/p33-david.pdf)

Only having skimmed it, I think that Table 2 is making the same point that you
did, that on x64, apart from instruction dependency, there very little
difference in latency between a CAS and a store.

------
gengstrand
I don't know why lock free concurrency control is so fashionable these days. I
realize that deadlock and livelock are hard to debug but STM and software
implemented CAS is a CPU hog that gets worse as load increases.

~~~
sitkack
Doesn't it depend on the overlap and granularity of work? The concurrency
mechanism should be treated as a failsafe on conflict, reducing the conflict
in the algorithm should be the primary goal.

------
chaotic-good
This is translation from Russian. Original article is here:
[http://habrahabr.ru/company/ifree/blog/195948/](http://habrahabr.ru/company/ifree/blog/195948/)

