

Lockfree protection of data structures that are frequently read - panagios
https://www.arangodb.com/2015/08/lockfree-protection-of-data-structures-that-are-frequently-read/

======
haberman
I believe the author may have re-invented RCU:
[https://en.wikipedia.org/wiki/Read-copy-
update](https://en.wikipedia.org/wiki/Read-copy-update)

This code follows the basic structure of RCU. The traditional RCU primitives
correspond to DataProtector primitives like so:

    
    
        - prot.use() is rcu_read_lock()
        - prot.unUse() is rcu_read_unlock()
        - prot.scan() is synchronize_rcu()
    

RCU was originally a kernel-side concept, but it has since expanded to user-
space also, see:

[http://liburcu.org/](http://liburcu.org/)

[https://lwn.net/Articles/573424/](https://lwn.net/Articles/573424/)

liburcu is clearly not as simple as what is described in the article. One
reason is that they are using pre-C11 C, which has no standardized atomics or
barriers. It is also much more aggressive about reducing overheads: this
article's code incurs a test plus a (possibly contended) atomic
increment/decrement pair for every read critical section. liburcu offers
several variants of the implementation, but none of them are as expensive as
this.

I said "possibly contended" even though the article's code endeavors not to
have contention. But it appears it still can run into contention in the case
where there are more threads than the compile-time template parameter of the
DataProtector class. In that case the id space will wrap around and multiple
threads will be assigned the same slot, leading to contention. The more you
exceed this, the more contention you will get. This is an unfortunate drawback
of this code's simplicity.

Also, it seems like there is a bug in the code. _mySlot is thread-local, but
_last is an instance variable. It seems like two DataProtectors (which will
have independent _last values) could assign the same _mySlot to different
threads. This would cause unnecessary contention even if you don't exceed the
thread limit. It seems like _last should be static (global).

I don't mean this commentary to come off negatively. I think there is a _lot_
of value in the way they have managed to factor this so that the DataProtector
class is short and simple. Lots of lock-free algorithms have been known for a
while, but in many cases they aren't practical because they aren't factored in
such a way that they have convenient APIs. SMR/Hazard Pointers is a great
example of this. I would love to see improvements to this class to address
these problems while still remaining simple.

------
k4st
A nice simplification of would be to use the current CPU number as your ID.
That eliminates the dependence on thread-local storage, and with high
probability avoids issues where there are collisions between threads whose IDs
modulo N are equivalent.

You could use an instruction like `RDTSC` to extract the CPU number. There
might also be ways of getting at it efficiently with glibc/pthreads.

~~~
haberman
> A nice simplification of would be to use the current CPU number as your ID.

I don't think that works unfortunately. The thread could be rescheduled on a
different CPU in the middle of the read-side critical section. When the
critical section is exited, it will decrement a different counter. Scan() will
wait until every counter is zero, but this will never happen unless another
critical section is also rescheduled in the reverse order.

~~~
k4st
That is fine ;-) Sum up the counters.

~~~
haberman
I don't think that works. You can't get a consistent view of the counters
without a lock. Consider the following scenario:

    
    
           CPU1                   CPU2
           ----                   ----
                                  T1 rcu_read_lock (+1)
           read counter (0)
           T2 rcu_read_lock (+1)
                                  T2 rcu_read_unlock (-1)
                                  read counter (0)
    

We'll find that zero is the sum of all counters, even though T1 is still in
its read lock.

~~~
k4st
Indeed! I think I missed that in the original algorithm it doesn't wait for
all slots to be zero, just that each slot go to zero. In that way it is just
like waiting for each reader to go through a period of quiescence. This is a
lot like the the version of rcu where the writer shuttles itself across cpus
to use syscalls returns as a proxy for knowing that a particular cpu is not in
any read side critical sections.

If there were a nice way to swap out the counters then summing would work,
unfortunately that introduces an even bigger race :-(

Thanks for the correction!

