
Making algorithms lock-free with Read-Copy Update (RCU) - ingve
http://the-paper-trail.org/blog/make-any-algorithm-lock-free-with-this-one-crazy-trick/
======
gpderetta
RCU is a very general solution to the concurrent reclamation problem, but it
hardly makes any algorithm lock free.

Also I find the following troubling:

"[in the kernel] Even a concurrency scheme that nominally used spinlocks to
protect critical sections would be lock-free, because every thread would exit
their critical section in bounded time"

This ignores the possibility of deadlocks and livelocks. Also the lock-free
definition requires that the system makes progress if other threads are
halted; you can't just handwave the requirement away by claiming that your
threads never halt; and even in practice you can't guarantee that a buggy NMI
handler won't takes over the cpu effectively halting any thread.

edit: are never not halted -> never halt

~~~
vardump
Indeed, spinlocks most definitely aren't lock-free.

"rcu_read_lock" doesn't ever block (as in wait for a spinlock), right?
Sometimes you just need readers to never block and can pay higher price for
other operations.

~~~
amelius
From the article:

> So as soon as a thread yields its CPU, it’s guaranteed to be out of its
> critical section.

Therefore, the yielding can be seen as a blocking operation, meaning that
readers do in fact block!

~~~
gpderetta
yielding per-se is not a blocking operation, otherwise no algorithm would be
non-blocking would under preemptive scheduling. It could be considered a
potentially blocking operation if scheduling is exclusively cooperative.

~~~
amelius
The CPU is a resource too, for which there is contention just as for other
resources such as files. Therefore, at _every_ machine instruction, you are
_implictly_ performing a potentially blocking operation. (In a pre-emptive
environment.)

Unless you are in a critical section (in which your thread has exclusive
control of the CPU). So the blocking may start again at the end of the
section.

I hope this makes it clear.

~~~
mtanski
On Linux you can guarantee (with some effort) that your process / thread will
be the only thing running on a core.

There's a boot parameter that will let you black list CPUs on which to not
schedule anything (by default). Then you use tasksel to schedule your process
there as well. Finally, you can configure IRQ affinity so also doesn't handle
interrupts on that core. It's not a 100% non pre-emptive but pretty close.

I believe there was work being to enable it to be even more accurately 100 but
I haven't followed along.

------
mtanski
For any HN readers in NYC who have to care about stuff like this (system
level)... you should come to ACM's Applicative 2016 conference. It's June 1st
and 2nd at NYU. [http://applicative.acm.org/](http://applicative.acm.org/)

Paul McKenney is going to be presenting. He is one of the inventors of RCU,
the author of Linux's RCU code. He's going to be presenting about RCU and
write rates.

I'm really curious to hear that talk so that's the primary reason I'm going.
But there's going to be other great systems / application talks.

~~~
abiox
this seems really interesting, but i won't have the chance to attend - any
idea if there will be recordings?

~~~
chrisseaton
The ACM almost never records any talks. Not sure why. Maybe it's to allow
people to debate more freely.

~~~
scott_s
I attended Applicative 2015, and they did record the talks. For each talk,
there is a video link:
[http://applicative.acm.org/2015/applicative.acm.org/speakers...](http://applicative.acm.org/2015/applicative.acm.org/speakers.html)

However, it is true that most ACM conferences do not record talks. That,
however, is probably just because recording talks is expensive, and the
conference organizers don't want to spend their budget on it.

------
derefr
The generalized form of this concept is that there are advantages in having a
non-preemptive execution model. You can implicitly do all sorts of crazy
things without synchronization primitives in a non-preemptive system, because
nobody will try to mess with any of the resources _you 're_ messing with until
you yield execution—which you don't have to do until you're done.

The degenerate case of this is cooperative scheduling, which is painful in
about the same way that manual memory management is painful. But the hybrid
case, reduction scheduling ala Erlang, gets a lot of the same advantages
(nothing pre-empting a function in the middle of a loop body) without the
ability to forget to add yields (they happen at tail-call sites—which, for a
language where tail-recursion is the only loop primitive, means your code will
always have an O(1) runtime before a yield.)

------
javitury
This ugly son of a b____ is optimizing super hot algorithms and you are
f______ stupid. How?... Just read this free paper >>>

I wonder how long it will take until academical papers are published with
clickbait titles.

~~~
jeffwass
>I wonder how long it will take until academical papers are published with
clickbait titles.

Seriously. It's the exact opposite of an abstract, which is to give a quick
TL;DR summary so you can know if it's worth your time reading more about the
paper.

~~~
Chris2048
Add more pictures, remove any set-theory notations, rewrite in "explain like
i'm 5" basic English, and people might actually start reading academic papers.

Startup idea?

~~~
mtanski
I agree that the language of most papers is needlessly obtuse. In fact I doubt
that reviewers understand the full concept when evaluating paper submissions.

Having said that, even if most papers were vastly simplified I doubt that many
topics could be lowered to ELI5.

------
ww520
Two things.

1\. A misplaced comma adds substantial confusion: "during a critical section,
a thread may not block, or be pre-empted by the scheduler." This sounds like
the thread may not block OR if it's blocked, the scheduler would preempt it,
which doesn't make sense in regarding to the readers using the critical
section. It should read: "during a critical section, a thread may not block
and it would NOT be preempted by the scheduler." That just means a thread in a
critical section is guaranteed to run to the end of the critical section
without worrying about other threads preempting it. It owns the processor
until the end of the critical section.

Then it makes sense. When the readers exit the critical section, they have
done dealing with the old shared data. The writer's scheduled thread won't
preempt the reader threads in critical section, and when it runs, it means the
readers have finished.

2\. Critical section is a form of lock. I don't see how it can be claimed
lock-free. May be it should say critical section help avoid inter-processor
locking.

------
simscitizen
You can do a variant of this in userspace as well:

[https://github.com/opensource-
apple/objc4/blob/master/runtim...](https://github.com/opensource-
apple/objc4/blob/master/runtime/objc-cache.mm#L37-L53)

Basically, issue a syscall that acquires the PC of all threads. When none of
the PCs are in the critical section, then you can garbage collect old objects.

~~~
nkurz
Hmm, if I'm understanding right, rather than passively checking whether any of
the other threads are in the critical section, it's actually forcing all the
threads to undergo a context switch so that the PC (instruction pointer) can
be read. So long as the context switch is prevented until the thread is out of
the critical section, the actual value of the PC is irrelevant so long as the
call succeeds:

    
    
      #define PC_SENTINEL  1
      unsigned int count = x86_THREAD_STATE64_COUNT;
      kern_return_t okay = thread_get_state (thread, x86_THREAD_STATE64, (thread_state_t)&state, &count);
      return (okay == KERN_SUCCESS) ? state.__rip : PC_SENTINEL;
    

[https://github.com/opensource-
apple/objc4/blob/master/runtim...](https://github.com/opensource-
apple/objc4/blob/master/runtime/objc-cache.mm#L639-L641)

If you have a lot of threads running, this could get really expensive. I
presume it works, but this doesn't seem like an efficient approach unless your
"writes" are extremely rare relative to the number reads and number of
threads.

I haven't been able to find source for thread_get_state(), though. I don't
think there is any way to get the current PC from a core without an interrupt?
I presume it at least reads the PC for sleeping threads without needing to
wake them?

~~~
simscitizen
It absolutely can get expensive with lots of threads running. And you are also
right that this code ends up making each thread reach a synchronization point.
Luckily Objective-C is targeted towards clients which generally don't have a
bazillion threads running so it's generally not a problem.

There was talk at some point of adding a single syscall that got all the PCs
of all the threads at once to cut down on the overhead, but it looks like that
still hasn't happened.

What's going on here is something like this:

1\. objc_msgSend is Obj-C's method dispatcher, and it depends on method caches
to make method lookup faster.

2\. Sometimes the method caches become out of date. For instance, maybe the
method cache filled up and the Obj-C runtime needs to allocate a larger method
cache.

3\. This old method cache has to be GC'd at some point, so it's added to a
freelist.

4\. This GC function (the "write" you're talking about) for these caches runs
once the size of outdated method caches grows beyond a certain threshold
(garbage_threshold in the code). So it shouldn't run too often. The GC
function works by checking if any thread is currently within objc_msgSend. If
no function is in objc_msgSend, then the runtime is sure that none of the
method caches on the free list is in use.

5\. It is definitely optimized for reads. The "read" in this case is literally
a method dispatch so it happens all the time, and it's important for the read
to be lock-free.

------
PaulRobinson
Click-bait titles will never, ever get my up vote even if the content is OK.

Most of the web is a mess. Let's not make this corner like that stuff over
there.

~~~
limaoscarjuliet
The title seems like an obvious joke though.

~~~
bigredhdl
I'm sure it is. That phrase is like nails on a chalkboard for me.

------
ricksplat
_> The general pattern is that you can prepare an update to a data structure,
and then use a machine primitive to atomically install the update by changing
a pointer._

Interesting. Analogous to the "two phase" commit used by Oracle, as opposed to
the global lock used by SQL server.

(sorry if tpc is ubiquitous now, or SQLs doesn't use the locking any more -
haven't worked with these in a while)

This is "lock free" certainly - but it still requires atomic updates, which is
another concurrency primitive. Sorry not an expert but doesn't this still
require some kind of locking under the hood? More efficient than explicit
locking certainly but locking nonetheless.

EDIT I just re-read - so he's proposing that rather than locking you just
prevent a thread from being context-switched by the scheduler. So you're kind
of stepping back from definitive _preemptive_ multitasking and introducing a
_cooperative_ element.

So you can't really make "any" algorithm lock-free - only where the scheduler
lets you put it on hold, and where your code has the execution privileges to
do that.

I think he's speaking specifically about _Linux_ system code, but you have to
delve into the details of TFA before that becomes apparent.

It seems fairly obvious then that if you have a more cooperative multitasking
model then locking isn't really as necessary.

~~~
mtanski
> Interesting. Analogous to the "two phase" commit used by Oracle, as opposed
> to the global lock used by SQL server.

I don't think 2PC is the right analogy here (which is really about distributed
commit). But if you want to go use a database analogy MVCC. For example in
LMDB there's many reads and one writer, the write finishes by updating
swapping the root page and all future readers can see updates via that root
page (analogous to the pointer with new version).

In fact the way LMDB is implemented is essentially RCU with a reader table and
GCing of old pages once the oldest reader with access to that page goes away.

Yes, the author is talking about (Linux) kernel space where you have a lot
more control over the environment so it's "easier" to use / build various
concurrency primitives. Simple example: it's possible in the kernel to be in a
spinlock loop with interrupts disabled and the holder will not be pre-empted.
You can't make guarantees like that in user space.

~~~
gpderetta
The problem with RCU is that the name is extremely misleading.

The name Read-Copy-Update describes the general way to concurrently update a
shared data structure: make a local copy of the parts subject to change,
modify the local copy then atomically substitute the original with the updated
copy.

This is absolutely not exclusive of the RCU algorithm and in fact is pretty
much how any lock free update of a data structure looks like (and even not
lock free, see persistent data structures or MVCC).

A problem that lock free algorithms have is that they need a way to dispose of
the now stale original node, which might be concurrently being accessed by
other readers. There are many way to handle the disposal: for example hazard
pointers, pass-the-buck, and full GC (this includes shared pointers).

RCU is one of these disposal algorithms and it uses epoch detection plus the
contract that an accessor thread won't hold a reference across an epoch.
Implementations of RCU can be very efficient for read mostly data structures
because, on the reader side do not need expensive memory barriers (contrast
with the store-load required to for hazard pointer updates even on the read
side): a dependent load (the infamous load_consume) is enough.

------
vardump
In kernel space, anything that helps to achieve (even limited) wait-freeness
is rather welcome. You often don't have the luxury of long lasting locks.
Short lasting (at most microseconds) spinlocks is usually the most important
concurrency primitive you have.

So nice to read about techniques that can be implemented in kernel space.
Inability to do a context switches often works against you there [1].

[1]: Many usermode waiting based constructs cannot be used, because they
require yielding or pre-emption. You can't yield if you can't schedule (or
block).

------
prewett
Does "lock-free algorithm" just mean "won't deadlock"? It sure sounds like a
standard locking algorithm to me, what with rcu_lock() and talk of critical
sections. In fact, it seems like this is similar to having a read lock and a
write lock. Am I missing something?

~~~
firethief
[https://en.m.wikipedia.org/wiki/Non-
blocking_algorithm#Lock-...](https://en.m.wikipedia.org/wiki/Non-
blocking_algorithm#Lock-freedom)

The stronger guarantee you're expecting is probably "wait free"

------
bnastic
Is RCU still patent encumbered, though?

~~~
noselasd
Yes, see e.g.
[https://www.kernel.org/doc/Documentation/RCU/rcu.txt](https://www.kernel.org/doc/Documentation/RCU/rcu.txt)
and
[https://www.kernel.org/doc/Documentation/RCU/RTFP.txt](https://www.kernel.org/doc/Documentation/RCU/RTFP.txt)

~~~
gpderetta
I believe that the basic RCU patent has expired a few years ago. Some
extensions might be still under patent. And no, I'm not going to do a google
search to figure out what's patented or not (neither should you probably).

------
graycat
Apparently RCU abbreviates "reed-copy-update".

~~~
nkurz
Yes, it's an abbreviation for "read-copy-update", but unfortunately it doesn't
really mean that at all. Rather, it's shorthand for something like "Now that
I've written the updated version of the data to a new location and atomically
switched a pointer so that all new readers will use it, how will I know when
all previous readers of the old version have finished so that I can reclaim
the space that the old copy was using?" It's a terrible name, but an almost
magically efficient approach for certain problems.

------
nialv7
I think by 'lock-free' this author means 'deadlock free'

------
huhtenberg
Crazy trick. I'll tell you what a crazy trick is.

It's when you are sitting, optimizing some bulk throughput rate in a router
firmware, generally minding your own business and then a guy pops in and says
that he just sped things up by a factor of 10. You get up, follow him to his
cubicle and, lo and behold, it _is_ 10x faster when blasted with a SmartBit
stream. Ask him how he managed to achieve such a remarkable feat and he says -
I just profiled the code, found a bottleneck and worked around it:

    
    
        // some_random_mutex.lock();
        
           ...
    
        // some_random_mutex.unlock();
    

Now THAT was a crazy trick.

Edit: to clarify - the lock was there for a reason. The whole thing whoopsed
in an instant outside that specific test.

~~~
MereInterest
Now I'm curious. Was the mutex actually necessary, or was the performance
increase purchased at the cost of race conditions?

~~~
huhtenberg
Yeah, the mutex was necessary and it was pretty bloody obvious from looking at
few lines of context around those calls.

