
Why is memory reclamation so important for lock-free algorithms? - qznc
https://concurrencyfreaks.blogspot.com/2017/08/why-is-memory-reclamation-so-important.html
======
rusanu
Once I've read an excellent book, The Art of Multiprocessor Programming [0].
All examples in the book are in Java. After reading it, when thinking about
some of the solutions and algorithms presented, I quickly concluded that
deploying them in C/C++ would be an order of magnitude more complex, because
of 'memory reclamation'. So yes, I think the OP makes a good point.

[0] [https://www.amazon.com/Art-Multiprocessor-Programming-
Revise...](https://www.amazon.com/Art-Multiprocessor-Programming-Revised-
Reprint/dp/0123973376)

~~~
morecoffee
I picked this book up from the library and it is definitely a useful book.
About 2/3rds of the book is actual runnable example code.

Also agree totally on Java making concurrency way easier with GC. Russ Cox
describes one of these problems here
[https://research.swtch.com/lockfree](https://research.swtch.com/lockfree)

------
haberman
I've found that a lot of the tricky edge cases in lock free programming can be
reduced to the following case:

Imagine that a thread reads a pointer, then immediately (before it
dereferences it) gets scheduled out and frozen for arbitrarily long.

To keep the thread from crashing or performing incorrect logic when it wakes
up, to have to make sure this pointer remains valid for arbitrarily long.

You have to accommodate the fact that a thread can get frozen basically
forever.

~~~
smegel
Why would you ever free an object if something else might be pointing at it?

~~~
xenadu02
>Why would you ever free an object if something else might be pointing at it?

Hi, welcome to programming /s

This is the crux of the entire issue for high-performance multi-threaded
programs: how to tell when memory can safely be freed without introducing
expensive synchronization?

The simple solutions are obvious:

1\. Don't share memory 2\. Use a lock

The real tricks are in forms of reference counting or garbage collection that
avoid taking locks as much as possible. For really high performance you also
need to avoid atomic operations and fences as much as possible. A "pause the
world" garbage collector is easy (relatively) to get right and you can be
confident it will work. Doing a concurrent pauseless GC is another matter
entirely. Inserting a memory fence on every write is not great for
performance.

For both GC and reference counting schemes weak references are an additional
wrinkle. Swift's original weak reference scheme solved this by inverting
control and maintaining parallel reference counts: when the last strong
reference is released it ran deinit and set a flag in the object header but
otherwise left the memory allocated. Only once every weak reference was
touched (or itself deallocated) was the memory actually freed. That's trading
greater high-water memory usage for performance. (I don't know off-hand why
the implementation was changed though).

~~~
smegel
> 1\. Don't share memory 2. Use a lock

3\. Have a concept of ownership and make sure all your code follows some basic
rules?

E.g. if a consumer reads a pointer from a look free queue, it is then
responsible for freeing that memory. OP seems to be saying let some other
thread free it on some racy assumption that the consumer will be done with it
within N nanoseconds.

That's not how I would code it.

~~~
haberman
The problem is _inside_ the lock-free queue. How do you remove something from
a lock-free queue? Usually by reading a pointer to the queue node, then doing
a CAS to remove the node from the queue. But another thread can remove that
same node and free it before you can do the CAS!

If you're just _using_ a lock-free queue, then the hardest part of the problem
has been solved for you already. But the person who _wrote_ the lock-free
queue had to think about and solve these really tricky problems.

~~~
smegel
So how did they solve it?

~~~
haberman
There are a number of techniques for solving this:

    
    
        1. use a GC'd language
        2. hazard pointers
        3. epoch-based GC
        4. collect nodes when a user destroys the structure
           (user is required to guarantee that destruction is
           single-threaded).
    

There are probably others.

------
emerged
I came to use an epoch based reclamation approach where a token is passed
between all threads in the system. Since this chain involves only relaxed
read/writes, it has practically zero overhead. The only downside is the
relatively high latency between retiring and recycling the memory. That and
all participating threads must occasionally process their epoch.

I hadn't seen this exact approach used elsewhere, all the other epoch
approaches were more complex and required atomic operations at certain points.

~~~
sbahra
This is quiescent-state-based reclamation, other implementations exist that
are cheap. I have seen the technique you're talking about used in K42 (after I
had also thought I invented something novel, when I used the same approach as
you :-P).

~~~
cwzwarich
Doesn't the strict definition of QBSR require a thread to keep no references
to shared data when it is in quiescent state, whereas epoch-based reclamation
allows threads to simply retain no references to older versions of shared data
when it advances its epoch?

~~~
sbahra
I'm referring to the FIFO implementation. On the definition of QSBR though,
frankly, I think the term gets overloaded in some of the more recent
literature in this area. Both EBR and QSBR require that no hazardous
references get carried over across "protected sections" (EBR is explicit read-
side protected sections and in QSBR, across a quiescent point).

------
jnwatson
The last time I had to write a lock-free algorithm (in this case, a multi-
writer, multi-reader queue), the memory reclamation strategy was
straightforward:

Make _2_ lock-free queues. Pre-allocate all the memory into message-sized
chunks, and post them to the "upwards" queue. A writer would read/consume from
the upwards queue, write into the received buffer, and then post the same
buffer in the _other_ queue. Everything is passed by reference, so it is zero-
copy.

I guess I don't understand the reason for all the complexity. Perhaps for more
complicated algorithms than queues?

~~~
barrkel
If you preallocate memory, you're not reclaiming it, are you?

~~~
jnwatson
Sure, I'm reclaiming it for future writers.

------
convolvatron
huh. I expected this to be a discussion about the ABA problem, probably would
have been worth at least a mention

edit ref:
[https://en.wikipedia.org/wiki/ABA_problem](https://en.wikipedia.org/wiki/ABA_problem)

------
ianopolous
They claim there are no lock-free or wait-free GCs. But Azul have had a
pauseless one for the best part of a decade:
[http://www.artima.com/lejava/articles/azul_pauseless_gc.html](http://www.artima.com/lejava/articles/azul_pauseless_gc.html)

~~~
bitcharmer
C4 (the GC implementation you are referring to) is not truly pauseless. It
does enormous amount of work with the assist of the "user" threads and has a
concept of local (instead of VM-wide) safepoints, but it still (rarely)
pauses.

[http://dl.acm.org/citation.cfm?id=1993491&dl=ACM&coll=DL](http://dl.acm.org/citation.cfm?id=1993491&dl=ACM&coll=DL)

[http://static.usenix.org/events/vee05/full_papers/p46-click....](http://static.usenix.org/events/vee05/full_papers/p46-click.pdf)

~~~
ianopolous
So they are wrong to repeatedly call it pauseless? Or are you making a
distinction between their implementation and their algorithm? Or are there
different definitions of pauseless?

~~~
cwzwarich
To quote their earlier paper:

"HotSpot supports a notion of GC safepoints, code locations where we have
precise knowledge about register and stack locations [1]. The hardware
supports a fast cooperative preemption mechanism via interrupts that are taken
only on user-selected instructions, allowing us to rapidly stop individual
threads only at safepoints. Variants of some common instructions (e.g.,
backwards branches, function entries) are flagged as safepoints and will check
for a pending per-CPU safepoint interrupt. If a safepoint interrupt is pending
the CPU will take an exception and the OS will call into a user-mode
safepoint-trap handler. The running thread, being at a known safepoint, will
then save its state in some convenient format and call the OS to yield. When
the OS wants to preempt a normal Java thread, it sets this bit and briefly
waits for the thread to yield. If the thread doesn't report back in a timely
fashion it gets preempted as normal.

The result of this behavior is that nearly all stopped threads are at GC
safepoints already. Achieving a global safepoint, a Stop-The-World (STW)
pause, is much faster than patch-and-roll-forward schemes [1] and is without
the runtime cost normally associated with software polling schemes. While the
algorithm we present has no STW pauses, our current implementation does. Hence
it's still useful to have a fast stopping mechanism."

To quote their later paper:

"The C4 algorithm is entirely concurrent, i.e. no global safepoints are
required. We also differentiate between the notion of a global safepoint,
where are all the mutator threads are stopped, and a checkpoint, where
individual threads pass through a barrier function. Checkpoints have a much
lower impact on application responsiveness for obvious reasons. Pizlo et al
[16] also uses a similar mechanism that they refer to as ragged safepoints.

The current C4 implementation, however, does include some global safepoints at
phase change locations. The amount of work done in these safepoints is
generally independent of the size of the heap, the rate of allocation, and
various other key metrics, and on modern hardware these GC phase change
safepoints have already been engineered down to sub-millisecond levels. At
this point, application observable jitter and responsiveness artifacts are
dominated by much larger contributors, such as CPU scheduling and thread
contention. The engineering efforts involved in further reducing or
eliminating GC safepoint impacts will likely produce little or no observable
result."

~~~
ianopolous
Thank you. My reading of that is that the algorithm is theoretically
pauseless, but the implementation has a fixed sub millisecond pause that
doesn't scale with heap size or allocation rate. So they could make the
implementation truly pauseless, but it wouldn't make any practical impact
because of other factors like CPU scheduling jitter dominating. So to all
intents and purposes it is pauseless.

------
none_to_remain
I recall explaining to a guy that the lock free hashmap was very easy to use
as long as you never free()'d anything you put in it, in which case I had some
literature and code on Hazard Pointers for him to take a look at...

------
sbahra
Great write-up! A few notes worth mentioning (IMO) below:

Garbage collection: This is only true in absence of garbage collection. If
you're paying garbage collection cost to begin with, this is not an issue.
Also, note things such as [http://web.mit.edu/willtor/www/res/threadscan-
spaa-15.pdf](http://web.mit.edu/willtor/www/res/threadscan-spaa-15.pdf)

This only applies to dynamic lock-free data structures: Or, data structures
requiring memory allocation. If you're using bounded buffers and don't require
memory allocation, this isn't an issue.

Taxonomy: Not all passive schemes are quiescent-state-based. In QSBR,
quiescent points are a first-class entity while that is _not_ the case in EBR
(you have only 3 distinct states). Absent extensions you are unable to
differentiate one quiescent point from another, which has real implications on
being able to implement things like deferred / non-blocking reclamation
efficiently. There are some advantages to this, a while ago Paul Khuong
contributed a reference counting scheme to epoch reclamation allowing for
overlapping protected sections (bounded epoch makes reference counting
practical, something you can't do with unbounded quiescent points). It's
pretty great and note, it doesn't require a strict notion of quiescence!
You'll find this in Concurrency Kit.

It is also worth noting the HP, etc... (pointer-based schemes) can be used to
implement QSBR.

For these reasons, I prefer to classify these techniques into two primary
families (based on the semantics of the interface rather than the
implementation): "passive schemes" and "active schemes". Passive schemes do
not require cooperation from the algorithms requiring safe memory reclamation
(EBR, QSBR, etc...) while "active schemes" do (HP, PTB, etc... in their
textbook form require modification to the algorithm).

On blocking for passive schemes: It is worth noting that QSBR, EBR and other
"passive schemes" do have "non-blocking" (informally) interfaces (rcu_call,
text-book EBR utilizes limbo lists, etc...) such that writers do not have to
block on the fast path, but of course, there is the cost of memory
accumulating until a grace period is detected (so, not non-blocking in the
formal sense but if you've the memory to spare and sufficient forward
progress, becomes a non-issue). In my implementations, I typically use a
dedicated thread for forward progress / epoch detection, ensuring the writer
has forward progress.

You have much higher granularity and the lock-free progress guarantee in
hazard pointers, because it is pointer-based (tied to the forward progress
guarantees of the underlying algorithm).

Under-appreciated recent development: An interesting thing to note here is
there are schemes such as
[https://www2.seas.gwu.edu/~seotest/publications/eurosys16ps....](https://www2.seas.gwu.edu/~seotest/publications/eurosys16ps.pdf)
which do not necessitate the same heavy-weight barriers in the presence of
invariant TSC+TSO and with a sufficiently high granularity TSC, can provide
the same forward progress guarantees as hazard pointers.

On hazard pointers being slow: One interesting thing to note is hazard
pointers can also be used to implement "passive" schemes such as proxy
collection, to give similar performance properties as EBR and QSBR.

On real world implementations: It is worth mentioning
[http://concurrencykit.org](http://concurrencykit.org) :-P

~~~
hyperpape
I'm having trouble understanding your point about GC. The post points out that
GC solves some of these problems, at the expense of creating waits. Do you
think there's something wrong with the way it describes the issue?

~~~
wbl
Not GP, but I think what he's getting at is that the GC cost is already paid
and doesn't get paid again for reclaiming memory with lock-free algorithms.

------
rootw0rm
the best way to do it: don't require memory reclamation in your design.

I coded a lock free IPC interface from standard C to C#. Luckily in my case
the data I was moving originated on the stack. So all I do is copy from stack
to shared IPC buffer and obviously keep using that same buffer until exit.

HOWEVER, if you're competent enough to handle raw threads and lock free design
to begin with, you shouldn't have the slightest issue with queuing up your
calls to free(). My workstation is old, but I still have 6 cores/12 threads @
4.2 GHz. That's more than enough headroom to have a dedicated GC thread that
has zero effect on performance, and a barely noticeable effect on memory.

Proper concurrency is not that difficult.

~~~
anarazel
The difficulty isn't calling free, it's to know when it's safe to do so,
without incurring significant overhead.

------
otabdeveloper1
> Btw, even if you have a GC, no known GC is lock-free (much less wait-free)

There are lots of lock-free memory allocators for C (and C++).

They're not at all hard to write, I've written them myself.

