
Biased Reference Counting: Minimizing atomic operations in GC (2018) [pdf] - luu
http://iacoma.cs.uiuc.edu/iacoma-papers/pact18.pdf
======
NightMKoder
This is pretty cool! As they mention in the paper this is similar to biased
locking in java where “synchronized” blocks only ever get promoted to a full
mutex when a second thread tries to lock it. It’s the same idea here except
with a counter.

This made me think of two other approaches I wonder if anyone tried. The first
is using something like per-cpu counters. The number of threads is usually
large, but maybe the number of cpus is bounded enough to contain the memory
pressure? Linux implements restartable sequences
[[https://blog.linuxplumbersconf.org/2016/ocw/system/presentat...](https://blog.linuxplumbersconf.org/2016/ocw/system/presentations/3873/original/presentation-
rseq-lpc2016.pdf)] for this but I don’t know if that’s any faster than BRC
described here. The other thought was - in a similar light to the static
analysis the paper mentions - is there any benefit to somehow marking the
objects that are “about to escape” to switch the owner thread id? One of the
common patterns is swift is bouncing a sequence of tasks off different
dispatch queues to get on/off the UI thread, but its really a sequence of
operations on the same object.

Great stuff in either case - I’m a fan of no-effort memory management with
minimal performance implications and this is one step closer.

~~~
vnorilo
I played with per-thread counters. The problem is you don't want them sharing
a cache line, so having them inside the object is problematic. Having external
per-thread counter arrays involves bookkeeping and indirection I could not
make fast enough. YMMV, and maybe someone has done a better job.

A known variant is deferred counting, where a single thread mutates refcounts
and you have per-thread lightweight queues from other threads to the mutator.

------
gottesmm
One problem with the type of approach here is that many modern UI programs
allocate objects on side threads and then migrate those objects to the main
thread (which is the only thread allowed to touch UI state). So truly, one
needs to have an approach that allows for that steady state to be recognized
(the steady state being when the object has landed in its final destination
thread).

Once swift has some sort of concurrency model, it would be interesting to see
if we could use per thread local ref counts that manage the lifetime of a
single atomic ref count bump. One would be able to convert in between the two
types of ref counts upon escape if needed.

This type of approach would eliminate most of the ref counts and allow for all
threads to use in hot code these non-atomic thread local ref counts.

~~~
blt
I only skimmed the paper, but it seems like the proposed method could support
handing over ownership from the worker thread to the GUI thread via an
explicit statement in the code.

~~~
liuliu
Yeah. Event at runtime, it seems possible when the owner thread gives up the
ownership (by setting owner_tid = 0) can use an extra flag to let other
threads to compete for the ownership. As long as the flag is on the 2nd half-
word, (it seems) there is no atomic requirement for the 1st half-word.

------
blt
This is a nice result. A similar pattern to generational garbage collection:
divide everything into a fast, common path, and slow, fully general path.

