This made me think of two other approaches I wonder if anyone tried. The first is using something like per-cpu counters. The number of threads is usually large, but maybe the number of cpus is bounded enough to contain the memory pressure? Linux implements restartable sequences [https://blog.linuxplumbersconf.org/2016/ocw/system/presentat...] for this but I don’t know if that’s any faster than BRC described here. The other thought was - in a similar light to the static analysis the paper mentions - is there any benefit to somehow marking the objects that are “about to escape” to switch the owner thread id? One of the common patterns is swift is bouncing a sequence of tasks off different dispatch queues to get on/off the UI thread, but its really a sequence of operations on the same object.
Great stuff in either case - I’m a fan of no-effort memory management with minimal performance implications and this is one step closer.
A known variant is deferred counting, where a single thread mutates refcounts and you have per-thread lightweight queues from other threads to the mutator.
Once swift has some sort of concurrency model, it would be interesting to see if we could use per thread local ref counts that manage the lifetime of a single atomic ref count bump. One would be able to convert in between the two types of ref counts upon escape if needed.
This type of approach would eliminate most of the ref counts and allow for all threads to use in hot code these non-atomic thread local ref counts.