Great to see people acknowledge this. Garbage collection is full of tradeoffs; be skeptical of any claims to the contrary.
I also like the way the author emphasizes the importance of inline bump allocation in the nursery. In the fast path, allocation of memory doesn't need to be any more than 5 or 6 instructions. This speed advantage is huge, and GC schemes that throw it away need to have a very good reason for doing so.
I have some experience writing garbage collectors and with the situations I was targeting the handful of instructions quickly fades when multiple threads come into the equation.
1. Using atomics for bumping the pointer had some luck but contended atomics on the platforms I was targeting meant that the low-instruction-count allocation was still slow.
2. Using locks (futex-style) was slow, as expected.
3. The best results I found for my use case (precise garbage collector for a runtime targeting video game consoles) resulted in per-thread nursery-type allocation arenas, selected by TLS, with no locks in the fast-path. This was slower than the ideal single-threaded fast path because of the TLS overhead.
(TLABs are the recommended solution for multithreaded malloc implementations like jemalloc and tcmalloc as well.)
Not sure if LLVM has an equivalent of global register variables, though.
> the allocation rate goes down at least 5x, and time to execute goes up 10x! And this is not even starting to touch what a collector has to do when multiple threads are asking for memory (probably contended atomics)
> For Epsilon, the allocation path in GC is a single compare-and-set - because it issues the memory blocks by pointer-bumps itself.
So both the TLABs and the Epsilon GC are doing pointer-bumps, right? Why does TLAB outperform no-TLAB even in the single-thread case? Just because the pointer-bump instructions are inlined?
On SBCL, it's pretty normal to dynamically allocate a cons-cell, which is the size of two pointers. The garbage collector pays zero cost for any garbage in the nursery when it is run, so the performance overhead versus stack allocation is approximately zero.
I've seen language implementations where there is no stack at all (it's one of the most straightforward ways to implement scheme). Chicken scheme works this way, for example (it cleverly uses the C stack as the nursery, since functions never return in chicken and since a copying nursery allocation is just a pointer-increment, which is what a C stack allocation is).
If I recall though, because our compiler generated DLLs, we weren't able to use the TLS fast path options, we needed instead to use the TLS API (ie: TlsAlloc/TlsGetValue on Win32) which slowed things down a bit.
Turbo Pascal on MS-DOS already provided a pluggable API to customize the allocator.
And I remember companies living off selling optimized versions of malloc(), for different kinds of scenarios.