Hacker News new | comments | show | ask | jobs | submit login

> I hope I explained why “one size fits all” does not really do it in garbage collection.

Great to see people acknowledge this. Garbage collection is full of tradeoffs; be skeptical of any claims to the contrary.

I also like the way the author emphasizes the importance of inline bump allocation in the nursery. In the fast path, allocation of memory doesn't need to be any more than 5 or 6 instructions. This speed advantage is huge, and GC schemes that throw it away need to have a very good reason for doing so.






> In the fast path, allocation of memory doesn't need to be any more than 5 or 6 instructions. This speed advantage is huge, and GC schemes that throw it away need to have a very good reason for doing so.

I have some experience writing garbage collectors and with the situations I was targeting the handful of instructions quickly fades when multiple threads come into the equation.

1. Using atomics for bumping the pointer had some luck but contended atomics on the platforms I was targeting meant that the low-instruction-count allocation was still slow.

2. Using locks (futex-style) was slow, as expected.

3. The best results I found for my use case (precise garbage collector for a runtime targeting video game consoles) resulted in per-thread nursery-type allocation arenas, selected by TLS, with no locks in the fast-path. This was slower than the ideal single-threaded fast path because of the TLS overhead.


Yeah, the canonical solution for multithreaded GC is the third option (TLABs). The TLS overhead is annoying, but on some architectures you can get away with burning a register to save the TLS load. It might well be worth it on AArch64, with its 32 GPRs...

(TLABs are the recommended solution for multithreaded malloc implementations like jemalloc and tcmalloc as well.)


AArch64 has a dedicated register (TPIDR_EL0) for TLS.

Didn't know that, thanks!


What is the TLS overhead? On x64 you presumably could spare two registers, one for the current heap pointer and one for the end of the heap. How could you implement this more efficiently with a single threaded version?

Not sure if LLVM has an equivalent of global register variables, though.


#3 is a thread-local allocation buffer, which is what HotSpot is using. https://shipilev.net/jvm-anatomy-park/4-tlab-allocation/

I must be missing something. From the article:

> the allocation rate goes down at least 5x, and time to execute goes up 10x! And this is not even starting to touch what a collector has to do when multiple threads are asking for memory (probably contended atomics)

...

> For Epsilon, the allocation path in GC is a single compare-and-set - because it issues the memory blocks by pointer-bumps itself.

So both the TLABs and the Epsilon GC are doing pointer-bumps, right? Why does TLAB outperform no-TLAB even in the single-thread case? Just because the pointer-bump instructions are inlined?


Aside from the contention overhead, per-thread nurseries also has a fringe benefit for SMP platforms with exclusive L1 caches; short lived data is unlikely to be shared, so having threads allocate from different cache lines prevents thrashing the lines across cores.

Why would any allocator split a cache line to multiple threads, and why would a program use dynamic memory allocation for something so small it would fit in the L1 cache?

If your dynamic allocator is a pointer-increment (as it is in most copying generational GCs), then if you have a single global nursery, then two threads allocating after each other would get adjacent locations in memory when allocating.

On SBCL, it's pretty normal to dynamically allocate a cons-cell, which is the size of two pointers. The garbage collector pays zero cost for any garbage in the nursery when it is run, so the performance overhead versus stack allocation is approximately zero.

I've seen language implementations where there is no stack at all (it's one of the most straightforward ways to implement scheme). Chicken scheme works this way, for example (it cleverly uses the C stack as the nursery, since functions never return in chicken and since a copying nursery allocation is just a pointer-increment, which is what a C stack allocation is).


What was the TLS overhead in practice? On most architectures it's just indirecting through one register isn't it?

I can't remember the exact numbers, I last worked on the project in 2011.

If I recall though, because our compiler generated DLLs, we weren't able to use the TLS fast path options, we needed instead to use the TLS API (ie: TlsAlloc/TlsGetValue on Win32) which slowed things down a bit.


Yep, given that it also applies to manual memory management, it could be hardly otherwise on automatic memory management.

Turbo Pascal on MS-DOS already provided a pluggable API to customize the allocator.

And I remember companies living off selling optimized versions of malloc(), for different kinds of scenarios.


Honestly I think this is something that will become less and less important as time goes on. The JDK, for example, has 4 separate garbage collectors that can be selected via configuration. But with runtime information (whether via JIT or PGO), couldn't they be better selected automatically for a given runtime profile? I think as compilers continue to build better optimization capabilities, that the gap between GC and manual memory management will shrink to the same significance as the gap between C and assembly today.



Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: