Great to see people acknowledge this. Garbage collection is full of tradeoffs; be skeptical of any claims to the contrary.
I also like the way the author emphasizes the importance of inline bump allocation in the nursery. In the fast path, allocation of memory doesn't need to be any more than 5 or 6 instructions. This speed advantage is huge, and GC schemes that throw it away need to have a very good reason for doing so.
I have some experience writing garbage collectors and with the situations I was targeting the handful of instructions quickly fades when multiple threads come into the equation.
1. Using atomics for bumping the pointer had some luck but contended atomics on the platforms I was targeting meant that the low-instruction-count allocation was still slow.
2. Using locks (futex-style) was slow, as expected.
3. The best results I found for my use case (precise garbage collector for a runtime targeting video game consoles) resulted in per-thread nursery-type allocation arenas, selected by TLS, with no locks in the fast-path. This was slower than the ideal single-threaded fast path because of the TLS overhead.
(TLABs are the recommended solution for multithreaded malloc implementations like jemalloc and tcmalloc as well.)
Not sure if LLVM has an equivalent of global register variables, though.
> the allocation rate goes down at least 5x, and time to execute goes up 10x! And this is not even starting to touch what a collector has to do when multiple threads are asking for memory (probably contended atomics)
> For Epsilon, the allocation path in GC is a single compare-and-set - because it issues the memory blocks by pointer-bumps itself.
So both the TLABs and the Epsilon GC are doing pointer-bumps, right? Why does TLAB outperform no-TLAB even in the single-thread case? Just because the pointer-bump instructions are inlined?
On SBCL, it's pretty normal to dynamically allocate a cons-cell, which is the size of two pointers. The garbage collector pays zero cost for any garbage in the nursery when it is run, so the performance overhead versus stack allocation is approximately zero.
I've seen language implementations where there is no stack at all (it's one of the most straightforward ways to implement scheme). Chicken scheme works this way, for example (it cleverly uses the C stack as the nursery, since functions never return in chicken and since a copying nursery allocation is just a pointer-increment, which is what a C stack allocation is).
If I recall though, because our compiler generated DLLs, we weren't able to use the TLS fast path options, we needed instead to use the TLS API (ie: TlsAlloc/TlsGetValue on Win32) which slowed things down a bit.
Turbo Pascal on MS-DOS already provided a pluggable API to customize the allocator.
And I remember companies living off selling optimized versions of malloc(), for different kinds of scenarios.
1) LLVM doesn't place any restrictions on how a language runtime allocates memory.
2) LLVM doesn't "expect" a stack map -- it provides infrastructure to compute them if the front end wants to, but the front end is completely free to ignore that infrastructure.
BoehmGC uses the stack for its root set (where it starts to find live memory). With LLVM, you dont ever need to explicitly use the stack, you can use registers for everything and make LLVM figure out if it should go on the stack or not. If you want to use Boehm with LLVM, you are forced to explicitly allocate everything on the stack, and not just on a register, so that Boehm is guaranteed to find the memory.
So I wouldnt say restriction, but definitely you need to think about how LLVM operates with the GC and other runtime components of your language.
> First, a description of how SBCL’s generational GC works:
> Being conservative with regards to the stack. That creates a major clash with LLVM which expects a stack map. A stack map declares which data type can be found where on the stack for a function (as in every invocation of that function needs to comply to it), whereas SBCL’s compiler can and does generate code that just places anything anywhere. This makes the stack a lot more compact in some call patterns.
So those were intended as corrections.
No, this is UB. It can cause a segfault, but it can also allow bad things™ to happen.
It is not unusual to rely on triggering a hardware exception in runtimes and GCs, up to and including segfaults. For example, a check for safepoint might be implemented by attempting to read a particular address. When the GC needs to stop the world, it could change the protection bits on the page that contains that address. This technique minimizes the amount of code in the safepoint test and doesn't require any branching logic.
See e.g. https://stackoverflow.com/questions/46394575/safepoints-in-j...
Another technique: a runtime can support dynamic growth of the stack by keeping an unmapped or protected page at the limit, and extending it when hit. This is how stacks work on Windows, and it relies on help from codegen: compilers targeting Windows need to generate code that performs a loop touching every page in its stack frame allocation one at a time in allocation order, if the stack frame is larger than the page size.
See e.g. https://geidav.wordpress.com/tag/stack-probing/
That's one of the common options in Erlang/Elixir: spawn a worker process for each task with a `min_heap_size` high enough that most workloads would not trigger a collection (the default is fairly low), let it die after it's handled the request. More complex/memory intensive tasks will fall through to normal GC-ing behaviour once they breach the `min_heap_size` limit.
I have a basic understanding about card tables and promotion but couldn't find anything about hole punching. Pretty sure I have heard the term before and was just as confused, could someone point me into the right direction for this?
From context I'd guess that it means the gc doesn't copy unless x% of the block is unused?
You cannot move the possibly (conservatively) pointed to thing because you cannot adjust the pointer to it (because it might be a non-pointer thing such as an integer.
Now you have some GC unit worth of space occupied by one unmovable object, otherwise it's empty space backed by physical pages. What do you do with the rest of the space? In a C/malloc scheme you are aware of such holes and fill them from new allocations. When you have a fast allocation scheme not involving complex code to find holes you will keep these "hole" as long as the conservative-pointer looking thing exists. You do wipe all the other pointerish things in that GC area, though, so that they don't hold down additional space. Still, now you "waste" a whole GC card worth of physical RAM on a single object, the tradeoff being that you do not want to move to an allocation scheme that spends time thinking about fragmentation.
You could use the empty space in those GC cards as a target for the next moving GC, however that has drawbacks as you know continue to have to special-treat such regular objects co-located with possibly conservatively pointed to objects.
If there is a better term than "punching holes" for this I would be interested.
ETA: now that I think about it, you could give back all physical pages in that GC card that do not contain the held down object. This assumes that GC card size is more than one VM page.
Please keep corrections to my post coming, the time to determine and influence GC design restrictions in LLVM is now. Before a popular GCed language comes along and then tramples whatever its current GC happens to be into the status quo.