
Conservative GC: Is It Really That Bad? - dleskov
https://www.excelsiorjet.com/blog/articles/conservative-gc-is-it-really-that-bad/
======
barrkel
The problematic code in the article is, AFAICT, using an object finalizer to
free manually allocated memory; such approaches seldom work well, even with
precise GCs.

Thread stacks are effectively manually allocated blocks of memory. You create
a thread, which allocates the stack, and as long as the thread lives, the
stack is kept alive - it's self-sustaining. The thread must die by explicit
programmatic action, which in turn will free its allocated block of stack
memory.

Using finalizers at all is usually an anti-pattern in a GC world. The presence
of finalizers is a very strong hint that the GC is being used to manage
resources other than memory, something that GC is a poor fit for, because
other resources almost certainly have no necessary correlation with memory
pressure; and GCs usually only monitor GC heap memory pressure.

That's not to say that there aren't plenty of edge cases where you can end up
with lots of false roots that artificially lengthen object lifetimes with a
conservative GC. Putting a thread stack in your cycle of object references and
relying on GC pressure to break the cycle isn't a strongly motivating one to
my mind, though.

~~~
dbg_nsk
Absolutely agree with you about finalizers! However, please note that this
"threadReaper" code is from JDK class, so, the problem can appear on every
application that just use Timer class.

Of course, there are many other examples of false-roots, but this concrete
class caused unexpected OOMs on several applications of our clients, so we
made this small sample and used it for sanity checking during implementing
precise GC (and then mentioned it in the post).

~~~
yxhuvud
I take it you are part of the ones behind the article? Did you ever see the
papers about a conservative variant of immix, which would be both compacting
and conservative?

~~~
dbg_nsk
Yes, I wrote this article.

No, I haven't read those papers before, but after your comment I took a look
at one of them.. and it is very interesting! Thanks for mentioning it.

But, if I understood correctly, the excess memory retention is still a problem
in conservative immix?

~~~
yxhuvud
It is mentioned in one of the papers (or possibly the doctoral thesis, which
is nice as it gives an overview of the whole chain of improvements) that it
can still happen but that it is much better compared to what they benchmarked
it against (boehm, iirc). Hard to tell if it delivers what it promises
though..

~~~
dbg_nsk
It would be very interesting to run our sample with Timers on this GC. In that
paper I found a link to their implementation, but unfortunately it is
unavailable.

~~~
yxhuvud
Link is dead, but code is on Github somewhere. Dunno where I found the link to
it but I am away from computer for a few days so I can't really look it up for
you.

------
aidenn0
Interestingly enough, SBCL on x86/x64 has a conservative, but moving, GC. It
can know some, but not all roots precisely, so it pins any objects that are
reachable through conservative roots.

It's earlier implementations were on RISC chips that had 24 or more GPRs so
the implementation was simple: 2 stacks and divide the local registers in half
for boxed and unboxed values. This obviously didn't work when porting to x86
which had far fewer registers.

The ARM port I believe uses the non-conservative approach, despite having 1
less register than x64 (the x64 port was derived from the x86 port so uses the
same register assignments).

~~~
rwmj
> divide the local registers in half for boxed and unboxed values

Fondly remembers the separate address and data registers on 68000. Why didn't
they go back to this approach for x86_64 (16 registers), now that no one
really cares about 32 bit x86?

~~~
aidenn0
The conservative GC approach has worked well enough in practice that nobody is
going to do the work. Also there is a performance tradeoff in non-allocating
code: Sometimes you need more unboxed registers, other times you need more
boxed registers so with only ~6 of each[1] you will run into register
pressure.

1: 2 stacks means 2 stack pointers and 2 frame pointers leaving only 12
registers left for values; it's also possible that the SBCL ABI uses a global
register for something else as well, which would leave only 11. PowerPC is a
really luxurious platform in which you have 32 GPRs so even if you use 8 GPRs
for various bookkeeping purposes that leaves 24 remaining, which is enough for
pretty much everyone.

------
dmytrish
I am not an expert in garbage collection techniques, but this article does not
even mention locality of reference (copying GCs improve locality on each
compaction) and how many cache misses are introduced by increased
fragmentation. Are there any benchmarks on this?

~~~
aidenn0
On SBCL the bigger win for using a copying collector isn't the locality of
reference (which helps with some loads, but hurts with others), but rather the
fact that you can make an allocation be about two instructions in the non-GC
case (pointer increment plus a bounds check).

I hadn't spent a lot of time thinking about how much faster this is than
malloc/free until a question came up the other day here on HN to the extent of
"why would anyone dynamically allocate an object that is smaller than a cache
line?" In lisp a commonly allocated structure is a CONS cell which is two
pointers in size, and is often smaller than the cache line. It would be very
wasteful to do a malloc/free of 8 (or 16) bytes, in C but throughput is
approximately identical compared to stack allocating them with SBCLs
allocator.

~~~
bjourne
It never gets as fast as stack allocating variables. Consider a loop:

while(1) { Integer n = new Integer(....) vs int n = ... }

A C compiler would allocate the n variable on the stack, making the
"allocation" completely free. But in a GC:ed language, the n variable would be
bump allocated once every loop. That wouldn't in itself be so costly, but
every so often, a GC cycle would be needed as the garbage accumulates.
Furthermore, in C the address of the n variable stays in place while in a
GC:ed language it moves a bit for each loop.

That is why escape analysis is a fruitful optimization. It takes heap
allocated objects and attempts to stack allocate them, similar to how a C
compiler would do it.

~~~
ScottBurson
> But in a GC:ed language, the n variable would be bump allocated once every
> loop.

Citation needed. I don't know of any GCed language that heap-allocates local
variables in this way (well, maybe SML/NJ does, but I doubt it). Certainly not
Java or any Common Lisp I've ever used.

But I agree with your larger point: heap allocation is never quite as fast as
stack allocation, once you factor in the additional GC load. I don't actually
know how close it gets with modern collectors; would love to see some numbers.

~~~
bjourne
C# does and Java did before the escape analysis optimization became default.
You can find numbers in this old article from 1999:
[https://www.cc.gatech.edu/~harrold/6340/cs6340_fall2009/Read...](https://www.cc.gatech.edu/~harrold/6340/cs6340_fall2009/Readings/choi99escape.pdf)

"and the overall execution time reduction ranges from 2% to 23% (with a median
of 7%) on a 333 MHz PowerPC workstation with 128 MB memory."

The benchmark is 20 years old so it is kind of out of date. I don't know of
any modern benchmarks. I suspect that the difference would be much bigger
nowadays because programmers don't avoid allocating small local objects as
much.

~~~
ScottBurson
Oh, now I understand what you're saying. I got thrown off when you said "the
address of the variable n [...] moves a bit for each loop". This isn't quite
right; it's not the address of the _variable n itself_ that changes, it's the
address of the allocated object that it points to.

------
bjourne
A safepoint in x86 is nothing more than the instruction mov [rip+0x1234], eax.
That shouldn't cause a major slowdown? Also, safepoints are useful for
features other than gc. For example, you can inspect a running thread's
callstack. That is useful when debugging and when objectifying a thread's
state.

Stack maps can be made a bit smaller by pushing and popping all registers from
the stack during gc. That way, you only need to store the values of stack
locations in them and not of individual registers.

Btw, the article is really good. This is the kind of stuff I keep coming back
to HN for!

~~~
dbg_nsk
Right, but even a single instruction placed on every backward branch and at
the epilogue of every method causes noticeable impact on the performance. This
is especially important in highly optimized code and that's why optimizing
compilers like HotSpot C2 (and JET as well) try to remove as many safepoints
as possible. Sometimes it even causes troubles like in this case:

[https://bugs.openjdk.java.net/browse/JDK-5014723](https://bugs.openjdk.java.net/browse/JDK-5014723)

Good point about inspecting thread's callstack. Indeed, with conservative GC
we had a problem: many popular profilers were incompatible with JET because
they inspect threads at safepoints and we had none. When we implemented the
precise GC, this problem disappeared and it was an additional benefit for us.
However, there are alternative ways to gather thread's callstack out of
safepoint. We use them to avoid safepoint bias in our profiler. You can read
more about it here:

[https://www.excelsiorjet.com/blog/articles/portable-
profiler...](https://www.excelsiorjet.com/blog/articles/portable-profilers-
and-where-to-find-them/)

\----

Thanks for kind words! I'm glad that you liked the post!

------
naasking
Conservative GC would probably work well enough for the JVM because there are
no value types or inline arrays which more easily masquerade as roots, ie. a
random sequence of bytes as used in crypto or hashing would yield a lot of
false positives.

By comparison, the CLR is a much worse fit, because value types and
stack/inline/fixed arrays means false positives would be much higher for some
applications.

~~~
dbg_nsk
You are right, such things as value types or inline arrays are not introduced
in Java language (yet), but still JVM can allocate objects including arrays on
the stack if this objects are not escaping into the heap. Of course, not all
objects fit this condition, but the problem remains.

------
SteveJS
The Chakra Javascript engine uses a conservative generational mark and sweep
collector with many phases running in parallel to code execution. It looks
like Chakra is now on github (and with an MIT license). In chakra the GC is
called it a 'Recycler', which can throw one for a loop when searching for the
GC implementation.

------
willvarfar
This is just off the top of my head, but it made me wonder: are there any VMs
that put a stack map header of some sort as a literal in the stack?

E.g. for each frame the compiler orders roots first and then other primitives.
Then, as you enter the frame, write the number of roots to the stack. When the
GC walks the stack it can see precisely which are roots.

~~~
dbg_nsk
The problem is that it all depends on the liveness of the variables. The same
value on the stack can be a root at the begining of the method as the
corresponding variable is still alive, and later it becomes useless as the
variable is already dead. So, you still need to know a location of each live
reference at every safepoint (and that means several stack-maps for each
method).

~~~
willvarfar
You can zero them when they are not live.

~~~
dbg_nsk
Yeah, we've tried that. It was one of the attempts of improving conservative
GC (no dead values on the stack => no false roots). Unfortunately, it causes
noticeable performance degradation, so it is easier to consult with stack maps
about liveness of the variable.

------
le-mark
So does this analysis extend to libgc/Boehms gc, since it's conservative as
well?

------
PaulHoule
The pain from conservative GC depends on how much your address space you are
using.

In the 32-bit age, you ran into problems more and more as your heap approaches
the GB range. At some point the probability that you end up with a false root
that keeps a lot of garbage alive goes to 1.

In the 64-bit age we get a respite, although many systems don't really use
64-bit pointers.

------
kazinator
A useful hybrid is possible and useful in some circumstances: conservative
scan for roots, which point into a precisely traced heap.

------
jacksmith21006
This is something that I just love about Go. Get rid of the GC stalling like
you have with Java.

~~~
imtringued
The primary difference is that Go creates less garbage than Java. Of course
this results in significantly shorter GC pauses but they still exist.

