
JVM Anatomy Park – 11: Moving GC and Locality - hyperpape
https://shipilev.net/jvm-anatomy-park/11-moving-gc-locality/
======
pizlonator
I think that the main reason why compacting GCs improve performance is that
they obviate the need for a sweep phase, which always has some cost even if
you optimize the hell out of it. I believe that from a scientific standpoint,
it is still an open question exactly how much of the compacting GC speed-up is
due to not sweeping versus improved locality. It's hard to construct a good
experiment of this, because non-compacting GCs always have at least some kind
of sweep phase. For example, if the non-compacting collector experiences
fragmentation, then is its reduced performance due to the cache misses
experienced by the sweeper (which have nothing to do with locality per se -
they are just an inherent cost of sweeping) or because of cache misses
experienced by accesses within the program due to the changed ordering of
objects?

The experiments in this post are super interesting, but it's not a good idea
to extrapolate from microbenchmarks. In my experience big programs always have
some chaotic subtlety, which can sometimes even make its behavior be the
opposite of what the microbenchmark tells you.

It's worth noting that GCs are tuned so that the typical program will
typically allocate objects so as to maximize the locality of object accesses.
Non-compacting GCs tend to allocate objects of the same type next to each
other (so if you allocate O, P, Q where O and Q have type T and P bas type S,
then Q will be right next to O but P will be elsewhere) while compacting GCs
tend to allocate objects in sequence (so O and P will be adjacent, and then
after P will be Q). There's no guarantee that any particular program will
prefer one ordering over the other.

~~~
panic
_It 's hard to construct a good experiment of this, because non-compacting GCs
always have at least some kind of sweep phase. For example, if the non-
compacting collector experiences fragmentation, then is its reduced
performance due to the cache misses experienced by the sweeper (which have
nothing to do with locality per se - they are just an inherent cost of
sweeping) or because of cache misses experienced by accesses within the
program due to the changed ordering of objects?_

What about adding a dummy sweep phase to the compacting collector? Then you
could compare compacting + dummy sweep with non-compacting + real sweep.

~~~
pizlonator
There is no way to implement a dummy sweep that has the behavior of a real
sweep if the collector is doing compaction.

Maybe it isn't obvious: compaction isn't a plug-in that you put into a
collector. It's a different way of writing a collector. For example,
compacting collectors often avoid having any mark bits. You need mark bits to
sweep (among other things).

------
fauigerzigerk
_> Be ever so slightly wary when someone sells you the non-moving GC or no-GC
solutions_

And you should also be wary whenever someone uses the JVM to make a general
point about GC strategies.

The importance of a moving/compacting GC varies hugely depending on whether or
not a language/runtime lets developers define the memory layout of objects.

In a language like Java (and other JVM based languages) that doesn't let you
do that relying on the GC is the only way to fix locality. But that is not the
case in languages that have structured value types (C#, Go, Swift, ...).

~~~
geodel
You might have checked but here is IanT's reply on "Why golang garbage-
collector not implement Generational and Compact gc? There is also interesting
discussion in thread.

[https://groups.google.com/d/msg/golang-
nuts/KJiyv2mV2pU/wdBU...](https://groups.google.com/d/msg/golang-
nuts/KJiyv2mV2pU/wdBUH1mHCAAJ)

"However, modern memory allocation algorithms, like the tcmalloc-based
approach used by the Go runtime, have essentially no fragmentation issues. And
while a bump allocator can be simple and efficient for a single-threaded
program, in a multi-threaded program like Go it requires locks. In general
it's likely to be more efficient to allocate memory using a set of per-thread
caches, and at that point you've lost the advantages of a bump allocator. So I
would assert that, in general, with many caveats, there is no real advantage
.."

~~~
pcwalton
> And while a bump allocator can be simple and efficient for a single-threaded
> program, in a multi-threaded program like Go it requires locks. In general
> it's likely to be more efficient to allocate memory using a set of per-
> thread caches, and at that point you've lost the advantages of a bump
> allocator.

Ian is extremely confused here. Every production multithreaded generational GC
uses thread-local allocation buffers (edit: thanks to Filip for correcting my
terminology).

~~~
pizlonator
Very few production multithreaded generational GCs use per-thread nurseries.

But probably all of them use thread-local caches, or what you might call per-
thread caches. I think that's what you might have meant.

Per-thread nurseries is a shit idea unless your language has extra-super-high
levels of infant mortality. Remember that in a per-thread nursery genGC, the
write barrier has to infect the pointed-to object with the global heap so that
future stores to that object also trigger the barrier. In a normal genGC, the
pointed-to object is unaffected; the stored-to object is merely shaded (or
card-marked, or logged, or whatever) for revisiting.

Humorously for parent's post, to my knowledge the first instance of a thread-
local cache was the idea that a multithreaded compacting GC can chop off a
slice of memory from the global bump allocator and then use it to service a
thread-local allocator. Not 100% sure though because maybe Boehm and friends
beat everyone to it.

~~~
pcwalton
> Very few production multithreaded generational GCs use per-thread nurseries.

> Per-thread nurseries is a shit idea unless your language has extra-super-
> high levels of infant mortality.

Isn't that what TLABs are? HotSpot uses TLABs by default, unless my info is
out of date… For example, [1] shows the generated assembler for object
allocation, which uses nonatomic pointer bumps.

I think we might be talking about different things by "per-thread nurseries".
All I mean by a "per-thread nursery" is a TLAB.

[1]: [https://shipilev.net/jvm-anatomy-park/4-tlab-
allocation/](https://shipilev.net/jvm-anatomy-park/4-tlab-allocation/)

~~~
pizlonator
Like I said, you're thinking thread-local caches. TLABS are thread-local
caches, not per-thread nurseries.

There's a HUGE difference between a thread-local cache (which everyone uses
and which you cite) and per-thread nurseries (which are an obscure idea that
hardly anybody likes).

~~~
pcwalton
Yes. Apologies for using the wrong terminology :)

------
BenoitP
I ran the 3 visualizations at the end of the article and made imgur albums:

Images are in order.

JOLSample_22_Compaction: [http://imgur.com/a/agJaR](http://imgur.com/a/agJaR)

JOLSample_23_Defragmentation:
[http://imgur.com/a/A74k9](http://imgur.com/a/A74k9)

JOLSample_24_Colocation: [http://imgur.com/a/RTrhL](http://imgur.com/a/RTrhL)

In JOLSample_22_Compaction, Aleksey writes: "It happens because many temporary
objects are allocated while populating the list.". I'm having trouble knowing
exactly what are the temporary objects. Is it the implicit c.toString()? Maybe
also the "int minCapacity" in ensureCapacityInternal in ArrayList[1]? Is there
more?

Also, in JOLSample_22_Compaction it is kind of nice to see some gaps spaced by
growing powers of 2 in [http://imgur.com/4T5dBi8](http://imgur.com/4T5dBi8) ;
it must be the actual pointer arrays of the ArrayList, replaced successively
to grow in size. And the current one being in green color. Or am I mistaken?

Fascinating.

[1]
[http://grepcode.com/file/repository.grepcode.com/java/root/j...](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/ArrayList.java#222)

------
notamy
Love these. Even after reading over the JVM spec, these writeups still show
how much I don't know.

~~~
pvg
This might be because these are about (important) implementation details the
spec quite deliberately does not specify.

------
openasocket
OT, but I'm curious if there's any manual memory management analogue for
moving GC. Generally in that setting if you want to allocate a bunch of
objects with good locality you use a slab allocator. But what if you want to
compact that slab after a certain period of time because of fragmentation, or
you have a bunch of object already on the heap you want to move together onto
a single slab? The complexity is you have to have a way to update all the
references to those objects.

I'm curious if anyone knows of such work that has been done?

~~~
gwbas1c
Doubtful, because compaction requires updating the pointers to the new
locations. In manual memory management, the compactor would have to stop the
world, know enough about each structures' internal layouts to know where the
pointers are, and then update all of them.

That would be impossible when trying to interoperate among different
languages.

Manual memory management has its place, but if you need compaction, you should
use a garbage collector. Manual memory management, (and reference counting,)
aren't always faster than GC. (Malloc has overhead too, and can run slower
than GC.)

