The experiments in this post are super interesting, but it's not a good idea to extrapolate from microbenchmarks. In my experience big programs always have some chaotic subtlety, which can sometimes even make its behavior be the opposite of what the microbenchmark tells you.
It's worth noting that GCs are tuned so that the typical program will typically allocate objects so as to maximize the locality of object accesses. Non-compacting GCs tend to allocate objects of the same type next to each other (so if you allocate O, P, Q where O and Q have type T and P bas type S, then Q will be right next to O but P will be elsewhere) while compacting GCs tend to allocate objects in sequence (so O and P will be adjacent, and then after P will be Q). There's no guarantee that any particular program will prefer one ordering over the other.
What about adding a dummy sweep phase to the compacting collector? Then you could compare compacting + dummy sweep with non-compacting + real sweep.
Maybe it isn't obvious: compaction isn't a plug-in that you put into a collector. It's a different way of writing a collector. For example, compacting collectors often avoid having any mark bits. You need mark bits to sweep (among other things).
And you should also be wary whenever someone uses the JVM to make a general point about GC strategies.
The importance of a moving/compacting GC varies hugely depending on whether or not a language/runtime lets developers define the memory layout of objects.
In a language like Java (and other JVM based languages) that doesn't let you do that relying on the GC is the only way to fix locality. But that is not the case in languages that have structured value types (C#, Go, Swift, ...).
Far from varying "hugely", the programmer-specified memory layout of objects empirically barely has any effect at all on locality. This meme is cited a lot but doesn't actually hold up in reality.
We've probably done one of the few actual studies of this in Rust, because we started reordering struct fields in the compiler instead of using the field order the developer wrote (though this can be turned off), and it barely affected locality at all. Memory usage was slightly improved (which argues against your point that developers, left to their own devices, specify optimal layouts).
The presence of struct types reduces allocation churn, so all memory management things become less important.
But C# has value types, and the generational hypothesis certainly holds for C#. So while good memory management becomes less important, it doesn't change what the optimal MM approach is.
The higher the allocation rate, the larger the throughput advantage of compacting collectors.
But compacting collectors have all sorts of other costs - complexity, hard to make them concurrent, potential for high barrier overheads. So, if your allocation rate is low enough, then the throughput advantage of the compacting collector might be low enough that you might think to use some other kind of collector.
JSC is a great example. I should say that the allocation rate that matters is really the allocation rate scaled against the GC's speed; in classic terms we're talking about the cons/mark ratio. JS has a low cons/mark ratio (low relative allocation rate) because the JS collector is fundamentally no slower than any other one but the JS mutator is significantly slower than something like Java. So JS's allocation rate relative to the GC is lower than Java's allocation rate relative to the GC. This means that all GC throughput effects are dampened, which is one of the big reasons why JSC can get away with a collector that doesn't compact.
Also, the generational hypothesis has nothing to do with whether compaction is profitable. These are orthogonal things.
(EDIT: replaced "fundamentally now slower" with "fundamentally no slower", above)
"Also, the generational hypothesis has nothing to do with whether compaction is profitable. These are orthogonal things."
It is my understanding that these two are strongly linked, at least in practice.
The generational hypothesis states that most (even overwhelmingly most) allocations are unreachable very quickly.
So we employ moving/compacting collectors so that we no longer need to sweep all those dead allocations. These dead allocations are freed at no cost and we only pay for copying the live set.
Having written that, it occurs to me that the sentence I quote read oddly to me because I conflated 'compaction' with 'moving'.
Happy to be corrected on anything written above.
It’s true that those generational GCs that copy objects also sometimes use the address of the object to track the object’s generation. JSC’s GC uses a GC state byte in the object header to tell which generation an object is in.
Eh, "nothing" is a strong word there. Technically you're right, but most generational GCs in practice use bump allocation in the nursery. The most common object copying occurs through minor collections, since compaction of the tenured generation is expensive and is done infrequently. So introducing generational GC in the usual way often does coincide with introducing copying.
You seem to be treating “generational GC” as a historical concept rather than an algorithmic concept. You’re right that from an historic perspective, generational GC and copying are related. But they are not technically or theoretically related.
Um, I'm well aware that generational GC doesn't require copying, and I mentioned this from the start. I don't even know what we're arguing about anymore.
And it's not just that I don't need the GC to help me with that in a language that has value types. It's also that the GC might get it completely wrong if I'm not accessing the objects in creation order.
The generational hypothesis will hold in general, but I don't think it is very relevant when we're talking about processing large arrays filled with small objects. They're all going to be old gen.
Most apps aren't doing that. The generational hypothesis is one of the strongest empirical observations we've ever seen in computer science, and those are very hard to come by. Every attempt to deny the generational hypothesis that I've ever seen has been proven to be wrong in the end.
The observation as I understand it is that if you take the average over a large set of programs, then the hypothesis holds.
The observation is _not_ that if you take any program, then the hypothesis holds.
I'm hedging that it's the average over a large population of programs. I'm not making a claim about every program, only most programs, or rather - "most programs most of the time". I think that this is important because without a doubt, some programs are not generational at all.
Great example: go navigate from this page to another one. You want a full GC, not an eden GC, at this point. If your primary activity when browsing is navigating around and you're not spawning a new process for each navigation, then you are violating the generational hypothesis big time.
Note that this is somewhat orthogonal to the question of whether you should implement a generational GC. Generational GCs are great in part because they are rarely a regression on non-generational workloads.
Classic example: in the MMTk project they built this goofy "only for testing" non-generational collector that allocates in a bump nursery and then during GC it simultaneously evacuates the nursery and marks the "old" space. This non-generational GC, which has some traits of a generational GC, was slightly faster than their mark-sweep collector. It's often the case that generational GCs are really about a lot more than just exploiting the generational hypothesis.
I'd be more than happy to discard my speculation if it happens to be wrong!
"However, modern memory allocation algorithms, like the
tcmalloc-based approach used by the Go runtime, have essentially no
fragmentation issues. And while a bump allocator can be simple and
efficient for a single-threaded program, in a multi-threaded program
like Go it requires locks. In general it's likely to be more
efficient to allocate memory using a set of per-thread caches, and at
that point you've lost the advantages of a bump allocator. So I would
assert that, in general, with many caveats, there is no real advantage .."
Mostly because the GO gc does fragment, which can lead to crashes when it shouldn't. The GO team seems to believe escape analysis means no need for generational hypothesis, while Gil and Shiplev think that escape analysis means less allocation (which does help collection, but not enough).
Shiplev, these days works with the RedHat Shenadoah GC team. Which is low pause time (in the order of the Go Team promise) while compacting and fully parallel. So he knows a thing or two about modern GCs and JIT and compiler tech ;) Not to say that the Go team doesn't, it just that they might assume somethings which are not true for some apps.
Here are Gil's view:
"Go's GC story is evolving. And it's doing so at a fairly quick pace. Judging it based on a current implementation and current choices is premature, I think.
I think that the Go team's implementation priority focus on latency is a good one. And unlike others, I don't actually think there is a strong latency vs. throughput tradeoff in the long run."
And Kind words about Hotspot:
"The overall approach with HotSpot can be thought of as "start with a STW generational collector and increment your way to lower average/typical pauses in the old generations". E.g. all practical collectors in HotSpot use a monolithic stop-the-world copying newgen collector (not concurrent or incremental, monolithic, run-to-completion STW, always). And all collectors in HotSpot have a fallback to a full STW oldgen pause. The various non-monolithic-STW "cool modes" (Partially Concurrent non-compacting Oldgen in CMS, Incremental STW compaction in G1) are filters in the oldgen that occur between full STW newgen pauses and full STW oldgen pauses. They are "happy" when the filters work well enough to keep the STW newgen pauses in the tens of msec and delay the full-STW pauses for hours or days. And they consider it to be "ok and still working" even when those levels are not met."
You can check Shenandoah pause times (20-45ms) :
Decent for Java world but not really competing with Go.
To add they seem very critical of Generational GC:
"Generational GC pay a steep penalty for copying Data."
You neglected to add the rest of the comment, which refutes the rest of your post:
> I would predict that as Go's use application cases expand it's GC implementations will evolve to include generational moving collectors.
> To add they seem very critical of Generational GC:
That's their LRU cache benchmark, which is an artificial benchmark testing the worst case for generational GC, as they point out later in the slide deck. The takeaway from this should not be that generational GC is bad in general.
"It is common to see early GC-based runtime implementations that do not move objects around succumb to an architectural "hole" that creates a reliance on that fact early in the maturity cycles. This usually happens by allowing native code extensions to assume object addresses are fixed (i.e. that objects will not move during their lifetime) and getting stuck with that "assumed quality" in the long run due to practical compatibility needs. From what I can tell, it is not too late for Go to avoid this problem."
Previously I was not concerned about the long-term outlook for Go's GC. It's low pause (with some pathological cases) and currently very inefficient. The long term plan had previously mentioned moving to a generational/moving collector in the future. Gil's endorsement was cheering.
But, Ian's comments on non-moving collector's on the Golang-nuts mailing list were alarming (and seemed technically confused). Time will tell.
Inefficient compared to what? Does it lead to Go apps use more memory? more CPU? longer execution time? Because I haven't seen much difference in those respect. In Fact in general Go apps use 2-10 times less memory than Java.
Specifically because it uses a non-moving collector it has a sweep phase which frees each of the dead allocations.
In Java, or similarly .net, the use of a moving collector means that dead objects aren't freed. The live objects are moved out of the current memory region and the whole region is then 'free'. If you have few live allocations and lots of dead allocations in that region then your GC cycle is much more efficient.
I can't comment on memory usage experiences. I have written very careful Java programs that ran on the 64bit hotspot server in ~40mb or memory. I've written Java programs that used gigs, and I've written the same range in Go.
I would like to quickly note that I actually like the Go GC. I think they are on a very promising path to a potentially great GC. But they are also given to some public hyperbole which I find awkward.
Go will not be able to fix this without introducing generational GC (or global two-space copying collection, but nobody would do that). It's unfortunate that the Go developers seem to be digging in their heels over this choice. They should admit that their approach is wrong for usual apps and introduce generational GC (as Gil suggests).
As a GC developer someone might be more impressed by Java's cutting edge work but as an application developer Java (hotspot) approach to GC does not buy me anything more over Go.
In my development experience Java apps use more memory and its GC is less predictable.
The production problems I commonly run into is rather unpredictable and very long GC pauses in Java.
At this point I'd just be repeating Mike Hearn's excellent blog post, so I'll just link you there: https://blog.plan99.net/modern-garbage-collection-911ef4f8bd...
Ian is extremely confused here. Every production multithreaded generational GC uses thread-local allocation buffers (edit: thanks to Filip for correcting my terminology).
But probably all of them use thread-local caches, or what you might call per-thread caches. I think that's what you might have meant.
Per-thread nurseries is a shit idea unless your language has extra-super-high levels of infant mortality. Remember that in a per-thread nursery genGC, the write barrier has to infect the pointed-to object with the global heap so that future stores to that object also trigger the barrier. In a normal genGC, the pointed-to object is unaffected; the stored-to object is merely shaded (or card-marked, or logged, or whatever) for revisiting.
Humorously for parent's post, to my knowledge the first instance of a thread-local cache was the idea that a multithreaded compacting GC can chop off a slice of memory from the global bump allocator and then use it to service a thread-local allocator. Not 100% sure though because maybe Boehm and friends beat everyone to it.
> Per-thread nurseries is a shit idea unless your language has extra-super-high levels of infant mortality.
Isn't that what TLABs are? HotSpot uses TLABs by default, unless my info is out of date… For example,  shows the generated assembler for object allocation, which uses nonatomic pointer bumps.
I think we might be talking about different things by "per-thread nurseries". All I mean by a "per-thread nursery" is a TLAB.
There's a HUGE difference between a thread-local cache (which everyone uses and which you cite) and per-thread nurseries (which are an obscure idea that hardly anybody likes).
"Thread-local allocation buffers (TLABs) are widely used in memory allocators of garbage-collected systems to speed up the fast-path (thread-local allocation) and reduce global heap contention yet at the expense of increased memory fragmentation. Larger TLABs generally improve performance and scalability but only up to the point where more frequent garbage collection triggered by increased memory fragmen- tation begins dominating the overall memory management overhead. Smaller TLABs decrease memory fragmentation but increase the frequency of executing the slow-path (global allocation) and thus may reduce performance and scalability..."
Whoever wrote that must not be very familiar with how memory management works.
Images are in order.
In JOLSample_22_Compaction, Aleksey writes: "It happens because many temporary objects are allocated while populating the list.". I'm having trouble knowing exactly what are the temporary objects. Is it the implicit c.toString()? Maybe also the "int minCapacity" in ensureCapacityInternal in ArrayList? Is there more?
Also, in JOLSample_22_Compaction it is kind of nice to see some gaps spaced by growing powers of 2 in http://imgur.com/4T5dBi8 ; it must be the actual pointer arrays of the ArrayList, replaced successively to grow in size. And the current one being in green color. Or am I mistaken?
I'm curious if anyone knows of such work that has been done?
That would be impossible when trying to interoperate among different languages.
Manual memory management has its place, but if you need compaction, you should use a garbage collector. Manual memory management, (and reference counting,) aren't always faster than GC. (Malloc has overhead too, and can run slower than GC.)