Hacker News new | comments | show | ask | jobs | submit login
JVM Anatomy Park – 11: Moving GC and Locality (shipilev.net)
138 points by hyperpape 189 days ago | hide | past | web | favorite | 51 comments



I think that the main reason why compacting GCs improve performance is that they obviate the need for a sweep phase, which always has some cost even if you optimize the hell out of it. I believe that from a scientific standpoint, it is still an open question exactly how much of the compacting GC speed-up is due to not sweeping versus improved locality. It's hard to construct a good experiment of this, because non-compacting GCs always have at least some kind of sweep phase. For example, if the non-compacting collector experiences fragmentation, then is its reduced performance due to the cache misses experienced by the sweeper (which have nothing to do with locality per se - they are just an inherent cost of sweeping) or because of cache misses experienced by accesses within the program due to the changed ordering of objects?

The experiments in this post are super interesting, but it's not a good idea to extrapolate from microbenchmarks. In my experience big programs always have some chaotic subtlety, which can sometimes even make its behavior be the opposite of what the microbenchmark tells you.

It's worth noting that GCs are tuned so that the typical program will typically allocate objects so as to maximize the locality of object accesses. Non-compacting GCs tend to allocate objects of the same type next to each other (so if you allocate O, P, Q where O and Q have type T and P bas type S, then Q will be right next to O but P will be elsewhere) while compacting GCs tend to allocate objects in sequence (so O and P will be adjacent, and then after P will be Q). There's no guarantee that any particular program will prefer one ordering over the other.


It's hard to construct a good experiment of this, because non-compacting GCs always have at least some kind of sweep phase. For example, if the non-compacting collector experiences fragmentation, then is its reduced performance due to the cache misses experienced by the sweeper (which have nothing to do with locality per se - they are just an inherent cost of sweeping) or because of cache misses experienced by accesses within the program due to the changed ordering of objects?

What about adding a dummy sweep phase to the compacting collector? Then you could compare compacting + dummy sweep with non-compacting + real sweep.


There is no way to implement a dummy sweep that has the behavior of a real sweep if the collector is doing compaction.

Maybe it isn't obvious: compaction isn't a plug-in that you put into a collector. It's a different way of writing a collector. For example, compacting collectors often avoid having any mark bits. You need mark bits to sweep (among other things).


>Be ever so slightly wary when someone sells you the non-moving GC or no-GC solutions

And you should also be wary whenever someone uses the JVM to make a general point about GC strategies.

The importance of a moving/compacting GC varies hugely depending on whether or not a language/runtime lets developers define the memory layout of objects.

In a language like Java (and other JVM based languages) that doesn't let you do that relying on the GC is the only way to fix locality. But that is not the case in languages that have structured value types (C#, Go, Swift, ...).


Totally! Java is in a class of languages that encourages a high allocation rate. I think that each language has a kind of "average" allocation rate that is an indirect outcome of the various design decisions in the language. I think you're right that C# probably has a lower average allocation rate because of structs. It's true that languages with lower average allocation rates benefit less from compaction, because they benefit less from any GC optimization.


> The importance of a moving/compacting GC varies hugely depending on whether or not a language/runtime lets developers define the memory layout of objects.

Far from varying "hugely", the programmer-specified memory layout of objects empirically barely has any effect at all on locality. This meme is cited a lot but doesn't actually hold up in reality.

We've probably done one of the few actual studies of this in Rust, because we started reordering struct fields in the compiler instead of using the field order the developer wrote (though this can be turned off), and it barely affected locality at all. Memory usage was slightly improved (which argues against your point that developers, left to their own devices, specify optimal layouts).


I thought he was talking about the existence (or lack) of struct types, not the ability to lay things out in a struct.

The presence of struct types reduces allocation churn, so all memory management things become less important.


Yeah, if "letting developers define the memory layout of objects" means "value types" then that's obviously true.

But C# has value types, and the generational hypothesis certainly holds for C#. So while good memory management becomes less important, it doesn't change what the optimal MM approach is.


Of course it changes the MM approach if allocation rate goes down.

The higher the allocation rate, the larger the throughput advantage of compacting collectors.

But compacting collectors have all sorts of other costs - complexity, hard to make them concurrent, potential for high barrier overheads. So, if your allocation rate is low enough, then the throughput advantage of the compacting collector might be low enough that you might think to use some other kind of collector.

JSC is a great example. I should say that the allocation rate that matters is really the allocation rate scaled against the GC's speed; in classic terms we're talking about the cons/mark ratio. JS has a low cons/mark ratio (low relative allocation rate) because the JS collector is fundamentally no slower than any other one but the JS mutator is significantly slower than something like Java. So JS's allocation rate relative to the GC is lower than Java's allocation rate relative to the GC. This means that all GC throughput effects are dampened, which is one of the big reasons why JSC can get away with a collector that doesn't compact.

Also, the generational hypothesis has nothing to do with whether compaction is profitable. These are orthogonal things.

(EDIT: replaced "fundamentally now slower" with "fundamentally no slower", above)


Can you expand on

"Also, the generational hypothesis has nothing to do with whether compaction is profitable. These are orthogonal things."

It is my understanding that these two are strongly linked, at least in practice.

The generational hypothesis states that most (even overwhelmingly most) allocations are unreachable very quickly.

So we employ moving/compacting collectors so that we no longer need to sweep all those dead allocations. These dead allocations are freed at no cost and we only pay for copying the live set.

Having written that, it occurs to me that the sentence I quote read oddly to me because I conflated 'compaction' with 'moving'.

Happy to be corrected on anything written above.


Generational GC had nothing to do with moving objects. It’s possible to implement generational GC without moving anything. That’s how BDWGC, Edge’s GC, and JSC’s GC all work and there are probably others.

It’s true that those generational GCs that copy objects also sometimes use the address of the object to track the object’s generation. JSC’s GC uses a GC state byte in the object header to tell which generation an object is in.


> Generational GC had nothing to do with moving objects.

Eh, "nothing" is a strong word there. Technically you're right, but most generational GCs in practice use bump allocation in the nursery. The most common object copying occurs through minor collections, since compaction of the tenured generation is expensive and is done infrequently. So introducing generational GC in the usual way often does coincide with introducing copying.


Generational GC is a reusable concept. It’s very important for those who go to apply this concept to understand what you need to build a generational GC. My point is that you don’t need copying to do generational GC. The only thing that generational GC has to do with copying is that this is how the first generational GCs happened to be implemented and some people don’t know that there is another way, probably because people like you refuse to acknowledge that copying has nothing to do with generations.

You seem to be treating “generational GC” as a historical concept rather than an algorithmic concept. You’re right that from an historic perspective, generational GC and copying are related. But they are not technically or theoretically related.


> probably because people like you refuse to acknowledge that copying has nothing to do with generations.

Um, I'm well aware that generational GC doesn't require copying, and I mentioned this from the start. I don't even know what we're arguing about anymore.


pizlonator is correct. I didn't mean the layout of individual structs (that is its fields). What I meant is the layout of large numbers of objects in memory (relative to each other) which is what the article was benchmarking.

And it's not just that I don't need the GC to help me with that in a language that has value types. It's also that the GC might get it completely wrong if I'm not accessing the objects in creation order.

The generational hypothesis will hold in general, but I don't think it is very relevant when we're talking about processing large arrays filled with small objects. They're all going to be old gen.


> The generational hypothesis will hold in general, but I don't think it is very relevant when we're talking about processing large arrays filled with small objects. They're all going to be old gen.

Most apps aren't doing that. The generational hypothesis is one of the strongest empirical observations we've ever seen in computer science, and those are very hard to come by. Every attempt to deny the generational hypothesis that I've ever seen has been proven to be wrong in the end.


I think you're overstating the meaning of the empirical observation.

The observation as I understand it is that if you take the average over a large set of programs, then the hypothesis holds.

The observation is _not_ that if you take any program, then the hypothesis holds.

I'm hedging that it's the average over a large population of programs. I'm not making a claim about every program, only most programs, or rather - "most programs most of the time". I think that this is important because without a doubt, some programs are not generational at all.

Great example: go navigate from this page to another one. You want a full GC, not an eden GC, at this point. If your primary activity when browsing is navigating around and you're not spawning a new process for each navigation, then you are violating the generational hypothesis big time.

Note that this is somewhat orthogonal to the question of whether you should implement a generational GC. Generational GCs are great in part because they are rarely a regression on non-generational workloads.

Classic example: in the MMTk project they built this goofy "only for testing" non-generational collector that allocates in a bump nursery and then during GC it simultaneously evacuates the nursery and marks the "old" space. This non-generational GC, which has some traits of a generational GC, was slightly faster than their mark-sweep collector. It's often the case that generational GCs are really about a lot more than just exploiting the generational hypothesis.


Yes, there are some workloads that violate the generational hypothesis. But the vast majority of real-life programs adhere to it. I'm not arguing that the generational hypothesis holds over the entire universe of all possible programs; that would obviously be wrong. Rather I'm arguing against the implication expressed by some in this thread that generational GC isn't worth implementing because the generational hypothesis is wrong for most programs in X language.


I'm super puzzled why you think that the generational hypothesis should hold in every language, to the point that you're willing to reject empirical findings in other languages that contradict this presumption. If some language did not obey the generational hypothesis then this would not contract any single - not even one - prior empirical finding, unless there had also already been experiments on this subject in that particular language. The field of GC is way too young to start extrapolating findings from some languages to all languages.


I'm not rejecting empirical findings that contradict the generational hypothesis. I simply don't believe anybody has actually done any rigorous measurements of object mortality in Go and how it compares to other languages. In the absence of such measurements, I'm going to speculate similar allocation patterns to C#, which has a very similar model and the generational hypothesis certainly does hold for it.

I'd be more than happy to discard my speculation if it happens to be wrong!


Most apps aren't doing that, which is exactly why I said that the generational hypothesis will hold in general. I don't need convincing that optimizing for that case is generally a good idea. It just has very little to do with the point I was trying to make (apparently in a less than clear way)


Field order is uninteresting compared to, say, coalescing allocations or using SoA (two things Rust isn't particularly good at, IMO). In the sense of those, memory layout is important.


You might have checked but here is IanT's reply on "Why golang garbage-collector not implement Generational and Compact gc? There is also interesting discussion in thread.

https://groups.google.com/d/msg/golang-nuts/KJiyv2mV2pU/wdBU...

"However, modern memory allocation algorithms, like the tcmalloc-based approach used by the Go runtime, have essentially no fragmentation issues. And while a bump allocator can be simple and efficient for a single-threaded program, in a multi-threaded program like Go it requires locks. In general it's likely to be more efficient to allocate memory using a set of per-thread caches, and at that point you've lost the advantages of a bump allocator. So I would assert that, in general, with many caveats, there is no real advantage .."


I think that shipilev wrote this blog post after seeing that post. He and Gil Tene, of Azul Zing/C4 pauseless GC fame, don't seem to be impressed by the GO team GC claims.

Mostly because the GO gc does fragment, which can lead to crashes when it shouldn't. The GO team seems to believe escape analysis means no need for generational hypothesis, while Gil and Shiplev think that escape analysis means less allocation (which does help collection, but not enough).

Shiplev, these days works with the RedHat Shenadoah GC team. Which is low pause time (in the order of the Go Team promise) while compacting and fully parallel. So he knows a thing or two about modern GCs and JIT and compiler tech ;) Not to say that the Go team doesn't, it just that they might assume somethings which are not true for some apps.


Don't know about him but Gil seems happy by Go's choices GC area.

Here are Gil's view:

https://groups.google.com/forum/#!searchin/mechanical-sympat...

About Go:

"Go's GC story is evolving. And it's doing so at a fairly quick pace. Judging it based on a current implementation and current choices is premature, I think.

I think that the Go team's implementation priority focus on latency is a good one. And unlike others, I don't actually think there is a strong latency vs. throughput tradeoff in the long run."

And Kind words about Hotspot:

"The overall approach with HotSpot can be thought of as "start with a STW generational collector and increment your way to lower average/typical pauses in the old generations". E.g. all practical collectors in HotSpot use a monolithic stop-the-world copying newgen collector (not concurrent or incremental, monolithic, run-to-completion STW, always). And all collectors in HotSpot have a fallback to a full STW oldgen pause. The various non-monolithic-STW "cool modes" (Partially Concurrent non-compacting Oldgen in CMS, Incremental STW compaction in G1) are filters in the oldgen that occur between full STW newgen pauses and full STW oldgen pauses. They are "happy" when the filters work well enough to keep the STW newgen pauses in the tens of msec and delay the full-STW pauses for hours or days. And they consider it to be "ok and still working" even when those levels are not met."

You can check Shenandoah pause times (20-45ms) :

https://www.slideshare.net/RedHatDevelopers/shenandoah-gc-ja...

Decent for Java world but not really competing with Go.

To add they seem very critical of Generational GC:

"Generational GC pay a steep penalty for copying Data."


> About Go:

You neglected to add the rest of the comment, which refutes the rest of your post:

> I would predict that as Go's use application cases expand it's GC implementations will evolve to include generational moving collectors.

(Emphasis mine.)

> To add they seem very critical of Generational GC:

That's their LRU cache benchmark, which is an artificial benchmark testing the worst case for generational GC, as they point out later in the slide deck. The takeaway from this should not be that generational GC is bad in general.


I would like to add another choice quotation from that discussion.

"It is common to see early GC-based runtime implementations that do not move objects around succumb to an architectural "hole" that creates a reliance on that fact early in the maturity cycles. This usually happens by allowing native code extensions to assume object addresses are fixed (i.e. that objects will not move during their lifetime) and getting stuck with that "assumed quality" in the long run due to practical compatibility needs. From what I can tell, it is not too late for Go to avoid this problem."

Previously I was not concerned about the long-term outlook for Go's GC. It's low pause (with some pathological cases) and currently very inefficient. The long term plan had previously mentioned moving to a generational/moving collector in the future. Gil's endorsement was cheering.

But, Ian's comments on non-moving collector's on the Golang-nuts mailing list were alarming (and seemed technically confused). Time will tell.


> and currently very inefficient.

Inefficient compared to what? Does it lead to Go apps use more memory? more CPU? longer execution time? Because I haven't seen much difference in those respect. In Fact in general Go apps use 2-10 times less memory than Java.


Good question, my claim was _very_ vague. The current GC in Go is inefficient in its use of CPU.

Specifically because it uses a non-moving collector it has a sweep phase which frees each of the dead allocations.

In Java, or similarly .net, the use of a moving collector means that dead objects aren't freed. The live objects are moved out of the current memory region and the whole region is then 'free'. If you have few live allocations and lots of dead allocations in that region then your GC cycle is much more efficient.

I can't comment on memory usage experiences. I have written very careful Java programs that ran on the 64bit hotspot server in ~40mb or memory. I've written Java programs that used gigs, and I've written the same range in Go.

I would like to quickly note that I actually like the Go GC. I think they are on a very promising path to a potentially great GC. But they are also given to some public hyperbole which I find awkward.


More CPU time. By not having generational collection, Go pays an enormous throughput cost, not only in how long marking and sweeping takes, but also in how slow allocation is. Go tries to make up for this by using escape analysis, but HotSpot's allocation runs circles around Go's. It's literally like 3 instructions in HotSpot; for Go it's probably over 100.

Go will not be able to fix this without introducing generational GC (or global two-space copying collection, but nobody would do that). It's unfortunate that the Go developers seem to be digging in their heels over this choice. They should admit that their approach is wrong for usual apps and introduce generational GC (as Gil suggests).


Gil seems to be fine with Go's approach to low latency and making a prediction about future direction. And IMO regarding hotspot he doesn't seem too impressed with their lack of emphasis on latency.

As a GC developer someone might be more impressed by Java's cutting edge work but as an application developer Java (hotspot) approach to GC does not buy me anything more over Go.

In my development experience Java apps use more memory and its GC is less predictable.


I can't speak for Gil, but Go's approach to garbage collection—low pause times at the expense of absolutely every other thing you might want in a GC—runs contrary to the wisdom of pretty much everybody else who works in this space. That's why nobody else does what Go does.


What is absolutely every other thing? My Go programs run more or less in same time for similar processing, and use quite a bit less memory than Java along the way.

The production problems I commonly run into is rather unpredictable and very long GC pauses in Java.


Sweeping throughput, allocation performance, pause frequency, mark time, compaction (Java can move objects around instead of triggering OOM), cache behavior, etc. etc.

At this point I'd just be repeating Mike Hearn's excellent blog post, so I'll just link you there: https://blog.plan99.net/modern-garbage-collection-911ef4f8bd...


Thanks. At this point it is clear you are not a Java developer and have no idea how Java actually behave in enterprise applications. You would just quote an article with no numbers but unsubstantiated claims.


The takeaway for me is low latency is way more difficult to achieve in Java. Also Shenandoah claims to be useful for trading/e-commerce/interactive/qos guarantees based service. This covers large range of applications typically developed in Java. And here Shenandoah is suppose to be better than other hotspot GCs.


> And while a bump allocator can be simple and efficient for a single-threaded program, in a multi-threaded program like Go it requires locks. In general it's likely to be more efficient to allocate memory using a set of per-thread caches, and at that point you've lost the advantages of a bump allocator.

Ian is extremely confused here. Every production multithreaded generational GC uses thread-local allocation buffers (edit: thanks to Filip for correcting my terminology).


Very few production multithreaded generational GCs use per-thread nurseries.

But probably all of them use thread-local caches, or what you might call per-thread caches. I think that's what you might have meant.

Per-thread nurseries is a shit idea unless your language has extra-super-high levels of infant mortality. Remember that in a per-thread nursery genGC, the write barrier has to infect the pointed-to object with the global heap so that future stores to that object also trigger the barrier. In a normal genGC, the pointed-to object is unaffected; the stored-to object is merely shaded (or card-marked, or logged, or whatever) for revisiting.

Humorously for parent's post, to my knowledge the first instance of a thread-local cache was the idea that a multithreaded compacting GC can chop off a slice of memory from the global bump allocator and then use it to service a thread-local allocator. Not 100% sure though because maybe Boehm and friends beat everyone to it.


> Very few production multithreaded generational GCs use per-thread nurseries.

> Per-thread nurseries is a shit idea unless your language has extra-super-high levels of infant mortality.

Isn't that what TLABs are? HotSpot uses TLABs by default, unless my info is out of date… For example, [1] shows the generated assembler for object allocation, which uses nonatomic pointer bumps.

I think we might be talking about different things by "per-thread nurseries". All I mean by a "per-thread nursery" is a TLAB.

[1]: https://shipilev.net/jvm-anatomy-park/4-tlab-allocation/


Like I said, you're thinking thread-local caches. TLABS are thread-local caches, not per-thread nurseries.

There's a HUGE difference between a thread-local cache (which everyone uses and which you cite) and per-thread nurseries (which are an obscure idea that hardly anybody likes).


Yes. Apologies for using the wrong terminology :)


And TLABs causes more memory fragmentations and which lead to more GCs and more need for memory compaction. So finally Java solved the problem it created itself by implementing compacting and generational GC.


What? TLABs allow a bump allocator with copying collection, which results in the least kind of fragmentation you can possibly have—zero! How can TLABs increase fragmentation?


You can read at:

https://www.thinkmind.org/download.php?articleid=icons_2012_....

"Thread-local allocation buffers (TLABs) are widely used in memory allocators of garbage-collected systems to speed up the fast-path (thread-local allocation) and reduce global heap contention yet at the expense of increased memory fragmentation. Larger TLABs generally improve performance and scalability but only up to the point where more frequent garbage collection triggered by increased memory fragmen- tation begins dominating the overall memory management overhead. Smaller TLABs decrease memory fragmentation but increase the frequency of executing the slow-path (global allocation) and thus may reduce performance and scalability..."


That isn't relevant to what we're discussing here. The problem of fragmentation across arenas that that paper describes is a problem with any system that uses thread-local caching, including Go's tcmalloc. The type of fragmentation we're discussing is fragmentation within a memory arena. That is the kind of fragmentation that only a copying/compacting collector can eliminate.


Bump allocators don't require locks. Bump allocators are especially effective in the presence of thread-local caches.

Whoever wrote that must not be very familiar with how memory management works.


I ran the 3 visualizations at the end of the article and made imgur albums:

Images are in order.

JOLSample_22_Compaction: http://imgur.com/a/agJaR

JOLSample_23_Defragmentation: http://imgur.com/a/A74k9

JOLSample_24_Colocation: http://imgur.com/a/RTrhL

In JOLSample_22_Compaction, Aleksey writes: "It happens because many temporary objects are allocated while populating the list.". I'm having trouble knowing exactly what are the temporary objects. Is it the implicit c.toString()? Maybe also the "int minCapacity" in ensureCapacityInternal in ArrayList[1]? Is there more?

Also, in JOLSample_22_Compaction it is kind of nice to see some gaps spaced by growing powers of 2 in http://imgur.com/4T5dBi8 ; it must be the actual pointer arrays of the ArrayList, replaced successively to grow in size. And the current one being in green color. Or am I mistaken?

Fascinating.

[1] http://grepcode.com/file/repository.grepcode.com/java/root/j...


Love these. Even after reading over the JVM spec, these writeups still show how much I don't know.


This might be because these are about (important) implementation details the spec quite deliberately does not specify.


OT, but I'm curious if there's any manual memory management analogue for moving GC. Generally in that setting if you want to allocate a bunch of objects with good locality you use a slab allocator. But what if you want to compact that slab after a certain period of time because of fragmentation, or you have a bunch of object already on the heap you want to move together onto a single slab? The complexity is you have to have a way to update all the references to those objects.

I'm curious if anyone knows of such work that has been done?


Doubtful, because compaction requires updating the pointers to the new locations. In manual memory management, the compactor would have to stop the world, know enough about each structures' internal layouts to know where the pointers are, and then update all of them.

That would be impossible when trying to interoperate among different languages.

Manual memory management has its place, but if you need compaction, you should use a garbage collector. Manual memory management, (and reference counting,) aren't always faster than GC. (Malloc has overhead too, and can run slower than GC.)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: