A presentation on Shenandoah by one of the principal developers (focuses on the how): https://youtube.com/watch?v=4d9-FQZZoVA
Another presentation (focuses on the how, why, and when) at devoxx: https://www.youtube.com/watch?v=VCeHkcwfF9Q
Tangent: Go GC (generational, non-copying, concurrent mark and concurrent sweep) and the improvements it saw from 1.5 -> 1.8 https://blog.golang.org/ismmkeynote
The mutator/pacer scheme, the tri-color mark scheme, using rw-barriers during the sweep phase are interesting to contrast between the two GCs.
I think that Go GC is not generational. If I'm not mistaken, the keynote you linked explains they tried it and the gain were not there.
In Go, most of those objects actually do get allocated directly on the stack, or as fields directly inside other objects. So they are never seen by the GC. This is one of the aspects of the language design that I really admire.
Or, you could look at as saying that Go does have a generational memory manager, and the first generation is "on the stack".
Most golang code I've seen still relies on pointers to objects, so unless escape analysis is better than say what the JVM or .NET offer, it's not really that different from them. .NET already has value types, and the JVM should be getting them soonish. So there really isn't that much difference from this aspect.
As you say, value types are coming soon, but they don't exist now, so right now there is a lot of difference (and in practice will be for a long time, because not everything is going to migrate to value types immediately).
Whether the language uses value types or heap allocated types changes the constant factor, but the core hypothesis isn't affected - you can't allocate everything on the stack. Memory copying costs start overtaking GC overhead if you try too hard.
If you want to know more about the current progress, there is a great presentation -> blog here: https://blog.golang.org/ismmkeynote
And you can see the most recent generational GC code here:
Though my guess is it won't get merged into mainline for Go1.13 as it is already in code freeze.
That something is much higher memory requirement for same size workload as compared to Go.
Non-generational GCs don't have this problem, and it's one of the reasons why Shenandoah suited us well there.
< tenures all those requests to OldGC
> tenures all those requests to OldGen
Keep your caches out of GC memory for extra goodness.
Could you expand on this please?
Caches that are in scalar data forms (e.g. byte arrays) or off-heap aren't too bad - bytes and off-heap memory doesn't need to be traced. If you're caching an object graph dense with pointers, then not so great.
Could you please clarify this question? Do you mean that if cached objects are a small part of the total allocation rate, then generation GCs work well with that?
My answer meant that Shenandoah would work well in a program where cache occupies like 70-80% of the heap, and generational GCs might not. But surely, neither are going to break from a 1%-heap cache.
Our team is going to be targeting ZGC for the main reason that it's included by default in JDK11. Yeah, builds of JDK11 with Shenandoah are available but it's more work.
We have throughput intensive loads where pause times do not matter and so we do not use ZGC or Shenandoah, it seems that in these cases even ParallelGC is better than G1 but that is another story.
I've tried to summarize this experience in this post (incl. GraalVM and different JDK versions)
It looks like the maximum pauses of Shenandoah are still well over 1ms, which will still cause a lot of tail latency in services. (Go reached ~5ms max pauses a few releases ago, but there was still a significant improvement in the behavior of services when pauses were clamped at 100μms in more recent releases.)
Definitely the right direction. I hope future versions of Shenandoah will clamp GC pauses even lower.
Also, I think Go's pause times are misleading because I believe Go uses throttling and throttling pauses are not counted as GC pauses. One could easily write a GC with absolute 0 max pause: you make each allocation a super-fast, simple pointer bump from a thread local allocation buffer, and when that's out you throttle (i.e. block) the thread forever. Of course the throughput of such a "collector" will end up converging to zero, too. So pause times without throughput numbers are meaningless.
Yes. All GCs I know employ the negative feedback loops between allocators and GCs. Stop-the-world GCs are on the far side of that spectrum: there, allocation failure induces "throttling" that would not unblock the thread until the GC is over. Even contending on shared allocation lock when requesting the new thread-local allocation buffer can be thought as accidental throttling.
Concurrent GCs are relying on collecting faster than application allocates, but sometimes application allocates way too much anyway. I believe ZGC does the simplest thing in this case: if allocation fails, the allocation is blocked until memory is available (i.e. until current GC cycle is able to recover some memory). Is this also throttling? I think so. Shenandoah does a few other tricks, including diving into STW when concurrent mode fails. To me, this is also the form of throttling.
At the end of the day, I would say that GC pause durations are non-ideal proxies for end-to-end latency. That said, GC pauses do contribute to end-to-end latency very significantly, and so they are not completely meaningless. Their duration should be taken with a fair grain of salt, though, for either Go GC, ZGC, Shenandoah, or whatever else.
I would not agree with the blanket "The stop-the-world pauses in the JVM are typically from compaction". Concurrent marking is done in CMS, G1, Shenandoah, ZGC. [In first two, there are nitty gritty details about young collections that distort the story] -- and that resolves a significant portion of stop-the-world time. Concurrent copying and updating references is done in Shenandoah, ZGC -- that resolves the rest of it.
Of course, you can skip the compact part, and just do a sweep, which frees implementation from dealing with the second part completely. This is not without drawbacks, though: allocation path gets more complicated, fragmentation needs to be dealt with, etc. How far you can get with that, depends on what the use cases really are. As far as I can tell, the damned "CMS concurrent mode failure" caused by heap fragmentation and unwillingness to part with uber-fast bump-ptr allocation paths nudged most JVM people to accept compaction as the go-to answer for reliable GC.
Each language has its own conventions for how much to rely on stacks versus specialized memory pools versus garbage collection. For example, in Go, you often have a stack per request, and can use it for some per-request data as an alternative to GC. That's going to change required throughput for an equivalent amount of work. As will differences in escape analysis.
I don't know how much that matters, but it seems like this makes cross-language comparisons pretty subjective. You need to decide what an equivalent app would be in each language. Are you comparing idiomatic or optimized code and what does that really mean? External validity seems pretty limited.
This is pretty good news for various big data tools and
NoSQL databases which runs on JVM.
Edit: Ah, I should have mentioned I am one of those Shenandoah devs, and I have never said "your mileage may vary" that often until diving into GC development work :D
Pony uses a similar heap-per-actor, doesn't copy data, and uses background messages to update GC state between actors.
Loom/JVM fibers doesn't impose the same limitation; references can be shared across threads. So I would expect that, without a concurrent collector, you'd encounter global pauses.
 outside of some special case, natively implemented datastructures
The devs are extremely responsive on the mailing list.
> Garbage collection is by no means a solved problem.
I was reading articles like this one, about new ideas in JVM garbage collection, literally two decades ago. And it's still "not a solved problem".
I mean, it's mostly solved. Managed heaps work great for lots of applications and have been very successful. But that last 5% has stretched out so long it seems almost like a joke. We'll never get there, for the mythical "there" where GC overhead and latency isn't an issue that needs to be tuned in deployment. And IMHO it's time to start recognizing that fact instead of trying to make the JVM and .NET do what we get from explicit heaps in C and Rust. They just aren't going to get there.
I think that for the most part we're "there" already. Such improvements in GCs, like Shenandoah, C4 and ZGC, are more than keeping up with most applications' evolving requirements, and yield tremendous benefits for the relatively low cost of added RAM footprint.
> it's time to start recognizing that fact instead of trying to make the JVM and .NET do what we get from explicit heaps in C and Rust.
To do better than the JVM, especially in concurrent applications, you have to work hard. What we "get" from manual memory management we actually buy for a rather steep price (and remember that for shared data structures, which are extremely important in concurrent applications, Rust also uses a GC, just a particularly primitive one -- reference counting).
And even what we buy is not universally "better." I just saw this nice benchmark the other day, that compares allocation costs (even without concurrency). As always, it gives a small part of the picture, but a very real one: https://github.com/rbehrends/btree-alloc. I don't think anyone can reasonably claim that manual memory management is "generally better" than modern GCs, even once the effort is invested. It's better in some respects and worse in others. So it's not even clear which of them has that "last 5%" (and sometimes more) advantage.
The outrageous amount of work that's gone into two new GC ideas over the last half century has, almost exclusively, been trying to address the latency excursions inherent in trying to collect a giant heap. And it's a completely manufactured problem, because refcounting (and manual allocation strategies more generally) never head it to begin with!
So... I just don't see the point you're trying to make. I mean, yeah, there are a handful of workloads where a good GC will beat naive heap allocation (though the heap solution can always be optimized to beat a GC, by definition). But everything has its sweet spot.
My point is that Java puts its bet on the idea that the GC will work better for effectively all useful workloads. And that's not true. It hasn't been true for 20 years (c.f. all the GC work we're discussing), and it won't be true in the next 20 years either. "We won't get there", as it were.
What space is that? The big applications that the JVM targets have very large working sets (hundreds of GBs or even a couple TBs).
> The outrageous amount of work that's gone into two new GC ideas over the last half century has, almost exclusively, been trying to address the latency excursions inherent in trying to collect a giant heap.
But that work has been successful! We're now at 1.5ms max pause with an acceptable cost to throughput.
> And it's a completely manufactured problem, because refcounting (and manual allocation strategies more generally) never head it to begin with!
This is just false. Ref-counted deallocation of a large data structure also introduces latency. Unless you're also controlling the allocator (which is more work), it's also not entirely deterministic.
> there are a handful of workloads
Those "handful of workloads" are the ones that interest those companies running on the JDK, like Google, Netflix, Apple, Twitter, Amazon and many, many others. I would agree that some use cases benefit more from manual/refcount-collection and some benefit more from tracing, but the market for tracing GCs is humongous.
> It hasn't been true for 20 years (c.f. all the GC work we're discussing), and it won't be true in the next 20 years either. "We won't get there", as it were.
Well, I strongly disagree, and the millions of happy JDK users will likely disagree, as well.
Further, manual heaps tend to scatter your small blocks all over your address space, making your heap work harder as well as doing bad things to your caches. (Unless you go to the work of building cache-aware data structures, which most developers are not capable of considering that they have problems with the discipline needed to make manual management work at all.)
There's no free lunch, but if you know heap behavior is a performance issue in a giant C++ app or whatever, you have a big list of tools with which to address it in software. And in practice this works. Large deployments of, say, PostgreSQL don't require that the admins tune the allocation strategy of the process or carefully size the heap memory internal to the process (obviously there is external tuning at the OS level, though).
And I'm a little unclear what the difference in practice is between "GC tuning voodoo" and "a big list of tools". ("external tuning at the OS level" ?= that's not one of your big list of tools)
I really like the Rust approach, but (I believe) it is impossible so have heap compacting at run time in practice and similar things.
There was here an article about SQL that pointed out how Oracle would give you estimates of run time for a query, choose the estimated best data structures and algorithms, and also switch on the fly if it realizes the query was taking to long. As another example of the kind of techniques that are possible only from a higher abstraction.
I am not sure how the thread situation is for GCs and Java, but Arc is indeed reference counting and I believe it is used quite often in rust.
This meme that reference counting and garbage collection are even remotely the same needs to die in a fire. Reference counting is not, and never has been, "primitive garbage collection", and garbage collection is not, and never has been, "super sophisticated reference counting".
Then you learn why it sucks and not for taking space for counter.
Reference counts are used to manage resources, one of which is memory (and I'm assuming you are referring to no-longer-referenced memory as "garbage").
But that's not the only use for reference counts! You can, and should, use reference counts for external resources, such as files, or memory buffers on a GPU, or….
Garbage collection is COMPLETELY USELESS for managing non-memory resources because you need to be able to deterministically release non-memory resources (and oftentimes, memory resources—e.g., on the GPU). And since all a GC can do is examine memory, well, it doesn't work.
Go ahead: point me to a 3D game engine using garbage collection for GPU resources. I'll wait.
So yeah: "reference counting is a subset of garbage collection" is absolutely not true, and even the most basic knowledge of how reference counting is used in practice should be enough to convince anyone of that.
Again: this meme needs to die. Garbage collection is not "super sophisticated reference counting", and in fact, can't even be used in most places where reference counting is used. It's not a substitution, or a superset, of reference counting.
GC is a compelling alternative to reference counting for resources just in the main memory heap in some situations where performance (esp. total memory usage) isn't an important concern relative to, say, programmer productivity.
I never said that garbage collection is sophisticated reference counting so if you've seen that "meme," it wasn't here. But reference counting and tracing are two classical approaches to garbage collection, the garbage collection literature does refer counting as a garbage collection algorithm, and refcounting is used for garbage collection in languages such as Rust. Refcounting has other uses, too, but this discussion is about garbage collection and not about 101 uses for refcounting.
Anyway, reference counting GC is not intrinsically more primitive or worse-performing than tracing, but in practice, tracing GCs tend to be sophisticated, while languages that rely on refcounting usually employ a rather simple algorithm. In fact, this is one of the attractions of refcounting for GC: it is easier to get OK performance with primitive refcounting than with primitive tracing.
Refcounting detects new garbage more quickly than periodic tracing, which is why it costs so much more in bus cycles. But both are dynamically detecting live objects and garbage. I've also seen systems where the only reference count values are "one" and "more than one", with the latter going uncollected until tracing.
The real issue is not that these resources are expensive. The real issue is that they're limited. The usual example is the file descriptor table: once all 1024 (or whatever your ulimit is) slots are in use, trying to open any new file or file-like resource will fail. The same can happen for instance with disk space (once I had a system which could only release a disk file when the corresponding in-memory object was closed, and someone else forgot to close the objects after using them, so all disk space allocated to that system was used up and it stopped working), database connections, and many others.
In fact, the only reason this isn't an issue with memory (which is also limited) is that the garbage collector notices when it's getting full, and starts trying to collect garbage. It could in theory do so for other resources like file descriptors, but not all (how could it know that a remote database, shared with other users, is nearing its connection limit?), and it would probably be very inefficient (like having to do a full scan of the memory just to have a chance to find an open file handle which isn't referenced by any live object and therefore could be closed).
The main advantage of reference counting is that it's deterministic: it doesn't just detect new garbage "more quickly", it detects new garbage immediately after it becomes garbage, which keeps the resource usage at a minimum.
Sort of. It's only deterministic in the sense that as soon as an object is not referenced it is freed, but it's not entirely deterministic because you don't know exactly when (or even in what thread) that "as soon as" happens.
> it detects new garbage immediately after it becomes garbage, which keeps the resource usage at a minimum.
Yes, but that comes at a cost to performance. One of the observations that tracing GCs make is that you may not need to do any work to deallocate memory at all. For example, a simple copying algorithm used to manage the young generation in many generational collectors is based on the fact that more objects die young than survive, and it only does work to keep the (fewer) surviving objects alive and does not do any work to free (the many) dead objects. As a result allocation itself becomes much faster. Tracing indeed requires more memory, but it automatically uses that memory to reduce the cost of memory management. In fact the (amortized) cost of memory management can be reduced arbitrarily by enlarging the heap and using a tracing collector, so much so that it could, in principle, be made cheaper than even stack allocation.
But you do know when and where: it's at the exact moment the last reference to the resource was released, and in the exact thread that did this last release. In many cases, this will be a single place and thread; in other cases, it will be a well defined set of places and their corresponding threads, with the one coming temporally last being the one where the resource is freed (that is, the resource is held for as long as it's needed, and no longer).
> > it detects new garbage immediately after it becomes garbage, which keeps the resource usage at a minimum.
> Yes, but that comes at a cost to performance.
I was talking not only about memory, but also about other (more limited) resources like file descriptors. But yes, for memory there is a time/space tradeoff: reference counting minimizes memory usage, but for that it has to do more work by constantly freeing and allocating memory.
> Tracing indeed requires more memory, but it automatically uses that memory to reduce the cost of memory management.
Which occupies memory which could be used instead to reduce other costs. For instance, this extra memory could have been used by the operating system's disk cache.
But that's also true for GC, only the set of places and threads is larger :) So I agree that refcounting is a more deterministic garbage collection than tracing, but it's still not deterministic. Deterministic is when the "set of places and threads" is of size 1.
> Which occupies memory which could be used instead to reduce other costs. For instance, this extra memory could have been used by the operating system's disk cache.
This is not quite true. A given machine, with a given set of processors P, running application X, has a certain processing capacity beyond which it cannot go. To reach that maximum capacity it has a certain working set size, M(P, X), that it can put in RAM. It can cache it from disk or from the network, but increasing that cache beyond that point won't help you with performance any more (especially for distributed caching, where the distributed workload would just cause lots of cache invalidation). Caching anything beyond M would not help (and perhaps could even hurt). But you can increase RAM by any amount beyond M to reduce the cost of memory management almost arbitrarily low with a tracing GC (the increased throughput may increase M, too, so that more memory could be put to other uses, not less).
This is why so many people want a low-latency GC. You can get machines with 6TB RAM (probably even more now). Very few applications can benefit from an effective working set of that size (unless they have 500 cores or something).
Of course, in environments where RAM is constrained tracing may not be the best choice.
It's not common because it doesn't actually solve deterministic resource disposal, which is what is actually required, and a problem reference counting trivially solves.
Tracing and reference counting are two GC approaches that the paper shows are dual in some interesting ways and can have similar characteristic -- when both are sophisticated enough.
So here's how reference counting and garbage collection are duals: they both manage the memory lifetimes for objects not on the stack. If that blows your mind, you also might enjoy the above referenced paper and the meme that reference counting is "GC done badly".
Where duality is in term of seeing reference counting as a form of negative-tracing. the paper also analyzes various techniques and claim that they mostly just mixes the two approaches.
also if I can be facetious:
> So here's how reference counting and garbage collection are duals: they both manage the memory lifetimes for objects not on the stack.
that would be an equivalence, not a duality.
Java's concurrency features are extremely rich. Parallel stream are basically magic: I add .parallel() and pow, near-linear speedups. No channels, no select statements, no realising I messed up a mutex somewhere. It just works.
Then there's reactive programming. The JVM can chew through truly absurd amounts of requests when a reactive approach is taken.
The JVM has a lot of problems, mostly due to its concentration on Java and bytecode/JITting. Given those constraints, memory management isn't really one of them.
Large companies don't pick technologies for hiring reasons, small companies do. Large tech companies can afford to train their engineers in whatever they use internally and have the money (and brand recognition) to not worry about needing technologies to entice candidates. They pick technologies to solve business problems.
A fair part of "solving" gc is optimising for context-specific tradeoffs as application domains and environments change. That will continue forever unless we all agree to stick with one architecture to build all of our programs. That's the reason different gcs fare better now than in the 90's, for example, when nobody cared about concurrency.
The linked article is literally about a complete rewrite of the JVM garbage collector!
"Mature" technologies are things like compilers, where e.g. gcc was last significantly reworked 25 years ago and the "new upstart kid" competitor clang is in its second decade of life. Garbage collection isn't mature, it it was no one would be rewriting it and blogging about the results.
Also, GCC and Clang are not the only compilers in existence. There's plenty of new compiler research going on.
The GCs represent "good enough" for most use cases.
[Edit] Also, 1993: "The Measured Cost of Conservative Garbage Collection" (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.1...) [Anyone seen any work on conservative GC for existing languages since Boehm?]
Throughput-style garbage collection has mostly been solved; I believe the only outstanding problem is to reduce the necessary memory footprint towards the minimum required by the application. (On the other hand, the best memory management scheme is still an old-style Unix proggy that does bump-allocations in an arena and never frees memory until it terminates. Just don't let it run too long.)
Most of the work in GC has been towards reducing latency, which typically means different GC algorithms for different tasks. The ideal of a fast, real-time GC is still an open research project.
Explicit heaps are still not a solved problem, either, and they've been around longer than GC work (ahem*). With no static checking for leaks, use-after-free, and other issues, C memory management can be a nightmare unless you have iron-willed development discipline. The only real tools for dealing with those issues are run-time testing tools like valgrind, which among other things are horribly slow if you want to run them against a real workload.
Rust is an interesting case: it does an amazing job, if you use it for C++-like RAII idioms. If you don't, or can't, you either can't get there from here, or you're in deep unsafe voodoo land.
(One issue I brought up the last time this came up was the perennial graph structure: the best representation of edges is a numeric index into a vector of nodes. It works, mostly. However, that numeric index is acting almost exactly like a raw pointer into the vector. (The only difference is that the vector will insure the index is currently valid when you dig out a node.) The index, specifically, doesn't track the lifetime of the node or the overall graph, so you may get a node, just not the node you were looking for.
Now, if you wanted to bring up explicit linear or affine memory management systems, ala ATS, where the allocation comes with a statically checked certificate that the pointer is valid and which must be explicitly destroyed by freeing the memory for the program to compile, I would be singing in the chorus with you. But those are research problems, too.
In the mean time, unless you are working on a problem that explicitly requires managing your own memory, you would probably be advised to use a garbage collector, and one of the many choices that works best with your problem, for a fair number of reasons.
Exactly. Long tail latency is often a more important server metric than throughput.
This is why I’m excited about Swift on the server. Unlike most other languages it uses automatic reference counting instead of pause the world garbage collection. That means consistent latency.
These days my smallish async Python crawler runs for 24 hours and then starts having 5 second gc-related delays once every couple of minutes. Diagnostics says it's during gc of the longest-lived generation. Sucks to be me. I probably need to stop keeping my long-lived queues as Python objects.
Does it? Based on my admittedly limited knowledge of refcounting, either freeing an object can take an non-predictable amount of time (decrementing the reference counters of all the objects pointed to by a freed object), or allocating new memory can take a non-predictable amount of time (by lazily deferring the above mentioned decrements until that memory is needed upon allocation).
Maybe Swift will gain it, one day.
“We have discussed in the passed using hybrid approaches like introducing a cycle collector, which runs less frequently than a GC would. The problem with this is that if you introduce a cycle collector, code will start depending on it. In time you end up with some libraries/packages that works without GC, and others that leak without it (the D community has relevant experience here). As such, we have come to think that adding a cycle collector would be bad for the Swift community in the large.”
Easy is good, particularly if it works well. Which it does.
Also much more frugal with memory than typical tracing GCs.