Tangent: Go GC (generational, non-copying, concurrent mark and concurrent sweep) and the improvements it saw from 1.5 -> 1.8 https://blog.golang.org/ismmkeynote
The mutator/pacer scheme, the tri-color mark scheme, using rw-barriers during the sweep phase are interesting to contrast between the two GCs.
This makes sense for Go. A big part of the reason the generational hypothesis is true for most applications is that they allocate short-lived objects inside function bodies that are no longer needed when the function returns.
In Go, most of those objects actually do get allocated directly on the stack, or as fields directly inside other objects. So they are never seen by the GC. This is one of the aspects of the language design that I really admire.
Or, you could look at as saying that Go does have a generational memory manager, and the first generation is "on the stack".
> In Go, most of those objects actually do get allocated directly on the stack, or as fields directly inside other objects.
Most golang code I've seen still relies on pointers to objects, so unless escape analysis is better than say what the JVM or .NET offer, it's not really that different from them. .NET already has value types, and the JVM should be getting them soonish. So there really isn't that much difference from this aspect.
There is a vast difference in this aspect; a []SomeStruct makes two allocations (one for the control structures, one for the data), which might be on the stack or not. Conversely an ArrayList<SomeClass> makes a linear number of allocations; each object in it is separately allocated.
As you say, value types are coming soon, but they don't exist now, so right now there is a lot of difference (and in practice will be for a long time, because not everything is going to migrate to value types immediately).
And even after value types arrive it will be years before library ecosystem leverage that. Typical Java projects use dozens of 3rd party libraries and they are not going to see much effect in next 5 years.
But most of the value of value types would be in your own code to remove boilerplate code. From outside it will for sure be just another class so libraries and frameworks do not need to know abuot your value types.
The generational hypothesis works best when most memory allocations over time are rooted by call stacks that sooner or later return to a core loop. Start rooting things in long-lived stack frames, or in static things like a cache, and things get worse.
Whether the language uses value types or heap allocated types changes the constant factor, but the core hypothesis isn't affected - you can't allocate everything on the stack. Memory copying costs start overtaking GC overhead if you try too hard.
They tried it without going as far as bump-pointer allocation in the nursery, so I wouldn’t say that Go’s GC journey away from generational was due to measuring it and it not working. I don’t think anyone has high expectations for the performance of non-copying generational GC. IIRC the presentation just said they weren’t willing to try copying because they didn’t think it would be possible to maintain good latency. Which I think Azul and perhaps a future evolution of the new JVM collectors would have something the say about.
I have adblock and I don't see an ad. Most people commenting at HN use adblock. It's a pretty good illustration of why you might want to prevent 3rd parties from fucking with the content you view without your permission, while also using adblock to fuck with the content you view with your permission.
A quick note: Shenandoah is not generational, according to the article. Most bog-standard web apps (including REST thingys; not sure why the author calls out those) do strongly obey the generational hypothesis. For most web apps, in my experience, if you can tune your GC to serve the vast majority of your requests from a young generation, your latencies will be good, your performance will be good, your pauses will be infrequent and short, and plump unicorns and bunny rabbits will gather in your cubicle to share their rainbows.
Hi, author here. You are saying exactly what I was thinking before. But turns out, generational GCs have nasty failure modes when things don't go as expected. E.g., if an upstream experiences its own difficulties and returns responses slower, our service has to keep all the requests in memory longer, so the heap runs out, and G1 performs a few fruitless YoungGCs (without freeing much) and then tenures all those requests to OldGC, and now you have a big OldGC pause bomb waiting for you.
Non-generational GCs don't have this problem, and it's one of the reasons why Shenandoah suited us well there.
If practically everything is collected in the young gen GCs like most request/response applications, do you even gain anything from GC being generational?
DirectByteBuffers allow java programs to use unmanaged memory without needing to drop to JNI or similar. There are open source and commercial libraries that wrap that API with caching code. Using one of those solutions keeps your cache out of GC memory.
Caches violate the generational hypothesis. Entries die in middle age: long enough to have survived multiple young generation collections, so that they are promoted to older generations. The problem is that older generations are (a) not collected as frequently, (b) are often larger than newer generations, and (c) have a lower proportion of dead space to live objects, so the effort of tracing has lower value.
Caches that are in scalar data forms (e.g. byte arrays) or off-heap aren't too bad - bytes and off-heap memory doesn't need to be traced. If you're caching an object graph dense with pointers, then not so great.
Completely vice-versa, Shenandoah is much better for caching because it is NOT generational. [LRU] Caches go against generational hypothesis because the oldest elements are evicted first.
I understand what you mean, but wouldn't the majority of allocations still hapen during a request? For example, generational GC works really well with Elixir and Erlang caches.
> wouldn't the majority of allocations still happen during a request?
Could you please clarify this question? Do you mean that if cached objects are a small part of the total allocation rate, then generation GCs work well with that?
Well, if caching takes a small part of the overall workload, then you can't really say it's a "cache workload" or "cache-heavy workload", right?
My answer meant that Shenandoah would work well in a program where cache occupies like 70-80% of the heap, and generational GCs might not. But surely, neither are going to break from a 1%-heap cache.
Does anyone have experiences with both ZGC and Shenandoah? It seems like they both have very similar goals (10 ms max pause) even though the implementation is quite different. So with both landing at about the same time, when would you prefer one or the other? Both seem like such a huge advance over G1 and so similar to each other that between them it doesn't look like it matters much.
Our team is going to be targeting ZGC for the main reason that it's included by default in JDK11. Yeah, builds of JDK11 with Shenandoah are available but it's more work.
We have tried them both in production and regarding pause times both reduces them a lot in our case. At the end our use case was better served from ZGC: lower pause times and smaller latency.
We have throughput intensive loads where pause times do not matter and so we do not use ZGC or Shenandoah, it seems that in these cases even ParallelGC is better than G1 but that is another story.
I've tried to summarize this experience in this post (incl. GraalVM and different JDK versions)
Great to see the JVM experimenting with low-pause GC.
It looks like the maximum pauses of Shenandoah are still well over 1ms, which will still cause a lot of tail latency in services. (Go reached ~5ms max pauses a few releases ago, but there was still a significant improvement in the behavior of services when pauses were clamped at 100μms in more recent releases.)
Definitely the right direction. I hope future versions of Shenandoah will clamp GC pauses even lower.
ZGC, the other low latency OpenJDK GC, is now (as of JDK 12) at 1.5ms max pause, 0.5ms average, with a 128GB heap, on SPECjbb2015 (https://www.jfokus.se/jfokus19-preso/ZGC-Concurrent-Class-Un... slides 36-40) with throughput that's, as always, much better than Go's.
Also, I think Go's pause times are misleading because I believe Go uses throttling and throttling pauses are not counted as GC pauses. One could easily write a GC with absolute 0 max pause: you make each allocation a super-fast, simple pointer bump from a thread local allocation buffer, and when that's out you throttle (i.e. block) the thread forever. Of course the throughput of such a "collector" will end up converging to zero, too. So pause times without throughput numbers are meaningless.
Yes. All GCs I know employ the negative feedback loops between allocators and GCs. Stop-the-world GCs are on the far side of that spectrum: there, allocation failure induces "throttling" that would not unblock the thread until the GC is over. Even contending on shared allocation lock when requesting the new thread-local allocation buffer can be thought as accidental throttling.
Concurrent GCs are relying on collecting faster than application allocates, but sometimes application allocates way too much anyway. I believe ZGC does the simplest thing in this case: if allocation fails, the allocation is blocked until memory is available (i.e. until current GC cycle is able to recover some memory). Is this also throttling? I think so. Shenandoah does a few other tricks, including diving into STW when concurrent mode fails. To me, this is also the form of throttling.
At the end of the day, I would say that GC pause durations are non-ideal proxies for end-to-end latency. That said, GC pauses do contribute to end-to-end latency very significantly, and so they are not completely meaningless. Their duration should be taken with a fair grain of salt, though, for either Go GC, ZGC, Shenandoah, or whatever else.
Oh, my point isn't about whether throttling is good or bad, just about how it's measured. Go's GC's throughput is particularly bad in comparison to OpenJDK's GCs, so talking just about its GC pauses (which do not include throttling, AFAIK) without mentioning throughput gives a false impression.
I agree with you and present even a stronger statement: talking about GC pauses without measuring the actual end-to-end latencies may give the false impression! The overall throughput might look good, but threads might experience bad local latencies all over the place anyway. In Shenandoah project, that is my recurrent argument for capping the pacing stalls and going to STW mode when bounded pacing did not help.
Also, Go just doesn't compact the heap, ever. The stop-the-world pauses in the JVM are typically from compaction. It's not really fair to compare a compacting GC to a non-compacting one without acknowledging the huge tradeoff.
In tracing collectors, marking (and generally walking the heap in possibly random order) would take a significant amount of time. After marking is done, you may decide to move only a few objects, so the overhead of the copying itself is not that large. Updating the references to all those moved objects might take another bulk of time. These stories get better with attempts to segregate the heaps (generational, recording intra-regions references, etc) somewhat. That comes with associated runtime costs to maintain the metadata to support those partial collections, but on the upside it allows to minimize the amount of work done (again), as it only walks/marks copies/updates the sub-heaps.
I would not agree with the blanket "The stop-the-world pauses in the JVM are typically from compaction". Concurrent marking is done in CMS, G1, Shenandoah, ZGC. [In first two, there are nitty gritty details about young collections that distort the story] -- and that resolves a significant portion of stop-the-world time. Concurrent copying and updating references is done in Shenandoah, ZGC -- that resolves the rest of it.
Of course, you can skip the compact part, and just do a sweep, which frees implementation from dealing with the second part completely. This is not without drawbacks, though: allocation path gets more complicated, fragmentation needs to be dealt with, etc. How far you can get with that, depends on what the use cases really are. As far as I can tell, the damned "CMS concurrent mode failure" caused by heap fragmentation and unwillingness to part with uber-fast bump-ptr allocation paths nudged most JVM people to accept compaction as the go-to answer for reliable GC.
Throughput numbers seem misleading as well for cross-language comparisons?
Each language has its own conventions for how much to rely on stacks versus specialized memory pools versus garbage collection. For example, in Go, you often have a stack per request, and can use it for some per-request data as an alternative to GC. That's going to change required throughput for an equivalent amount of work. As will differences in escape analysis.
I don't know how much that matters, but it seems like this makes cross-language comparisons pretty subjective. You need to decide what an equivalent app would be in each language. Are you comparing idiomatic or optimized code and what does that really mean? External validity seems pretty limited.
The pause time behavior depends a lot on what is going on in application itself. Shenandoah wiki has a section on that: https://wiki.openjdk.java.net/display/shenandoah/Main#Main-G.... It also depends on the hardware, because the same amount of GC work instruction-wise may translate to drastically different cycle/wall-clock time. Try the same 1 GB heap on large x86 desktop and then on Raspberry Pi! This, unfortunately, makes point benchmark numbers not very relevant, and you would want to see how the particular deployment works. With many applications you can push concurrent GCs down to sub-millisecond range without even trying. With some applications, adjusting the application to avoid pitfalls that inflate pauses helps -- the report mentions a few adjustments like that.
Edit: Ah, I should have mentioned I am one of those Shenandoah devs, and I have never said "your mileage may vary" that often until diving into GC development work :D
Depends on what you are trying to do, but I’ll happily trade one to ten millisecond maximum pauses in exchange for better throughput. It’s the hundreds of millisecond pauses that no amount throughput improvements are worth.
But the per-process gc is pretty slow, and stw pauses are still proportional to the size of the working set in that process. If you have a multi-gb data-set you have to choose between a single process with long pauses, or spreading the data across multiple processes and paying for copying, locking and cache contention. Not to mention the extra code complexity of speading operations on single data-structure across multiple asynchronous sections. In many cases, for large working sets you're still better off with one the JVM GCs.
https://www.techempower.com/benchmarks/ It's pretty bad. We're talking 3-20x slower than Java. FWIW, you can also look at the benchmark shootout game. Erlang/HiPE/BEAM is very cool, but it's just slooooooooow. Now, I'm not trying to say that OpenJDK is "better" than BEAM -- it's a great VM and it's good for lots of stuff -- but comparing performance-related features with Java doesn't make much sense, but it's not really in the same game. Yes, Erlang never has STW pauses, but you'll still get lower latency (and much higher throughput) with OpenJDK and ZGC or Shenandoah, despite the shared heap and occasional (short) STW pauses.
Erlang gets away with no global pauses because there is no shared state [1] between Erlang processes (their term for fiber).
Loom/JVM fibers doesn't impose the same limitation; references can be shared across threads. So I would expect that, without a concurrent collector, you'd encounter global pauses.
[1] outside of some special case, natively implemented datastructures
Curmudgeon opinion, riffing off what I think is the most important takeaway from the linked article:
> Garbage collection is by no means a solved problem.
I was reading articles like this one, about new ideas in JVM garbage collection, literally two decades ago. And it's still "not a solved problem".
I mean, it's mostly solved. Managed heaps work great for lots of applications and have been very successful. But that last 5% has stretched out so long it seems almost like a joke. We'll never get there, for the mythical "there" where GC overhead and latency isn't an issue that needs to be tuned in deployment. And IMHO it's time to start recognizing that fact instead of trying to make the JVM and .NET do what we get from explicit heaps in C and Rust. They just aren't going to get there.
> We'll never get there, for the mythical "there" where GC overhead and latency isn't an issue that needs to be tuned in deployment.
I think that for the most part we're "there" already. Such improvements in GCs, like Shenandoah, C4 and ZGC, are more than keeping up with most applications' evolving requirements, and yield tremendous benefits for the relatively low cost of added RAM footprint.
> it's time to start recognizing that fact instead of trying to make the JVM and .NET do what we get from explicit heaps in C and Rust.
To do better than the JVM, especially in concurrent applications, you have to work hard. What we "get" from manual memory management we actually buy for a rather steep price (and remember that for shared data structures, which are extremely important in concurrent applications, Rust also uses a GC, just a particularly primitive one -- reference counting).
And even what we buy is not universally "better." I just saw this nice benchmark the other day, that compares allocation costs (even without concurrency). As always, it gives a small part of the picture, but a very real one: https://github.com/rbehrends/btree-alloc. I don't think anyone can reasonably claim that manual memory management is "generally better" than modern GCs, even once the effort is invested. It's better in some respects and worse in others. So it's not even clear which of them has that "last 5%" (and sometimes more) advantage.
See... calling reference counting "primitive GC" is I think symptomatic of the wrong thinking here. Reference counting has, for the small heap blocks typical in apps that tend to run in this space, non-ignorable (not "high", but real) space and CPU overhead relative to traditional GC. But it's 100% deterministic!
The outrageous amount of work that's gone into two new GC ideas over the last half century has, almost exclusively, been trying to address the latency excursions inherent in trying to collect a giant heap. And it's a completely manufactured problem, because refcounting (and manual allocation strategies more generally) never head it to begin with!
So... I just don't see the point you're trying to make. I mean, yeah, there are a handful of workloads where a good GC will beat naive heap allocation (though the heap solution can always be optimized to beat a GC, by definition). But everything has its sweet spot.
My point is that Java puts its bet on the idea that the GC will work better for effectively all useful workloads. And that's not true. It hasn't been true for 20 years (c.f. all the GC work we're discussing), and it won't be true in the next 20 years either. "We won't get there", as it were.
> for the small heap blocks typical in apps that tend to run in this space
What space is that? The big applications that the JVM targets have very large working sets (hundreds of GBs or even a couple TBs).
> The outrageous amount of work that's gone into two new GC ideas over the last half century has, almost exclusively, been trying to address the latency excursions inherent in trying to collect a giant heap.
But that work has been successful! We're now at 1.5ms max pause with an acceptable cost to throughput.
> And it's a completely manufactured problem, because refcounting (and manual allocation strategies more generally) never head it to begin with!
This is just false. Ref-counted deallocation of a large data structure also introduces latency. Unless you're also controlling the allocator (which is more work), it's also not entirely deterministic.
> there are a handful of workloads
Those "handful of workloads" are the ones that interest those companies running on the JDK, like Google, Netflix, Apple, Twitter, Amazon and many, many others. I would agree that some use cases benefit more from manual/refcount-collection and some benefit more from tracing, but the market for tracing GCs is humongous.
> It hasn't been true for 20 years (c.f. all the GC work we're discussing), and it won't be true in the next 20 years either. "We won't get there", as it were.
Well, I strongly disagree, and the millions of happy JDK users will likely disagree, as well.
You realize the latency excursion needed to deterministically free a large data structure is also problematic, exacerbated by the background data structure needed to manage allocation and deallocation. (Not all data structures can be "small".)
Further, manual heaps tend to scatter your small blocks all over your address space, making your heap work harder as well as doing bad things to your caches. (Unless you go to the work of building cache-aware data structures, which most developers are not capable of considering that they have problems with the discipline needed to make manual management work at all.)
Of course, but it's treatable. Large applications deployed on traditional heap implementations do not, as a general rule, suffer from the kind of GC tuning voodoo that plagues the JVM and which inspired ZGC and Shenandoah (and two decades of their ancestors!).
There's no free lunch, but if you know heap behavior is a performance issue in a giant C++ app or whatever, you have a big list of tools with which to address it in software. And in practice this works. Large deployments of, say, PostgreSQL don't require that the admins tune the allocation strategy of the process or carefully size the heap memory internal to the process (obviously there is external tuning at the OS level, though).
If you haven't measured the difference between different allocators for your C++ programs you are probably using the wrong one. The system allocator can be 2-5x slower than an allocator that more closely matches your workload.
There are always tools that help you improve performance, be it with code changes or tuning. The question is how much effort is required to achieve the required performance. I think that the past twenty years have shown that for a huge portion of server applications, GCs are the approach most developers prefer.
It's treatable, but not while maintaining deterministic behavior.
And I'm a little unclear what the difference in practice is between "GC tuning voodoo" and "a big list of tools". ("external tuning at the OS level" ?= that's not one of your big list of tools)
This is a point I very much like. One of the benefits of more abstraction is that "sufficiently smart" compilers/interpreters can do a lot of reordering/optimizations at a lower level that would otherwise need to be (unsafely) done by hand.
I really like the Rust approach, but (I believe) it is impossible so have heap compacting at run time in practice and similar things.
There was here an article about SQL that pointed out how Oracle would give you estimates of run time for a query, choose the estimated best data structures and algorithms, and also switch on the fly if it realizes the query was taking to long. As another example of the kind of techniques that are possible only from a higher abstraction.
Liveness is a global property. A cache-aware data structure requires understanding the code that maintains it, but safely freeing an escaped object requires understanding all code everywhere that could possibly be using it, and we can't seem to find anyone who can actually do it without a hell of a lot of proof automation (or falling back to runtime checking).
Rust doesn't need to use reference counting for shared data structures, which I'm pretty sure you already know. The use of reference counting is about dealing with unknown lifetimes, not dealing with sharing.
If there is no sharing then you just need a box, reference counting is for when there is not a clear owner and many references all still want to share.
I am not sure how the thread situation is for GCs and Java, but Arc is indeed reference counting and I believe it is used quite often in rust.
You can just use a shared reference to achieve sharing in Rust; you don't need a Box or an Arc. Pretty much every shared core data structure in Rust works fine with references with no Arc in sight, and this makes plenty of sense when you consider that the main way you interact with a type protected by an Arc is by extracting a shared reference from it.
Yes, we are in agreement about that. But they are certainly not necessary for shared data structures (even concurrent ones) and that is what the post to which I was responding said. This also isn't a purely pedantic point; scoped threads and libraries like Rayon are explicitly designed to allow concurrent sharing of values where the sharing has a bounded lifetime, and they do not require reference counting in order to support this (indeed, some usecases of these libraries, such as splitting up arrays into disjoint slices, don't really work with reference counting).
> Rust also uses a GC, just a particularly primitive one -- reference counting
This meme that reference counting and garbage collection are even remotely the same needs to die in a fire. Reference counting is not, and never has been, "primitive garbage collection", and garbage collection is not, and never has been, "super sophisticated reference counting".
the latter is true, the former isn't - Ref counting is the first automatic memory management (actual name of the field) technique you learn in some of the seminal works.
Then you learn why it sucks and not for taking space for counter.
Garbage collection means a way to automatically collect garbage. Reference counts automatically collect garbage. Therefore reference counting is a subset of garbage collection.
Reference counts are used to manage resources, one of which is memory (and I'm assuming you are referring to no-longer-referenced memory as "garbage").
But that's not the only use for reference counts! You can, and should, use reference counts for external resources, such as files, or memory buffers on a GPU, or….
Garbage collection is COMPLETELY USELESS for managing non-memory resources because you need to be able to deterministically release non-memory resources (and oftentimes, memory resources—e.g., on the GPU). And since all a GC can do is examine memory, well, it doesn't work.
Go ahead: point me to a 3D game engine using garbage collection for GPU resources. I'll wait.
So yeah: "reference counting is a subset of garbage collection" is absolutely not true, and even the most basic knowledge of how reference counting is used in practice should be enough to convince anyone of that.
Again: this meme needs to die. Garbage collection is not "super sophisticated reference counting", and in fact, can't even be used in most places where reference counting is used. It's not a substitution, or a superset, of reference counting.
GC is a compelling alternative to reference counting for resources just in the main memory heap in some situations where performance (esp. total memory usage) isn't an important concern relative to, say, programmer productivity.
> this meme needs to die. Garbage collection is not "super sophisticated reference counting", and in fact, can't even be used in most places where reference counting is used. It's not a substitution, or a superset, of reference counting.
I never said that garbage collection is sophisticated reference counting so if you've seen that "meme," it wasn't here. But reference counting and tracing are two classical approaches to garbage collection, the garbage collection literature does refer counting as a garbage collection algorithm, and refcounting is used for garbage collection in languages such as Rust. Refcounting has other uses, too, but this discussion is about garbage collection and not about 101 uses for refcounting.
Anyway, reference counting GC is not intrinsically more primitive or worse-performing than tracing, but in practice, tracing GCs tend to be sophisticated, while languages that rely on refcounting usually employ a rather simple algorithm. In fact, this is one of the attractions of refcounting for GC: it is easier to get OK performance with primitive refcounting than with primitive tracing.
You can close a non-memory resource when an object that owns it becomes garbage (Java calls this a "finalizer"), it's just not common. Those resources tend to be too expensive to leave unused indefinitely, and explicit closing risks IOExceptions but not disastrous undefined behavior.
Refcounting detects new garbage more quickly than periodic tracing, which is why it costs so much more in bus cycles. But both are dynamically detecting live objects and garbage. I've also seen systems where the only reference count values are "one" and "more than one", with the latter going uncollected until tracing.
> Those resources tend to be too expensive to leave unused indefinitely
The real issue is not that these resources are expensive. The real issue is that they're limited. The usual example is the file descriptor table: once all 1024 (or whatever your ulimit is) slots are in use, trying to open any new file or file-like resource will fail. The same can happen for instance with disk space (once I had a system which could only release a disk file when the corresponding in-memory object was closed, and someone else forgot to close the objects after using them, so all disk space allocated to that system was used up and it stopped working), database connections, and many others.
In fact, the only reason this isn't an issue with memory (which is also limited) is that the garbage collector notices when it's getting full, and starts trying to collect garbage. It could in theory do so for other resources like file descriptors, but not all (how could it know that a remote database, shared with other users, is nearing its connection limit?), and it would probably be very inefficient (like having to do a full scan of the memory just to have a chance to find an open file handle which isn't referenced by any live object and therefore could be closed).
The main advantage of reference counting is that it's deterministic: it doesn't just detect new garbage "more quickly", it detects new garbage immediately after it becomes garbage, which keeps the resource usage at a minimum.
Sort of. It's only deterministic in the sense that as soon as an object is not referenced it is freed, but it's not entirely deterministic because you don't know exactly when (or even in what thread) that "as soon as" happens.
> it detects new garbage immediately after it becomes garbage, which keeps the resource usage at a minimum.
Yes, but that comes at a cost to performance. One of the observations that tracing GCs make is that you may not need to do any work to deallocate memory at all. For example, a simple copying algorithm used to manage the young generation in many generational collectors is based on the fact that more objects die young than survive, and it only does work to keep the (fewer) surviving objects alive and does not do any work to free (the many) dead objects. As a result allocation itself becomes much faster. Tracing indeed requires more memory, but it automatically uses that memory to reduce the cost of memory management. In fact the (amortized) cost of memory management can be reduced arbitrarily by enlarging the heap and using a tracing collector, so much so that it could, in principle, be made cheaper than even stack allocation.
> but it's not entirely deterministic because you don't know exactly when (or even in what thread) that "as soon as" happens.
But you do know when and where: it's at the exact moment the last reference to the resource was released, and in the exact thread that did this last release. In many cases, this will be a single place and thread; in other cases, it will be a well defined set of places and their corresponding threads, with the one coming temporally last being the one where the resource is freed (that is, the resource is held for as long as it's needed, and no longer).
> > it detects new garbage immediately after it becomes garbage, which keeps the resource usage at a minimum.
> Yes, but that comes at a cost to performance.
I was talking not only about memory, but also about other (more limited) resources like file descriptors. But yes, for memory there is a time/space tradeoff: reference counting minimizes memory usage, but for that it has to do more work by constantly freeing and allocating memory.
> Tracing indeed requires more memory, but it automatically uses that memory to reduce the cost of memory management.
Which occupies memory which could be used instead to reduce other costs. For instance, this extra memory could have been used by the operating system's disk cache.
> in other cases, it will be a well defined set of places and their corresponding threads
But that's also true for GC, only the set of places and threads is larger :) So I agree that refcounting is a more deterministic garbage collection than tracing, but it's still not deterministic. Deterministic is when the "set of places and threads" is of size 1.
> Which occupies memory which could be used instead to reduce other costs. For instance, this extra memory could have been used by the operating system's disk cache.
This is not quite true. A given machine, with a given set of processors P, running application X, has a certain processing capacity beyond which it cannot go. To reach that maximum capacity it has a certain working set size, M(P, X), that it can put in RAM. It can cache it from disk or from the network, but increasing that cache beyond that point won't help you with performance any more (especially for distributed caching, where the distributed workload would just cause lots of cache invalidation). Caching anything beyond M would not help (and perhaps could even hurt). But you can increase RAM by any amount beyond M to reduce the cost of memory management almost arbitrarily low with a tracing GC (the increased throughput may increase M, too, so that more memory could be put to other uses, not less).
This is why so many people want a low-latency GC. You can get machines with 6TB RAM (probably even more now). Very few applications can benefit from an effective working set of that size (unless they have 500 cores or something).
Of course, in environments where RAM is constrained tracing may not be the best choice.
It's not common because it doesn't actually solve deterministic resource disposal, which is what is actually required, and a problem reference counting trivially solves.
What needs to die is for random devs, without compiler design background, assuming they know better than CS researchers, just to advocate reference counting.
there was an article a few month ago here describing how refcounts and liveness reachability were the two dual basic GC and most simple GC are a combination of the two techniques.
Tracing and reference counting are two GC approaches that the paper shows are dual in some interesting ways and can have similar characteristic -- when both are sophisticated enough.
All programming languages are "duals" of each other in the Turing complete sense, though that doesn't tell you much beyond: they're programming languages.
So here's how reference counting and garbage collection are duals: they both manage the memory lifetimes for objects not on the stack. If that blows your mind, you also might enjoy the above referenced paper and the meme that reference counting is "GC done badly".
Where duality is in term of seeing reference counting as a form of negative-tracing. the paper also analyzes various techniques and claim that they mostly just mixes the two approaches.
also if I can be facetious:
> So here's how reference counting and garbage collection are duals: they both manage the memory lifetimes for objects not on the stack.
You can always do better than GCed shared memory VM, especially in concurrent applications. And in general shared data structures always slow down concurrent applications. There is no future for JVM in concurrency at this point. Even a toy VM/language designed for concurrency will be able to beat it.
> There is no future for JVM in concurrency at this point. Even a toy VM/language designed for concurrency will be able to beat it.
I disagree.
Java's concurrency features are extremely rich. Parallel stream are basically magic: I add .parallel() and pow, near-linear speedups. No channels, no select statements, no realising I messed up a mutex somewhere. It just works.
Then there's reactive programming. The JVM can chew through truly absurd amounts of requests when a reactive approach is taken.
"Even a toy VM/language designed for concurrency will be able to beat it."
Citation needed.
The JVM has a lot of problems, mostly due to its concentration on Java and bytecode/JITting. Given those constraints, memory management isn't really one of them.
We are not in the dark ages anymore. Denying reality doesn't help anyone here. Shared memory is dead for concurrency, it can't be made neither fast, nor easy to use.
I don't know what reality I'm denying. These post-dark-ages days Amazon, Apple, Google, Netflix and Twitter -- among many, many others -- are betting heavily on OpenJDK, with its shared heap GCs, for many/most of their big, concurrent backend applications.
I though we were talking about engineering challenges, research, not bets execs of large companies make for very different reasons, mostly hiring reasons.
Oh, sorry, I didn't realize you're a Google/Apple/Netflix/Amazon executive who's intimately familiar with their technical decisions. It's certainly true there is a lot of research on non-shared memory, but also a lot of research on shared memory. If non-shared has so decisively won, then not only the executives but also the researchers didn't get that memo. If and when non-shared "wins", the popular industry platforms will adjust as well.
>I though we were talking about engineering challenges, research, not bets execs of large companies make for very different reasons, mostly hiring reasons.
Large companies don't pick technologies for hiring reasons, small companies do. Large tech companies can afford to train their engineers in whatever they use internally and have the money (and brand recognition) to not worry about needing technologies to entice candidates. They pick technologies to solve business problems.
A fair part of "solving" gc is optimising for context-specific tradeoffs as application domains and environments change. That will continue forever unless we all agree to stick with one architecture to build all of our programs. That's the reason different gcs fare better now than in the 90's, for example, when nobody cared about concurrency.
That’s just how technology works. It’s like internal combustion engines. Garbage collection is mature. Mark and sweep was invented in 1960. Concurrent collection was developed in the late 1970s. We are currently in the plateau stage of the technology, where we are exploring the universe of trade-offs between basic constraints (e.g. throughout versus latency). There is no “magic bullet” forthcoming.
The linked article is literally about a complete rewrite of the JVM garbage collector!
"Mature" technologies are things like compilers, where e.g. gcc was last significantly reworked 25 years ago and the "new upstart kid" competitor clang is in its second decade of life. Garbage collection isn't mature, it it was no one would be rewriting it and blogging about the results.
You could design a new internal combustion engine (and companies do) but that doesn’t mean the technology isn’t mature. You’re working within a well understood theoretical framework, and the low hanging fruit has been discovered so you’re at a point of diminishing returns.
Throughput-style garbage collection has mostly been solved; I believe the only outstanding problem is to reduce the necessary memory footprint towards the minimum required by the application. (On the other hand, the best memory management scheme is still an old-style Unix proggy that does bump-allocations in an arena and never frees memory until it terminates. Just don't let it run too long.)
Most of the work in GC has been towards reducing latency, which typically means different GC algorithms for different tasks. The ideal of a fast, real-time GC is still an open research project.
Explicit heaps are still not a solved problem, either, and they've been around longer than GC work (ahem*). With no static checking for leaks, use-after-free, and other issues, C memory management can be a nightmare unless you have iron-willed development discipline. The only real tools for dealing with those issues are run-time testing tools like valgrind, which among other things are horribly slow if you want to run them against a real workload.
Rust is an interesting case: it does an amazing job, if you use it for C++-like RAII idioms. If you don't, or can't, you either can't get there from here, or you're in deep unsafe voodoo land.
(One issue I brought up the last time this came up was the perennial graph structure: the best representation of edges is a numeric index into a vector of nodes. It works, mostly. However, that numeric index is acting almost exactly like a raw pointer into the vector. (The only difference is that the vector will insure the index is currently valid when you dig out a node.) The index, specifically, doesn't track the lifetime of the node or the overall graph, so you may get a node, just not the node you were looking for.
Now, if you wanted to bring up explicit linear or affine memory management systems, ala ATS, where the allocation comes with a statically checked certificate that the pointer is valid and which must be explicitly destroyed by freeing the memory for the program to compile, I would be singing in the chorus with you. But those are research problems, too.
In the mean time, unless you are working on a problem that explicitly requires managing your own memory, you would probably be advised to use a garbage collector, and one of the many choices that works best with your problem, for a fair number of reasons.
The index-based approach can be refined to guarantee that the index is in the heap and avoid the bounds check, or even updated to directly use raw pointers if you use an arena. The main disadvantage compared to garbage collection is that it's not moving, but for many workloads the whole graph will die before a more mature GC would do a major collection anyway--all the movement is in the young generation. So, the main advantage of GC at that point is mostly about memory consumption since most objects die young (and you don't have to trace them if they do). Rust could get back almost all performance against state of the art GCs even for graph workloads by just using an arena for the old generations and focusing on young generation optimization.
> Throughput reduction is predictable, and it's easy to plan for that — if your program runs 10% slower, bring up ~10% more servers; that's about it. But long GC pauses are rapid and volatile; you can't "autoscale" out of them, so in order to not fall over, you must allot extra resources to handle them.
Exactly. Long tail latency is often a more important server metric than throughput.
This is why I’m excited about Swift on the server. Unlike most other languages it uses automatic reference counting instead of pause the world garbage collection. That means consistent latency.
I've worked on a couple of million-line codebases that used reference counting. They were a joy, and their latency behavior was beautiful.
These days my smallish async Python crawler runs for 24 hours and then starts having 5 second gc-related delays once every couple of minutes. Diagnostics says it's during gc of the longest-lived generation. Sucks to be me. I probably need to stop keeping my long-lived queues as Python objects.
Does it? Based on my admittedly limited knowledge of refcounting, either freeing an object can take an non-predictable amount of time (decrementing the reference counters of all the objects pointed to by a freed object), or allocating new memory can take a non-predictable amount of time (by lazily deferring the above mentioned decrements until that memory is needed upon allocation).
The amount of effort required when a refcount reaches zero is unpredictable in general, yes, but at any particular site, the shape and lifetime of objects tend to follow a pattern. So if a particular bit of code drops a refcount to zero and takes a long time deallocating objects, it will with high probability always do so and is hence still fairly predictable/consistent.
There are plenty of reasons why reference counting keeps losing to tracing GC since the 70s. Swift is not doing any breakthrough to any of the variants that came before it.
There exist runtimes that use reference counting and mark/sweep together. Mark/sweep is used to collect cycles that reference counting misses. PHP and Firefox's XPCOM both turn up in a google search for Cycle Collector.
“We have discussed in the passed using hybrid approaches like introducing a cycle collector, which runs less frequently than a GC would. The problem with this is that if you introduce a cycle collector, code will start depending on it. In time you end up with some libraries/packages that works without GC, and others that leak without it (the D community has relevant experience here). As such, we have come to think that adding a cycle collector would be bad for the Swift community in the large.”[1]
With a good architecture and right tools to detect cycles due to bugs the reference counting works even for complex applications. Essentially this is a trade off between development time versus headache at deployment.
A presentation on Shenandoah by one of the principal developers (focuses on the how): https://youtube.com/watch?v=4d9-FQZZoVA
Another presentation (focuses on the how, why, and when) at devoxx: https://www.youtube.com/watch?v=VCeHkcwfF9Q
Tangent: Go GC (generational, non-copying, concurrent mark and concurrent sweep) and the improvements it saw from 1.5 -> 1.8 https://blog.golang.org/ismmkeynote
The mutator/pacer scheme, the tri-color mark scheme, using rw-barriers during the sweep phase are interesting to contrast between the two GCs.