For promising modern and parallel GC techniques please check MPL or MaPLe with its novel Automatic Management of Parallelism. It won distinguished paper award in POPL 2024 and ACM SIGPLAN dissertation award 2023 by proposing these two main things [1],[2]:
a) Provably efficient parallel garbage collection based on disentanglement
b) Provably efficient automatic granularity control
Standard ML and the community around it has been pretty impressive as far as contributions to memory management literature goes.
There is of course the paper you linked, and there's also the MLKit which was among the first users, and one of the pioneers, of region-based memory management.
I'm one of the authors of this work -- I can explain a little.
"Provably efficient" means that the language provides worst-case performance guarantees.
For example in the "Automatic Parallelism Management" paper (https://dl.acm.org/doi/10.1145/3632880), we develop a compiler and run-time system that can execute extremely fine-grained parallel code without losing performance. (Concretely, imagine tiny tasks of around only 10-100 instructions each.)
The key idea is to make sure that any task which is *too tiny* is executed sequentially instead of in parallel. To make this happen, we use a scheduler that runs in the background during execution. It is the scheduler's job to decide on-the-fly which tasks should be sequentialized and which tasks should be "promoted" into actual threads that can run in parallel. Intuitively, each promotion incurs a cost, but also exposes parallelism.
In the paper, we present our scheduler and prove a worst-case performance bound. We specifically show that the total overhead of promotion will be at most a small constant factor (e.g., 1% overhead), and also that the theoretical amount of parallelism is unaffected, asymptotically.
* Separate compilation vs whole-program compilation. OCaml uses separate compilation and therefore has a very constrained heap object model which makes it possible to get polymorphism across separately compiled modules. In contrast, MaPLe uses whole-program compilation and therefore is able to monomorphize and optimize much more aggressively. Whole-program compilation can be slow for large projects.
* The multicore OCaml effort was driven by backwards compatibility, especially in terms of performance -- they wanted to ensure that the performance of existing sequential OCaml code would be completely unaffected by the new run-time system. In contrast, MaPLe focuses on efficiency and scalability for parallel code.
* Multicore OCaml will let you implement your own scheduler, as a library, on top of coarse-grained threads. In contrast, MaPLe comes with a built-in scheduler, and it's not easy to change it.
We did a comparison with multicore OCaml in https://dl.acm.org/doi/10.1145/3591284, and found that MaPLe can be significantly faster, but that comes with all of the tradeoffs above. And, it's a cross-language comparison, so take it with a grain of salt. Our comparison in particular emphasized similar source code, but typically, fast code in OCaml just looks different from fast code in MaPLe. For example, in OCaml, you often need to manually unbox certain data structures to get better memory efficiency (but MaPLe will often do this for you, automatically).
One of the people who helped optimise the multi core implementation for OCaml said it was the way to go, but that was in 2020. Don’t know where things are now. https://news.ycombinator.com/item?id=23776609
Question for people who are more qualified: How applicable is this to other languages? Could this approach significantly speed up garbage collection in Go for example?
Or do we run into design issues with existing languages?
The RCU use case is convincing, but my experience with GCs in other situations has been poor.
To me, this reads more like an argument for bespoke memory management solutions being able to yield the best performance (I agree!), which is a totally different case from the more general static lifetimes generally outperforming dynamic lifetimes (especially when a tracing step is needed to determine liveness).
> Lies people believe... Calling free() gives the memory back to the OS.
I believe calling `free()` gives the memory back to the allocator, which is much better than giving it to the OS; syscalls are slow.
Perhaps not immediately; mimalloc only makes frees available to future `malloc`s periodically.
Trying a simple benchmark where I allocate and then immediately `free` 800 bytes, 1 million times, and counting the number of unique pointers I get:
glibc's malloc: 1
jemalloc: 1
mimalloc: 4
Julia's garbage collector: 62767
62767, at about 48 MiB, isn't that bad, but it still blows out my computer's L3 cache.
Using a GC basically guarantees every new allocation is from RAM, rather than cache. This kills performance of any heavily allocating code; we don't care only about how fast memory management can work, but how quickly we can worth with what it gives us.
I gave a benchmark in Julia showcasing this: https://discourse.julialang.org/t/blog-post-rust-vs-julia-in...
Malloc/free gives you a chance at staying hot, if your actual working memory is small enough.
Allocators like mimalloc are also designed (like the compacting GC) to have successive allocations be close together. The 4 unique pointers I got from mimalloc were 896 bytes apart.
My opinions might be less sour if I had more experience with compacting GCs, but I think GCs are just a vastly more complicated solution to the problem of safe memory management than something like Rust's borrow checker.
Given that the complexity is foisted on the compiler and runtime developers, that's normally not so bad for users, and an acceptable tradeoff when writing code that isn't performance sensitive.
Similarly, RAII with static lifetimes is also a reasonable tradeoff for code not important enough for more bespoke approaches.
The articles example is evidently one of those deserving a more bespoke solution.
It's really not enough to just say that a GC gave you more pointers = it has worse cache locality. Compacting GC almost always has better cache utilization than malloc, because heap fragmentation over long-running programs will waste TLB cache entries and slack space between objects. A bump allocator from a compacting GC will give you a new pointer for each allocation because free doesn't reclaim the memory...but those allocations will be sequentially, and if you were in the case where you are churning through your heap and only ever touch the most recent object they will always be in cache still. Benchmarking the knock-on effects of allocators and GCs are insanely difficult and I'm very skeptical of basically any synthetic benchmarks like this.
I think the fact that it is complicated to reason about is precisely why systems developers don’t trust GC’s.
It’s far easier to write a threadsafe bump (slab) allocator than to locate and diagnose the code that someone wrote two years ago and, as of last week, started causing the GC to blow up the cache, contend on a lock, fragment the heap, add unbounded pauses, burn an extra cpu, etc, etc.
(Though, at this point, most mallocs are so good that the slab allocator loses anyway and there’s no need to bother.)
FWIW, that synthetic benchmark was reflective of some real world code we were deploying.
Using malloc/free for one function led to something like a 2x performance improvement of the whole program.
I think it's important to differentiate between malloc implementations/algorithms, just like it's important to differentiate between GCs.
E.g., mimalloc "shards" size classes into pages, with separate free lists per page. This way, subsequent allocations are all from the same page. Freeing does not free eagerly; only if the entire page is freed, or if we hit a new allocation and the page is empty, then it can hit a periodic slow path to do deferred work.
https://www.microsoft.com/en-us/research/uploads/prod/2019/0...
Good malloc implementations can also employ techniques to avoid fragmentation.
It's unfortunate that the defaults are bad.
But I confess, compacting GCs and profiling the effects of heap fragmentation (especially over time in long running programs) are both things I lack experience in. Microbenchmarks are unlikely to capture that accurately.
If we are talking long running multithreaded processes, then libc malloc have issues that isn't even fragmentation. There are many workloads when it seems to forget to return whole empty pages and just accumulate allocated areas despite them being totally empty.
compaction really does help runtimes alot. but I'm not sure how much it really has to do with line level locality. in general we don't try to batch related objects together except in a coarse form by generation.
I think the measurable benefit comes from page level savings, both reducing the number of trips to the kernel to get zeroes pages, and from reduced pressure on the tlb.
but I have definitely seen numbers like 20% on some workloads for turning on compaction
> in general we don't try to batch related objects together except in a coarse form by generation.
It greatly depends on the GC algorithm, right? Copying collectors naturally bunch up related objects (though of course, in which order is less well defined, if one object has multiple pointers to others).
> Compacting GC almost always has better cache utilization than malloc
...only if you put each object into its own heap allocation, which is a dumb thing to do in manual memory management. Typically you'd group many related objects of the same type tightly packed into the same allocation and loop over them in sequence to make proper use of prefetching.
The programmer will always have a better idea how objects are going to be used compared to an automatic memory management solution, and even with a GC it makes a lot of sense to use arrays of value types for better memory layout control.
The post explains why this works in the RCU context, why it sucks in general, and then just writes it off and ignores it.
> At this point, some folks fire back with non-arguments about how this isn’t “real” garbage collection. Like, uh, because you manually mark the garbage!
Yeah. People's concerns are that the process of figuring out what memory is not longer used is inefficient and non-deterministic relative to simply telling the allocator when you're done with a resource. I've never met someone who's been concerned with deferring deallocation.
Sure traversing the whole live set is rare and we've spent 30 years tweaking GC algorithms to make them better, and now they're practically sentient. However this statement either willfully or unintentionally writes off the thing people actually have an issue with. If you run into GC issues in your services you have to bring in a shaman to tweak things here and there hoping it sends the angry spirits back to the shadow realm.
If you're just marking the garbage and being notified when it's no longer used, that entire process is gone.
Yes, it can be very fast to allocate memory in a GC. This ignores the cost of marking and compaction that actually need to be amortized in to get a fair comparison.
The other big issue people have with GC is that in general it requires significantly more memory than manual memory management to achieve equivalent performance. And you have to have a bunch of extra CPU power to throw at redundantly checking if things are still referenced over and over. And you have to be okay with a bunch of extra memory copies for optimistic compaction.
Finally the post criticizes manual memory management (Rust's Arc/Rc) as being necessary when you have unclear lifetimes - but ignores that you basically build the exact same infrastructure in GC'd languages to close external resources as you can't rely on finalizers ever being called.
Anyways this has been beaten to death for the last 20-30 years and this article doesn't seem to bring anything new to the table besides ignoring the legitimate concerns of a GC using memes, which is fine because memes are fun IMO.
The correct answer is exactly what you say - there is no general correct answer. You use the tool appropriate to the job to meet the design constraints of the system.
> I've never met someone who's been concerned with deferring deallocation.
Realtime audio processing (instruments and effects), where malloc/free can never be called on the realtime audio thread. Deallocations have to be deferred.
If it matters for realtime audio, I cannot imagine that it would not matter for twitch games as well.
In non-GC languages the audio thread can be kept running even when changing audio plugins, as long as allocations and deallocations are not performed on the audio thread.
Realtime audio synthesis is possible in .net; but no memory allocations can occur anywhere in the entire realtime audio process. It is possible to write allocation-free message queues between a UI process and a realtime process. If allocations do occur in the realtime audio process, the realtime thread has to be suspended at some point in order to grab object references on the stack of the realtime thread -- a process that can take 2ms or more in .net GCs (which is more than enough to case audio dropouts).[1]
> > I've never met someone who's been concerned with deferring deallocation.
> Realtime audio processing (instruments and effects), where malloc/free can never be called on the realtime audio thread. Deallocations have to be deferred.
I think you misunderstood GP, they are saying “I’ve never met someone who thinks deferred deallocation is a bad thing”, ie. you’re agreeing with them.
Many of the systems I have seen that need to be deterministic and fast but still need to allocate will use pool allocators or slab allocators. The edge cases mean that malloc/free is not in the conversation. I suppose that pre-allocating objects and deferring deallocation would also work, but that seems roughly equivalent to pool allocation.
The null allocator is the fastest allocator, though!
The principle application is making changes to Audio plugin chains without stopping the real-time audio thread. The new chain has to be pre-allocated, and the old chain has to be sent off-thread to get deallocated on a deferred basis. You can't pool-allocate a VST or an LV2 audio plugin.
Data subscriptions (Audio VU data and control port values) also use a similar deferred deallocation scheme. Doing so allows use of mutexes and shared ptrs in dependent non-realtime data structures.
This makes no sense to me. In a generational GC gen-0 is more likely than not to be cached — and that’s where the ‘churn’ is. Outside of that, any longer lived allocations are by definition not easy to control cache-wise.
Locality is one of the big wins for GCs, the only issue I’m aware of is the ‘stop the world’ mark/sweep (yes, I know modern GCs have a background thread — but you still get stop-the-world events afaik)
Modern collectors have stop-the-world pause times in the millisecond range (aiming for less), even for very large heaps. Allocating threads may also incur allocation stalls, also in the millisecond range. However, these collectors need additional barriers in the application, and the concurrently running collector competes with the application for resources even if the application is not paused. Meaningful comparisons are difficult.
The problem is not only how long the pause takes but also the fact it pauses all the things. In manual memory management even if you have to spend some time in allocation / deallocation, it affects only the allocating thread. A thread that doesn’t allocate doesn’t pause.
With current concurrent collectors, the stop-the-world pauses are so short that they approach the pause times processes encounter for other reasons on untuned systems.
But I meant something else: it's not always clear whether low-pause collectors are a win. A totally made-up example: Is it worthwhile to add 2ms processing time to every request to avoid that every 7 to 10 minutes, there is a 20% chance that there is a garbage collection that delays all requests currently in flight by 40ms?
> I believe calling `free()` gives the memory back to the allocator, which is much better than giving it to the OS
Having to deal with memory fragmentation in long running servers is no fun at all, especially internal fragmentation of pages maintained by slab allocators.
this is not a very common problem, but it is a hard one to deal with.
Fragmentation rarely wastes more than 20% memory and 50% is extremely unlikely. But with tracing GC you’re often wasting 4-5x right from the start. Maybe it’s not called fragmentation, but the room needed for the GC to run efficiently is also waste. Plus all the headers in the objects to keep the marking flags also addup.
If cache usage is that major of a concern, arena allocation works just as well as it does with manual memory allocation. Thankfully there aren't too many areas where garbage collection has to compete with such conveniently contrived examples.
>> My opinions might be less sour if I had more experience with compacting GCs
I have quite a bit of experience with the GC in .NET. For projects that deal with large data structures, the GC is something that you are always thinking about though it's behavior is conceptually transparent. I think I would ultimately prefer a more explicit approach.
Except in the special case where all memory can be easily handled in arenas, good tracing GCs have long ago surpassed manual memory management in throughput, and more recently their latency impact is more than acceptable for the vast majority of applications (OpenJDK's ZGC has typical pause times measured in double/triple-digit microseconds, and the worst case rarely exceeds 1ms for a reasonable allocation rate -- the pauses are in the same ballpark as OS-induced ones). The only real and significant tradeoff is in memory footprint, and outside of specialty niches (where arenas just work for everything and worst-case latency is in the low microseconds range) that is the only high order question: is my application running in a memory-constrained environment (or it's really worth it to sacrifice other things to keep down RAM consumption) or not?
IME it's the other way around, per-object individual lifetimes is a rare special case, in most real world code there will be many related objects of the same or very similar lifetimes. In such code, tracking individual object lifetimes is overkill (in the end, memory management is all about lifetimes, and fewer individual lifetimes instead of many is always better because it means less work, both in manual and automatic memory management).
Not having to think about object lifetimes is just very convenient, that's why GC languages were successful despite the significant under-the-hood complexity of a good garbage collector.
That's not at all a rare special case in most server applications.
One way to see that is to consider how much of a program's working set is in threads' stacks vs the heap. If the most of the working set is in the heap, there's usually some non-trivial object lifetimes involved (i.e. cases where lifetimes can be encapsulated and abstracted away from client code). Yes, sometimes all of these can be taken care of by arenas (and there are languages, like Zig, that strive to be very arena-friendly), but that -- i.e. the case where all objects are easily arena-able -- is not the more common scenario.
Depends on the type of server I guess. I can easily imagine a situation where all allocations of a request are handled through a simple bump allocator, and once the request is done, the bump allocator is reset instead of 'freeing' each individual allocation.
This would only work for fairly trivial applications. The moment you start adding things like http clients or databases you have to start considering having connection pools with lifetimes that don't strictly match a request lifecycle.
Not saying such an application doesn't exist, it certainly does.
How did we go from "this is the most common case" to "I can imagine it?" Sure there are cases where a request can be handled by the stack but the point is that the more complex case is extremely common.
Any kind of background/asynchronous work spawned from a request and your plans are shot.
Yes and that is yet another case for encapsulation.
For me, it is an antipattern for a short lived task to be directly creating long lived resources. Async work ideally is scheduled using message passing to the task manager which may have its own memory management strategy (heck, the background task may well be written in an entirely different language).
I just feel that it is very common to have large portions of the application that fit the model of read input/write output, free all intermediate data upon completion. Due to a lack of encapsulation, however, we put unnecessary pressure on the allocator or garbage collector by mixing the short and long lived parts of our applications.
> we put unnecessary pressure on the allocator or garbage collector by mixing the short and long lived parts of our applications.
That is not how modern tracing GCs work, though. First, modern tracing GCs don't do any work to actively free any dead object (they don't even need to know about them). They work to compact the live ones. Second, modern tracing GCs automatically separate short-lived from long-lived objects and preserve them in different ways.
While it's true that in principle a programmer could find a technique to optimally manage memory in a given program and that even a good GC would never be quite as optimal, in practice a GC would do a better job than you could unless you spend an inordinate amount of time figuring out the optimal strategy; furthermore, the optimal strategy has to contend with non-local effects, i.e. a change in one subroutine might have a global effect on what the optimal global strategy is.
So as a first approximation, a good tracing GC will give you better memory management (performance-wise) per unit of programmer effort, and if you want even better performance -- give the GC more RAM. You're trading off RAM for less work on your part.
That “simple bump allocator” is basically allocating everything on the thread's stack, as the GP mentioned.
It's what you put on the heap that has complex lifetimes. Sometimes you can fix that with an arena. If you can't, you probably can't figure out when the last reference to it dies either.
In generational GCs there are two or more allocation regions. New objects are put in the "younger" generation, which is garbage collected separated from the other generations. This sort of resolves the issue of tracking individual object lifetimes by having all the short-lived objects subject to rapid GC. This means most of the effort tracking lifetimes is reserved for the fewer long-lived objects.
It’s “medium term” objects that cause all the trouble. Async operations. That’s why .net invested heavily into zero-alloc tasks, as the gc would kill scaling in async heavy code.
> IME it's the other way around, per-object individual lifetimes is a rare special case
It depends on your application domain. But in most cases where objects have "individual lifetimes" you can still use reference counting, which has lower latency and memory overhead than tracing GC and interacts well with manual memory management. Tracing GC can then be "plugged in" for very specific cases, preferably using a high performance concurrent implementation much like https://github.com/chc4/samsara (for Rust) or https://github.com/pebal/sgcl (for C++).
Reference counting is neither lower latency nor lower memory overhead than basically everything else.
Reference counting requires atomics on every single object. Per object atomics are quite unfriendly to modern microprocessors.
There is either very little contention with lots of atomics that you don't need or you have high contention with atomics that are blowing out your cache lines repeatedly.
In addition, managing reference counts practically requires RAII semantics and all of the baggage that goes along with that. Doing reference counting in C, for example, is extremely error prone.
The reason Rust has both Arc and Rc types for reference counting is precisely because most of the time when you need reference counting, you do not need thread safety. This is something I think about all the time in C++ when I use std::shared_ptr: it gives me thread safety using atomics but I don't need it.
More languages should distinguish between thread safe reference counting and single-threaded reference counting.
> The reason Rust has both Arc and Rc types for reference counting is precisely because most of the time when you need reference counting, you do not need thread safety.
Hmmm, I'm curious about your use cases as this is almost precisely opposite my experience. Normally, I regard Rc as a footgun as I've always found that I'm shortly going to have to change everything to Arc.
I almost never use atomic reference counters, when I use them at all. A major point of using thread-per-core software architectures is that it eliminates almost all atomics even in highly concurrent contexts that may require reference counting to manage resource lifetimes.
In these cases, your concurrency primitive is essentially a coroutine, either with or without its own stack. You may have thousands of these executing concurrently similar to as if they were multithreaded and using locks or atomics, but the execution is always on a single thread so locks are unnecessary and atomics can be elided entirely. In the rare cases where atomics are used, there are almost always at most two threads accessing them.
Basically, the only place you see atomics are the very thin and lightly accessed interfaces between threads, which are about resource control rather than true thread concurrency.
I have used these types of architectures on diverse high-scale server apps. The places I’ve had to use atomic reference counters were always extremely limited and largely outside the hot path.
Then why are you using Rc at all? This seems like "arenas" would be a much better match where you can reclaim everything in one fell swoop as soon as the thread finishes its task.
Is the issue simply that Rust until recently didn't support memory allocators other than the standard one?
The resource management doesn't fit an arena model, that is an overly simplistic view of what is required. The resources have a lifecycle independent of the references any execution context has with them. A reference isn't ownership, it just keeps the resource from disappearing or pins it in a particular context.
Memory allocations may not actually exist in memory. They can be paged to disk in user space because, frankly, the virtual memory hardware on modern CPUs is often incapable of handling large storage and when it does the performance is poor. Sure, you can do better than the default allocators in Rust and elsewhere but that is far from the only reason you might want a reference counter on a resource. DMA requires managing reference counts for threads of execution that only exist in silicon, not in software.
Arenas solve none of this. The lifecycle of the resource has nothing to do with the lifecycle of the execution context of a coroutine.
I use Rc inside the implementation of data structures such as persistent trees, and also to implement functionality such as Undo. The synchronization happens on a higher level.
Reference counting does not have lower latency than a tracing GC nor is it more generally predictable. In terms of performance, it is usually worse than a modern tracing GC by most performance metrics. Its benefits are, indeed, lower memory footprint and that it can achieve reasonable latency even using a crude implementation. In other words, you'd reach for refcounting either if you mostly care about footprint or if you don't have a lot of resources to invest in implementing the GC.
Modern concurrent tracing has very little impact on latency. First, compared to refcounting, the barriers are rarely invoked. Second, they compact the heap concurrently in the background, and like all mark-compact GCs, their amortized cost drops to zero as the RAM increases (the work is linear with the size of the working set, which is roughly constant for a given program, but it needs to be done less frequently the more RAM you have). That's why high-performance languages that rely heavily on a GC -- Java, C#, Go -- prefer tracing.
The state-of-the-art in refcounting[1] greatly improves the barrier situation over a naïve implementation: no read barrier, and the write barrier only uses atomics when mutating an (apparently) unmodified field in an old generation object.
Certainly improvements in refcounting can bring its performance closer to that of tracing (and there are constant improvements in tracing, too), but one of the main reasons some languages choose refcounting is due to simplicitly of implementation, and these improvements bring the complexity of tracing and refcounting closer to each other. Currently, tracing leads in performance, but if that changes we'll see a shift in the algorithm for languages that depend heavily on GC performance.
BTW, our GC team investigated the implementation in that paper, and it is still significantly behind that of tracing in too many relevant workloads to be considered for adoption in production.
when the reference count can only be zero or one reference counts have different performance than when it can be more, and it is much easier to reason about. This is most cases.
And yet, the reason tracing GCs are chosen by virtually all high-performance languages that heavily rely on GC is that they've been found to be faster in practice for the common workloads.
One of the reasons why your intuition is not so straightforward is that a tracing GC needs to do no work whatsoever when the number of references is zero. One of the common ways to teach the basics of GC is by starting out as looking at tracing and refcounting as duals: refcounting needs to work to free objects, while tracing works to keep them alive. If you thinking in terms of what work needs to be done to promptly determine when an object becomes garbage, then you're already not thinking in terms of tracing, because tracing never actually needs to learn about when an object becomes garbage (this isn't actually true when they do reference processing, but that's another story).
Or, if you want to think about it another way, in a tracing collector there are already only two cases no matter how many pointers there are to an object: reachable or not, i.e. the same one and zero as in your case, only there isn't even a need to ever set the counter to zero.
However, in principle tracing and refcounting can be quite similar (https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/unifie...) in their behaviour, but in practice most refcounting GCs in industry use are crude, and don't match the performance of tracing GCs in common use, which are quite sophisticated.
> And yet, the reason tracing GCs are chosen by virtually all high-performance languages that heavily rely on GC is that they've been found to be faster in practice for the common workloads.
I disagree.
IMO the reason is simplicity, most languages don’t want to burden programmers with lifetimes and/or soft references.
Rust and Swift, both high-performance languages targeting specific niches, chose differently. Both largely for reasons of predictability, lower latency, lower memory use (which you admitted above) as well as lower memory trashing.
> IMO the reason is simplicity, most languages don’t want to burden programmers with lifetimes and/or soft references.
I'm talking about languages that already use a GC as the primary heap management mechanism. Any kind of GC algorithm, whether tracing or refcounting, frees developers from thinking about lifetimes, but languages that primarily rely on GC and care about performance almost always pick tracing.
The only languages that choose a refcounting GC are either those that care little about performance (Python) or those that don't rely heavily on GC (Rust).
> Both largely for reasons of predictability, lower latency, lower memory use
Refcounting offers worse latency, worse predictability [1] (compared to modern low-latency tracing GCs, like ZGC), and worse throughput; it does offer a significantly lower memory footprint. The reason Rust chose refcounting is primarily because it doesn't rely on the GC heavily and so it might as well use a crude GC (and enjoy the lower effort of implementing it as well as the lower footprint), as a crude refcounting GC is usually better than a crude tracing GC.
Furthermore, an sophisticated tracing GC is not a great fit for languages like Rust or C++ because it requires a more elaborate code generation and some "magical" mutator thread behaviour, when low-level languages like the generated code to be as close to the source code as possible. This isn't so much predictability of performance, but predictability of the generated code and the behaviour of mutators (in terms of lowering the surprise of "why is my thread doing _that_ now?" when observed at a low level). For example, with ZGC, a mutator thread may find itself moving an object when accessing it; this is surprising at a low level -- which may not be a good fit for a low-level language -- even though it ends up offering better performance predictability overall. On the other hand, a mutator thread doing work freeing memory with a refcounting GC when it's dropping a reference to it may offer worse latency and worse predictability overall, but it's not as surprising at a lower level in the sense that it's easy to understand why it's doing that.
As for Swift, I think it picked a refcounting algorithm because it cares about memory footprint (most Swift applications run in memory-constrained environments) and because some unique language features allow to reduce the cost of refcounting sufficiently for a simple refcounting GC to be acceptable.
[1]: Again, not in principle (as both refcounting and tracing can converge to a similar behaviour) but as practiced by most refcounting GCs in industry use.
> The only languages that choose a refcounting GC are either those that care little about performance (Python) or those that don't rely heavily on GC (Rust).
Not true. The creators of Swift chose RC exactly for performance reasons: low memory footprint AND low latency.
//edit: ah, you actually admitted it at the end. So you’re contradicting yourself.
Refcounting's latency is worse, not better than modern tracing. Swift chose refcounting not because it performs as well as tracing, but because they managed to get it to perform acceptably for their needs (with special language features), and they really care about footprint.
If someday there's some advancement in refcounting that makes it better (or even as good but with significantly lower footprint), you'll see all languages that currenly use tracing switch quickly. For the time being, though, tracing is winning.
Modern concurrent collectors like ZGC only pause threads to synchronize state transitions -- there is no collection work done in STW pauses (or any work that grows with the size of the working set or total heap) -- and not for a longer duration than the OS does for various things. Certainly the average impact on latency is lower for concurrent tracing collectors.
These days, tracing collectors offer better throughput, latency and predictability than refcounting, while refcounting offers better footprint. That's the main tradeoff, and I mentioned some secondary ones (re ease of implementation and low-level "surprises") before.
A tracing GC, by definition, traces. (Not to mention that tracing keeps the live objects alive.)
My important-to-me application has a large amount of "live forever" memory and, at any one time, a fairly small amount of memory used by short-lived objects. However, it allocates and then discards those objects very frequently.
Let's say that it has 100MB of "live-forever" and, at any one time, 1MB of transient objects. Moreover, the vast majority of the pointers in the 100MB are to other parts of "live-forever."
A GC-based approach will allocate Nx1MB of transient objects and then trace the pointers in that 100MB to free all but the currently in use 1MB of transient objects. The vast majority of the traces are a waste of time, but GC can't know that. Moreover, it will tromp through the address space, trashing the cache.
A ref-counting approach will have two RMWs for every pointer assignment and the objects modified will almost certainly be in the cache.
It's not clear that those RMWs will cost more than tracing that 100MB.
Note that the 1MB is probably in L3. If N is large enough to keep the number of traces down, the Nx1MB will cache-fault like crazy. (Remember that the 100MB is also subject to caching.) That 1MB won't use more pages than the Nx1MB.
> The vast majority of the traces are a waste of time, but GC can't know that
It can, it does, and you've basically described generational collection. Generational GCs try to only trace through recently allocated objects, and only bother with old ones if they can't free enough memory. Not wasting time on unnecessary traces is much of the point of generational GCs.
More modern GCs like OpenJDK's G1 go some steps further. They're region-based, and they keep track of inter-region pointers that tell them the likelihood that a region may benefit from compaction.
> A ref-counting approach will have two RMWs for every pointer assignment and the objects modified will almost certainly be in the cache.
A tracing GC does that without any bookkeeping. An object born and dead between young-gen collections will never even make its existence known to the GC (generational GCs usually only need to learn about an object's existence once it's promoted); it will never have been visited and traced. The old objects would not have been traced either (in that interim) if few enough young objects survive.
In fact, in such ideal situation where some fixed number of objects live forever and some large number of objects are continuously allocated and die, a generational GC will effectively need to do nothing: the old objects will have been promoted into the old generation, and the GC won't have to ever look at them again if there's always sufficient RAM, while the young objects will be allocated by a pointer bump, and when it reaches the end the GC will wake up and want to start tracing, but will find that all the roots still point into the old generation, so the young generation will simply disappear (i.e. the allocation pointer will be reset to the beginning). The GC in this case requires zero operations (well, except for some fixed number required to notice it has nothing to do).
> It's not clear that those RMWs will cost more than tracing that 100MB.
It's clear that they cost more than not tracing it, which is what a generational GC won't do.
> Note that the 1MB is probably in L3. If N is large enough to keep the number of traces down, the Nx1MB will cache-fault like crazy
That prompt reclamation has some benefits on locality is true, but so does compaction. The problem isn't so much cache faults (because we're talking about writes, which fill up the write buffer, not reads that cause stalls) but exhausting the memory buffer. This can and does happen with super-high allocation rates (which makes it an interesting problem), and indeed the memory bus's bandwidth is sometimes the limiting factor on throughput, but we're talking about allocation rates that far exceed those that common refcounting GCs can deal with anyway.
It's been well established by now that tracing GCs offer better performance than (crude) refcounting GCs, which is why they're the choice of every high-performance language that heavily relies on GC. The price you pay for modern tracing GCs is in memory footprint and in cost of implementation. There is, however, constant research in GC, and some advances in refcounting might make it more appealing, but we're talking about sophisticated implementations that are comparable in complexity to tracing.
Generational hypothesis fails pretty miserably for some very practical and common workloads - e.g. caching. The real world is way more complex than short lived vs forever. There are many objects with midterm lifetimes as well.
Hence, generational collection is in practice only a constant factor improvement. That’s why not even all state of art uses it. E.g. Golang doesn’t.
Generations do not fundamentally fix the property of GC mentioned by the parent poster you’re replying to, because eventually the full heap will be scanned anyways, as it is virtually impossible to keep the promotion rate down to zero. It only delays that moment.
There is also non zero price to pay for maintaining generations. And also tuning generational GCs is much harder than non generational. It’s not a clear win.
> because eventually the full heap will be scanned anyways, as it is virtually impossible to keep the promotion rate down to zero. It only delays that moment.
But the whole point of tracing collectors is that even then scanning the whole heap is proportional in cost to the size of the working set, and so the amortized cost goes to zero with increase in RAM. Scanning the working set is still cheaper than tracking all references constantly. In fact, throughput wise it's a winning algorithm -- a doubling of RAM leads to a halving of cost.
I'm not claiming that tracing collectors always lead to optimal memory management, but the ones that are in common use are, in practice, much better than the refcounting collectors in common use except for footprint.
> And also tuning generational GCs is much harder than non generational.
There's no more tuning on the collectors of the last couple of years (I'm talking about ZGC and generational ZGC).
In languages like C++ and Rust the majority of objects have trivial lifetimes and don't need reference counting at all. It is less than 0.1% of heap objects that need to be tracked by Rc/Arc, and even for them the reference updates are usually rare - like only at the moment of creating a new thread / task / request etc.
Hence, even if per object the cost of RC may be higher than tracing, the fact that there are 1000x fewer objects to manage makes RC a huge win. I guess Swift is also not that stupid and doesn't update the references every time a temporary is created, etc.
> In languages like C++ and Rust the majority of objects have trivial lifetimes and don't need reference counting at all.
Sure, that's why they can get away with an inefficient GC.
> Hence, even if per object the cost of RC may be higher than tracing, the fact that there are 1000x fewer objects to manage makes RC a huge win.
The "win" here is not RC at all but the fact that there's little reliance on GC and, of course, it's not really a "win" but something you buy by giving up on something else (i.e. a tradeoff), in this case abstraction ability (memory management details cannot be hidden as implementation details), and giving that up increases maintenance costs. So to get the footprint savings you have to pay -- either with performance, if you rely heavily on a refcounting GC, or with maintenance cost, if you don't. To get the performance you can choose to pay for RAM or for more development hours, but either way -- you have to pay.
Tracing GCs manage not to do any work for the vast majority of objects, those with trivial lifetimes -- they, too, don't manage the vast majority of objects (the allocator just bumps a pointer and the GC is never even aware of the existence of most of objects) -- by not freeing dead objects promptly which comes at the cost of spending more RAM.
In other words: good footprint, good performance, good abstraction (i.e. lower maintenance costs) -- pick two and pay by giving up on the third. C++/Zig/Rust pick the first two, Java/C#/Go pick the last two, and I guess Python picks the first and last.
There are, of course, other nuances; e.g. most refcounting GCs in common use leak memory as they are too crude to handle cycles, and not all tracing GCs are state-of-the-art.
That "pause-free" GC engine performs worse than modern tracing collectors, so by using C++ you've already chosen lower footprint, and then you have to choose between worse performance or more work. I think many people are unaware of the advancements in tracing GCs over the past few years, but modern concurrent collectors don't do any collection in stop-the-world pauses.
Putting aside the fact that the compilation time is not included in the C++ result but is included in the Java result, the Java implementation allocates many more objects than the C++ implementation, that the GC used is old, and that there seems to be no interesting GC activity (shown by the fact that the results are similar with a 50MB heap, not to mention that in the C++ implementation, there is only a small number of objects all of the same size, and the code is single-threaded), I have to wonder: you see the result of a self-proclaimed "completely unscientific" benchmark and conclude that a 100-line GC using a decades-old algorithm beats decades of research and five generations of 100KLOC GCs done by the top garbage collection experts in the world spent and think, "this must be right"?
All the people working on Java at Sun/Oracle, on Go at Google and at C# at Microsoft (who, BTW, have written most of these languages' runtimes in C++) spent years and years designing elaborate and sophisticated algorithms that are worse than a crude implementation of a 50-year-old GC algorithm because we want to spend tens of millions of dollars and decades of work to give our users worse results? In your mind, what was it that possessed everyone designing high-performance language runtimes for non-memory-constrained environments over the past three decades, at unrelated projects that compete with one another, to all choose variants of a worse algorithm despite all of those variants being several orders of magnitude more costly to develop than what you believe is the better algorithm?
When someone comes up with a more performant GC algorithm than tracing -- we'll all switch to it. Who knows? It may be some sophisticated form of refocounting, but it ain't simple refcounting.
Java needs to generates a lot of garbage, whereas C++ does not. Therefore, GC for C++ can be simpler and generate lower overhead than GC for Java.
I'm not interested in the amount of money companies invest in developing garbage collection systems, I'm interested in code performance. Are you aware of any other absolutely zero pause GCs?
> Java needs to generates a lot of garbage, whereas C++ does not. Therefore, GC for C++ can be simpler and generate lower overhead than GC for Java.
You mean lower footprint overhead; higher performance overhead. But that was my point: Higher-level, high-performance languages that rely on good abstraction -- like Java, C# and Go -- need a better-performing GC, which is why they choose tracing, and pay for it with more footprint. In terms of performance, tracing used in industry beats refcounting as used in industry.
> I'm interested in code performance
Sure, but you have to pay for: either with a higher footprint or with higher maintenance costs.
> Are you aware of any other absolutely zero pause GCs?
The ones I know of are ZGC and Azul's C4, which preceded ZGC by some years but was not open-sourced. ZGC isn't some obscure technology, BTW. It is currently running most of Netflix's workload, and probably other very large companies.
(Also, ZGC, does no collection work in pauses; it does pause threads to synchronise a state transition but for durations similar to those that the OS also pauses threads for).
As their goal was a language that would be very popular, and given that languages that heavily rely on GC have many times more users than languages that don't, it doesn't seem that they were wrong, at least not about the centrality of GC (although, "value types" and arrays-of-structs are coming to Java).
> SGCL never pause threads, so it can collect garbage more often and more efficiently.
I can't examine the algorithm right now and I couldn't find a paper describing it (so I can't tell you if it's one of the algorithms we've tried), but if you believe you have a more efficient GC than those in the JDK, feel free to plug it in and test. If you're right, you'll have many millions of users and the overall impact of that algorithm on the software industry would be much greater than it would as a C++ GC.
> Scanning the working set is still cheaper than tracking all references constantly.
that is why you need to be careful not to needlessly update your reference count for temporaries. managing memory is not free. generational garbage collection avoids a lot of thought but it demands more memory.
Yes. As I said repeatedly, the one thing that tracing sacrifices is memory footprint. It effectively turns RAM into performance. If you want a smaller footprint you either have to sacrifice performance or pay with more development effort.
What happens if a large std::unordered_map<std::string, std::string> has its destructor called?
The maximum number of references is a red herring. While having a RC capped at 1 allows you to elide the actual reference count and makes pointer assignment cheaper, it does not materially affect the primary source of latency in a reference counting implementation, namely cascading deletions.
1. Such data structures (or more generally, std::vector<std::string, std::vector<std::string>> or something like that) are the natural way to represent e.g. dictionaries. So, you absolutely often need to do that and "don't do that" doesn't help here.
2. This general issue extends to virtually all collections. The idea that you should avoid large collections is not a practical solution to real world problems.
3. An alternative solution would be lazy destruction, but that comes with its own issues, such as a really bad worst-case memory overhead or making RC sufficiently more complex that it's not really a win over tracing GC anymore [1].
Still no, because STW GC can pause innocent code in arbitrary unexpected moments.
If you’re pausing the thread because the thread is doing some work eg calling into system to do I/O, this is not considered a pause (even though technically it is, as the thread may be simply paused and waiting for data) but rather just the cost of the operation. If the thread is being interrupted and paused because some unrelated operation has to run - now that’s bad and considered a true pause.
> Still no, because STW GC can pause innocent code in arbitrary unexpected moments.
Well, yes, that's the problem, isn't it? And that's the point I was making. Pauses in the main thread absolutely count for latency purposes. Note that in most stop-the-world GCs you also can have pretty good control over when the GC is invoked, and that doesn't make things better.
The idea that you can always predict when cascading deletes happen in RAII code is also misleading. They can easily be hidden behind an abstraction barrier. Do you exactly know what's happening under the hood in all third-party libraries that you use, just for an example?
> If you’re pausing the thread because the thread is doing some work eg calling into system to do I/O, this is not considered a pause
In real-time scenarios (soft or hard), it absolutely is. "Regular" operations are absolutely part of your latency budget, and if "regular" operations can exceed it, you absolutely have a problem. See e.g. pretty much any of Gil Tene's talks on the subject.
The kernel (as long as it's not realtime) also pauses innocent code in arbitrary, unexpected moments, and modern concurrent collectors like ZGC pause such "innocent" code for no longer than the OS does. These synchronisation pauses are of roughly constant duration (regardless of working set/heap size) and overall introduce unexpected latency to a lesser degree than the freeing work done by refcounting collections.
There is a lot in that statement and it deserves challenge. Note two things: there is no universal right answer here, sometimes you are correct, sometimes you are wrong; my responses will be contradictory (and also contradict things I didn't think to write) each project will need to figure out which of the combinatorical points is important their project to come up with the right also. Not doing this is premature pessimization - getting your data structures wrong up front will force slow runtime on you in a way that no micro-optimization (which is what premature optimization is really referring to) can ever fix and getting out will cost a very expensive rewrite.
> 1. Such data structures (or more generally, std::vector<std::string, std::vector<std::string>> or something like that) are the natural way to represent e.g. dictionaries.
They are natural, and easy but that doesn't mean they are the right way.
Often I find with a little thinking that there is a enum key under it all that I can hard code - and the enum key is also type safe so that the compiler can prove my code is correct against some class of bugs in a way a string as key cannot.
Why are your keys and values std::string which (baring small string optimization) will allocate more? Often you can place a maximum length on one of both that is small enough that you can replace std::string with a struct containing a char array (or std::string_view - I still have to support C++14 so I haven't been able to try it) and avoid a lot of allocations.
Failing that, often the correct data structure is a database (I reach for sqlite first, but there are plenty of options with pros and cons) which is fast lookup, allows for more complex structures easially, it persists the data to disk so that next startup I don't have to spend all the time reconstructing all that data (realistically performance of creating that large data dictionary is of far greater concern that getting rid of it).
> 2. This general issue extends to virtually all collections. The idea that you should avoid large collections is not a practical solution to real world problems.
This article is about systems programming. In systems programming you rarely have such a large dictionary in that format. You often will in something like the filesystem, but there the point is to have the data persisted from disk and typically you only read the subset you care about. You may have a copy in cache (and cache invalidation becomes an issue), but the in memory portion of data itself is never in a dictionary of that format.
> 3. An alternative solution would be lazy destruction, but that comes with its own issues, such as a really bad worst-case memory overhead or making RC sufficiently more complex that it's not really a win over tracing GC anymore [1].
That is one option, there are more.
Intentionally leak that whole dictionary. There is no reason to care if all the destructors run for any of the examples you presented and realistically if you are getting rid of the whole thing you are probably in shutdown anyway so who cares if you leak memory? Many systems programs and libraries do this.
Move the data to a different thread that only handles reclaiming memory and then don't worry about it. (in discussion we talk about making that thread idle time priority so it won't affect performance - but in practice all schedulers I'm aware of do this poorly in some way)
Finally, if after reading all that and consideration all your options (including things I didn't think of!) if you still conclude a large dictionary of strings to strings is your correct answer, then you probably should demand good tracing garbage collector. I'm not against using them where they are appropriate - I'm just arguing that if you think about your data a little more you will discover they are not needed nearly as much as it seems - and this time spending thinking is usually worth it because it results in much faster programs anyway (even if there is one large dictionary in the code how many others did you get rid of?).
> They are natural, and easy but that doesn't mean they are the right way.
> Often I find with a little thinking that there is a enum key under it all that I can hard code - and the enum key is also type safe so that the compiler can prove my code is correct against some class of bugs in a way a string as key cannot.
There's a fundamental misunderstanding here, I think. By dictionaries I mean language dictionaries, e.g. English -> French. You won't find an underlying "enum key" here. Nor am I sure how an enum would ever get large.
> This article is about systems programming. In systems programming you rarely have such a large dictionary in that format.
First, the article is, but the subthread that started this discussion was more general than that.
Second, even in systems programming, you will commonly have arrays or other collections. Caches are a very common thing in systems programming, after all.
> That is one option, there are more.
It is trivially true that there are alternative options, but all these are workarounds around the problem, i.e. you have to complicate your design or make silent assumptions that may not hold in the future (e.g. when disposing of memory at process exit does no longer work because you now need the code as a library).
> By dictionaries I mean language dictionaries, e.g. English -> French. You won't find an underlying "enum key"
That is a niche case not a common one though. There are a lot of niches each either different requirements.
>It is trivially true that there are alternative options, but all these are workarounds around the problem, i.e. you have to complicate your design
This attitude of non-systems programmers is why people argue garbage collection is slow. Garbage collection can be faster, but people who work in garbage collected languages think that solves all problems and so they don't build efficient data structures. Sure they are not leaking memory, but garbage collection is not enough more efficient than reference counted as to make up for thousands of destruction's when the more complex reference counted version only had a couple. Yes the code is more complex, but it is also faster and that is a trade off I'll take.
It was an example meant as an illustration. The general case of "collection of objects" is hardly niche.
> This attitude of non-systems programmers is why people argue garbage collection is slow.
I started programming writing Z80 assembler in the 1980s, counting T-states in order to make hard realtime code work. I wrote a graphics driver for Atari ST/TT hardware not to soon after. I think I have a pretty good idea what working in a real-time and/or resource-constrained environment means.
> This attitude of non-systems programmers is why people argue garbage collection is slow.
That is an incorrect generalization. In fact, I see plenty of inefficient code in C++ and Rust (e.g. because a lot of the workarounds for not having GC require additional copying).
> Sure they are not leaking memory, but garbage collection is not enough more efficient than reference counted as to make up for thousands of destruction's when the more complex reference counted version only had a couple.
This is some really unclear statement. If you're trying to make a connection between absence of (tracing) GC and having value types, they are not inherently related. You can have tracing GC and value types (e.g. C#) or reference counting and a lack of value types (e.g. Python).
What is true is that in general memory allocation is relatively expensive, so you want to avoid unnecessarily allocating objects, but that's true regardless of whether you use malloc()/free() or a garbage collector and the strategies for dealing with that are the same in both cases.
> Yes the code is more complex, but it is also faster and that is a trade off I'll take.
C# has structs with generics/explicit layout/object references/fixed buffers/etc. and byrefs which are special pointers that can point to managed memory, stack or unmanaged one yet are still tracked by GC despite allowing pointer arithmetic on them, and regular C pointers yet successfully uses GC.
Much damage has been dealt to non-JVM languages by stereotypes and perception that the way JVM languages work is the way all GC languages work.
That isn't a choice though. The high performance garbage collector is implemented by the language implementers and comes for free, while the delayed cascading deletion has to be done by the individual programmer not the language designers.
This sounds intuitively true. So...what if GC languages could introduce an optional annotation for an object to say that only one reference to it can exist, and use that as a hint to the GC?
I don't see how this could be implemented in a safe and performant way - either you check for existing references at runtime, or you risk some use-after-free bug. But perhaps your project is already in a GC language and you're happy with that, but just want to optimise GC for this critical component. And we already have the concept of "unsafe" code blocks in many languages.
Does anything like this exist? I Googled "smart pointers in Java" but just got a bunch of mid-quality answers where people explained that I'm stupid for even asking this and they're unnecessary because Java manages its own memory. But hasn't someone smarter thought of this?
> This sounds intuitively true. So...what if GC languages could introduce an optional annotation for an object to say that only one reference to it can exist, and use that as a hint to the GC
Tracing GCs fundamentally look at what is still reachable, while ref counting tracks deadness. They are actually the “yin-and-yang” of each other, with different tradeoffs, tracing being more practical on current hardware/usage pattern.
There is no practical difference in whether a tracing GC can reach an object from multiple place, or just a single one - but there might be some other analogue feature in tracing corresponding to 0-1 only ref count.
Can only be one is easy on current hardware because that means it lives on the stack and in turn is destructed at the end of the current scope using the standard machine code to change stack size. The only part that is hard is if the system supports exceptions not the compiler needs to somehow jump back to each scope to run those - I can think of several ways to do this, but none are hardware supported (exceptions are not hardware supported - at least not on any architecture I know of - so it is up to the compiler to deal with this either way - it ends up looking as a catch and rethrow block)
> There is no practical difference in whether a tracing GC can reach an object from multiple place, or just a single one
The real gain would be deterministically run finalizers in this case. Many languages have looked at C++'s RAII and realized they need a deterministic way to do things like close file handles. As such they already have all the parts in place.
It doesn't gain anything else though as the tracer still needs to go through all the pointers in the 0-1 reference count as they might point to something that is sometimes shared but currently is not (and thus the tracer wouldn't reach that memory otherwise - if the memory is never shared it could also be marked 0-1).
I'm not convinced it is a good idea though. Manual memory management is hard. I write C++ every day, std::unique_ptr makes my life much easier, but it has limits (shared_ptr is in my experience a sign you don't understand your memory and so I will have to figure out some rare bug when you stored a reference to the underlying thing). If you always have garbage collection it is easy as you don't have to worry (though you can run into problems: either over allocating when a reference would work; or storing a reference in some data structure after the data isn't needed). If you never have garbage collection you know you need to think about object lifetimes. The 0-1 makes it too easy to have a situation where you didn't think enough and now 0-1 memory cleaned up even though it is still reached.
> That's why high-performance languages that rely heavily on a GC -- Java, C#, Go -- prefer tracing.
And yet, despite 30+ years of compiler and GC research, they are still in the second performance league. All languages in the first league (C, C++, Rust, Zig) use manual management and occasional reference counting.
In theory, refcounting and tracing can behave similarly [1], but assuming we're speaking about their implementations in the industry (rather elaborate tracing GCs; rather crude refcounting GCs) then a refcounting GC would do some work as soon the refcount for a particular object drops to zero. When exactly that happens and how much work there is to do (and sometimes, what the fragmentation effect on heap is) are local properties that are not very predictable. In contrast, the amount of work a tracing GC needs to do when compacting is directly proportional to the program's working set and the frequency in which it needs to do that work is proportional to the allocation rate -- both of which are fairly regular and predictable global properties for a given program. For stop-the-world GCs there was then the matter of the highly unpredictable exact timing of a large STW pause, but modern concurrent GCs don't collect anything in STW pauses anymore. They only have very short (sub 1ms) and constant-time pauses and no surprise throttling as long as the allocation rate is within an acceptable range. So all in all, you pay a fairly fixed tax on the CPU (and if you want to pay less, just add more RAM) and virtually no impact on latency.
Thanks for responding. My experience with tracing GC at scale is exclusively in the realm of .Net, and RC exclusively with C++ smart pointers. That matches your “sophisticated vs crude” contrast.
The experience with .Net is that GC impact was difficult to profile and correct, and also “lumpy”, although that may have been before GC tuning. GC would dominate performance profiling in heavy async code, but these days can be corrected by value tasks and other zero alloc methods.
For C++ style ref counting, the impact was a continuous % load and simple to profile (and therefore improve). Although here, the ref counting needed to be stripped from the hot paths.
The biggest issue I’ve hit between the two modes though, is how they behave when hitting memory limits. Tracing GC appears to have an order of magnitude perf hit when memory becomes scarce, while ref counting does not suffer in this way. This is enough for me to personally dislike tracing GC, as that failure state is particularly problematic.
Same experience with doing database system programming in Java. The number of performance issues caused by GC and the amount of code complication introduced in order to avoid them is mind-blowing. We’re managing memory manually in critical parts. But Java ergonomics around manual memory management (cake off-heap) is just terrible. There are moments I think I’d be more productive in pure C because of that. Not C++, not Rust, but pure C.
When you hit memory limits, .NETs GC implementation would perform much more frequent, invasive and aggressive collections, including LOH compaction to reduce memory watermark which leads to greater GC pauses, though this is rarely seen in such a bad way on modern versions with e.g. SRV GC.
The most scaling way to address this is usually to just allocate less and use valuetasks with pooling where applicable (frequent asynchronous yields), I'm certain if you built a .NET 8 based solution you would see user-written code dominate heap allocations profile, as most hot internal paths of async utilize said state machine box pooling+ValueTask<T>[0] and may be entirely allocation-free.
> When you hit memory limits, .NETs GC implementation would perform much more frequent, invasive and aggressive collections, including LOH compaction to reduce memory watermark which leads to greater GC pauses, though this is rarely seen in such a bad way on modern versions with e.g. SRV GC.
The trouble with server GC mode is that then there is no natural back pressure. If the processing is not CPU bound, then memory allocation can grow unbounded. This is not something that happens with RC as, again, the GC performance hit is inlined with task processing. The service may not be capable of as much throughput, but it doesn’t take out the entire server either.
> The most scaling way to address this is usually to just allocate less and use valuetasks with pooling where applicable (frequent asynchronous yields), I'm certain if you built a .NET 8 based solution you would see user-written code dominate heap allocations profile, as most hot internal paths of async utilize said state machine box pooling+ValueTask<T>[0] and may be entirely allocation-free.
Absolutely; I think it’s relatively simple to write servers that scale using modern .net; the memory allocation foot-guns when dealing with asynchronous code are now well understood, and tooling is good. I am compressing ~15 years of experiences in that previous post.
It’s probably the case that a tracing GC is the better choice for most modern applications, excepting memory constrained devices (like phones), and so long as you design with memory in mind.
You are correct, sustained load heap size of SRV GC has been a known pain point that had been particularly exacerbated after beefy Windows Server hosts fell out of fashion and got replaced by 512Mi Linux containers.
There has been work conducted on this each release throughout Core 2.1, 3.1, and then 5, 6, 7 and 8 versions to make it play nicer with more constrained memory limit systems.
The two major features that address this are Regions[0] (.NET 6/7) and DATAS[1] (.NET 8). The former is enabled by default everywhere except macOS and the latter is available opt-in either via env var DOTNET_GCDynamicAdaptationMode=1 or msbuild poperty GarbageCollectionAdaptationMode: 1 (see more in [1]).
The latter has shown to significantly reduce sustained (or, especially, idling) heap size for some particularly problematic workloads (but not all, sometimes you just have a lot of live objects). I definitely recommend giving it a try if this is something still relevant to you.
TLDR of what DATAS does is dynamic heap count scaling and much smarter heap up/downsizing depending on allocation rate/frequency and anticipated throughput impact of adjusting those.
That goes for manual memory management -- and certainly languages with a reference-counting GC, like Rust -- as well. The main difference is by far footprint overhead.
> and certainly languages with a reference-counting GC, like Rust
It's a mistake to say that the Rust language has reference counting. There's a pair of reference-counting wrapper types in its standard library (Rc and Arc), or you can roll your own, but there's no special support for these types in the language, and their use is optional. Most of the time, you won't be using these reference-counting types. Box (the generic heap-allocated or "boxed" type) doesn't use reference counting. String (and its several specialized variants) doesn't use it. Vec (the generic heap-allocated array type) doesn't use it. HashMap, HashSet, BTreeMap, BTreeSet, none of them use reference counting. And so on. You can write a lot of Rust code without using reference counting even once.
What the Rust language has is just C++-style RAII: when a value goes out of scope, if it implements Drop, its Drop::drop is called.
> It's a mistake to say that the Rust language has reference counting.
Having these types in the standard library is the language having those types.
Perhaps it's not integrated to the level that a language like swift is. However, I think it's reasonable to say the language supports Rc when the standard library supports it. I'd say the same thing about C++ with `shard_ptr`.
Otherwise you end up in weird pedantic notions about what a language has or does not have. Does C have a heap? Well, technically no since malloc and free are just function calls in the standard library and you can write valid C programs without calling those functions.
> Having these types in the standard library is the language having those types.
It depends on whether you consider the standard library an indivisible part of the language or not. For Rust, it's clearly not the case, since you have the #![no_std] mode in which only a subset of the standard library is available, and this subset does not include these reference counted wrapper types (or any heap-allocated type at all).
> Perhaps it's not integrated to the level that a language like swift is. However, I think it's reasonable to say the language supports Rc when the standard library supports it. I'd say the same thing about C++ with `shard_ptr`.
It's one thing to say a language "supports reference counting", which only means you can use reference counting with it, and another thing to say "[...] languages with a reference-counting GC", which implies that the language uses a GC for everything, and that GC is a reference-counting GC.
> Does C have a heap? Well, technically no since malloc and free are just function calls in the standard library and you can write valid C programs without calling those functions.
It's actually the same thing: C can run on either a "hosted" environment or a "freestanding" environment, and on the later, most of the standard library is not available, including malloc and free. So C does not necessarily have a heap when running on a freestanding environment.
It's not part of std exactly, it's part of alloc. It's re-exported by std.
It would still be available in a #![no_std] environment using `extern crate alloc`.
This crate generally abstracts over the concept of allocation too, so relying on it doesn't require you to also have an allocator - it just requires someone at some point specify one with #[global_allocator]
It is not, if you have objects with dynamic lifetimes, and allocating them for the whole duration of the program is not an option.
Sure, their use can be much less than a managed language that can only do automatic memory management, but RC is objectively a worse from most perspective than tracing GC, except for the fact that they don’t need runtime support, and a slightly lower memory overhead.
It goes even a step further. You can write a lot of Rust code with exactly zero heap allocations in the critical paths. While at the same time we’re struggling getting our Java code to stay below 1 GB/s heap allocation rate. Rust recounting could be 100x slower than Java GC and it would still win.
I think the important thing to understand is that reference counting isn't any better (and often worse) than "regular" garbage collection.
The point of manual memory management is to come up with problem-specific strategies to avoid or at least reduce dynamic memory allocation, not to insert manual release/free calls for individual objects ;)
Reference counting is regular garbage collections. The two broad classes of GC algorithms are tracing and refcounting, and while they can converge to similar behaviour, usually the former optimises for throughput while the latter for memory footprint; latency is similar these days.
> ...while I agree, for many C++ and Rust coders statements like this are pure heresy ;)
It's a matter of definitions. For many people, "garbage collection" refers only to tracing GC, and reference counting is a separate category. In my experience, that's the common usage; insisting that "reference counting is formally (in some paper from the last century) also defined as a form of GC" will not magically change the opinions "many C++ and Rust coders" have about tracing GC. In fact, I'd say that insisting on this nomenclature point only weakens the whole argument; tracing GC should stand on its own merits, and not on depend on some nomenclature equivalence to be accepted (if quibbling about nomenclature is your strongest argument, your arguments are weak).
There's no need to "change opinions". People who work on GCs know that reference counting and tracing are the two general GC strategies. The only people who don't think of refcounting as a GC are people who simply aren't familiar with GCs and how they work. If they also think refcounting has lower latencies (let alone higher throughput) than tracing, then they're also just wrong. No one needs to "insist" on the GC nomenclature. You're either familiar with it or you're not (and since most people are not, they commonly make mistakes on the subject). Also, given that tracing GCs are used by ~90% the market, they hardly require justification anymore; they've won by a large margin over the application space (which constitutes most of software). However, it's nice to occasionally educate those unfamiliar with the subject on GC algorithms and nomenclature.
I have to wonder whether some of this is semantic drift over time or context. My recollection since undergrad (a few decades ago) involves treating “garbage collection” as referring to tracing garbage collection, and “reference counting” as a separate mechanism. There is still a term for the category including both, only that term is not “garbage collection” but “automatic memory management”. But what I see nowadays is closer to what you describe.
If anything, the drift is towards GC being only tracing as it is so dominant in the languages that are normally considered having GC. But before C++ (via boost) introduced shared pointers, and Swift ARC, I'd expect the separation to basically not exist.
I agree; I meant “including” in the non-restrictive sense, not “including only”. Stack allocation is a special case where the lifetimes are arranged in a convenient way—see also escape analysis in languages where stack allocation isn't explicitly supported at the language level but can be added by the compiler.
> Also, given that tracing GCs are used by ~90% the market, they hardly require justification anymore; they've won by a large margin over the application space (which constitutes most of software).
Tracing GCs have clearly proven themselves and are everywhere (JVM, CLR, Go, Dart, OCaml, etc.) but we can't ignore that the Apple ecosystem (Swift) is using ARC. That's a significant share of the "market". Python and Ruby also use reference counting, but I don't think anyone is considering them state-of-the-art GC.
Except that obligate ARC ala Swift has even lower throughput than obligate tracing GC. It's the worst possible choice unless you really care about low-latency and deterministic freeing of resources (and even then, using RAII for common tree-like allocation patterns like Rust does will perform better).
You're right, I should have said "languages where GC is the primary means of managing heap memory are used by 90% of the market" rather than focused on a specific algorithm.
Ruby does not use reference counting. It has a mark and sweep generational gc that is incremental and use compaction. I doubt it would be state of the art, but it is not too bad nowadays.
> Tools to automate the “actually freeing the memory” part, like lifetimes in Rust and RAII in C++, don’t solve these problems. They absolutely aid correctness, something else you should care deeply about, but they do nothing to simplify all this machinery.
Rust is more similar to C++, in that the compiler inserts calls to free as variables exit scope. Runtime reference counting is limited to those objects wrapped with Rc or Arc.
I agree with pron’s larger point. GC is fine for most applications. It’s just factually inaccurate to compare Rust’s memory management with languages like Python and PHP.
> (OpenJDK's ZGC has typical pause times measured in double/triple-digit microseconds, and the worst case rarely exceeds 1ms for a reasonable allocation rate -- the pauses are in the same ballpark as OS-induced ones)
We've been benchmarking ZGC and Shenandoah at work, and the p100 pause time is usually below 500us (micro-seconds). ZGC seems to be performing a bit better, as it seems to be performing less pauses than Shenandoah (hence doing more work/pause).
We still have to run tests-in-production, but so far it seems that GC pauses are largely a solved issue when using ZGC (and Generational ZGC since Java21).
FYI, ZGC doesn't perform any collection work in the the stop-the-world pauses. They are only required to get all mutator threads to atomically observe the increment of the "GC epoch" for all threads. All actual work, both marking and compaction, is done by GC threads running concurrently with the mutator threads or by the mutator threads themselves as they run. It is only when allocation rate is very, very high that an allocating mutator thread will be paused ("throttled") for a significant amount of time to allow the GC to catch up by freeing up memory (and if you hit these cases, then you might be better off using a throughput-oriented collector).
Strangely in our workloads we have noticed generational ZGC latencies are better than G1 at ~99th-99.9th percentile or below but worse at percentiles above that. The allocation rate is moderate.
> We've been benchmarking ZGC and Shenandoah at work, and the p100 pause time is usually below 500us
> so far it seems that GC pauses are largely a solved issue when using ZGC
What? 500us is abysmally slow.
Most projects I work on have a latency budget of less than 10us, the average being 2us. That is the budget for the whole wire in/wire out for a packet. Even for less latency sensitive workloads, 500us is a no go for most networking application.
We're talking about occasional hiccups, not an average-case response-latency overhead. You can't get worst-case latency of 2-10us with a non-realtime kernel. Even a page fault could take longer than that.
> You can't get worst-case latency of 2-10us with a non-realtime kernel. Even a page fault could take longer than that.
You obviously can, and this has nothing to do with the kernel being real-time or not.
There is no situation I can think of where a page fault should occur on a properly setup system running a production networking software, meaning no swap, huge TLB, and proper memory management.
If you can think of "no situation" where a server may incur a page fault, forced preemption, or need to perform any I/O to a service/database, then I hope you at least recognise that your world in no way represents the state of server software at large because none of these things is true for the vast majority of server software.
In a former life I worked on some safety-critical onboard avionics software for an ultrasonic platform, and 2us was around the upper-limit worst-case latency (i.e. you'll kill someone if you miss that deadline); still, it's not the kind of requirements the vast majority of software finds itself under.
When working over the internet, some of the very best services are at >10ms ping latency anyway, where a 500us hiccup is imperceptible.
> If you can think of "no situation" where a server may incur a page fault, forced preemption, or need to perform any I/O to a service/database, then I hope you at least recognise that your world in no way represents the state of server software at large
I won't deny that the majority of software out there is not latency sensitive, but the context of this article is specifically targeting those softwares that are _not_ using garbage collection, arguing that it is undeservedly overlooked. OP even adds that GC is a "solved problem" because some GC implementation is 500us worst case latency.
My point is that the article author, and OP, are mistaken. Because if you are in the category of "I write server side software without GC" (e. g. C/C++), then 500us is horribly wrong.
Your point being that 500us is fine for most software out there is surely true, but not relevant, because if that is the case, you're probably not using C/C++, thus this article is not targeting you.
In _my world_ as you phrase it, traffic is unshaped. You need to be able to support line rate, otherwise packets are dropped, and hell breaks loose.
Your "properly set up system" is apparently doing nothing other than running your single job. The vast majority of real-world systems have to deal with antagonists.
All of the characteristics you mention are true on production systems used in large scale fleets...and yet "bumps" happen...because there's never one thing happening. It's all the things, and it's always changing.
I'm gonna guess you do finance. A six microsecond fiber oneway is a thing of beauty. There are certain technical luxuries associated with that domain, and rarely any requirement to exhibit high utilization rates...or deal with antagonists running on the same hardware.
Finance is one of such use cases, but there's a lot more, and that's the use case for people not using GC, thus why I find this article (and the comment saying 500us is a solved problem) pedantic.
I wrote code profilers for instance, which also need perfectly predictable latency. I worked on L2 and L3 networking applications (bridging, routing, forwarding) that need line rate support. People working on audio sampling, or codecs have the same constraints, etc.
There's a whole world of applications where 500us is ridiculously slow. The article takes the OS as example, but if my OS has 500us random latency spikes, I would be horrified.
> because if that is the case, you're probably not using C/C++
The point is that this claim is more wrong than it should be, eg. that C/C++ is still used more than it should be partly because these GC myths persist, hence the article.
I think at the end we're debating if the glass is half full or half empty.
I claim that people are not ignorant and if they use C/C++, they are aware of what a GC implies, and cannot afford it.
The article claims that people are somehow wrongly mislead to think they _need_ C/C++ while a GC'ed language would be alright.
I don't think people are dumb. I think given the choice, any sane person would pick Python or similar to write an application, and that thinking they don't because they don't know more is pedantic.
It's a mistake to conclude that people make rational choices based on the premise that they aren't dumb. Counterexamples abound.
It's literally true that people who have never programmed in a language with GC don't know what it's like, but they absorb folklore about it's difficulties or unsuitability from places like HN. Except such content is almost always from extremes that aren't applicable to most development, thus producing an unintentionally skewed picture.
Articles exactly like this one are needed for balance, and to encourage more experimentation with safer choices.
> special case where all memory can be easily handled in arenas
That seems to be an unfair bar to set. If _most_ objects are easily allocated by an arena, then that still removes most of the need for GC.
I like Jai's thesis that there's four types of memory allocations, from most common to least common:
1. Extremely short lived. Can be allocated on the function stack.
2. Short lived + well-defined lifetime (per frame/request). Can be allocated in a memory arena.
3. Long lived + well-defined owner. Can be managed by a subsystem-specific pool.
4. Long lived + unclear owner. Needs a dynamic memory management approach.
If you want to make the claim that tracing GCs surpass manual memory management in general, you should compare to a system written with this in mind, not one that calls malloc/free all over the place. I guess it might be more fair if you compare tracing GC with modern c++/rust practices.
I agree that for most systems, it's probably much more _practical_ to rely on tracing GC, but that's a very different statement.
I agree that all might be too high a bar, but most is too low, because even if most of your objects fall into categories 1-3, sufficiently many objects in category 4 would still make your life miserable. Furthermore, it's not like arenas take care of too much because they still require some careful thinking. E.g. Rust's lifetimes don't automatically ensure a correct use of arenas in all cases. A language like Zig is very friendly to arenas, but you still have to be careful about UAF.
Now, I know very little about Jai, but bear in mind that its author doesn't have much experience at all with servers or with concurrent software in general, which is 1. where objects with uncertain lifetimes are common enough to be a serious problem, and 2. a huge portion of software being written these days. Games are a domain where it's unusually common for nearly all objects to have very clear, "statically analyzable" lifetimes.
> 2. Short lived + well-defined lifetime (per frame/request). Can be allocated in a memory arena.
You now have an "arena management" problem. Figuring out the correct arena to allocate from can be a problem. (You want to put all objects that become unused at the same time into the same arena, but that's not the same as all temporary objects allocated between time t and t+delta.)
> Except in the special case where all memory can be easily handled in arenas, good tracing GCs have long ago surpassed manual memory management in throughput
Where are the measurements comparing throughput of tracing GCs and manual memory management? I'm aware of how incredibly hard this area is to measure, but it's a shame that the state of mainstream discussion is "most people just assume GC implies slow, but then again a handful of people say it's not."
Given that the no-GC-by-default market is ~10% of the global software market [1] with no signs of shift in either direction over the past couple of decades, which sounds about right to me (considering the percentage of programs that need to run in memory-constrained environment or must have precise control over memory), it seems that the number of those who may benefit significantly from a different choice is small and so it doesn't look like anyone wants or needs to be convinced of anything. "GC languages" already command ~90% of the market and have little to gain from such a small market of potential converts, and the others aren't trying or succeeding in increasing their market share, so who cares given the small stakes?
> good tracing GCs have long ago surpassed manual memory management in throughput, and more recently their latency impact is more than acceptable for the vast majority of applications
There is a very subtle problem you seem to be missing: the GCs with high throughput are not the same as the ones with low latency. So while technically it is possible that tracing could beat manual management on one metric - throughput, latency or memory overhead, it doesn’t do on all of them together.
The article motivates RCU and then does a u-turn and starts making a general argument for general-purpose GC. Not quite a trojan horse but a bit whiplash provoking.
I definitely wouldn't call the RCU thing a GC, since at no point is the object garbage. It is in one of 3 states, and changes between them as quickly as possible:
* active
* obsolete but alive for old readers
* deallocated
Note that depending on how you write your code, it may be possible to reuse an "obsolete-but-alive" object for a "new" allocation safely, though I haven't analyzed performance for this in full.
As usual for GC discussions, it is very handwavy about when you have to "fall back to `shared\_ptr/Arc`", when in fact avoiding refcounts (either because you can prove you already have ownership (which does have implications for tail calls, but you shouldn't use those anyway), or by avoiding indirection entirely) is the whole point of serious refcount-based systems. Doing nothing at all is obviously better than GC's "do something, sometime".
RCU objects are garbage when they're held by old readers thay will no longer utilize them in the remaining code path before they exit. This code path is virtually always of non-zero length, because ensuring they are deallocated in the absolute shortest path is an NP-class problem for sure. I don't see why the fact that this code path might sometimes be longer in GC would preclude RCU from being GC.
For the kind of software I write there are two cases: (1) the hot path for which I will always have custom allocators and avoid allocations and (2) everything else.
For (1) GC or not it doesn’t make a difference, I’ll opt-out. For (2) GC is really convenient and correct.
Agreed- I come from a Java/C++ shop where we tried to tackle this dichotomy with interop but it ended up causing more problems than it solved. A lot of the work that Java has done with modern garbage collectors is impressive, but even they admit (indirectly, via Valhalla) that no/low-alloc code has it's place.
This skirts around the edge of an observation which I want to dive into, which is that modern user OSes (anything which isn't a specialized RTOS) has built-in garbage collection. We just don't call it that: we just call it memory management. What do we call languages with a built in GC? Memory-managed languages!
You see this in a lot of older "top-to-bottom" C programs: they allocate, they clean up system resources (using longjmp to one label), but they don't bother with free. When the program exits the OS gets all that memory back anyway, so why bother?
There's a missed opportunity here, to have an OS with a garbage collector with less isolation from programs, one which handles resources more like a language's runtime GC. It will probably stay missed, because in the typical GCed language, the GC is intricately built into practically every line of the runtime, so it's not really practical to make a distribution of that language for one OS which hands that control over to the operating system.
But it's a pity, because there's a lot of room to improve some of the chronic problems we see from this artificial isolation of program-level memory management from OS level.
> You see this in a lot of older "top-to-bottom" C programs: they allocate, they clean up system resources (using longjmp to one label), but they don't bother with free. When the program exits the OS gets all that memory back anyway, so why bother?
You don't need to bother with releasing other types of resources either though (files, sockets, threads, ...), since the operating system will take care of that on process exit (unless you are on AmigaOS). The only reason to free memory is to recycle that memory in other allocations in long running applications without grabbing new memory from the OS. For one-shot command line tools it's usually not needed.
The OS only knows it can free memory when your process exits (same as file handles and other resources.) If your process is designed to exit once it's done its job, you can use the OS as a garbage collector.
Having an operating system know when memory is unused within your running program is not something that has ever existed though (except for perhaps some esoteric research OS's.) I wouldn't say we're missing an opportunity, because the thing we're missing doesn't exist in any meaningful sense. On the other hand, a programming methodology that uses very simple, short-lived programs is a totally legitimate way to do things... it's how CLI tools and the scripting languages that script them work, it's how web servers used to work (with CGI/etc), and it's a perfectly reasonable approach even today.
> I wouldn't say we're missing an opportunity, because the thing we're missing doesn't exist in any meaningful sense
That's a not-uncommon sort of missed opportunity! But I believe Inferno does this, the Dis VM which runs the entire (hosted) operating system has a GC which is reused by Limbo, the systems language. But it didn't catch on.
There's no technical reason for the division between a process GC and an OS GC, and it would be nice to see what could be done by eliminating it. There are good practical reasons it hasn't happened, but it's hard for me to imagine it making anything worse, and it's easy to see things which could improve, including safe(r) shared-memory between processes.
(1) the pivot from rcu to general purpose tracing gcs is bait-and-switch.
(2) Manual memory management is more than just malloc/free calls -- it's about layout (e.g. struct-of-arrays, inlining, implicit offsets, packing, etc)
For (2) Virgil has several features that allow you to layout memory with various levels of control. I assume you meaning "array of structs", and you can do that with arrays of tuples, which will naturally be flattened and normalized based on the target (i.e. will be array-of-structs on native targets). You can define byte-exact layouts[1] (mostly for interfacing with other software and parsing binary formats), unbox ADTs, and soon you can even control the exact encoding of ADTs.
Skimming [1], Virgil Layouts resemble ArrayBuffers in JavaScript -- a boxed view of some native span of memory. If I'm reading that right, it's basically an escape-hatch from the GC for "mixed" environments -- the buffer itself is owned by a GC'd proxy, but it doesn't doesn't e.g. contain GC references internally.
That's useful for lots of low-level interfaceing (e.g. communicating with serial interfaces or device drivers), but one couldn't, however, build an "allocator" in a Layout for GC'd objects the way, e.g., in native code you can make a "bump allocator" for arbitrary structs which are "free"ed with just a pointer-reset (pretty important, e.g., in game engines, which is my field).
The last part is correct; layouts are not GC'd objects. They are views over byte arrays (or mmap'd memory, or unsafe regions like the execution stack). The first part is partially incorrect; layouts are not "boxed"--they are represented underneath as a pair of a (potentially null) GC object and a pointer-sized offset into that object. So with that representation you can have a reference to an off-heap layout as well.
Layouts are a little underpowered right now, owing mostly to the conservatism in that they could always be aliased by a byte array or any other alias, thus their bytes are chaos-bits that cannot be trusted. Adding the ability to have pointers between layouts is another level; for that, I am envisioning reads of pointers requiring a second capability which is the expected region into which the pointers lie.
But all of the layout stuff is fundamentally not about performance, it's about interfacing with external software and hardware. In Virgil the expectation that you should use the good, type- and memory-safe constructs like classes, arrays, tuples, ADTs, etc. These are plenty efficient and getting new annotations to control their representation in a more fine-grained way.
I think struct-of-arrays was a quite delibarate wording. It stands out as a strategy that is at odds with object orientation and is sometimes more efficient.
I disagree with 2 being manual memory management. There is definitely a lack of control in contemporary managed languages (though it’s also far from perfect in low-level languages) for memory layout, but there are definitely ways to affect it.
I agree with your disagreement. As a case in point, Nim has packed bitfields but various choices in automatic memory management. As a concrete example, a spell-check custom-data store uses them here: https://github.com/c-blake/suggest/blob/04e313f8f8d3adf4cb55... (there the memory is in a memory mapped file and so that code has to manage the space itself.. so, maybe not the perfect example).
But I also agree there tends to be a correlation in PLang design in avoiding both low-level memory layout and in manual memory management. But it's just a tendency, not fundamental.
Not mentioned in this article is one thing that goes very well with GC is async/await
I am a async/await hater for personal Idiosyncratic style reasons that I will not bore you with
I use it a lot in Type/Java script. Have done so in Dart. Works as it should
I have used it in Rust. IMO it is a disaster there. Shoehorning the sort of memory management required to use asyc/await with a multi threaded runtime is a hellscape
A point that seems to get lost in many of the pro-GC articles, this one included from what I can see, is that memory is just one kind of resource.
Correct code, especially in systems programming, will need to manage external resources as well, be it file handles, sockets or whatnot. GC only solves the application memory part, thus doesn't help at all for handling those external resources. In fact it can make it much more complicated, just look at what it takes to correctly implement a non-trivial IDispose in .Net.
Other approaches like RAII or reference counting makes it much easier in my experience to handle both memory and external resources in a unified way, thus making it easier to write correct code and reason about it.
That said, I'm not blatantly against GC's. It's a tool and it has some pros and cons, like everything else. The "manual GC" RCU approach mentioned in the article is interesting for certain tasks.
There is a big difference between memory and other resources, and that is that memory -- like processing -- is fundamental for any computation; not for nothing do most theoretical models of computation assume unbounded memory. Very few languages require manual allocation of processing -- even though that is done in some software like OS kernels and hard realtime applications -- and for similar reasons automatic memory management is very useful for abstracting computation, i.e. not leaking their details of memory by a subroutine to its clients just as this is rarely done for CPU use.
So while every non-trivial computation involves some non-constant amount of processing and memory, I/O is usually done at the edges of the system. Management of I/O is very important, of course, but it's not as central to the notion of computation (and therefore to abstracting computation) as processing and memory.
Yes, and the "Memory Safety" arguments apply to the other resources too. For example Rust eventually grew I/O safety, so your File Handles (in Unix, OwnedFd) or just straight up Handles (in Windows, OwnedHandle) are owned objects, not just integers like the number 4
At the surface this looks like it's just about avoiding dumb mistakes like using arithmetic on handles or mistakenly using reserved values as sentinels -- but the ownership model also means anything tricky we're doing with handles has explicit ownership, transparent to future maintainers.
So true, moving from C++ to mostly C# I'm liking memory management by GC but hating tracking file handles, sockets, etc. RAII is something I really appreciated
Use Roslyn analysis that errors when using is forgotten in an IDisposable type, for example.
By the way, in modern .NET, using makes use of structural typing. Any class or struct, with a Dispose() method can be referred to, no need to implement the interface, and can also be retrofitted via extension methods.
This is strange advice. Why not just implement the IDisposable interface like normal C# code? Using extension methods for this is strange as well. Doing this even somewhat correctly would mean creating another class that implements IDisposable which can only work with public members of the original. In general best not to do weird stuff for no reason imo.
Using another class, or struct adds up to memory usage, and turns out some C and C++ folks are itchy with such solutions, not to mention the added issue of passing wrapper classes/structs around.
If the hypothetical is: 1) you don't own the type, 2) it holds onto unmanaged resources, 3) the author didn't bother implementing IDisposable. Furthermore, 4) the unmanaged resources are publicly exposed. This seems like an extreme edge case but ok. As soon as you are done implementing it, please contact the author and ask them to implement IDisposable like every other class that holds onto unmanaged resources.
(SafeFileHandle is internally reference counted, but yes, worst case when you forget to .Dispose it, it waits in the finalizer queue for the GC to trigger the finalizer thread to process the items in it)
Yet, a point that tends to get lost when criticising GC languages, is that most of them have features for deterministic resource management, that many keep failing to learn.
- Some do have RAII
- Some do offer keywords
- Some do arena like management, lambdas with implicit management
- Some have a little help from the type system
- Some do a bit of everything listed above
Additionally, just like system developers have to rely on static analysers, the static analysers for those languages can also provide validation when something is forgotten, when the type system alone isn't enough.
My point though is that by moving memory management into "don't care" territory while still, obviously, requiring explicit handling of other resources, it's easier to forget or miss when you need to explicitly handle something.
When instantiating objects in C# say I need to check the documentation or the source code to see if it implements IDisposable to know if I need to handle it. Lets say that for a given class X in library Y, it does not. So I don't worry, I just instantiate and don't do anything about the cleanup because GC will handle it.
Later, the implementation of X changes and IDisposable is added to it. I now have to change my code, and not doing so could lead to serious production issues. Yet the compiler happily compiles my code without any warning.
Sure some static analyzers might catch it, but they're not perfect, and you need to run them. A stock Visual Studio 2022 will not complain about the above scenario for example.
In my experience it's much less error prone to unify memory and external resource management. If instead I had been using a different language which has a more uniform resource handling, the above change in class X would likely not have been a big deal.
Also, my code will already be written with resource handling in mind. It can be non-trivial having to change a hierarchy of classes just because a dependency deep down suddenly had IDisposable added to it.
I guess what I'm trying to say is that I think just focusing on memory management, that is to GC or not to GC, is having a myopic view things. I feel it's like arguing what kind of pickle to have on your burger without considering the other ingredients. Sure, the pickle is a crucial ingredient, but there's a lot more to it.
Just like a stock C or C++ compiler won't complain about the endless possibility of getting things wrong.
Or to pick on IDisposable, you can repeat exactly everything you said regarding actually providing a destructor, properly implemented, taking into account class hierarchies, multiple inheritance, heap allocation, and being exception safe.
RAII is of course great but any language with a GC and proper exception handling can handle resources safely. Eg look at Java's try-with-resources statement which guarantees that the resource will be disposed safely if an exception is raised: https://docs.oracle.com/javase/tutorial/essential/exceptions...
You can build up quite resilient and resource-safe systems using these basic building blocks.
Having garbage collection (GC) in your toolkit doesn't mean you're limited to choosing between GC and deterministic destruction. You can integrate GC, stack allocation, and manual memory management within the same application. It's possible to leverage GC in a manner similar to how `shared_ptr` is used, providing both automated cleanup and precise control over object lifetime. For a practical example, consider SGCL: https://github.com/pebal/sgcl
First, I have no desire to handle both memory and external resources in a unified way, because memory management and resource management have different needs.
Memory is not just "one kind of resource", it's a very specific type of resource that if it has to be managed manually, inherently creates crosscutting concerns. And memory allocation is pervasive, often implicit in other language constructs. Garbage collectors get to cheat here, because they have a global view that ignores module boundaries and information hiding.
The classical example is that introducing a caching mechanism usually introduces API breaks. Where a function normally returns a pointer/reference/unique pointer and makes the caller responsible for freeing memory (whether through convention such as in C or enforced/automated through language mechanisms such as in Rust), the moment you cache it, you need to return a reference-counted pointer, because now the memory can only be freed if both the caller and the cache don't use it anymore. And that change from a non-reference-counted pointer to a reference-counted pointer is a breaking API change.
There are plenty more situations where manual memory management interacts poorly with modularity, such as filter() style functions, or the various complications that arise from closures capturing their local environment.
Conversely, it is absolutely possible to have pretty straightforward resource management with guaranteed and predictable lifetimes in a GCed language (though, alas, there's a lack of direct language support for that).
The general approach is as follows: Each resource's constructor takes an explicit or implicit owner argument (implicit being the current scope, whether defined through a language construct). You can also transfer a resource to a different owner (reparenting) [1].
Owners of resources can be lifetime managers such as scopes (but those do not need to correspond to lexical scopes and are more like transactions), that have more complex lifetime logic (such as a pool of resources) or objects that themselves are owned (e.g. if you have resources dependent on other resources). When the lifetime of the owner finishes, it calls a dispose function in all owned objects.
Because an owner is required in order to construct such a resource object (unlike a C# using clause or Java's try-with-resources) by virtue of its constructor requiring it, it is impossible to accidentally create such a resource without a controlled lifetime [2].
Note that this is not the equivalent to RAII. You can have a number of non-owning references to such resource objects, essentially the equivalent of a weak pointer. In my experience, this is generally a good thing, because you do not want to have a hidden pointer secretly extending the lifetime of a potentially expensive resource. I prefer resource lifetimes to be explicit and to get an error if they are used past their intended lifetime.
Introducing caching is a semantic API change, regardless of the way how memory is managed, so it should be breaking. After introducing caching, now you also accidentally introduced sharing, because two consecutive API calls can return the reference to the same object when earlier they couldn’t. This could lead to a correctness issue.
This is not a problem if the returned object is immutable. If you're returning mutable objects, then that already needs to be documented as part of the AI and not an incidental discovery from the object being reference counted.
In any event, that is hardly the only case of manual memory management breaking abstractions.
It may not be a problem in a language like Rust, where the compiler understands the concept of immutability. But most (if not all) mainstream languages except Rust don’t. Your object may be immutable in January but in February someone makes a change in a class used 10 layers below and suddenly your class is no longer immutable. Add invisible sharing to that and your code explodes.
I will happily pay a small price in breaking some abstractions and get the protection from fuckups like that every single time.
Abstractions understood as „I can change only X without changing Y” (aka GoF OOP patterns or most Clean Code OOP patterns) are overrated anyways. Readability and understandability of code is more important than ability to add something without changing something else. If code is readable and constraints are enforced by the compiler, it is easy and safe to change. Optimizing for ability to change is better than optimizing for not having to to changes. Because changes will happen no matter how beautiful abstraction you make.
> It may not be a problem in a language like Rust, where the compiler understands the concept of immutability. But most (if not all) mainstream languages except Rust don’t. Your object may be immutable in January but in February someone makes a change in a class used 10 layers below and suddenly your class is no longer immutable. Add invisible sharing to that and your code explodes.
No offense, but this strikes me as a strawman argument. What software design methodology leads to immutable types suddenly being made mutable? Outside of dynamic languages, where is support for immutability not there?
Note that any language with proper information hiding can expressly do immutability (literally going back to the days of Modula-2), plus of course several mainstream languages that have first-class language features to represent immutability for added convenience.
Finally, this is only one example. One underlying problem is that if all reference counts need to be capped at one, you have to either switch to full reference counting or to copy the underlying data.
You can see this play out in the std::collections::HashSet interface. Operations like intersection, union, difference, and symmetric difference return iterators rather than sets. While there are also operators that do return sets, such as bitor, that's implemented as follows.
Because the result and the argument can't in general share references to elements, you end up essentially doing a deep copy for e.g. a set of strings, which you want to minimize as much as possible. Thus, limitations on the sharing of references dictate aspects of the API.
> Abstractions understood as „I can change only X without changing Y” (aka GoF OOP patterns or most Clean Code OOP patterns) are overrated anyways. Readability and understandability of code is more important than ability to add something without changing something else. If code is readable and constraints are enforced by the compiler, it is easy and safe to change.
So, you never work with third-party libraries where you do not control the API and have never written libraries to be consumed by other teams/third parties?
Java cannot express immutability. `final` is not transitive, so nothing stops an unrelated change in code to break something that was immutable earlier. Same with Golang.
> GC only solves the application memory part, thus doesn't help at all for handling those external resources.
This is far from entirely true. Most languages with a garbage collector have finalizers, which will clean that sort of thing up. Generally, and unlike memory allocation, one can call those finalizers from within user code, so as not to have to rely on the GC running.
The distinction between reference counting and garbage collection is an artificial one. Reference counting is an approach to garbage collection, one with different tradeoffs from more concept-central algorithms for GC. I agree with you that it's a more unified approach to resource management, finalizers require user attention in a way which ordinary allocation doesn't, so yes, system resources get treated differently. I don't see that as a slam-dunk against them, however.
> Most languages with a garbage collector have finalizers, which will clean that sort of thing up.
Using finalizers for cleanup of non-memory resources is bad because they will only be called when there's memory pressure. If you have used all of your non-memory resource, but there's still plenty of free memory, the allocation of that resource will fail; if you instead force a garbage collection at that point, not only will you cause a pause, but also you will be collecting memory while there's still plenty of it available.
Finalizers will be non-deterministic when called as part of GC. One advantage of reference counting is that it is deterministic (and yes, this means that sometimes freeing the last reference that's keeping a bunch of objects around will lead to a spike of extra deallocations. Guess what, that's what determinism is. It's implied by the need to free resources promptly.)
One problem with finalizers, try-finally, try-with-resources, Python-`with`, etc. is that they don't actually guarantee the code will be called even in common cases.
In particular, any function that returns a (possibly newly-constructed) open file handle has not yet informed the caller that it's actually a thing to keep track of, unlike with destructors which are active immediately once the object's lifetime begins, and only rely on `move` being relatively atomic.
How is a try-with-resources not guaranteed? At least in Java, it is guaranteed to run - plus, you can get your closing/destructor logic wrong in RAII languages as well.
> How is a try-with-resources not guaranteed? At least in Java, it is guaranteed to run
There's a subtle detail you have to be careful with, however: if you try to allocate memory or call a method between allocating your resource and actually doing the try-with-resources, you might get an OutOfMemoryError or a StackOverflowError, and your resource will leak. That is, if you do something like:
try (MyHolder holder = new MyHolder(allocateResource())) { ... }
You can have a leak if the memory allocation in the "new MyHolder()" fails, or if there's not enough space in the stack to call the MyHolder constructor.
> plus, you can get your closing/destructor logic wrong in RAII languages as well.
For instance, in C++ you can accidentally put your resource in a temporary which is immediately destructed at the end of the line, when you wanted it to last until the end of the enclosing scope. Rust makes it harder for this to happen, but it's still possible.
I mean, it's possible to write bad code in C++. But at least it's possible to write good code too. C++ makes a careful sequence that's hard to get wrong and easily compiler enforced:
* if a given subobject (field or base class)'s constructor runs, its destructor is guaranteed to run.
* in particular, in low-level ownership classes, you can arrange for field initializers to be `noexcept`, so you get to run their dtor, regardless of subsequent manipulation of the fields - just be sure to not assume invariants from the fully-constructed main object case. In most classes, deferring all ownership logic to the fields is simpler - rule of zero beats rule of 5. And if you do add manual try-catch during construction it will actually work.
Languages other than C++ generally fail in several ways:
* Allow exceptions to be thrown by the runtime for reasons unrelated to the code being executed
* No support for subobjects, forcing all objects to be allocated separately and thus the runtime not knowing it needs to poke them since finalizers are only applied to objects not pointers. In particular, Python's `contextlib.ExitStack` can narrow the window in which case leaks are possible, but not eliminate it.
Rust does slightly better than C++ by enforcing trivial moves, at the cost of making many useful programs impossible to write.
Good luck implementing "peer pointers" in Rust. Zero allocation involved, trivial to do safely in C++.
class A
{
/* can be NULL if detached */
B *peer;
/* other fields generally exist in one of the classes. */
/* useful objects are typically created in pairs, but can be constructed detached if needed */
/* move ctor and swap keep the (value) objects pointing at each other */
/* move assignment and dtor call detach() on the overwritten/expiring object */
/* Often separate "detach and cancel" and "detach without canceling" are useful */
void detach();
/* other functions specific to the peerage */
};
class B
{
A *peer;
/* same API as `class A`, but usually most of it is only used via one owner. */
/* In my experience, generally one class's instance goes in some collection of centralized objects like an event loop, and the other is used as a handle to it. */
};
I guess we could always use the usual Rust solution of "screw ownership and efficiency, just shove everything in an `Rc<RefCell<>>`".
Using RCU as the example to motivate GC is interesting. It is essentially transferring the responsibility of freeing from the writer to the last reader, who cannot be determined when compiling. It makes a lot of sense.
But this makes me thinking that, if I want more performance, should I further transfer the freeing from the reader to a dedicated batch process? The readers only update a mark / write into a queue or something. Every once in a while, a batch process collects all garbages and compacts. This way the readers don't have random additional overheads.
> Lies people believe about memory management
> The programmer knows the best times to pause for memory management.
In my experience, there are many programs in which the programmer does know the best times to pause for memory management. For example, in games and crypto trading programs, I want to classify all computations into two priorities. I need to do an update / render a frame / compute a trading action in every time period. If it finishes before the deadline, I have nothing to do now and would like to collect some garbages. If the high-priority thing is using all the available time, for example, when the market is very active, I don't care too much about collecting garbages and would like to defer everything that is not strictly necessary to as late as possible.
You are saying for example in games. The article is saying for example in games in a loading screen. You are saying not in loading screens. Maybe you should give more specific examples then, instead of leaving people guessing what you mean?
> For example, in games and crypto trading programs, I want to classify all computations into two priorities. I need to do an update / render a frame / compute a trading action in every time period.
Sorry if that is not clear enough. The very next sentence in my comment gave do an update and render a frame as examples about games. I'm talking about the typical game loop. In almost all games the program has to do an update and render a frame every 1/30s or 1/60s. Missing the deadline degrades user experience a lot. And this is constantly happening, applies to almost all games. I mention these examples because I believe they are completely different from loading screens which only happen on occasions. These examples make the batch and low-priority garbage collection I was thinking necessary. Loading screens do not.
I quoted this in another comment here, but just to highlight one of the best couple of sentences in TFA:
> Tools to automate the “actually freeing the memory” part, like lifetimes in Rust and RAII in C++, don’t solve these problems. They absolutely aid correctness, something else you should care deeply about, but they do nothing to simplify all this machinery.
I use Nim in production. It defaults to RC. The biggest benefit of runtime automatic memory management that rarely gets mentioned: You can easily eliminate almost all memory semantics from a typical server-side program. My code is 99.9% business logic, with the 0.1% being a couple procedures which interface to a C library.
Hardcore manual memory people seem to have completely misaligned priorities to real-world concerns. Maintainability, safety, productivity, and iteration speed seem to be bottom-of-list to the egotistic holy grail be being able to say they write their own memory routines. If they're not in an embedded context, I'm really unsure why anyone paying them to write code should care what they have to say. They're not talking about increasing robustness and reducing costs, so what the hell are they even rambling about? I've had too many of these debates in-person face-to-face with people who can't match my ability to deliver.
The vast majority of us are not smarter than the compiler, and are not smarter than automatic MM routines. Those who are write compilers and GCs. They don't work on the same programs we all write, just memory managed. It's the worst-kept secret that the remaining contexts where manual management is used often have the worst-maintained spaghetti codebases, leading to disasters, whistleblown scandles, and hush-hush covered catastrophes waiting in the wings while people get the hell out of dodge before they blow. It's all duct tape and prayers. Having had to inspect and translate procedures from a deliberately obfuscated spaghetti C codebase, my position on this is going to be hard to budge. Experience is an unbeatable Bayesian prior.
Hardcore manual memory people seem to have completely misaligned priorities to real-world concerns. Maintainability, safety, productivity, and iteration speed seem to be bottom-of-list to the egotistic holy grail be being able to say they write their own memory routines.
I think you've gotten very worked up over a theoretical boogieman person that doesn't exist. Most natively compiled programs are written in C++ with Rust as a distant second and both put a lot of effort into the selling point that you don't have to manually manage memory the same way you do with C and can do with ownership. I've never seen anyone like you're describing.
Most systems programming languages don't have fully manual memory management these days - they use RAII. Manual deallocation is possible, but not used often.
Have you ever really used modern C++? This is trivial to avoid and almost never a problem. If it does happen it's immediately clear what went wrong and where since it is in the same scope.
It would be better at compile time, but this is one of the things that people who don't use modern C++ think will be a problem and it basically never is. Something happening immediately at run time isn't that far removed from happening at compile time iteration wise.
This is a 10-line example. In real-world code you will get a segfault and then have to hunt through the codebase with a debugging session to figure out where it triggered.
Actually it would trigger the first time you use something after moving and it's pretty obvious what happened because the data structure will have null pointers.
This is very rare and usually just due to a typo in practice because moving a variable in the middle of function is mostly to pass it to another function as an argument so it will be cleaned up when the function is done.
How often does this happen to you? Do you work in modern C++? I've never heard of this actually tripping anyone up.
It's kinda funny to link to a static analysis developed expressly to check for this issue and then claim that the issue isn't a problem in practice, don't you think?
It isn't a problem in practice. I think the static analysis just comes from seeing if they can mimic rust and really nail it down. Just because someone does some static analysis doesn't they are solving a problem that's plaguing people.
How's Nim in production? I really enjoyed working with the language and I've been bitten too many times by Python's package management. Do you think it's suitable for scripting too?
I use it in a context it's extremely well suited for: As a CLI executable invoked by other server-side scripts. It's a program that does lots of mathematical and statistical data processing, along with HTML report generation. Like I said, I use the default RC and don't even think about memory. The type system is very simple; there is only one nominal string type, for example, instead of the 12 in Rust or C++, and it's a fleshed-out string type that can be mutable or immutable. I also take big advantage of its preference for functional-style referential transparency with only local mutations. I have only a single module that could be construed as "OO". Oh, and the module system works exactly like you probably intuit a module system should, so you probably already know how to use it.
If you want to script with Nim, it's actually quite nice; it has a "run" command which compiles and runs without depositing artifacts in the run directory, and has an incredibly robust standard library, comparable to that of Python.
I have no experience making a live always-running Nim application, so I can't speak to that. But in the context I use it, it's incredible.
Thanks for the detailed response :)
Sounds like I'm gonna be experimenting with Nim in the next few days. Creating CLI executables with Python is always a pain, you cannot propely create a self-contained executable without going through hoops.
> I have no experience making a live always-running Nim application, so I can't speak to that. But in the context I use it, it's incredible.
I have done this several ways quite successfully. Firstly, instead of "%cpu" I tend to measure "10s..100s of parts per billion" CPU from a like ~100 lines of code "cron library" that I use instead of system cron demons: https://github.com/c-blake/cron -- seeing 40 ppb on one box and 73 ppb on another at the moment.
Another example might be https://github.com/c-blake/bu/blob/main/doc/dirq.md which is a kind of ad hoc demon to monitor however many directories on Linux with inotify and then launch user-commands against dropped in files. This can be a sort of Plan 9 "plumb" style thing. E.g., one of the directories I monitor is a browser download directory. So, one click to save and them boom - a potential cascade of activity. (EDTI: This is clocking in at about 9572177ns/(48*3600s) =~ 55 parts per billion for me right now.)
As a final example, I got annoyed with the unwanted complexity of modern syslog junk. So, I whipped up https://github.com/c-blake/kslog which just looking it up for this post, according to /proc/PID/schedstat has only accumulated about 27357590/(24*3600+24*60) =~ 311 parts per billion of 1 CPU on a 4-core system since boot about 24h:24m ago. This in about 25% of the lines of Nim code that busybox spends on less useful C code (to your point of "almost all the code is the actual logic, not other junk").
Also, while it is not as full-featured as stdlib `string`, there is a zero-copy `cligen/mslice.MSlice` type (https://github.com/c-blake/cligen/blob/master/cligen/mslice....) that does have basic features like splitting and number parsing. There are probably others out in the Nimbleverse. If view types ever migrate from experimental status to relied-upon that might become less interesting.
I should also mention: Be careful analyzing forum or example Nim code. A LOT of Nim programmers are coming from the C or C++ side of things and trying to use it like C or C++. You can write most Nim programs essentially like Python. Keep it clean, keep it simple. Use iterators, use map(), use universal function call syntax, use the full semantic power of sequences (very comparable to Python lists).
Off topic but what is going on in the picture of the leaking pipe? I don’t think it’s AI but there’s a third arm in there that I don’t understand. There’s at least two guys in the picture but I can’t assign one of the arms to either of them.
Saying this as a big hardware atomics enjoyer: Static atomic variables are not fast, are not parallel and definitely are the worst kind of cache misses.
Atomics are also atomic only on that one single operation, such as manipulating a single number in a struct or swapping out a pointer from one version of an object to another. To atomically expose a set of changes to a datastructure, you either use a lock or swap the old for the new, which is the driving example in the article.
This is false. Depending on the CPU Arch, updating a shared static atomic variable may cause cache invalidation. How fast depends on how hot and tight you critical sections are.
a) Provably efficient parallel garbage collection based on disentanglement
b) Provably efficient automatic granularity control
[1] MaPLe (MPL):
https://github.com/MPLLang/mpl
[2] Automatic Parallelism Management:
https://dl.acm.org/doi/10.1145/3632880