I have been using the Boehm-Demers-Weiser GC for jank [1], the native Clojure dialect on LLVM with C++ interop. Since jank is a Clojure dialect, and Clojure is built primarily on persistent, immutable data structures, there are potentially a lot of references to objects, potentially cyclical, across many threads, and there's a lot of garbage being churned. I originally started with reference counting and RAII, using boost::intrusive_ptr and an atomic count. The GC was actually 2x faster, in the sequence benchmark I was trying to optimize.
At this point, jank is generally several times faster than the equivalent Clojure JVM code. I'm sure there are better GCs out there, in terms of performance, and I have my eye on MMTK [2] for a future upgrade, but the fact that remains is this: the Boehm GC is stupid simple to integrate and it's surprisingly fast. Compare it to MPS, MMTK, and others and both the documentation and the actual dev work required are worlds apart.
For a project which needs a GC but doesn't need to pick the absolute fastest one first, it seems like the best option, based on my research and my experience.
Just want to say, Never heard of Jank before and it looks awesome, great work!.
As more of a CL guy who's dabbled in Clojure but found it a bit too slow(but conversely I also miss the Clojure's data structures and transactional state in CL; multithread programming really isn't CLs strong suit, for obvious historical reasons), this might reinvigorate me. I may sit down one day soon and churn out some AOC or something in Jank just to get a feel.
The shameless plugging pays off! One more HN reader converted. :D
Check it out, join our Slack, and stay tuned. jank's still under heavy development right now, and not ready for AOC this year, but it's getting close. The blog is the best way to watch the progress: https://jank-lang.org/blog/
Well, in that case, if you have any issues that are somehow well suited for a reasonably experienced C/Lisp programmer with almost no C++ experience(unless shaking my head over the code in the Arduino stdlib counts...), then maybe I'll sit down with one of those!
There will be a lot of room for this, once I build out more of the features. In particular, there will be a lot of Clojure libraries which need to gain jank support. Clojure doesn't require "porting", so to speak, since it has a special .cljc file which can use reader conditionals to check the host that it's in (clj, cljs, cljr, jank, etc). So anywhere those libs are using Java interop, we'd need to wrap it to use native interop instead.
The vast majority of it is heavy C++ work, though. Outside of that, the biggest gains will come from time spent on packaging, distribution, and testing on various platforms.
And if none of that sounds interesting or applicable, don't worry. Just be sure to join the Slack channel and hang out with us. :)
Exactly as nlitened mentioned, Clojure has a few difference reference types, which are boxes that hold pointers to more immutable values. The values within those boxes can be changed transactionally. This allows for cyclical references.
What I love about the Boehm collector is the hack that enables it to work in a language with no runtime typing like C. It literally guesses what's a pointer. In C pointers are basically integers but they're unusual numbers, often quite large, and you know the entire set of valid numbers if you control malloc(). Surprisingly this worked even in 32 bit architectures.
I don't see that this is something to "love", and what you describe doesn't capture the full extent of the approximation.
What scared me off of Boehm GC is that there a bazillion config options -- e.g. it requires tuning and blacklisting to avoid memory leaks -- and probably even correctness because register scanning isn't portable.
The codebase is large and a bit scary, and there are seemingly lots of ways to use it poorly. People kept saying it's "drop in", but that seems like a myth.
At least personally it doesn't seem like the kind of thing I want to debug.
The opinions on Boehm generally seem very split -- a bunch of bad stories resulting in ripping it out of the codebase, and also people who think it's surprisingly good, and codebases that have used it for a very long time.
I think the Nix evaluator uses it and is carrying around patches, and Guile Scheme uses it. Though I think there is some desire to move away from it in both cases. But again people have used it for a very long time, which isn't nothing!
Recent versions of the collector use an approximate best fit algorithm by keeping free lists for several large block sizes. The actual implementation of GC_allochblk is significantly complicated by black-listing issues (see below).
...
The collector implements black-listing of pages, as described in Boehm, ``Space Efficient Conservative Collection'', PLDI '93, also available here.
During the mark phase, the collector tracks ``near misses'', i.e. attempts to follow a ``pointer'' to just outside the garbage-collected heap, or to a currently unallocated page inside the heap. Pages that have been the targets of such near misses are likely to be the targets of misidentified ``pointers'' in the future. To minimize the future damage caused by such misidentifications they will be allocated only to small pointerfree objects.
The collector understands two different kinds of black-listing. A page may be black listed for interior pointer references (GC_add_to_black_list_stack), if it was the target of a near miss from a location that requires interior pointer recognition, e.g. the stack, or the heap if GC_all_interior_pointers is set. In this case, we also avoid allocating large blocks that include this page.
"love" in the sense of someone loving their snaggletooth mutt. What's remarkable to me is despite the obviously terrible idea at the core of it, it seems to work pretty well for some practical use. (IIRC it was at the core of GNU Java as well, and I could have sworn the earliest Sun JVMs used it.)
I don't doubt it caused all sorts of problems too. It's just an interesting reminder that sometimes a really wacky hack can be pretty useful.
Certainly not but it wasn't particularly hard to implement either, I managed to do it with some simple inline assembly. Here's my programming language's register spilling code:
It walks the entire native stack and marks every pointer it finds! Handles pointers in registers by spilling them on the stack and just letting the stack scanner find them like all the others. It guesses pointers but in a way that makes false negatives impossible.
I assume everything must be properly aligned to work. Interestingly, it uses jmpbuf to spill the registers. There's also a compiler builtin to get the current stack frame which can be used to obtain the stack root if called at the beginning of the program. Really neat, and educational.
Ah the spectre of the 80s, most famously used by the ancient fork of Mono that Unity uses, leading to GC-pause induced stutters that have plagued gaming for over a decade.
I'd estimate that the bad rap GC languages get can be traced back to this piece of code.
The above problem is about latency of stop the world collectors in a domain that requires extremely low latency. And if you think that stop the world collections are representative of garbage collection as a whole (i.e. the "bad rap"), this is just not being up to the state of the art. (It's just that generational/incremental collection is hard to impossible without having support for write barriers/read barriers/snapshotting on the compiler or hardware side, which makes that a practical no-go for a language that was never designed to have GC support.)
But in terms of throughput, the BDW GC actually performs pretty well, competitive with or beating malloc()/free() libraries. This is plenty enough for a number of applications, especially when it comes to batch processing. In fact, even the stop the world latency is (combined with parallel marking) plenty good enough for a number of types of regular GUI applications, where you don't have the fairly extreme latency requirements of standard video games.
It is also worth noting that manual deallocation (especially via RAII) isn't a panacea for latency, as that is just as prone to large pauses due to cascading deletes [1].
[1] While in theory you can do that lazily, in this case you're losing timeliness of deletion and may actually have worse worst case latency than a modern GC that can stagger its work to have bounded pause times. The benefit of RAII is that you may be able to plan these deletions, but even that can be surprisingly difficult at times, requiring extra management to keep data structures alive longer than normal. [2]
[2] Note that lazily deleting objects in RC to avoid pause times is not necessarily the answer; for one thing, you're introducing much of the complexity associated with incremental GC again (especially having to do X amount of deallocation work for each allocation), or risking considerable space overhead. https://dl.acm.org/doi/abs/10.1145/964001.964019 It also remains a difficult challenge when you're dealing with objects that aren't of bounded size (e.g. large arrays).
> It's just that generational/incremental collection is hard to impossible without having support for write barriers/read barriers/snapshotting on the compiler or hardware side, which makes that a practical no-go for a language that was never designed to have GC support.
It was demonstrated in the mid 80s that the MMU could be used to implement the write barrier (when I saw that I knew lisp machines were a dead end, though I kept using them for as long as was feasible). You can use this and other compiler support to implement it in a language not designed for GC support, no problem.
The issue I think you meant is that you can’t have a transporting collector (which also compacts your working set, even allowing you to return memory to the system). A language like C++ doesn’t anticipate an object’s address ever changing, so that’s a no go.
I suppose you could write special shared_ptr / unique_ptr that supported address mutation but you’d need a compiler flag to check for an object’s address being taken… but to some degree the destructor paradigm renders a lot of this moot.
> It was demonstrated in the mid 80s that the MMU could be used to implement the write barrier (when I saw that I knew lisp machines were a dead end, though I kept using them for as long as was feasible). You can use this and other compiler support to implement it in a language not designed for GC support, no problem.
Yes, that's why I wrote about support "on the compiler or hardware side" (i.e. MMUs). That said, while this might work in theory, the BDW GC does have support for it, and in practice it doesn't actually work well. Throughput suffers and pause times can actually get worse.
By the way, another alternative way of having incremental collection without compiler support is to use fork() for snapshotting (there's an option in D for that), but that has its own issues, of course.
> The issue I think you meant is that you can’t have a transporting collector (which also compacts your working set, even allowing you to return memory to the system). A language like C++ doesn’t anticipate an object’s address ever changing, so that’s a no go.
No, I wasn't talking about a compacting collector. I very specifically meant generational and/or incremental collection.
The compiler support you need in practice is quite limited. There is an implementation of incremental cycle collection in Rust: https://github.com/chc4/samsara It's made possible because Rust can tell apart read-only and read-write references (except for references to interior mutable objects, but these are known to the compiler and references to them can be treated as read-write). This avoids a global stop-the-world for the entire program.
Cascading deletes are rare in practice, and if anything they are inherent to deterministic deletion, which is often a desirable property. When they're possible, one can often use arena allocation to avoid the issue altogether since arenas are managed as a single allocation.
> The compiler support you need in practice is quite limited.
This is true, but it's absent especially in C. Of course, as long as you can intercept writes and/or reads through pointers, you are good to go in principle. However, that solution may not be efficient.
> Cascading deletes are rare in practice
An array of objects (e.g. std::vector<std::string>) already has unbounded deallocation cost. People mostly think of tree-like structures here, but such structures only add the risk of stack blowout to the lack of an upper bound. But in principle, any container that does not have a size limit can give rise to unbounded pause times.
I have been writing low latency code for a decade and never seen cascading deletes be a problem. GC on the other hand is a constant latency problem for people.
> I have been writing low latency code for a decade and never seen cascading deletes be a problem.
I don't know what your domain is, but the domains I'm familiar with (especially hard real-time) deal with the underlying constraints by way of restricting how you write programs. A side effect of these restrictions is that you generally avoid use constructs that give rise to cascading deletes. But ultimately, having to deal with these restrictions is suboptimal.
I remember a talk by Gil Tene (CTO and co-founder of Azul Systems) about how the problem with writing real-time Java was that it historically required you to write non-idiomatic Java code, which came at a serious cost to reuse and where he argued that one of the benefits of Azul's GC was that you could write (mostly) idiomatic Java and still have GC noise be less than OS noise.
There is of course the actual problem that the stricter your latency requirements are, the fewer GCs actually meet them. Plenty of modern off the shelf GCs meet typical soft real time requirements, but the harder your real time requirements get, the fewer implementations actually meet them (and are increasingly more expensive).
I'll also add that depending on how strict your latency requirements are, off the shelf allocators may not meet them, either, even for a single malloc() (though there are allocators that offer hard upper bounds for allocation costs, such as TLSF).
running a throughput-oriented GC instead of latency/realtime one is a common error of those people (also, throughput oriented ones are usually the default)
For all of the reasons you note above, this is why I've almost exclusively dealt with arena-style allocations of fixed-object-sizes when doing ultra low latency work.
Hi, I'm just starting a (long) PhD in computer music languages, and one of my hoped-for projects is improving the GC for the Scheme implementation I work in for soft realtime use. If you (or anyone else) wouldn't mind sharing suggestion on resources for learning about low-latency modern GCs, it would be most appreciated. :-)
You can look at the SGCL garbage collector for C++: https://github.com/pebal/sgcl. It works in a separate thread, is locks-free and never stops the world.
thanks, that's good to know. I'm working through his first GC book right now and plan to order the new edition of the gchandbook when I'm done the other (which I read was a gentler intro to the topic).
Absolutely not. While it obviously should not be used for use cases that it isn't designed for, as a conservative mark-and-sweep GC it is extremely well engineered and has very good throughput.
Another big reason is the lack of value types in languages like Java. Modern languages with a GC tend to have those, and where it makes sense to use them they help, a lot.
Note about "modern languages with GC", this was already a thing in Common Lisp, Eiffel, Modula-2+, Modula-3, Mesa/Cedar, Oberon (and its linage),...
Java's mistake was ignoring them, as its was designed as set-top box programming language as Oak, pivoted to applets, and then pushed all over the place.
> "Another big reason is the lack of value types in languages like Java. Modern languages with a GC tend to have those, and where it makes sense to use them they help, a lot."
Might be interpreted as older languages with GC did not had value types, which is bluntly false, going back to 1970's.
My remark was to make clear this isn't something new and modern, rather a design mistake in languages like Java, that is now trying really hard to fix it, while others designed after it (like C#), thankfully didn't follow the same trap.
How modern a 2001 language (C#), using ideas from Mesa/Cedar (1985), is an exercise in the semantics of "modern".
Take your brownies, I guess, that wasn't my point.
As if me mis-remembering Common Lisp capabilities, invalidates the rest regarding "modern" GC languages.
Here is an array of structs in Oberon, 1987,
TYPE
Point = RECORD
x, y : INTEGER
END;
Triagle = RECORD
v1, v2, v3: Point
END;
Data = ARRAY 100 OF Triagle; (* static sized *)
DynData = ARRAY OF Triagle; (* dynamically sized *)
VAR
dynData : DynData; (* Allocated via NEW(dynData, elems)*)
No, it doesn't invalidate the rest of your comment. It just made it harder to figure out what point you were making in your original post (i.e. what the this referred to).
There's a little bit of irony here in how you react to people picking at details in your own posts, when your original post is precisely the same kind of pedantry!
If you nest an object within another object you often want to embed the child directly in its parent, sharing the memory and avoiding the indirection that you get if you nest with a reference as is the typical approach with GC languages. Also embedding makes it clear that it's a 1:1 relation, i.e. non-optional, fixed identity etc.
AFAICT few GC languages allow you to do this, basically you have to buy into indirection and end up with webs of objects.
This is quite cache unfriendly, but the bigger issue for me is that it makes the code less clear and requires additional complexities since you can't just jump from a child to its parent by subtracting an offset from its address.
> makes the code less clear and requires additional complexities since you can't just jump from a child to its parent by subtracting an offset from its address.
Wait, so you're saying that code that subtracts an offset from a child node to get to a parent node is clearer and less complex than alternatives?
Not having the address relationship requires additional complexities. To get from the child to the parent, you need to spend an additional reference field, and need to keep it consistent with the other reference. If you embed directly, there's nothing in the relationship of the objects' identities that could ever get consistent. It's a strong guarantee.
How do you get from the child to the parent in, say Java? And how do you tell that a child is never swapped out?
How do you copy an object that holds, say 4 vec4, vec4 being a struct that itself 4 holds floats? With value types, say in C, it's 1 memcpy and you're done.
In an archaic language like C, it may be clearer. In a modern language, in no way is subtraction of an offset clearer and simpler than almost any alternative. Even in C, as you've mentioned elsewhere, you've had to use macros to make it at all clear, and macros can do anything and are inherently opaque.
Many other languages also have newer/different approaches to solve some of these problems, be they subclassing, generics, macros (often safe ones), JITs, or other. They mostly cannot do this as efficiently, but that seems less like a clarity/complexity concern and more like a low-level-programming concern.
> How do you get from the child to the parent in, say Java?
In Java, you do indeed have to embed mutual pointers and just be careful. As a C user, you must be used to being careful. Even C#, which does allow structs, doesn't have references-to-members that can be turned into references-to-owners. The aforementioned language features like subclassing (for lists) and generics can help.
> How do you copy an object that holds, say 4 vec4, vec4 being a struct that itself 4 holds floats?
Copying structs as a block works fine in Rust, C++ and C#, along with many other languages. Again, Java is a little more limited in this regard, though I believe they are working on adding value types.
I think we have quite different ideas what complex means. You seem to see a language (C or machine code), which, while based on simple principles, is complex in its behavior, especially if something goes wrong (which should not happen in the first place).
I see literally a subtraction operation versus a huge memory safe language framework set up over the same machine, plus a whole lot more user facing surface level source code, just to achieve the same thing.
Sure, C and machine code can be simple (or simplistic) but in a world in which 70% of severe security vulnerabilities are caused by memory safety issues, I'll generally prefer languages that are safe against them, even if I can't then use text macros to get from a child pointer to a parent pointer. But that's off-topic, really.
In the message I originally replied to, I guess it's only the "since you can't ..." part that I really object to, as it's never going to be clearer to use &(par->child) and CONTAINER_OF(ch, some_type, child) than to use par.child and ch.parent.
Please, I did not want to offend you personally. It just took me by surprise. I understand your points, even though I'm with your direct reply, all things considered.
The address offset subtraction thing specifically is used quite a bit in the Linux kernel for example. It allows to combine abstract code with concrete data. For example they define a red-black tree as a graph of just plain nodes. These nodes are totally abstract, they contain only node pointers but no useful data. The code is totally abstract, no need to instanciate anything a dozen times with different offsets and such. You simply wrap the abstract nodes with whatever concrete useful data you fancy.
The same approach works for lists and a variety of other data structures.
You can get from the abstract link node to the containing struct with a macro: CONTAINER_OF(ptr, containertype, membername). It's easy to use and unlikely to get it wrong.
Mono is nowhere close to major use case of it, it has been succesfully deployed in maaaany projects.
C++ crazies denigrating GC even when it visibly fixed their issues, or maybe especially when it did, was a phenomenon since 1990. Even when code with Boehm GC ran faster - and most importantly, didn't crash all the time.
Manually managed code doesn't crash all the time. It probably crashed way more in the 90s. There is an art to writing manually managed code that is maintainable and fast, sure, and it doesn't involve malloc() and free(). And then there is RAII, automating lots of the work.
The C++ crazies are AFAICS still most everywhere where there is lots of temporary objects and serious requirements on throughput & latency. There are frameworks and languages that support garbage collection that get used in this space too, but I don't know how much can be inferred from that.
Seems like C and C++ guys are constantly looking for ways to automate their garbage cleanup.
And seems like GC guys are constantly looking for ways to fix their web-of-objects memory layout, and to prevent GC from ruining their performance. Plus, GC isn't deterministic, it mostly works for memory but otherwise you'll see _lots_ of manual cleanup code (not making use of the GC), and it always feels a bit tricky with the GC still being part of the game. Not that I take a huge issue with C-like cleanup code -- I think RAII can be tricky too because it's hard to "read" what happens, and I often prefer the manual cleanup code. But compared to GC, at least RAII is deterministic.
The deterministic myth only works out for basic types, see Herb Stutter CppCon 2015 talk on the matter, and how performance workarounds turn out a poor man's GC.
That is a one and a half our long talk, can you point to the timestamp where he talks about determinism?
> and how performance workarounds turn out a poor man's GC.
If you need performance then it doesn't matter if it is a "poor mans GC", what matters is that it meets the space/time requirements, most general purpose GC tend to care about neither.
The only deterministic thing is the call point, Herb points out that just like calling malloc()/free() it is not possible to deterministically assert the execution time, more so with complex data structures where not only there is a domino effect of destructors being called, there is also the possibility of causing stack overflows if the destuctors are written incorrectly, to build an argument for presenting his deferred pointer class, which is basically a GC in disguise.
That is an obvious thing, and totally besides the point. Also, probably GC only helps little against this. Sure, actually calling the destructor can be delayed in order to even out latency. You can do the same with RAII.
Sometimes you can't do that though because you have certain constraints on the execution order of whatever routines you need to run. This is completely independent of whether you use manual cleanup or GC or RAII, however with GC it is weird/hard/impossible to synchronize the GC with the non-cleanup code.
I had a AAA project hitting 30ms pauses in free() implementation by platform vendor for quite popular gaming platform in 2016. Ofc it is anecdata but having actual deadlines from implementations is really hard. I had to replace malloc/free implementation like 3 month before ship date to avoid the issue.
What? RAII is totally deterministic, not sure what you want to contend here. CppCon 2015: Herb Sutter "Writing Good C++14... By Default": Not sure where I'm going to find my answer here.
Anyway, for complex state that must be cleaned up, maybe flushed before etc., you'll always find a lot of manual cleanup code, or otherwise find code with lots of random problems, often unreported. That is true both for GC languages as well as C++ and other languages with RAII or RAII-likes.
Unless you can static assert how long each destructor takes, it is not deterministic, even less if ownership is given to background threads for cleanup like on C++/WinRT.
What I want to say here is the most basic thing. The order in which the destructors get called (also in the context of the other code) is deterministic. GC doesn't give you that. It should be obvious what I said.
I can't imagine what LTO should have to do with that either, but if there's any substance here please enlighten me.
The order in which destructors get called is well defined (and generally useful insofar as it's the reverse order of construction). Can't you just accept a simple argument? I'm not even a huge fan of RAII, but it has its uses.
This isn't really about constructors/destructors. Expressions like function calls with multiple arguments have always been "unsequenced" with respect to each other. In other words the order is left to decide for the compiler. It's always been like that, going back to C (and probably other languages). If you call f(x++, x++) what values get passed to f is unspecified.
I suppose the destruction of whatever the expressions constructed still happens in reverse order of construction.
But either way I might not even care, I'm aware that at the level of a single statement the execution sequences aren't much defined, so I rarely put more than one mutating thing per expression, or otherwise I'm sure that I don't care about the order -- I could live with any sequence as well as totally parallel execution.
Example: buf[i++] = 5 has 2 mutating sub-expressions, but I know it's not messing up anything. I don't care whether i gets incremented before 5 gets assigned or the other way around.
In modern C++, you no longer need to manually-manage your memory (except when writing custom containers and low-level libraries). Memory is managed automagically, no garbage is generated and no collection is necessary.
Of course, you can also do manual allocations if you really wanted to, but - with the changes in the language since C++11 - you just don't have the motivation to do it.
I guess doing it this way allows to have mostly deterministic collection (that is then just an implementation feature, not language semantics) while needing the actual tracing GC to not be that efficient, as it presumably needs to be executed less rarely.
Ref counting is a GC-algorithm, and the slowest at that (especially with multi-core sync, but that’s not the case with python), as a tradeoff for less memory consumption.
But you must understand, Python is safe and C is difficult. Therefore, to be fast we need Python (fast development that is because nobody _really_ cares about performance (or maintainability)).
Sadly it was not just the GC, there were plenty of problems in Mono itself that adversely affected GC performance.
A significant one was a problem in the way that the Mono C# compiler generated foreach loops, which caused it to allocate an object for the enumerator that the regular .NET compiler was able to avoid. This caused foreach() to produce unnecessary garbage and resulted in a lot of Unity developers writing manual for() indexing loops instead of a more straightforward foreach() loop.
Another problem was a bug in the Mono standard library where one of the array containers -- I forget whether it was ArrayList or List<T> -- would fail to clear elements at the end on removals. This was innocuous for arithmetic types, but for reference types it would result in hidden entries in unused capacity holding onto stale references. In some cases this caused rather large trees of objects to persist unnecessarily, pinned invisibly by containers that had already been deleted from or were even empty.
I assume that real time computing should have used real time computing compatible techniques. But luckily those exposed to other than games and game development might feel less aversion towards GC, probably even the other way around.
Another reason being that many people don't get GC theory, nor that because a language has some form of automatic resource management, does not mean the language doesn't offer other mechanisms.
I have a library which has an extremely slow free, around 2m for large files, because of unnaturally scattered allocation patterns, but this old conservative GC didn't help at all. It was about 40% slower with libgc. mimalloc was a bit better. Best would be an arena allocator without free, or at least a properly fast GC, like mps https://github.com/Ravenbrook/mps, but this would be too much work.
If you have a program that takes 2 minutes to free allocations, that implies 100 million to 1 billion small allocations. An arena is not the answer, a normal data structure is (most likely a vector). I'm not sure an arena is ever really the answer, because it seems like a poor work around to multiple small allocations, which shouldn't happen in the first place.
Maybe that time is coming from something even more ridiculous like scanning an entire linked list to find each element, but if you can open up that library there are without a doubt terrible memory allocation techniques that should never happen going on.
> An arena is not the answer, a normal data structure is (most likely a vector). I'm not sure an arena is ever really the answer, because it seems like a poor work around to multiple small allocations, which shouldn't happen in the first place.
The only way I can make sense of this claim is if you're saying the items he needs to operate on should be inlined into a vector. But that's basically just an arena.
My point was that a vector is like an arena but better in practicality because if someone is 'allocating' from an arena, they are then storing that pointer somewhere. In a vector you can store the index somewhere, but you have the much better option of looping through everything and dealing with it contiguously instead of dealing with each element in some isolated way.
Also a lot of arena implementations are basically linked lists inside a vector and freeing up the memory means jumping around in memory while dereferencing the pointers to unwind the individual allocations. The advantage is that you can free individual cells of the arena, but the cost of allowing that fragmentation is high. It's much better to avoid that and with hundreds of millions of elements it's likely that the data should be dealt with by just allocating huge chunks of memory and looping through them.
That's not an arena. The whole point of an arena is deallocation of all elements allocated from that arena is a constant time bulk operation.
I've seen both, you can use empty cells to link to the next empty cell and the head pointer to link to the first empty cell in the chain so that the cells can be arbitrarily freed, but the option is always there to do what you said and do a single bulk deallocation of the actual heap memory allocation instead of dealing with individual cells.
My point in all this is the same thing you have been implying. There aren't many differences in techniques so when people are talking about 'arena allocators' most of the time they should just be using a vector.
I once wrote a non-conservative generational GC for C++ just as an exercise, with the constraint that I could only use standard C++ (it was in C++17).
It worked based on a template type I called "gc_ptr<T>". So you could create one of these with the function template "make_gc(...)". This was the only way you could allocate a new garbage-collected object.
"make_gc" would check if a type descriptor for the type was already initialized and if not it would allocate a new one. It would then set a thread_local with the descriptor and the address of the current object being created.
Then it would call the object constructor.
If the type being instantiated had gc_ptr members, these would be initialized by their constructor. The gc_ptr constructor would, in turn, check the descriptor of the parent object and add a record to it representing the member. The record would store the member offset calculated from its address and the one of the parent object.
Garbage-collected objects would also use reference counting for gc_ptr(s) on the stack or inside objects that were not garbage-collected. It's easy to know if a gc_ptr is being constructed inside a garbage-collected object or not: If it is, then the code is also executing inside make_gc, so I just need a thread_local flag.
For the mark and sweep, I would use an atomic flag to mark the GC as running. If the flag was set, then all gc_ptr would stop the thread (by waiting on a mutex) as soon as some code would try to swap/reassign them.
This means that code that didn't touch them would be kept running during garbage collection.
I would start marking objects from those allocated in the custom arena I used that had a reference count different from zero.
I wanted to add containers that could allocate contiguous GC'ed objects by wrapping the standard containers but never finished that.
GC'ed containers of gc_ptr(s) just worked fine.
I just re-ran the benchmarks I wrote to compare it against the Boehm-Demser-Weiser GC. It is quite faster in doing the mark-and-sweep (up to an order of magnitude faster with thousands of objects and a bit less than 100 MBs of GC'ed memory in total).
However, because of the atomic check, it was slower in swapping pointers by roughly an order of magnitude.
> In our experience, the only examples we have found of a failure with the current collector, even in multi-threaded code, were contrived.
I once found a real incompatibilty. This collector assumes that a pointer to a block must point to somewhere in the block, or, in later versions, at least somewhere really close.
The original Numerical Recipes in C was a conversion of Numerical Recipes in FORTRAN. FORTRAN arrays start at 1. So, NR in C had an array allocator which returned an array address computed such that subscripts starting at 1 would work.[1] If you allocated a 100x100 array of 4-byte f32 values, the array address would be offset downward by (100+1)*4 bytes. That would not look like a pointer to the array to the garbage collector.
By the third edition, in 1986, the Numerical Recipes people had converted to C++, dropped FORTRAN array compatibility, and the problem went away.
My experience with Cpp isn't that extensive, but what is the use-case of a garbage collector in this language? I always had the impression that with the well-defined lifetimes of objects, you wouldn't really create garbage to begin with, but I guess there are some use-cases I don't yet know about?
It's pretty useful. Chrome uses one extensively for example (called Oilpan). So does Unreal Engine. GC in C++ is much more widely used than people realize.
The problem is that big programs in the real world often don't have well defined lifetimes for everything. The idea that everything in your app can have its lifetime worked out in advance isn't true when you're modelling the world (a world), or lifetimes are controlled by other people (e.g. website authors).
Generally what you see is that these apps start out trying to do manual memory management, decide it's too difficult to do reliably at scale, and switch to a conservatively GCd heap with a custom collector or Boehm.
Note that Rust doesn't fix this problem. Rust just encodes the assumption of pre-known lifetimes much more strongly in the language. If you're not in such a domain then you have to fall back to refcounting, which is (a) slow and (b) easy to get wrong such that you leak anyway. Early versions of Rust used refcounting for everything and iirc they found anywhere between 10-15% of the resulting binaries was refcounting instructions!
> Note that Rust doesn't fix this problem. Rust just encodes the assumption of pre-known lifetimes much more strongly in the language. If you're not in such a domain then you have to fall back to refcounting, which is (a) slow and (b) easy to get wrong such that you leak anyway. Early versions of Rust used refcounting for everything and iirc they found anywhere between 10-15% of the resulting binaries was refcounting instructions!
Well, modern idiomatic Rust only uses Arc/Rc on the few objects where it's needed, so the overhead of reference count adjustment is so tiny as to never show up. You typically only see reference count traffic be expensive when either (a) everything is reference counted, as in ancient Rust; or (b) on super-inefficient implementations of reference counting, as in COM where AddRef() and Release() are virtual calls.
Right, but that's what I mean by encoding the assumption of known lifetimes in the language. The design is intended for cases where most lifetimes are known, and only a few need ref counting, and you can reason about lifetimes and relationships well enough in advance to avoid cycles. At least, that's my understanding (my Rust-fu is shamefully poor).
> Early versions of Rust used refcounting for everything and iirc they found anywhere between 10-15% of the resulting binaries was refcounting instructions!
Do you happen to have a citation for this? I don’t remember ever hearing about it, but it’s possible this was before my time, as I started in the smart pointers era.
The Rust compiler does this. Even so, 19% of the binary size in rustc is adjusting reference counts.
I am not exaggerating this. One-fifth of the code in the binary is sitting there wasted adjusting reference counts. This is much of the reason we're moving to tracing garbage collection.
It's interesting how many strategies Rust tried before settling on linear types.
> It's interesting how many strategies Rust tried before settling on linear types.
Rust doesn’t actually have linear types. I’m not sure what Rust’s types are called (affine?), but linear types are the “must be consumed” (can’t leak) types, and Rust doesn’t have any support for this.
Rust’s guarantee is that you MUST NOT use an object after dropping it. Linear types would add the additional requirement that you MUST drop the object.
Due to the nature of web engine workloads migrating objects to being GC'd isn't performance negative (as most people would expect). With care it can often end up performance positive.
There are a few tricks that Oilpan can apply. Concurrent tracing helps a lot (e.g. instead of incrementing/decrementing refs, you can trace on a different thread), in addition when destructing objects, the destructors typically become trivial meaning the object can just be dropped from memory. Both these free up main thread time. (The tradeoff with concurrent tracing is that you need atomic barriers when assigning pointers which needs care).
This is on top of the safey improvements you gain from being GC'd vs. smart pointers, etc.
One major tradeoff that UAF bugs become more difficult to fix, as you are just accessing objects which "should" be dead.
> Are you referring to access through a raw pointer after ownership has been dropped and then garbage collection is non deterministic?
No - basically objects sometimes have some state of when they are "destroyed", e.g. an Element detached from the DOM tree[1]. Other parts of the codebase might have references to these objects, and previously accessing them after they destroyed would be a UAF. Now its just a bug. This is good! Its not a security bug anymore! However much harder to determine what is happening as it isn't a hard crash.
Not sure early versions of rust is the best example of refcounting overhead. There are a bunch of tricks you can use to decrease that, and it usually doesn't make sense to invest too much time into that type of thing while there is so much flux in the language.
Yeah I was thinking the same thing. "10 years ago the Rust compiler couldn't produce a binary without significant size coming from reference counts after spending minimal effort to try and optimize it" doesn't seem like an especially damning indictment of the overall strategy. Rust is a language which is sensitive to binary size, so they probably just saw a lower-effort, higher-reward way to get that size back and made the decision to abandon reference counts instead of sinking time into optimizing them.
It was probably right for that language at that time, but I don't see it as being a generalizable decision.
Swift and ObjC have plenty of optimizations for reference counting that go beyond "Elide unless there's an ownership change".
it's worth noting that percent of instructions is a bad metric since modern CPUs have lots of extra compute, so adding simple integer instructions that aren't in the critical path will often not affect the wall time at all.
The problem is that once threading gets involved they have to be interlocked adjustments and that's very slow due to all the cache coherency traffic. Refcounts are also branches that may or may not be well predicted. And you're filling up icache, which is valuable.
You could have a hybrid recount where you use basic integers when adjusting the ref count on the current thread and atomics to adjust the global count when you hit 0 (hybrid-rc is the crate you can use but something Swift-like where the compiler does ARC for specific values when you opt in may not be the worst). Also when the type isn’t Send then you don’t need to do atomic refcounts although the interaction with unsafe does get a like more complex.
But yeah, at this point Rust’s niche is a competitor to c/c++ and in that world implicit recounting doesn’t have a lot of traction and people favor explicit gc and “manual” resource management (RAII mechanisms like drop and destructors are ok).
> what is the use-case of a garbage collector in this language?
Same as other languages.
> I always had the impression that with the well-defined lifetimes of objects, you wouldn't really create garbage to begin with
There's no well defined lifetime of objects when it comes to dynamic allocation. For example, if you allocate something with the new keyword, there are no language guarantees that this won't leak memory.
I'm using C++ to build jank, a native Clojure dialect on LLVM with C++ interop. All of Clojure's runtime objects are dynamically allocated and the churn of reference counting is far too slow, compared to garbage collection. I had originally started with an intrusive_ptr and atomic count and the Boehm GC was about 2x faster for that benchmark (and at least as much for every later benchmark).
Even outside of writing languages on top of C++, if you're using something like immer, for persistent immutable data structures in C++ (as jank does), it has memory policies for reference counting or garbage collection. This is because immutable data generally results in more garbage, even when transients are used for the most pathological cases. That garbage is the trade off for knowing your values will never be mutated in place. The huge win of that is complete thread safety for reading those values, as well as complete trust on reproducibility/purity and trivial memoization.
"I don't like garbage. I don't like littering. My ideal is to eliminate the need for a garbage collector by not producing any garbage. That is now possible."
I think a mix of methods are best. We can use non-GC methods for whatever is easiest to prove is safe. In Python, they then use reference counting by default. Then, a GC when that would be a problem. I also prefer when the GC let's me tell it to immediately delete something which it just checks incrementally. Then, performance-sensitive modules can be coded without GC's using tools like Meta's Infer which work on large modules.
For the GC component, there's two options I'd like to see more exploration of. One is a modern, concurrent GC to reduce the pauses people here are worried about. Immix [1] is an interesting design I remember.
The other option is a runtime supporting multiple, app-specific GC's without much tuning. JX OS was the first variation of that I saw. I just stumbled on this other paper. So, we have a global, memory manager divided up into chunks managed differently across apps or components.
in CPP you don't need GC. Use RAII. Objects only need a constructor, destructor and a copy-constructor for a very efficient reference counting GC that's almost impossible to leak.
In C you can construct your hackish tracing GC yourself. For root objects save the stack address at the beginning of main() by saving the address of a dummy stack variable.
int dummy=0;
void* stack_head=&dummy;
and later during GC get the current stack address again and trace from old one to new. You also need to save the registers and can do it in a semi portable way using unix setjmp() longjmp() functions which save the register values into a structure that you can then later trace. There's also the issues of split stack that you need to keep into mind. But the positive thing is memory alignment ensures pointers are all 8 byte aligned, you only need to read address%8 values in between. It's a bit hackish regardless.
I built a few apps using the above technique but now am more comfortable with old styled manual memory management. Valgrind will tell you exactly which line you allocated memory that leaked.
In practice if you care about performance, one ends up creating a pseudo GC like Herb Sutter has shown on his CppCon talk about deferred _ptr, or why hazardous pointers are coming to C++26.
Then there is the point RAII handles have to be written in exception safe way, also be movable if used as part of other data structures.
I wouldn't call hazard pointers garbage collection.
Also, I don't think that's necessarily as critical as you suggest. I've been working with performance-critical software for many years and never saw a strong need for general-purpose concurrent reclamation techniques.
The RAII/ref-counting technique often works great, but not always. Cyclic structures are a problem -- so, if you've got, say, a global tree structure with bidirectional parent/child links, you'll need some kind of special-casing to make ref-counting clean up an entire abandoned subtree. A full GC does take care of cases like this.
This thread brings to mind an issue I’ve wondered about:
For hard real-time systems, do people use specialized CPU’s and hardware? Caches can introduce hard-to-predict timing issues. I’m sure there are other aspects of modern hardware that present similar problems; large pipelines? Speculative instruction execution? Does Intel Time-Coordinated Computing address some of these hardware issues? Are there embedded CPU’s from TI, Intel, AMD, etc. that are specifically designed to support predictable and understandable “lockstep” hard real-time time behavior?
It's an interesting question and one I'm unqualified to answer. From a brief survey I gather at least caching is enabled for some hard real time systems. The name of the game seems to be employing caches in ways that provably decrease worst case execution time. For example, ignoring associativity, if your task's worst case working set fits into a given cache level, you can reason that the task misses each line at most once. That's almost certainly a dramatic improvement in wcet over cache disabled.
Worst case execution time with pipelining should be similarly measurable given a worst case execution (which seems in any case necessary to demonstrate a bound on the task, even in a lockstep system).
I guess garbage collection has it's uses, but it's very niche. Since C++11 (i.e 2011) (STL, unique_ptr, shared_ptr), any need for it in C++ has all but disappeared. For the most part you are using STL data structures, and if you need something custom then just use a class with a smart pointer member owner, and you get memory management for free (and a lot more performant than a garbage collector).
There are a lot of cases where unique and shared pointers are no good solution, and where a GC would help (e.g. when there is no clear ownership hierarchy but a lot of mutual dependencies between objects). And not to forget that smart pointers come with a non-negligible cost.
shared_ptr essentially provides reference counted garbage collection. Sure there's a small overhead to smart pointers, but in 2024 it's a very odd application where you are compute bound and pointer dereferencing is the reason!!
Garbage collection has an overhead too of course, since there is a major cost to discovering what can be freed, while the cost of doing this with smart pointers is zero.
Not so small, and it has the potential to significantly speed down an application when not used wisely. Here are e.g. some measurements where the programmer used C++11 and did everything with smart pointers: https://github.com/smarr/are-we-fast-yet/issues/80#issuecomm.... There was a speed down between factor 2 and 10 compared with the bare-bone C++98 implementation. Also remember that smart pointers create memory leaks when used with circular references, and there is an additional memory allocation involved with each smart pointer.
> Garbage collection has an overhead too of course
The Boehm GC is surprisingly efficient. See e.g. these measurements: https://github.com/rochus-keller/Oberon/blob/master/testcase.... The same benchmark suite as above is compared with different versions of Mono (using the generational GC) and the C code (using Boehm GC) generated with my Oberon compiler. The latter only is 20% slower than the native C++98 version, and still twice as fast as Mono 5.
If your application is fighting the language you've selected so hard, then maybe the language choice was a poor one to begin with. Modern C++ is a very high level language - very powerful when it's a good match for your needs.
If you need zero overhead memory management then garbage collection isn't going to be the answer either - you need application specific arena allocators, reuse pools, etc. You seem to be focusing on a very odd set of performance requirements where the overhead of garbage collection is just right, C++ too inefficient, and the need to go full custom not needed.
> You seem to be focusing on a very odd set of performance requirements
It's a standardized benchmark suite used to compare different languages, thus focusing on similarity and comparability of the implementation. It's especially suited to compare GC and non-GC implementations.
> Garbage collection has an overhead too of course, since there is a major cost to discovering what can be freed, while the cost of doing this with smart pointers is zero.
Run-time cost for GC can be minimized, if not completely obviated, by having the collector run on a processor other than those allocated for system use. One such example of this strategy is the JVM Concurrent Mark Sweep[0] GC algorithm.
The other poster in this thread is advocating for GC as an alternative to C++ smart pointers in cases where the overhead of these is unacceptable, so giving up an entire processor/core to run GC would be a non-starter (in this performance-driven use case).
> where a GC would help (e.g. when there is no clear ownership hierarchy but a lot of mutual dependencies between objects)
Not really. This is a straw man argument.
Having large graph of object with mutual dependencies represented as references to each other is almost all the time a code smell and a problem of design.
It is terrible for cache locality, terrible for memory fragmentation and will bring even the best GC on its knees.
The way to handle that is to use Entity Component System [^1] or a SOA structue with indexes.
In fact every mature game engine or serious Graph Representation library (which are specially design to deal with these cases) will do exactly that.
> I guess garbage collection has it's uses, but it's very niche. Since C++11 (i.e 2011) (STL, unique_ptr, shared_ptr), any need for it in C++ has all but disappeared.
At least one situation where GC can greatly simplify C++ logic is when implementing lock-free collections. Having a GC available is not a requirement, however, as witnessed by Boost.Lockfree[0].
As others have pointed out, the smart pointer types you identified can be employed in may situations, but are not a universal replacement for all memory management concerns.
At this point, jank is generally several times faster than the equivalent Clojure JVM code. I'm sure there are better GCs out there, in terms of performance, and I have my eye on MMTK [2] for a future upgrade, but the fact that remains is this: the Boehm GC is stupid simple to integrate and it's surprisingly fast. Compare it to MPS, MMTK, and others and both the documentation and the actual dev work required are worlds apart.
For a project which needs a GC but doesn't need to pick the absolute fastest one first, it seems like the best option, based on my research and my experience.
1: https://jank-lang.org/
2: https://www.mmtk.io/code