Hacker News new | past | comments | ask | show | jobs | submit login
Garbage Collection (2017) (craftinginterpreters.com)
230 points by tosh 12 days ago | hide | past | web | favorite | 108 comments

An interesting story on generational garbage collectors: a couple of versions ago, Lua experimented with a generational GC. But it didn't improve performance, so they kept the old incremental collector. A long time later, they looked at it more closely and realised that the actual problem was that the generational collector was moving objects out of the nursery too eagerly. Objects in the stack are live and will survive if the GC runs, but many of them will stop being live as soon as their function returns. They fixed the issue by making the stack objects that survive one round of GC to stay in the nursery. The new GC is now much faster than the old one, and is one of the big features of the upcoming Lua 5.4

When using generational GCs it is of utmost importance to match the policies for moving things between GCs to the actual pointer interconnection pattern.

This can go really bad, for example in systems that keep caches/repositories of preallocated things for faster re-use. If they are anchored at global variables, or near global, then you need to treat those preallocated things special. You can't just go and move them every time even when knowing you will never collect them.

Generally, GC works better if you don't do tricks/optimizations with memory allocation and let just everything flow freely into the heap. If you do have to optimize allocation you generally have to teach your GC about your hack.

One of the consequences of this being that when you do in-process caching, you trade improved average case performance for degraded worst-case behavior, unless the GC authors have taken some pains to deal with pointer chasing on writes to the old generation.

I'm kind of a fan of out-of-process, same-system caches for this reason.

Interesting! Is there a more detailed post about this?

I wonder how they decide which objects are rooted only by the stack. Is it part of the write barrier, or computed at mark time?

I think it just forces objects to survive more than one collection before moving out of the nursery, without having a special logic for whether they were rooted on the stack or not. But I am not 100% sure so don't quote me on that.


Great article, but I'm curious why automatic reference counting (ARC) and smart pointers never seemed to really catch on outside of Objective-C and Swift:



They almost "just work", except for circular references:


I'd like to see some real-world studies on what percentage of unclaimed memory is taken up by orphaned circular references, because my gut feeling is that it's below 10%. So that really makes me question why so much work has gone into various garbage collection schemes, nearly all of which suffer from performance problems (or can't be made realtime due to nondetermistic collection costs).

Also I can't prove it, but I think a major source of pain in garbage collection is mutability, which is exacerbated by our fascination with object-oriented programming. I'd like to see a solid comparison of garbage collection overhead between functional and imperative languages.

I feel like if we put the storage overage of the reference count aside (which becomes less relevant as time goes on), then there should be some mathematical proof for how small the time cost can get with a tracing garbage collector or the cycle collection algorithm of Bacon and Paz.

> They almost "just work", except for circular references

Well, and they're often slower since they require mutating the reference count on every single time a field is stored. You can optimize that using lazy reference counting, or not updating refcounts for references on the stack and instead scanning the stack before a collection. But at that point... you're halfway to implementing a tracing collector.

Every refcounting implementation eventually gets "optimized" to the point that it has most of a tracing collector hiding inside of it. I think it's simpler, cleaner, and faster to just start with tracing in the first place.

There's a reason almost no widely used language implementation relies on ref-counting. CPython is the only real exception and they would switch to tracing if they could. They can't because they exposed deterministic finalization to the user which means now their GC strategy is a language "feature" that they're stuck with.

That being said, ref-counting is fine for other kinds of resource management where resources are pretty coarse-grained and don't have cyclic references. For example, I think Delphi uses ref-counting to manage strings, which makes a lot of sense. Many games use ref-counting for tracking resources loaded from disc, and that also works well. In both of those cases, there's nothing to trace through, and the overhead of updating refcounts is fairly low.

There's a classic paper that shows that refcounting and GC are in fact duals of each other, where GC is effectively tracing live objects and refcounting is tracing dead objects:


This had a number of important corollaries to this. One is that they have opposite performance characteristics: GCs make creating new references fast but then create a large pause when RAM is exhausted and a collection happens, while refcounts make creating new references slow and create a large pause when large object graphs are freed all at once. Another is that nearly all high-performance memory management systems are hybrids of the two: lazy refcounting is basically like implementing a local garbage collector and then feeding the live set of that into the refcounts, while generational garbage collection is like implementing a local refcount (the write barrier) and feeding the results of that into the root set.

> There's a reason almost no widely used language implementation relies on ref-counting. CPython is the only real exception and they would switch to tracing if they could. They can't because they exposed deterministic finalization to the user which means now their GC strategy is a language "feature" that they're stuck with.

Reference counting and address-semantics (which prohibits moving objects, so even if you work around the refcount, you can only do a tracing GC, but you don't get to compact) are deeply ingrained into CPython's C API as well, which is very widely used.

For what it's worth, Perl (5) has the same problem as CPython. Accordingly, there's not only heaps of allocation- and ref-counting-avoidance optimizations (eg. the stack isn't recounted) but also heaps of bugs from the optimizations. Again, the (main) stack not owning references is the biggest source of these bugs.

The problem with predictable destruction, we typically called "timely destruction", but more accurate would've been "immediate destruction" and if your language guarantees it, it tends to be great for automatically releasing resources (memory, file handles,...) in a lexically predictable manner. Softening any such guarantee should lead to entertaining resource races magically appearing in the real world. It occurs to me now that I never checked what pypy guarantees if anything?

We (Perl) never really came up with a great strategy for working around this (you tended to get the worst of both worlds in most naive reimplementations of the behavior in a GC-based interpreter) and while I certainly haven't tried to work out a clean argument to the effect, I'm fairly convinced it won't fly at all.

Raku (née Perl 6) has phasers that will be called when exiting a scope. One such idiom is:

        # do stuff
        my $dbh = DB.connect(...);
        # do stuff
        LEAVE .disconnect with $dbh;
        # do stuff
Whenever the scope is left (this could be because of an execution error, or a `return`), the code of the LEAVE phaser will be executed. The `with $dbh` checks whether the $dbh variable contains an instantiated object. If so, it topicalizes (set $_ to it) and calls the disconnect method (`.disconnect` being short for `$_.disconnect`).

For more complex uses, and more tuneable finalizing, there is the `FINALIZER` module in the ecosystem: https://modules.raku.org/dist/FINALIZER

Sure, but "schedule action at scope exit" is an easy (assuming your runtime already knows how to unwind the stacks) problem to solve. "Perform escape analysis and guarantee immediate destruction; efficiently without full gc nor refcounting" is not and they're not the same problem at all.

"There's a reason almost no widely used language implementation relies on ref-counting. CPython is the only real exception"

Perl is also ref counted. Swift too.

Edit: Also TCL, and PHP is mostly recounted...the "GC" just deals with circular refs.

Just for the record: Raku (née Perl 6) most definitely does not do refcounting. If you want to do serious async programming, refcounting becomes a nightmare to get right and/or a source of potential deadlocks.

Objective-C and then Swift are surely the most serious goes at fast widely used RC languages. Neither requires mutating the reference count on every store, and neither seems at risk of growing a tracing GC.

ObjC had some elision tricks like objc_retainAutoreleasedReturnValue, but more importantly the optimizer was taught about RC manipulation. Swift then extended this with a new ABI that minimizes unnecessary mutations.

The big advantages of this scheme are efficient COW collections and a simpler FFI (very important with Swift). More generally RC integrates better. Imagine teaching the JS GC to walk the Java heap!

> ObjC had some elision tricks like objc_retainAutoreleasedReturnValue, but more importantly the optimizer was taught about RC manipulation.

Cough. Actually, before ARC the ownership rules really kept RC traffic very low, and you could drive it lower still if you knew a bit about how ownership worked in your code ("these will balance and the object is already owned elsewhere...").

ARC just went mad with trying to give guarantees it really had no business making, and then hoping the compiler would be able to remove them again (difficult with message sending).

This caused some amusing bugs, for example a crash in the following method:

   -someMethod:arg and:arg2
       return 0;
How could this possibly crash? All it could possibly do is clear register AX and return.

Well, with ARC enabled and in debug mode, it inserted code to first retain all the arguments, and then immediately release all the arguments. Apparently one of the args was a bogus pointer (framework-provided, so out of our control) and so it crashed.

So the result of your instructions that have undefined behavior turned out to be not what you wanted? That’s hardly a problem with the language.

Interesting that

    return 0 
is now considered "undefined behavior".


Obviously the problem is with using objects in the message which are invalid. The compiler may or may not work as you expect and that’s your problem, not the compilers.

You can try to defend yourself by stating the call is done by some framework but that just means the problem is still not in the language, it’s in the framework.

First, the compiler and framework come from the same vendor, so it's perfectly fine to blame them. Unless you consider things like compilers and frameworks to be independent subjects capable of being blamed.

Second, whether the behavior is defined or undefined, the compiler has no business touching arguments that the code didn't tell it to touch. If it does so, it isn't fit for purpose and can be criticised as being unfit for purpose.

And the C standard is very clear and adamant that being in compliance with the standard is not, and in many ways cannot be, equivalent to being fit for purpose.

Last not least, I think we are probably not going to agree as to whether dereferencing unused arguments is a valid response to the presence of undefined behavior. See


So efficient that Swift's RC implementation loses against all major tracing GC implementations.


That project doesn’t compare GC implementations, so it’s probably not that useful here.

Also, the Swift implementation is a bit questionable if performance is a goal. That is, why not try to remove the memory management from the inner loop? Probably the first thing to try is value types instead of reference types, which are more generally preferred anyway.

Sure it does, memory management impacts the performance of the task being achieved, writing an high-speed network driver.

By that measure, every project is a good measure of GC performance. Is that really a good argument?

I believe all general purpose languages let the code allocate memory and therefore will let you allocate memory in a way inefficient to your task.

The argument is that to eventually write Swift code whose performance matches the competition, one needs to work around ARC's implementation.

Hence why there is so much emphasis on value driven programming alongside protocols at WWDC talks.

My claim was about efficient COW collections and the FFI. The point of COW collections is enabling a simpler programming model.

ARC is just reference counting, as there's nothing in reference counting that requires a runtime to do this for you (that is, just because the compiler inserts increments/decrements doesn't make it an entirely different thing).

The reason reference counting is not more common is because of its performance. Naive reference counting works in some cases (e.g. when the counts are not modified often), but doesn't work too well for most programming languages, as the reference counts will be modified frequently.

Deferred and coalesced reference counting are two techniques of improving performance, but they come at a cost: they are quite complex to implement, and require additional memory to keep track of what/when to increment/decrement. You will also end up facing similar nondeterministic behaviour and pauses, as an optimised RC collector may still need to pause the program to scan for cycles. You can handle cycles by supporting weak and strong references, but this puts the burden on the developer, and it's not always clear if cycles may appear when writing code.

Combining this with a good allocator you can achieve performance that is on par with a tracing collector (http://users.cecs.anu.edu.au/~steveb/pubs/papers/rcix-oopsla...), but it's not easy. If you are going down the path of implementing a complex collector, I think you're better off focusing on a real-time or concurrent collector.

> Also I can't prove it, but I think a major source of pain in garbage collection is mutability, which is exacerbated by our fascination with object-oriented programming. I'd like to see a solid comparison of garbage collection overhead between functional and imperative languages.

I can only speak for Haskell's GC, but it's pretty conventional. You might think that GC is simpler for Haskell without mutability but you'd be wrong, because (1) Haskell actually gives you plenty of safe and unsafe ways to have mutability, safe ways such as the ST monad (not to be confused with the State monad), (2) laziness effectively is mutation: evaluating a value is effectively overwriting the closure to compute the value with the value itself. The lazy list is a pretty common sight in most Haskell code. So basically Haskell has a pretty conventional stop-the-world, generational GC not unlike imperative languages.

The issue here is that while GC does take some time to perform, it's easy to make the assumption that removing the GC will reclaim the performance lost by the GC. This is a fallacy.

We can compare memory strategies like so:

Garbage Collection:

- Allocations: Very fast (often a single instruction)

- Ownership transfer: Free

- Pointer release: Zero

- Post-free: Garbage collection phase (this is where the time is spent)

Reference Counting:

- Allocations (similar to malloc(), requires a binary search at least)

- Ownership transfer: At least one instruction for single-threaded. Multi-threading: Requires a lock.

- Pointer release: Slow. Locking issues, memory housekeeping.

- Post-free: Zero. Good for realtime.

The first point might require some clarification. When you have a compacting garbage collector, the free memory is usually in a single block. This means that allocations merely require a single pointer to be updated. If the pointer hits the end of the free block, you trigger a GC. You don't even have to check for the end of the free block if the subsequent memory page is marked no-access.

One can spend a lot of time measuring the performance impact of all these different steps, and I am not going to try to prove that GC is always faster than refcounting, but at least it should be clear that it's not a simple matter of assuming that having no GC means that you will have no memory management overhead.

There's another difference: memory usage. For garbage collection, it's at the minimum only after a full GC (and there's a memory/performance tradeoff, in that using more memory allows doing less GC), while for reference counting, it's always at the minimum. Using more memory for the GC means less memory for other things like the disk cache (which caused a performance problem at work once, which we solved by giving the GC less memory to work with).

(Also, "ownership transfer" can be free in reference counting, since the number of references is the same before and after so there's no need to modify the reference count. What isn't free is sharing ownership, since it needs reference count manipulation.)

Indeed. Memory usage is a point I never addressed in my comment. When managing large Java applications, that is indeed one of the most important issues that one has to be aware of. Some runtimes tries to be more automatic than others, and there are books worth discussions to be had about that.

As for the second paragraph, thank you for clarifying that. I used poor terminology. I should probable have said pointer sharing, or copying.

For ownership transfer to be zero cost, the compiler have to be clever enough to figure out that the original reference isn't used after copy. This can be handled by the compiler itself, or be enforced by the language (as is the case with Rust, as far as I understand).

Ref counting uses additional memory to store the refcounts. And possibly more for the suspected set of the cycle collector.

GC uses additional memory to amortize collections over allocations. (Zero overhead means constantly collecting, using 2x steady state memory size is not uncommon.)

Overall, I think you're right, refcounting comes out ahead on memory usage. But it's not completely straightforward to determine.

That depends pretty much how the programming language allows for value types, stack allocation, global memory static allocation and off heap access, even in the presence of a GC.

The reason they are not used is not the circular reference problem, it's performance. Single-threaded ARC is passable, but thread-safe ARC is really, really slow. It basically hits all the worst performance pitfalls of modern CPUs.

Performance is really bad if you use Arc for practically everything, as with Swift. If you can deal with the simplest (tree-like) allocation patterns statically as Rust does, and use refcounting only where it's actually needed, you can outperform tracing GC.

> Performance is really bad if you use Arc for practically everything, as with Swift.

Do you have examples where Swift suffers compared to other languages solely because of ARC?

Also, is this something that Apple might theoretically make up for by optimizing their custom CPUs for it, without changing Swift?

> Do you have examples where Swift suffers compared to other languages solely because of ARC?

Literally all of them. Every single program where the ARC system is actually widely used.

Swift otherwise generates really good code, and should be meeting/beating Java most of the time, but if the ARC system is used, performance instantly takes a massive hit. This is very visible in the small synthetic benchmarks at:


In every benchmark where it was possible to write the program so that no references are updated in the hot loop, Swift comes out really impressive, beating Java. In benchmarks where reference updates are unavoidable, performance is more in line with Python.

> Also, is this something that Apple might theoretically make up for by optimizing their custom CPUs for it, without changing Swift?

There are various theoretical ideas that have been bounced about for decades, but so far none of those has panned out. While Apple might do that too, at this point they seem to be chasing the opposite approach:


Essentially, nicking the ownership/lifetimes system out of Rust, to avoid updating references wherever possible. The main difference from Rust will be that the current way of doing things will still be possible and the default, but programmers can opt into more performance by decorating their types with lifetimes.

I am cautiously optimistic for this approach: I love Rust, but learning it did definitely feel like being tossed right into the deep end. Which was filled with sharks. If Swift can succeed in a softer approach of introducing lifetimes, that might help the wider introduction of the concept.

> Every single program where the ARC system is actually widely used.

Interestingly enough, the semi-automatic reference counting mechanism used previously was reasonably fast if used as-is and could be made very fast with a bit of thought.

Yes, here it arrives always last.


Since the Xerox PARC workstations and Genera Lisp Machines, all hardware specific instructions for memory management have proven to be worse than doing it eventually in software.

How does that compare to SwiftNIO?


SwiftNIO is not high-speed network driver to start with.

As for memory management impact in overall performance, a similar set of benchmarks would be needed.

What is the difference between Clang's ARC implementation and Rust's std::rc::Rc implementation? Rust's ownership system provides the scaffolding for the Drop implementation that manages reference counting instead of Clang doing it as part of the Swift/ObjC frontend but otherwise, they seem to be very similar.

It's just another tool in Rust's box (no pun intended) but I wouldn't say it hasn't caught on. It's used quite often, along with its multithreaded cousin std::sync::Arc.

One difference is optimization. Rust's Arc is runtime-only, while Clang's ARC is understood by the compiler and optimizer. This allows eliding balanced mutations, and also weird ABIs - functions which return a +1 object, or functions which accept a +1 object. See Arc#Optimization [1].

More generally there's an ABI. While to Rust "Arc is just another tool", reference counting is deeply pervasive on iOS and macOS. You would not attempt to pass a Rust Arc to C, and expect something useful to happen. But Clang's ARC works well with C, and Swift, and with ObjC and even with ObjC-pre-ARC. This isn't a critique of Rust, just reflecting different priorities.

1: https://clang.llvm.org/docs/AutomaticReferenceCounting.html#...

Those compiler optimizations need some work I guess.


Personally, I appreciate you posting this link once, but seeing it this many times gets tedious.

Unfortunately it is hard to give an answer were all advocates of "Swift ARC performance" are able to read, thus urban myths keep being spread.

ARC makes sense given Objective-C compatibility, as otherwise a layer similar to RCW on .NET for dealing with COM/UWP would be needed.

And Objective-C only dropped its tracing GC, because it was too unstable while mixing GC enabled frameworks with classical ones, while making the compiler automatically call retain/release would provide better productivity without being unstable.

I understand the urge to reply in multiple places. I think it would be better to write up some more detailed reasoning, like you did here, and link to this from multiple subthreads. Better than just repeating the same ixy link with a new snide comment each time.

Makes sense.

> too unstable while mixing GC enabled frameworks with classical ones

Actually, you couldn't mix them. The GC simply never really worked properly.

True, however Apple's documentation suggested otherwise, that it was possible "with care".

But as you say, it never worked properly.

I think what you are referring to is "mixed mode" frameworks, which were able to work in both GC and RC environments. You had to do a bit of work to get that.

Which is similar but not quite the same thing. Mixing pure GC and pure RC frameworks was never possible. In fact, I think there were some flags and the (dynamic) linker would balk.

Which urban myth do you see being spread in this thread?

ARC outperforming modern tracing GCs.

Literally nobody here has said that.

I could collect statements about Swift compiler performance, how it removes retain/release calls, or how reference counting is so much better than tracing GC, but yeah lets live at here.

Another major user of reference counting is COM and WinRT. A major feature of WinRT projections over just using the raw C APIs is that you get reference counting implemented for you using whatever your higher level language uses for object lifetimes.

> and smart pointers never seemed to really catch on outside of Objective-C and Swift:

Smart pointers are a central feature of C++ even before they were added to the C++ standard in 2011.

Indeed, OWL, MFC and VCL already supported them for COM programming since its early days.

I'd agree that mutability is an issue with GC, in fact some LISPs like Losak are deterministic simply by being entirely functional:


It is true that you can get cyclical refs higher up in the language even with functional code though, and at some level you need a cycle checker.

There is another type of managed memory not talked about much, and that's the model used by Composita- RAII but with defined message interfaces between modules, such that you can deterministically allocate memory for that:


It looks to be the best way of doing deterministic performant memory management, though you'll need a cycle checker in the language runtime.

Whenever the topic of garbage collection comes up, I am reminded of the following excellent reference, https://www.memorymanagement.org/ which puts various different garbage collection (and memory management) techniques into a wider context. It perhaps doesn’t explain some of the newer tricks (eg using overlapping memory mappings to put the colour bits into the high(ish) bits of pointers and get a hardware write barrier only when necessary without needing to move lots of objects and have forwarding pointers).

The reference is provided by Ravenbrook, a consulting company formed from the ashes of Harlequin (who made lispworks, a CL implementation and IDE; MLWorks, the same for SML; and ScriptWorks, a postscript rasteriser which made them all their money). I don’t know when the reference was created.

Why was the title changed to include (2017)? It was published today.

please mail the mods at hn@ycombinator.com for such corrections

This comment seems like something others will want to see. Before the title is corrected (five hours so far), the comment is useful for people who didn't notice the wrong date, or did notice and wondered if anyone had emailed the mods. After the title is corrected, the comment is a useful place to explain why the mistake happened and learn a little about the process that changes the titles.

And if the user is interested in that change to actually happen, contacting the moderators is the best way of doing it. They might not be aware of that, which is why I mentioned it, so that they and others reading the comment know in the future.

Why is this marked 2017? This is the newest chapter of a book that isn't finished yet, and was released literally today. It couldn't be more (2019) if it tried!

Tech moves fast; this morning's bleeding edge is this afternoon's yesteryear, and you don't want to know what happens if you bookmark it to read tomorrow.

Was this a subtle reference to a sweep and mark algorithm?

If you are interested in optimizing generational GC with operating system facilities:


And a review of LLVM's GC facilities: https://medium.com/@MartinCracauer/llvms-garbage-collection-...

Overall message is that there are many different ways to get to the destination, and creative solutions are mixed in.

Do you know if llvm got good at garbage collection support recently? My understanding was always that their optimisations would mangle your stack/registers so much that a decent incremental exact GC becomes super hard. And that something like keeping one/two pointers to the tip of the heap in dedicated registers is super hard/annoying too.

I don't have anything new. I might have to re-write a GC for Clasp. Then I know :)

A better questions would be whether anybody has a moving, precise GC running on LLVM. Clasp with MPS runs, but that might be coincidence/luck.

I'm a bit surprised to find no discussion of reference counting versus tracing GC, only a couple of passing mentions. On the one hand, I suppose this has already been discussed to death. And if the author felt it would be more fun to write and read about actually implementing mark-and-sweep GC, fair enough. On the other hand, reference counting definitely has its place, and I'd be curious to know more about the author's opinion on it, given his background in game scripting languages. Some scripting languages that have been put to good use in games, such as AngelScript [1] and Squirrel [2], use reference counting.

Edit: My mistake; he discussed it back in chapter 3.

[1]: https://www.angelcode.com/angelscript/

[2]:: http://squirrel-lang.org/

FWIW, in re: reference counting vs. tracing GC

"A Unified Theory of Garbage Collection" https://researcher.watson.ibm.com/researcher/files/us-bacon/...

> Tracing and reference counting are uniformly viewed as being fundamentally different approaches to garbage collection that possess very distinct performance properties. We have implemented high-performance collectors of both types, and in the process observed that the more we optimized them, the more similarly they behaved— that they seem to share some deep structure.

> We present a formulation of the two algorithms that shows that they are in fact duals of each other. Intuitively, the difference is that tracing operates on live objects, or “matter”, while reference counting operates on dead objects, or “anti-matter”. For every operation performed by the tracing collector, there is a precisely corresponding anti-operation performed by the reference counting collector.

> Using this framework, we show that all high-performance collectors (for example, deferred reference counting and generational collection) are in fact hybrids of tracing and reference counting. We develop a uniform cost-model for the collectors to quantify the trade-offs that result from choosing different hybridizations of tracing and reference counting. This allows the correct scheme to be selected based on system performance requirements and the expected properties of the target application.

This is one of my favorite papers. It's a very accessible article that dispels the myth that reference counting is somehow not a form of garbage collection.

> This is one of my favorite papers.

Mine too. It's a gem.

While reading I was expecting to hit moving/copying/compacting GC discussion at some point. But not.

IMO, one of GC's benefits is that it maintains data locality. Otherwise given GC is not that different from reference counting MM (loops in ownership graphs aside). Same problems with memory fragmentation inherited from malloc/free.

Some memory allocators allocate objects of different sizes from different blocks, which can be far away from each other. This reduces fragmentation, but I'm wondering how much data locality we typically get, even in the best case, for composite objects composed of different-sized parts?

It seems like it will vary a lot depending on details of the language implementation. An array of value types should be laid out linearly, but beyond that it seems hard to tell what's going on?

I am a friend of simple garbage collectors. Let initial placement be decided by running code.

Let later placement be decided by the order in which the heap is scavenged. That is what e.g. re-unifies array-of-pointer target instances. You scavenge the array of pointers and in the default case you line up the instances the pointers point to beautifully one after another. Even if the initial allocation had interleaving other memory. In that case your code gets faster after GC than before the data is first GCed.

I don't see how we can know whether ordering things in memory by breadth-first or depth-first traversal will make things faster, though? It seems like it depends whether it's more common to iterate over the array or access single elements of it. And when iterating over the array, how deep does the code go when looking at each array element?

When you say locality, you mean moving memory allocations next to each other so that there is no wasted space between them right?

Not wasted space. Unreleated space would do.

Keep in mind that you only have a tiny amount of L1 data cache lines. They are gone so quickly. If you can get a couple more struct instances in an array into those cache lines (without the cache lines holding unrelated nonsense as a byproduct of a memory fetch) that is a huge win.

The issue of L1 cache lines is more important than the size of the L1 cache. The granularity of the cache lines uses up the size of the cache very quickly if all cache lines are padded up with 3/4rd nonsense that you don't need right now.

> no wasted space between them

That too but also, if you have array of objects, then compacting GC will replace those objects in continuous area in memory. So conceptually it will optimize access to those objects. Cache prefetching by modern processors, all that.

This is great. Garbage collection is one of those things that tends to scare people, even though a basic garbage collector can be very simple (though a naive approach is usually also horribly slow..). Great to see the amount of illustrations to make it easier to grasp.

> Garbage collection is one of those things that tends to scare people

That's exactly why I was excited to write this chapter. They have a reputation for being hard, but the core algorithm is really just a simple graph traversal.

Designing and implementing a low latency concurrent garbage collector is actually at least as difficult as writing an optimizing compiler. Many languages today still struggle with concurrency+GC (see e.g. python, ocaml). Good to see this chapter on GC, but I think the subject deserves an entire book!

I recently read this piece, which I think has been posted here before, which walks through the process in an isolated fashion:


I've implemented a couple of scripting languages, but thus far I've ignored GC. It is something I'd like to experiment with in the future though.

A GCed piece of nontrivial code can also be faster than a malloc/free solution.

The reason being that allocation can be much faster. The GCed code can allocate memory with as much as an increment of a pointer (and that is atomic, so no thread locking needed).

malloc/free always do full function calls, and they might/will descent into dealing with fragmentation (aka finding free space). Likewise, free() isn't free and cross-thread allocation/deallocation can further complicate things.

> increment of a pointer (and that is atomic, so no thread locking needed).

Nitpick (and mostly for others reading): incrementing the address of a pointer if done locally (as in, using a local variable of sorts) may be atomic, but storing the updated pointer somewhere may not. If a bump allocator wants to support concurrent allocations, it should make sure to use the right atomic operations (typically this just a compare-and-swap of the old pointer with the locally incremented one).

Is this universally true? Redis for example vendors jemalloc so it’s entirely possible for malloc to mostly inline? IIUC malloc isn’t a syscall like the underlying sbrk and mmap calls that malloc implementations use to get memory from the kernel?

Sure, you can change any of the individual properties.

But if you cannot move memory (adjust pointers like most GCs do) then you will have to deal with fragmentation, which slows down allocation (or causes other drawbacks).

Moving GC can also make the program faster due to memory compactation and hence being more efficient wrt CPU cache and TLB.

Fragmentation issues depends greatly on language. E.g I'm (perpetually, it seems) working on a Ruby compiler, and I slotted in a naive gc as a starting point and instrumented it.

Turned out the vast majority of allocations are tiny object instances of a handful of different sizes. As a result minimizing fragmentation is as easy as allocating pools of a fixed size for small objects, and round up larger objects to a multiple of block size. There are still truly pathological patterns possible, and a compacting / generational gc may still be worth it later both to deal with that and to reduce the cost of a collection, but for a lot of uses you can get away with simple options like that.

Malloc is not a syscall, you're right. But it's a quite complicated funtion, so it wouldn't make sense for the compiler (or linker) to inline it wherever it's called.

jemalloc is deliberately structured so that the fast path is all inlined and all unlikely paths which would cause it to blow the inlining complexity budget get pushed out.

Interesting. I didn't know that, but it sounds plausible.

Garbage collection isn't that slow. I find allocating off the fragmented memory after collection slower.

A good gc isn't slow. But a naive one that tends to be awful. E.g common problems involve thrashing cache by touching the entire heap regularly.

I'm definitely a newbie at this topic, but I wonder if we can have some type of static GC? I'm not speaking about Radical change to the model of the programming language (like Rust), but about a compile time analyzer that can detect when object will go out of memory (edit: out of scope) and insert a code to deal with it, or make some expectations about the real generation of the object, this will decrease the amount of required work by the GC during runtime, no?

Isn't a compile-time analyzer that can detect when an object will go out of memory precisely what Rust (and C++ RAII) does? It's just a form of escape analysis.

Imagine creating a Rust program entirely with `Rc`. It's basically a GC'd program at the point where the "roots" are managed by the reference counter. The "list of roots" is only ever messed with whenever a `Rc` is dropped/created, and one can optimize functions to take `&Rc` to reduce "GC" pressure. I do not believe it's possible to automate this process in general because if you could, I have a hunch the solution can be used to decide the Halting Problem.

So sure, in general a GC can perform some heuristics to predict the lifetime of an object, but usually the point of using a GC is that one does NOT know the lifetime or it is insanely complex.

Such analysis also requires language design that limits the legal statements in a program to make the analysis tractable.

Without these restrictions, this problem is isomorphic to the halting problem. (Proof: assignment of a given memory object to a field within another unrelated object creates another reference. The job of automatic memory management is to determine when no such references exist. Now replace that assignment with HALT. Any such automatic memory manager that operates statically would be able to find all HALT statements within the program and so solve the halting problem for an arbitrary program.)

That's why languages that manage memory statically like Rust & C++ must be able to reject some programs as "not passing the borrow-checker", and everything else requires run-time support via either GC or refcounting.

Doing it perfectly requires a suitable language design, but even a pathologically statical analysis unfriendly language like Ruby still allows you to determine it in many cases. You just need to accept that for such languages it is an optimisation, and you still need to fall back on full gc.

Systems programming languages with GC (any form of automatic memory management), also provide features for doing manually memory management, explicitly releasing resources, allocating on the stack or the global memory segment.

Depending on what one is trying to do, those constructs can still be allowed on safe code, or require explicit unsafe modules (or code blocks).

Examples, D, Nim, Swift, Mesa/Cedar, Modula-2+, Modula-3, Sing#, System C# (aka M#), .NET (version and language dependent).

Depending on the language, you can do this for some things. The fundamental challenge is escape analysis: when this function returns, what might have been retained elsewhere?

If you can answer that, then you can optimize accordingly.

E.g you can choose to allocate objects that won't be retained on the stack instead of the heap, or in separate blocks.

You can even potentially benefit from this even if you can't 100% know. If you use a generational collector, an object that almost certainly escapes could bypass the nursery, for example.

There are some optimizations that do thus. When you know the object won't outlive the function, allocate it on thw stack instead of on the heap

And on the more extreme case you can even replace the objects with a bunch of local variables. For example, replace a Point object allocated on the stack with a pair of variables gor the x and y coordinates.

The Standard ML implementation called ML Kit added memory tracking to the type system in order to infer the lifetime of objects. I ran into a safe C dialect that built on the work in ML Kit to avoid memory allocation overhead in C as well, but I can't remember the name of the project. The Cyclone project mentioned in the link might have been it, but it has less automation than ML Kit.


  > if (previous != NULL) {
  >   previous->next = object;
  > } else {
  >   vm.objects = object;
Ergh. That's much more painful than it needs to be. Maybe try:

  static void sweep() {
    Obj** ref = &vm.objects;
    while (*ref != NULL) {
      if ((*ref)->isMarked) {
        (*ref)->isMarked = false;
        ref = &(*ref)->next;
      } else {
        Obj* unreached = *ref;
        *ref = unreached->next;

Funny you might say that.

Baby’s First Garbage Collector[0] (from the same author) uses this approach.

[0]: https://journal.stuffwithstuff.com/2013/12/08/babys-first-ga...

The long awaited garbage collection chapter is finally here! I'm not sure the discussion about white/gray/black objects is strictly necessary for a simplest possible garbage collector, but it definitely will help those who want to read more about the topic in the future.

> I'm not sure the discussion about white/gray/black objects is strictly necessary for a simplest possible garbage collector

I wondered about that too. If you aren't doing an incremental collector, it's not really necessary. But I think it helps build a visual intuition for the algorithm (and other graph traversals for that matter), so I felt it was worthwhile to put in there.

Garbage collection solve many problems on virtual machine as memory leaks and override on primary memory, many programming languages now implement something like garbage collection on API then the compiler can manage the memory for the user.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact