Hacker News new | comments | show | ask | jobs | submit login
Writing a Simple Garbage Collector in C (illinois.edu)
162 points by webkike on Aug 25, 2014 | hide | past | web | favorite | 47 comments

This article teaches you how to write a conservative GC, which is what you'd do if you wanted to garbage collect in C or some other system where you don't know the shape of objects at runtime.

If you do know that -- like you're writing a GC for a higher-level programming language -- you can write a precise GC instead and avoid all of the (very interesting) muckiness this article goes through.

I wrote an article on that a while back:


Also, that avoids the very muckiness this article skips. In particular:

"Local variables can also be stored in registers, but we won't worry about this because registers and usually dedicated to local variables, and by the time our function is called they'll probably be saved on the stack anyway"

Ignoring things that 'usually' or 'probably' do not happen is not a good idea when programming, doubly so when doing systems programming. It is one thing when your application crashes, another when the OS crashes or becomes exploitable due to a bug.

If you want to implement a robust conservative garbage collector you will have to study the calling conventions of the language you write it in, will have to study the source of the compiler, and will have to study the release notes of every OS or compiler upgrade. For example:

- rust moved from segmented stacks to conventional ones. That probably broke any code walking the stack.

- on some systems, locals can get stored below the stack pointer (http://en.m.wikipedia.org/wiki/Red_zone_(computing))

(Yes, this example code appears to be broken on x64)

Checking the registers is easy enough; just push them onto the stack and a quick check of several consecutive memory addresses and you are done.

The other way to improve GC if you are using C is to make sure your local allocations are in fact local; use alloca() and friends (but only for sub page sizes, possibly less if you are doing lots of local allocations). This will reduce the amount of GC required.

As far as Red Zone goes if you are in the GC then you can't be in another local function; unless you are running your GC pre-emptively (signals etc), in which case you need to be much much more careful.

> Local variables can also be stored in registers, but we won't worry about this because registers and usually dedicated to local variables, and by the time our function is called they'll probably be saved on the stack anyway

Heh. No wonder it's 32 bit! You can only hope to get away with this on a register-poor ISA like i386.

I wonder if you could use setjmp to portably get access to register values?

It isn't portable. jmp_buf should be treated as opaque data type (just like FILE or va_list), even if it isn't. Also, its content can be encrypted (glibc does pointer mangling).

1. Writing a simple anything isn't hard.

2. Especially if simple means you don't have any interrupts or threads.

3. Writing anything is a lot easier if you carefully exclude "debugging" from "writing".

When you have GC bugs, they don't always occur in the collector itself, but rather when the code which relies on GC breaks the conventions that allow GC to work correctly (because that code is generated by a C compiler which doesn't understand your conventions, and so they are ensured by hand).

There is a nasty source of "false retention" in conservative, stack-walking GC: pointers themselves! It's not just integer variables (and other data) which "look" like pointers.

Here is the scenario: a local variable in the C program holds the only pointer to a heap object. The object is no longer needed, so the program overwrites it with NULL. Now it is garbage, right? Wrong! The C compiler is optimizing, and has no clue about garbage collection (unlike, say, a proper Lisp compiler). The C compiler performs a data flow analysis and comes to conclusion that "ptr = NULL" is a dead assignment because the variable has no "next use" in the program graph. The assignment is optimized away, and so a memory location, or a register, continue pointing to the object. Sometimes this happens in some top-level function, and so during some long-lived execution of lower level functions, memory is occupied. Only if you're lucky, the memory location or register get reused for something else and so the stale pointer is overwritten.

If you're using infinite lazy lists, and are counting on the prefix of these lists to be garbage-collected, as you extend the tail, spurious retention of the list head is the kiss of death to your program.

Large, portable, robust programs are hard to write. But that's not what I'm addressing. I'm trying to show how what is generally considered a complex problem is simple when you strip it down to its essence.

"Local variables can also be stored in registers, but we won't worry about this because registers and usually dedicated to local variables, and by the time our function is called they'll probably be saved on the stack anyway"

This is wrong; you have to take care of registers. The good noews is that an easy way to do this that will work on a broad range of platforms is to store the machine context to a jmp_buf that is on the stack. (You never use this jmp_buf for a longjmp, obviously.)

Relying on "probably" is completely the wrong foot to start out on, IMO. It leads to this kind of craziness: http://timetobleed.com/the-broken-promises-of-mrireeyarv/

I think the article does not really articulate the root cause properly. The issue is that a Ruby object goes out of scope, while a low-level pointer to its internals continues to be used (in the middle of code that allocates and can trigger GC). How that problem translates to code generation on amd64 is secondary.

What is happening is that the string object has been reduced to its C fundamentals (pointer to data, and length), and the one and only variable ("str") which holds the original object from which these fundamentals are derived has no next use after that.

The fix in the patch is not very good; the real fix is to get rid of the nasty macro zstream_append_input2, and make it a real function.

The function still has to solve the problem of pinning the object while the low level pointer is used, but it's more localized: we don't have to think about the interaction of the macro with the site where it is expanded (and will the hack always work).

In other words, don't write macros that break apart garbage-collected objects into low-level C pointers to their internals that are invisible to the collector. The macros spread dangerous code everywhere they are used.

> The issue is that a Ruby object goes out of scope, while a low-level pointer to its internals continues to be used (in the middle of code that allocates and can trigger GC). How that problem translates to code generation on amd64 is secondary.

"Out of scope" as defined how? Not by the C code, it can be in-scope in C but optimized away by the compiler... which seems best illustrated by dissecting the compiler's output.

> The fix in the patch is not very good; the real fix is to get rid of the nasty macro zstream_append_input2, and make it a real function.

That "real fix" doesn't actually fix anything, though. Don't get me wrong, macros are ugly (especially lowercase ones) and that's a fine refactoring to suggest, but it's not a fix.

> In other words, don't write macros that break apart garbage-collected objects into low-level C pointers to their internals that are invisible to the collector. The macros spread dangerous code everywhere they are used.

In this context, there's extremely little difference between a macro and a function that the compiler decides to inline.

So am I reading this correctly, in that if I get allocated a segment at 0x12340 and then a random variable on the stack happens to have the value of 0x12340 for a different reason, then that segment would never be freed? Is this strategy still sound?

Also, while this is a nice explanation, does anyone actually use GC's while writing code in pure C? I never found the idea of calling free() that troublesome. On the other hand, writing a GC in C for another language is obviously a good use case.

The issue isn't that calling free() is troublesome. It's that in a large C program with various libraries and modules and callbacks, it isn't clear who should call free() on a given chunk of memory or when it is safe to do so.

Indeed, the garbage collector is not precise. However, in practice, the situation you describe does not tend to occur that often. Intuitively there is a 1/4 chance on 32 bit machines, but the chances are actually much due lower to the heap being located at a high memory address, and most local integer variables tend to have small values.

And yes there are definitely uses for a pure C garbage collector. For example, one can use it to detect memory leaks, if for some reason valgrind isn't around.

Sorry, that's wrong.

The D programming language currently uses a conservative garbage collector. Before version 1.0, false pointers were a serious problem. The most common causes of false pointers were:

- ASCII text;

- floating-point values;

- high-entropy (compressed/encrypted) data, e.g. image files.

Just 1 MB of random-ish data is enough to pin down allocations of 16 KB or bigger. And of course, it cascades through any false pointers in any pinned memory blocks, and so on.

In D 1.0, the garbage collector now remembers whether any memory block contains any pointers or not. So, now it can tell whether any memory address either definitely contains no pointers, or may contain pointers. Even so, memory leaks due to false pointers on 32-bit are not unheard of. For personal use, I've had to write a tool to diagnose these memory leaks: https://github.com/CyberShadow/Diamond

Rainer Schuetze is working on improving the precision of the GC from one heap allocation block to one machine word, but there's still work to be done on this.

So what are some strategies to really solve this problem? Is it basically impossible to implement a GC for a C-like language that will not leak at least a little memory?

It's not possible as long as your language has:

- non-discriminated unions mixing pointers and non-pointers

- linking with C or other languages (as you can't scan their stack frames precisely)

- void*, memcpy, or other ways to move around memory which might have pointers.

A language without all of the above would be of limited utility. However, it's possible to dramatically improve the situation using:

- precise heap scanning - by knowing the type of each object allocated;

- precise stack scanning - by knowing the stack layout of all functions at all points throughout their execution;

- GC hooks for user-defined union types.

You are wrong.

I made a comment to that effect higher up before noticing this.

When you're dealing with real world data - timings, statistics, sizes, dates, and then if you add floating point numbers whose representations look like big ints, you will be hitting this a lot, and leaking memory, due to how your mark and sweep works.

It is absolutely critical to have a sure way of disambiguating pointers from data.

I've fixed bugs in a memory allocator that's running in an app deployed on several hundred million PC's. It had a really fancy allocator that put a header with a magic number around malloc output, in addition to what you do, in an attempt to identify specific kinds of allocations. We'd see several tens of thousands of crashes a day due to magic number collision with real world data.

Your intuition is wrong. You can never assume that most local integer variables tend to have small values. Assign -3.0 to a float, and map that to an int, or read the current epoch time into an int, and see what happens.

A C garbage collector would certainly help write some code more easily, but it has to work, and it has to be "precise". An imprecise GC is worthless in production, but fun as an academic exercise.

There are quite a few collectors in production that are conservative at least for the stack. D, Boehm, JSC, SBCL, etc. It can be a problem, but it often isn't, especially when you have precise knowledge of heap object layouts.

Boehm-Demers-Weiser does see 'real world' usage.

Tens of thousands of memory blocks that can't be freed daily, on an install base of several hundred million PC's sounds like an acceptable compromise, at least for non-critical applications.

For such a simple proof-of-concept this is good enough. Indeed, it's not going to break anything, just leak memory. If I recall correctly, more advanced C garbage collectors like Boehm's have some ability to describe the layout of a struct to indicate that certain items are pointers and others are just data.

If you're constrained on memory (either address space or the limits of your physical + paging memory spaces) this is going to break everything. This is not a common situation for everyone, but it's a common situation for me.

Do you think this is appropriate to cover in a learning exercise that is intended to be simple?

I think it's appropriate to cover in a discussion about the merits - and limitations - of a proof of concept. I believe it's an appropriate counterpoint to "it's not going to break anything" as the discussion opens up to GCs in general, although that quote was likely meant only in the context of the proof of concept.

In most such environments, stop-the-world is just as problematic.

"People seem to think that writing a garbage collector is really hard [...] Well it's not. In fact, it's rather straight forward."


"Thirdly. Please don't use this code. I did not intend for it to be wholly correct and there may be subtle bugs I did not catch."

I am not sure what you're pointing at. I think this article aims to do the same as any other "compiler for a toy language" article; namely remove the fear barrier and explain that at their core, many of our black boxes utilize simple concepts.

I've always thought a compiler was difficult to write until I wrote my first one. Sure it is nothing when compared to a production-level compiler but definitely no longer think that there's something magical about compilers, even the more advanced ones.

I'm sure that I don't need to tell you that there are many things in this world that look easy, but are in fact very hard. Even things that are very simple at the core turn out to be very difficult in practice.

I mean, to be a little flippant, all you need is abstraction and application, and you can compute all computable problems, right?

More to the point: what I was trying to do was to point out the irony of saying "building a garbage collector is easy" and then, only a few sentences later, saying "this code may have subtle bugs." Saying "this code is not fast enough for production" would be one thing, saying "this code is not correct" seems to put lie to the assertion that building it is "easy."

I personally have never built a garbage collector in anger. But I suspect that the problem of building a passable garbage collector (that is, one that somebody else would be willing to use) is quite a bit harder than it is made out to be in this article.

The point of this post is education. The author is trying to de-mystify garbage collectors. His approach - which I think is a good one - is to present code that is quite simple, but still meets the criteria for being a garbage collector.

This is a pedagogical technique. Your points are valid, but they also ignore that the author is trying to teach concepts through code. That often require simplification.

Yeah, and this is true of webapps (and many other kinds of programs) as well. Writing a webapp is (or at least can be) easy, but writing a production qualiy webapp isn't.

Not sure if webkike is Matt or not, but either way its a nice little mark and sweep GC. It reminded me a bit of the first one that was written for Oak (the pre-cursor language to Java). I have always been a fan of building versions of technology like this to experiment with because it helps you figure out where the hard parts are of the 'simple' programs. One of my favorite interview questions is to start with traversing a list to arrive at an answer, now the list is getting insertions from multiple sources ...

So too with GC that you start with a hook into your allocator and you end up with coloring references and aging them and disabling compiler optimizations and all sorts of other things that lead to really unexpected behavior in your environment.

I'm Matt (officially). And webkike is not me (also officially).

What do you mean by "officially" in this sense? I'm a native English speaker and all, but I'm not sure what you're implying. How would you unofficially be Matt? Or unofficially not be webkike?

It's a joke that is probably only funny to me. Because I am the authority on who I am, adding (officially) is superfluous. Sorry for the confusion.

It would be helpful to note that some of the variables used in the first two function examples are defined further down the page. I was trying to understand how his first loop in add_to_free_list worked and couldn't figure out how because *freep was never declared anywhere and that was his list of free blocks. Took me a page of scribbling pointer diagrams and 5 minutes to figure out that it was declared and set further down the page as a global.

re preciseness: You should really add your preciseness limitations. Also add that pointers located in loaded shared libs are not detected with this method. Boehm has special methods to detected the ranges of loaded shared libs.

Maybe you should mention weak refs and volatile also. volatile to keep pointers on the stack, so that you don't have to inspect the registers also. And some easy ways to make it precise. E.g. int tagging, float boxing, ...

I really appreciate seeing articles like this. This is not a topic most developers typically encounter in their every day work, and is often not a topic taught in college. Articles like this really help stretch my thinking and let me look at a problem I'd otherwise never encounter. Thanks for posting!

This code won't work.

It scans heap memory, and if a word looks like it points into allocated memory, it's assumed to point into allocated memory, and that memory is marked as used.

When dealing with numbers, such as times, file sizes, etc, you will inevitably store an integer value that looks like a pointer into a valid block.

The reason you can do mark and sweep in Java, is that there is a distinct wrapper for all pointers, so you can disambiguate pointers from data. You can't do that in this algorithm, it's impossible. This is also the same reason that disassembly of executables is so hard; it's not always clear what's data and what's instructions.

This has been discussed before. It's the distinction between the collector being precise and conservative. Conservative collectors are fine because they err on the side of retaining memory rather than freeing it. As I stated before, intuitively the odds of an integer containing a pointer-like value is 1/4 (on a 32 bit machine), but they actually tend to be much lower.

Even libgc has this issue and it is impossible to resolve in C.


Thank you.

"there is a distinct wrapper for all pointers"

That's not why the JVM can do this - the JVM knows the location of pointers in objects on the stack, and it only scans them. They aren't wrapped in any way.

Your point about the VM being aware of stack layout is of course correct. That aside, Java references can be more complex than a simple pointer to the referenced object. In particular:

In some of Oracle’s implementations of the Java Virtual Machine, a reference to a class instance is a pointer to a handle that is itself a pair of pointers: one to a table containing the methods of the object and a pointer to the Class object that represents the type of the object, and the other to the memory allocated from the heap for the object data.


That's not really to enable garbage collection. In C++, for example, objects with dynamic storage duration will be referenced off the stack by a pointer to your object, which in turn, if the object is of polymorphic type, contains a pointer to the class vtable and runtime type information. That's not so much wrapping as required infrastructure for runtime polymorphism, interfaces etc... even if you replicate the same features in C.

Presumably the major difference in Java with re: GC is that there's enough introspection left in the RTTI after compilation for the GC to identify internal references.

I wrote a simple GC for C once. It wasn't simple. I gave up after I hit what I felt like was the five-thousandth edge case and about 6000 LoC.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact