If you can look at a word of memory and differentiate pointers from values, garbage collection can become extremely simple. It's a shame that tagged architectures have largely died out.
As an experiment, I tried writing a garbage collector which used high-order bits of a word to identify pointers. The result is about a page of code in Forth:
This example works exclusively with fixed-size "cons pair" allocations, but generalizing to arbitrary-sized allocations only increases the complexity of the system slightly. Obviously this bitflag technique is not "safe" in general, as arbitrary values on the stacks could produce false positives, but it's easy to imagine a 33-bit or 65-bit architecture that provided the necessary hardware support without such caveats.
> It's a shame that tagged architectures have largely died out.
Judging by benchmarks, it's a good thing they did. Byte-addressed but word-aligned pointers plus pipelining and superscalar processors (branch prediction probably helps too) give you almost the same performance as explicit instructions for tag support. The amount of flexibility and the simplification in the instruction set you get from it is well worth it IMO. SPARC had such instructions, and on benchmarks (see http://www.cs.ucsb.edu/~urs/oocsb/papers/oo-hardware.html and http://www.cs.cmu.edu/~ram/pub/arch.ps) the difference in speed between using them and not was less than 5%.
On a 64-bit machine, you have 30 different tags available (5 tag bits, since the smallest thing you'd want to point at is going to be two 64-bit words big, and two of the 5-bit patterns reserved for positive and negative fixnums, for 60-bit immediate integers).
The problem is a lot of algorithms (especially crypto) are designed around 32 or 64 bit words, which typically means you have to make contortions and declarations to have your compiler operate on immediate representations. Typically, the compiler will set aside some registers on the CPU for dealing with these immediate values, so the garbage collector doesn't get confused. On x86 this might mean you don't have enough registers to run the algorithm efficiently, even if the compiler knows exactly what needs to be unboxed.
I do agree on one thing - slapping an extra byte onto word size would be awesome, as long as that extra byte is general purpose. Other things besides tagging that you could do with it would be error-correcting codes (the Symbolics Ivory actually had ECC for every word), and all kinds of metadata. Maybe it would even make sense to keep this extra byte out of the memory bus and just have it on registers; that's where most of the trouble with tagged values on modern machines seems to come from. This would make sense for load/store instruction sets.
In IEEE 754 standard, floating point numbers are represented using 1 bit for the sign, 11 bits for the exponent, and the remaining 52 bits for the fraction (number between 0 and 1). The exponent 7FF (in binary: all ones) is reserved for special values: NaN and infinity. If the fraction is 0 (all zeroes), then the number is +/- inf, otherwise it is a NaN.
NaN-tagging works by using the unused variations of NaN for pointers. The trick is that most processors only generate a single binary pattern for NaN, so all other binary patterns are never used to represent floating point numbers, and can thus be used to represent pointers. This works since the current x64 implementations only use 48 bits for the pointers, so all pointers can "fit" into the remaining 52 bits of NaN.
> In IEEE 754 standard, floating point numbers are represented using 1 bit for the sign, 11 bits for the exponent, and the remaining 52 bits for the fraction (number between 0 and 1).
On reading this I immediately read the standard, hoping to discover that the word "fraction" wasn't used, only to find that "fraction" is defined as the part of the significand (mantissa) lying to the right of the implied binary point. In essence then (because the bit to the "left" of the binary point is only ever implied, never explicit), "fraction" stands as a synonym for significand.
Unfortunate terminology. And why? Here's why:
A fraction: 1/3
A significand, base ten: 0.3333333333333333
IEEE 754 can provide the second, but it cannot provide the first.
Remember, this is binary, so the only possible digits are 0 and 1. For example, 1/3 in binary is 0.01010101010101...;
Encoding that in "floating point", we first get:
1.0101010101 * 2^-2;
We can see that in this format, the first digit of the number is always 1; thus, the first digit of the significand, for any number, is ALWAYS 1 (in binary)! IEEE754 makes the optimization to not store this leading 1 as part of the fraction, so the the approximation to 1/3 is actually encoded as:
0 01111111101 0101010101010101010101010101010101010101010101010101
^ ^ ^
| +-----+ |
sign bit | fraction = 1.33333... = 1.01010101010101...0101010101010101b
| |--- this part is included ---|
|
exponent = 01111111101b - 1023 = 1021 - 1023 = -2
Hey, you may be interested to know that in the upcoming ARM v8's AArch64 mode, the upper 8 bits of all pointers can be used freely for tagged pointers.
The issue isn't adding extra data to pointers; it's distinguishing a pointer from an integer of the same size. The extra bit needs to be "meta".
Some x64 targeted languages already use tagged pointers without specific hardware support; e.g. I recall reading about JVM using several bits of object pointers in 64-bit to encode a class index, so that polymorphic inline caches can be checked without an indirection.
Right, any language could do this -- `malloc` and friends return pointers aligned to 16 bytes, giving you 4 bits at the bottom that are totally free.
It can go even further than that; most systems only use 48 bits for addresses -- this leaves more than enough space to "NaN-box" objects by putting the address inside the extraneous bits of a 64 bit floating NaN.
Right, any language could do this -- `malloc` and friends return pointers aligned to 16 bytes
How does the having the bottom 4 bits free in a pointer help if you can't use them to tell the difference between a pointer and an integer 1 or 2? He is asking for a reserved set of bits on every data type so that he can store minimal type information like pointer, not pointer.
Sure. This was just meant as an interesting tidbit. On ARM v8, you'll still need the compiler/VM to help you out as far as type precision goes, of course.
I was under the impression that the collector that's in place right now is just a cycle collector, not a full-blown garbage collector. But please correct me if I'm wrong!
Oh yes, you're right -- I was assuming you were grouping cycle collection under garbage collection. In any case, the cycle collector is task-local, not global.
I think a really cool tactic is racket's places, which basically creates individual zones running their own module with their own gc (but objects shared between cores don't take up extra space, there is a global table or something).
This sounds very much like how Erlang does it - each process gets an isolated garbage collected heap and communication between processes happens through message passing.
How lightweight are Racket places? I'm not very familiar with the language, so I don't know if they're really comparable to Erlang processes.
How does that work? Either you have shared objects and you need a global garbage collector, or you don't. Using shared objects with a local GC results in memory leaks.
Nice article. What I didn't understand: How does a conservative GC without type information know where the references are in an object? E.g. given this object:
struct {
double a; // 64 bit
short b // 16 bit
int c, d; // 32+32 bit
FooPtr e; // 64 bit
int f; // 32 bit
BarPtr g; // 64 bit
}
Does it assume that all fields are aligned to 64 bit boundaries? Does it potentially consider a double to be a pointer? And how does it know where to stop looking for references without knowing the size of the object?
A conservative GC requires the application to obey certain "typical" behaviors. One of those behaviors is "natural" alignment of structures. Most C compilers will align pointers in structures to the word size of the architecture, inserting padding if necessary. In your example, padding would be inserted between "b" and "c" to align "c" to 32-bits and between "f" and "g" to align "g" to 64-bits (assuming this is a 64-bit machine). So the GC can scan an object a word at a time and be sure it'll hit all pointers. Note that using compiler directives that pack structures as tightly as possible, ignoring natural alignments, can break this assumption.
As for your other questions, remember that a conservative GC manages objects allocated through it, not just random objects in the system. It can use allocator metadata to keep track of where objects start and how long they are.* It aligns objects to some multiple of the word size, and can ignore any bit pattern that doesn't point to some multiple of 4 or 8 bytes. It also knows the extents of its own heap. As its scanning pointers, it does some checks to see if a particular bit-pattern points outside of the heap or does not point to the beginning of an object. If a possible pointer passes all of these tests, it definitely points to an object. It might not actually be a real pointer, but the target is actually an object managed by the GC.
On a 64-bit machine, it's actually fairly rare for a random "int" or "double" to be mistaken for a pointer. Most "ints" are small integers, and on a 64-bit machine the heap is almost always allocated above 4GB. Similarly, the odds of a 64-bit double bit pattern pointing inside the heap out of all those exabytes of heap is unlikely.
*) This is similar to how a malloc implementation might use allocator metadata to know how big a chunk is for a free() call.
I'll start with your last question: It does know the size of objects - this size is passed to the GC when you allocate memory from it. Root ranges (such as the stack, global variables, TLS areas, etc) also all have a static size.
(Not strictly true - some runtimes have dynamically growing stacks, but the GC knows the size regardless.)
Most garbage collectors assume pointers to be aligned on a word boundary; that is on a 4-byte boundary on 32-bit machines and an 8-byte boundary on 64-bit machines. This is a reasonable assumption because accessing pointers that are not word-aligned is extremely slow on most architectures. It does not care how fields that don't contain pointers are aligned because their contents are irrelevant (so this translates to the compiler being able to pack some fields together without worrying about breaking the GC).
So, a conservative GC will simply scan over every word in an object and interpret it as a potential pointer, regardless of what it actually is. So yes - even a double will be considered a pointer.
It can't make that assumption- in a conservative GC, any pointer sized field in an object could be a pointer. So assuming you're on a 32 bit machine, in the above example, the double would look like 2 pointers to the GC. One way of dealing with this is, the GC keeps a set of all the address ranges that belong to its objects (these are the only objects that could be garbage collected). Then, when it sees what could be a pointer, it'll first look up whether this "maybe pointer" actually belongs to it's address range, and only if it does will it add it to the mark stack for further scanning. In terms of how does it know when to stop looking for references, actually, the GC will know the object size for any objects that it has allocated, so it knows when to stop scanning it's own object. For objects that it hasn't given out, it doesn't matter since it won't scan those. And for the stack, it'll just scan the entire stack looking for pointers to it's own objects.
As an experiment, I tried writing a garbage collector which used high-order bits of a word to identify pointers. The result is about a page of code in Forth:
http://hastebin.com/raw/gabunowelo.fs
This example works exclusively with fixed-size "cons pair" allocations, but generalizing to arbitrary-sized allocations only increases the complexity of the system slightly. Obviously this bitflag technique is not "safe" in general, as arbitrary values on the stacks could produce false positives, but it's easy to imagine a 33-bit or 65-bit architecture that provided the necessary hardware support without such caveats.