

Practical Garbage Collection - Part 1: Introduction - cgbystrom
http://worldmodscode.wordpress.com/2011/12/26/practical-garbage-collection-part-1-introduction/

======
bitops
A very good writeup, but one thing always confuses me.

What is meant specifically by the "heap" and "stack"? I know what a stack is,
but "heap" gets thrown around in many different contexts and I've yet to find
any explanation that made it clear.

If anyone has a good explanation or good links for those two terms in this
context, I'd be very grateful. Thanks!

[EDIT: thanks everyone for the answers so far!]

~~~
lmkg
This looks pretty well-answered already, but I'll throw in my $0.02.

The _stack_ and _heap_ are the two main regions of memory where data can live
in a program. They are (somewhat unhelpfully) named after the data structures
originally used to represent them. Pretty much all modern languages have a
concept of a stack and a heap, although there are differences both in how
explicit they are, and in implementation details.

 _The stack_ is the function call stack, used to store data local to a
function call. It behaves like a traditional stack, with _push_ and _pop_
operations. When you call a function, a new _stack frame_ is pushed onto the
stack, which contains all of the local variables that are only accessible
inside that function call. When that function returns, it pops its stack frame
(and sends its return value, somehow). The language scoping rules determine
how one function may access the contents of an earlier stack frame that called
it, but stack frames are basically deleted and inaccessible after they are
popped (a source of errors in C).

 _The heap_ is the region of memory where you put data that lives longer than
the function that created it. Heap behavior is far less specified, and more
variable, between languages than stack behavior is (and involves the OS more),
but if you're passing around pointers or references to an object you're
probably using the heap. In languages that have them, the _malloc_ and _new_
operators usually mean "put this on the heap." In languages with a notion of
reference types and value types, reference types are usually heap-allocated by
default.

The distinction between the two types of storage is more important in some
languages that others. It's really important in C, where you have to manually
manage all your memory, but some of the more dynamic languages (especially
everything-is-an-object ones) use the heap for all user data, and the call
stack only for implementation details.

Languages also vary greatly in how much control you have over where things get
put. On one extreme, C lets you put any type in any location, with _malloc_ ,
while on the other I am not aware of any way in JavaScript to control object
lifetime. Java and C# both distinguish between value-types (stack-allocated by
default) and reference-types (heap-allocated by default), but C# allows user-
defined value-types while Java does not. Common Lisp has a standardized way to
tell the compiler that it is safe to stack-allocate a value (of any type). And
so on.

~~~
nhebb
There are several things I've never had a good understanding of regarding the
stack. How big is the stack? Is it a fixed size dictated by the architecture
of the processor? I know I've written recursions that overflow it. Other than
gut feel and catching an exception after the fact, how do programmers know
when their program might overflow it?

~~~
derleth
> How big is the stack? Is it a fixed size dictated by the architecture of the
> processor?

It's more to do with the OS or language runtime; I think the language runtime
has a bigger impact for most programs.

> Other than gut feel and catching an exception after the fact, how do
> programmers know when their program might overflow it?

In general, you don't. You might be able to do some system-specific thing on
specific systems or just happen to know what it is on some given system, but
there's no way to find out in the general case. Worse, there's generally no
way to recover from blowing the stack; this has always made programs written
using alloca() or, more recently, variable-length arrays potentially unstable.

------
pixie_
The complexity of generational garbage collection vs. the speed of manual
collection, makes me feel like the happy medium of speed and simplicity is
reference counting, like that found in objective-c. iPhone apps are fast, but
take a bit longer to design, develop, and debug due to memory management
issues. Though with experience these can be minimized.

It probably isn't possible without a ton of modification, but I wish the
JVM/CLR had an option to garbage collect through reference counting.

~~~
rwmj
Unlikely to happen. Reference counting has poor interaction with the CPU
memory hierarchy: It makes all objects larger, making the working set
correspondingly larger, meaning your data cache is less effective. It makes
code larger because of the extra ref counting twiddling code. Larger code
means a less effective instruction cache, as well as being slower because of
all the extra operations performed. Reference counts must be frequently
modified, and they are scattered all over memory (next to each object), so you
get poorer locality and more atomic writes which has all sorts of negative
implications for cache-coherent SMP architectures.

So while you may not notice the penalty of reference counting since it is
spread out over every operation, it's unlikely that it is better than proper
GC for most applications. Maybe just for ones which are extremely sensitive to
latency, yet don't mind running much slower overall (hard real time?
aeronautics and space?).

As usual you'd need to measure it.

~~~
aaronblohowiak
Gc pauses suck the "magic" out of interactive programs. The great thing about
obj-c is that it I trivial to have a c array of vanilla c structs, so your
performance sensitive code is still slim and hot. (size and cache locality.)

My iOS audio programming mixes c for audio processing and obj-c for ui and
control quite effectively.

~~~
rwmj
I didn't say that GC was free ... malloc has overhead too.

For interactive programs, I would recommend two things to avoid (or hide) the
pauses. Firstly schedule calls which run the GC when it won't be noticed:
between frames in an arcade-style game, or just before accepting user input in
a turn-based game. Secondly, size the minor heap large enough so that all
computation between the scheduled GC runs can happen without invoking a minor
sweep (but not too large that the minor heap is wasting memory -- some
experimentation and tuning required).

I've written a couple of interactive games in OCaml that used this strategy
and avoided visible GC pauses.

------
stcredzero
What are the long term implications of the increasing amount of parallel
computing resources available to programmers? Are we getting to a point where
there is enough excess CPU available, that the extra instruction on every
assignment for reference counting is no big deal? (In certain contexts. In
some contexts extra instructions are always a big deal, but these don't span
all of computing.) Combine that with incremental algorithms for cycle
reclamation, and you'd have great low-latency GC.

~~~
larsberg
It's still not trivial even with massive parallelism. Performance is easily
killed if you:

1) Completely saturate the memory interconnect between the processors and RAM
chips

2) If your parallel collector threads are touching the same memory that your
worker threads are, you're going to have some cache contention

3) One of the big challenges with cache lines that have writes in them is if
they are shared between processors. For example, create two ref cells and have
two parallel threads set them -- if the compiler did not happen to pad them
out to cache line size, they'll be fighting for the same 64 bytes (8 64-bit
words/pointers) of memory

We (Manticore) have done a lot of work to isolate the per-CPU work, heap
pages, etc. (<http://arxiv.org/abs/1105.2554> , MSPC 2011 proceedings should
be out on acm.org soon-ish). But it's not easy and requires rearchitecting
your whole compiler and runtime system. GHC is also in the middle of some
similar work, and they have a much harder problem both due to the challenges
of implementing laziness and they have a full-featured system and real user
base, so they can't just change things willy-nilly.

Further, once you get the parallelism right, you end up in a different
situation: Amdahl's Law bites. Even if 90% of your program is amenable to
parallel work, if you make that portion go infinitely fast, you still have a
sequential portion that takes 10% of the time and therefore limits you to a
maximum of 10x speedup.

We are hitting that limitation right now in our research and going back to do
more sequential performance work. I've got a small army of undergraduates
implementing compiler optimizations and analyses :-)

------
ruggeri
I enjoyed this basic introduction to GC; upvoted. What I'd look forward is
further discussion of incremental and concurrent GC algorithms.

Until then, does anyone know what makes concurrent GC non-trivial? It seems
like it shouldn't be too hard to trace concurrent to program execution. And if
you don't compact, collection seems to just involve updating some structure
tracking free blocks. I'd imagine it's possible to write a thread-safe version
of that structure where every "free" request doesn't need to block every
"malloc" request. But I must have missed something.

I'd also be interested to read how compaction works; how are references
remapped from the old address to the new one? Is it possible that a reference
value is a pointer to a reference "object" which contains the pointer to data,
which needs to be updated? Then you only need to update a single pointer when
moving data, but every dereferene incurs an extra layer of indirection.

~~~
dumael
Concurrent GC is somewhat non-trival as the mutator (i.e. the program you
write) can delete references to objects from an area of the heap that has not
been examined by the GC and introduce references to those same objects in an
area that has been examined. This means those objects will be reclaimed since
the GC never saw them.

To cope with this the mutator is modified at compile/run time to inform the GC
of object updates that could lead to this situation.

During compaction of any sort, the first time an object is encountered that is
to be moved it is copied somewhere else and the old copy's header is
overwritten with it's forwarding address. Every time the GC encounters a
reference to the old object, it re-writes the reference to the old object with
it's new location (conveniently located in the old copy's header).

> Is it possible that a reference value is a pointer to a reference "object"
> which contains the pointer to data, which needs to be updated?

Look up something called 'Brook's style forwarding pointer', it is essentially
what you've described.

~~~
ruggeri
Thanks; upvoted! I think I'm beginning to appreciate that concurrent marking
is tougher than I thought; specifically, I can see how it can be hard to prove
an object is unreachable. So I imagine the marking side is where the
difficulties are.

Your explanation for compaction makes perfect sense. Of course, this won't
work trivially concurrently. Only if you stop the world can you complete
examination of the entire live heap and know you've updated all references to
the moved object and can collect the original space.

~~~
dumael
> I can see how it can be hard to prove an object is unreachable.

The difficulty with concurrent GC is not to prove an object is unreachable but
to prove is likely to be reachable. For instance Yusa-type (or snapshot-at-
the-beginning) concurrent collectors have a tendency keep dead objects alive.
This 'floating garbage' is a small price to pay for a fairly simple
correctness proof of never dropping a reachable object.

It's no harm for a GC to keep some dead objects alive but dropping something
reachable is completely unacceptable.

> Of course, this won't work trivially concurrently. Only if you stop the
> world [..] can collect the original space.

Yes, moving objects concurrently requires some serious thinking to get right.
Sun's G1 collector does this by recording the location of pointer updates that
point into regions which are to be compacted. This pushes the pause times down
to the length of time to move objects and update references to them.

If you're real serious about moving objects concurrently, see "A Study of
Concurrent Real-Time Garbage Collectors", Pizlo et al, 2008.

------
valyala
While the article is interesting, it skips important things, which have high
influence on practical GC speed - write barriers and finalizers. The following
ancient article from Microsoft has better coverage of GC internals
<http://msdn.microsoft.com/en-us/library/ms973837.aspx> (somewhat biased to
.NET :) ).

------
rubashov
> The default choice of garbage collector in Hotspot is the throughput
> collector, which is ... entirely optimized for throughput

I just want to confirm this is true? Say you're doing a long running
simulation. You don't care about pauses at all. You just want it to finish
fast. The default GC with no particular options is the way to go?

~~~
oconnor0
I'd say, odd are, that if you're just interested in throughput, the standard
Hotspot GC without any options isn't going to give you optimal performance.
I've found that, like gtani, mentions you'll have to spend time tuning to get
the performance you want.

In a similar situation, I found that the Parallel Old was, by far, the fastest
(throughput) collector.

~~~
gtani
I'd say low fruit are -server, Xmx, Xms and MaxInlineSize and looking for
boxing/unboxing. At the other extreme, is it worth it to your company to spend
a few weeks reading the Hunt/John book and benchmarking thoroughly?

~~~
oconnor0
Interesting. I've definitely looked for & removed boxing/unboxing. I haven't
touched MaxInlineSize tho.

What performance improvements/degradations have you seen in altering
MaxInlineSize?

