Hacker News new | past | comments | ask | show | jobs | submit login
A Unified Theory of Garbage Collection (2004) [pdf] (virginia.edu)
151 points by mpweiher on July 22, 2017 | hide | past | favorite | 22 comments



This is a classic paper that's a must-read for anyone interested in GC. The basic idea is that not only is reference counting a form of garbage collection, there's actually a continuum of GC approaches between pure tracing and pure reference counting. Each algorithm has its tradeoffs, and, crucially, most optimizations—for example, generational garbage collection—actually just move the scheme toward the opposite pole, acquiring some of the upsides and downsides of the other algorithm. Viewed through this lens, most of what we would consider well-optimized GCs deployed in the wild (Java HotSpot, .NET, modern JS engines, etc.) are really best viewed as hybrids of tracing and reference counting, balancing the upsides and downsides of each approach.


Respectfully disagree that GC is limited to the 1-D continuum noted. This is only the story of store model of memory management.

The cache model with its various flavors of eviction policy -- LRU, LFU, etc. -- represents another dimension in the space of memory management.


> The basic idea is that not only is reference counting a form of garbage collection..

Oh, how this leads to debate. The confusion is whether having automatic reference counting without a cycle detector is considered GC by the masses (what language publicizes a GC and does this?), and if people question GC to be a form of automatic memory management, where other forms may exist.

Briefly looking at the article I think the authors opinion on this matter is pretty clear that an additional algorithm such as a cycle collector is required to be considered GC.

edit: "Reference counting" itself also isn't even necessarily 'automatic'. Just find it odd the way this gets phrased. GC can use reference counting, not the other way around..


> Briefly looking at the article I think the authors opinion on this matter is pretty clear that an additional algorithm such as a cycle collector is required to be considered GC.

That is almost certainly not what David Bacon thinks. I don't know how you could get that idea from reading the introduction. Reference counting is a form of garbage collection, period.

C++, Objective-C, etc. started muddying the waters in the '90s and 2000s and got people thinking that somehow reference counting isn't GC and therefore doesn't have the downsides of automatic memory management. But the literature is clear on this point.


> That is almost certainly not what David Bacon thinks. I don't know how you could get that idea from reading the introduction. Reference counting is a form of garbage collection, period.

I'm not sure what you're reading. I will quote:

"Tracing and reference counting are uniformly viewed as being fundamentally different approaches to garbage collection that possess very distinct performance properties"

Saying reference counting is an approach that garbage collection can use is certainly valid, but is not the same as saying that reference counting is a form of garbage collection, which is nonsensical usage IMO.

Further evidence that the author believes that reference counting alone is insufficient for garbage collection:

"Finally, tracing collects cyclic garbage while reference counting does not. As a result, when reference counting is applied to heaps that may contain cyclic garbage, cycles must either be collected using a backup tracing collector [43] or with a trial deletion algorithm"

"One of the primary disadvantages of reference counting collec- tors is that they do not find cyclic garbage. Therefore, an additional mechanism is required."

Regarding "who was first", although I don't have any supporting proof, I'd be shocked if usage of manually managing reference counting hasn't predated both automatic memory management and garbage collectors, which isn't a technique constrained to, say Objective-C.

Also worthy of mention, some referenced counted garbage collection implementations (e.g. CPython) vend an API for disabling the GC, which certainly has no implications on reference counting.


> Saying reference counting is an approach that garbage collection can use is certainly valid, but is not the same as saying that reference counting is a form of garbage collection, which is nonsensical usage IMO.

You're drawing distinctions in the paper that the author almost certainly did not intend. "An approach to garbage collection" means "a form of garbage collection". If Bacon had meant to write "an approach that garbage collection can use", he would have written that.

David Bacon's opinion is that RC systems should have some mechanism to collect cycles. This is understandable, given that he did more work than anyone else to develop good cycle collectors (and his work in this area is a good read). But nobody in the GC field, including him, would claim that reference counting without cycle detection isn't garbage collection at all.

All garbage collection (automatic memory management) systems use a form of approximation to determine which objects are reachable. Some approximations involve reachability; that is what tracing GCs use. Some approximations involve tracking the number of outstanding references; that is what reference counting GCs use. A pure GC has a more fine-grained approximation than a pure reference counting one does. But both are approximations to the real problem: whether some data will dynamically be accessed by the program again. Reference counting has a less precise approximation, but that doesn't make it not a form of garbage collection.

Required reading:

- http://www.memorymanagement.org/glossary/g.html#term-garbage...

- http://www.memorymanagement.org/glossary/r.html#term-referen...


By saying "reference counting is a form of GC", I'm not sure whether or not you think that reference counting has usage outside of (and could have predated) GC. David did not write "a form of" either, although you think the meaning conveyed is the same. Also one could also counter similarly with saying that if David meant should, he would have said so, but whatever.

I realize memorymanagement.org conflicts with my usage, although I'm not sure I subscribe with the idea that the usage people have been using first historically is sensible.


> "classic"

Is there anything more recent, in the same "unified" direction? I know Rust was experimenting with garbage-collected pointers but it seems like that's not really a focus anymore: https://www.rust-lang.org/en-US/faq.html#is-rust-garbage-col...


As far as I can tell, the only real way to take a next step beyond the imprecise informal idea in this paper would be to develop a formal calculus for describing garbage collectors that is 'polarized' in a way that tracing live objects and discovering dead objects are truly dual.


I'm curious how RAII approaches relate to this?

Would it be considered a form of binary reference counting(where reference counting isn't explicitly used, i.e non shared_ptr code in C++) or it's just completely separate and not part of the dual GC theory?


At it's basic RAII is about re-using the a hierarchy of lifetimes (induced from the hierarchy of stack frames in as strict language) for managing resources other than memory.


Great paper! Especially useful when talking to C++ fanatics who are proud of their language's de-facto reference counting via 'smart pointers'.


That is more common among those that only did C -> C++ transition, without any experience in other languages with automatic memory management, though.

If you check Herb Sutter's talk about "deferred pointers", he goes to great lengths to only speak the forbidden word towards the end, to avoid loosing the audience.

He also presents a few examples where RC can lead to long pauses or even stack overflow, in the presence of deeply nested data structures.


Python uses reference counting with a backup garbage collector just to pick up unreachable sets of object having circular references.


Python is slow & CPython is ever slower because reference counting is part of what makes removing the GIL such a challenge


The hard part of removing GIL are the language semantics themselves. Replacing CPython's reference counting with some kind of tracing GC will still leave you with problem of coming up with low-overhead thread-safe dictionary implementation.


I agree with the first sentence. For the second part, what are you thinking of? For reads, there's very low overhead for using an implementation like Java's ConcurrentHashMap. The overhead for insertions is more significant, but not enormous.


> reference counting is part of what makes removing the GIL such a challenge

Interesting, I would have expected that switching reference counting to atomic operations would be pretty trivial. What about CPython's design complicates this?


Pretty trivial. Also pretty expensive.


Other languages use reference counting too. I don't really see how that applies to a discussion on a paper about GC.

Edit: my bad, they literally mention reference counting in the first sentence. Leaving this here because I should read the link before the comments.


This is exactly why I insist on using the wording "automatic memory management" instead of "garbage collection". When you say AMM, it's clear and intuitive: it means an object's lifetime is not really under programmer's control, but is up to some automatically working algorithm/mechanism. But if you say GC to mean not tracing GC, half of audience is doomed to misunderstand you.


<insert language> fanatics != good coders




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: