Hacker News new | past | comments | ask | show | jobs | submit login

So am I reading this correctly, in that if I get allocated a segment at 0x12340 and then a random variable on the stack happens to have the value of 0x12340 for a different reason, then that segment would never be freed? Is this strategy still sound?

Also, while this is a nice explanation, does anyone actually use GC's while writing code in pure C? I never found the idea of calling free() that troublesome. On the other hand, writing a GC in C for another language is obviously a good use case.




The issue isn't that calling free() is troublesome. It's that in a large C program with various libraries and modules and callbacks, it isn't clear who should call free() on a given chunk of memory or when it is safe to do so.


Indeed, the garbage collector is not precise. However, in practice, the situation you describe does not tend to occur that often. Intuitively there is a 1/4 chance on 32 bit machines, but the chances are actually much due lower to the heap being located at a high memory address, and most local integer variables tend to have small values.

And yes there are definitely uses for a pure C garbage collector. For example, one can use it to detect memory leaks, if for some reason valgrind isn't around.


Sorry, that's wrong.

The D programming language currently uses a conservative garbage collector. Before version 1.0, false pointers were a serious problem. The most common causes of false pointers were:

- ASCII text;

- floating-point values;

- high-entropy (compressed/encrypted) data, e.g. image files.

Just 1 MB of random-ish data is enough to pin down allocations of 16 KB or bigger. And of course, it cascades through any false pointers in any pinned memory blocks, and so on.

In D 1.0, the garbage collector now remembers whether any memory block contains any pointers or not. So, now it can tell whether any memory address either definitely contains no pointers, or may contain pointers. Even so, memory leaks due to false pointers on 32-bit are not unheard of. For personal use, I've had to write a tool to diagnose these memory leaks: https://github.com/CyberShadow/Diamond

Rainer Schuetze is working on improving the precision of the GC from one heap allocation block to one machine word, but there's still work to be done on this.


So what are some strategies to really solve this problem? Is it basically impossible to implement a GC for a C-like language that will not leak at least a little memory?


It's not possible as long as your language has:

- non-discriminated unions mixing pointers and non-pointers

- linking with C or other languages (as you can't scan their stack frames precisely)

- void*, memcpy, or other ways to move around memory which might have pointers.

A language without all of the above would be of limited utility. However, it's possible to dramatically improve the situation using:

- precise heap scanning - by knowing the type of each object allocated;

- precise stack scanning - by knowing the stack layout of all functions at all points throughout their execution;

- GC hooks for user-defined union types.


You are wrong.

I made a comment to that effect higher up before noticing this.

When you're dealing with real world data - timings, statistics, sizes, dates, and then if you add floating point numbers whose representations look like big ints, you will be hitting this a lot, and leaking memory, due to how your mark and sweep works.

It is absolutely critical to have a sure way of disambiguating pointers from data.

I've fixed bugs in a memory allocator that's running in an app deployed on several hundred million PC's. It had a really fancy allocator that put a header with a magic number around malloc output, in addition to what you do, in an attempt to identify specific kinds of allocations. We'd see several tens of thousands of crashes a day due to magic number collision with real world data.

Your intuition is wrong. You can never assume that most local integer variables tend to have small values. Assign -3.0 to a float, and map that to an int, or read the current epoch time into an int, and see what happens.

A C garbage collector would certainly help write some code more easily, but it has to work, and it has to be "precise". An imprecise GC is worthless in production, but fun as an academic exercise.


There are quite a few collectors in production that are conservative at least for the stack. D, Boehm, JSC, SBCL, etc. It can be a problem, but it often isn't, especially when you have precise knowledge of heap object layouts.


Boehm-Demers-Weiser does see 'real world' usage.


Tens of thousands of memory blocks that can't be freed daily, on an install base of several hundred million PC's sounds like an acceptable compromise, at least for non-critical applications.


For such a simple proof-of-concept this is good enough. Indeed, it's not going to break anything, just leak memory. If I recall correctly, more advanced C garbage collectors like Boehm's have some ability to describe the layout of a struct to indicate that certain items are pointers and others are just data.


If you're constrained on memory (either address space or the limits of your physical + paging memory spaces) this is going to break everything. This is not a common situation for everyone, but it's a common situation for me.


Do you think this is appropriate to cover in a learning exercise that is intended to be simple?


I think it's appropriate to cover in a discussion about the merits - and limitations - of a proof of concept. I believe it's an appropriate counterpoint to "it's not going to break anything" as the discussion opens up to GCs in general, although that quote was likely meant only in the context of the proof of concept.


In most such environments, stop-the-world is just as problematic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: