It is well-known that the amortized cost of naive reference counting is considerably higher than that of a modern tracing garbage collector (regardless of whether you also support cycle collection). It is also well-known that non-naive RC-based approaches generally make it easier to achieve lower pause times (though cycle collection may throw a spanner in the works).
Attempts to bring amortized RC performance up to par with tracing GCs [1] generally require an underlying machinery that is just as complicated as that of a tracing GC; you will typically rely on stack scanning to implement deferred reference counting or non-trivial compiler optimizations with the goal to eliminate or reduce the overhead of local pointer assignments and you may still be left with a more expensive write barrier, so you may also have to gun for a hybrid approach.
Advantages of naive RC are that it is easier to add to an existing language without extensive compiler/runtime support (assuming you don't need cycle detection) and that it can more easily coexist with other approaches to memory management; amortized cost is not one of them.
What about how reference counting is done in languages like C++ and Rust? I.e. you explicitly make a value reference-counted, usually for large values (an entire class for example). Small values like ints are copied.
That seems like the best approach to me. Plus you can use RAII which is a far bigger advantage than any minor speed differences.
RAII is also possible in GC languages, just done differently.
If the language supports macros or trailing closures, you can somehow have similar approaches to RAII handle classes.
with-database {
// connection is only accessible here
}
// connection is now closed and gone
with-file {
// file is only accessible here
}
// file is now closed
with-mutex {
// mutex guarded resource is now visible here
}
// mutex is now released, regardless of possible errors
with-native-memory {
// able to use an OS memory block
}
// OS got the memory back
>It is well-known that the amortized cost of naive reference counting is considerably higher than that of a modern tracing garbage collector
True, but I think these comparisons are often done under the assumption that memory utilisation remains fairly low. Tracing GCs react very badly to growing memory utilisation in my experience.
This is of course ultimately an economic question and maybe it should be modelled that way.
In my quest for a memory management scheme on Leaf it's your final point which is primarily forcing my decision: the compatibility with other memory managers. In this article I just wish to provide counterpoint to those people trying to convince me that tracing GC is a better approach. In the end, it's the need to right libraries and be compatible that prevents me from using a complex GC.
Just keep in mind the trade-offs that you are making.
For starters, you are adding a considerable performance overhead to local pointer assignments; you're replacing a register-register move (and one that can be optimized away in many situations by a modern compiler through register renaming, turning it into a no-op) with a register-register move plus two memory updates plus a branch. Even where you have cache hits and the branch can be predicted correctly almost always, you're dealing with cache and BTB pollution; these effects may not fully show up in micro-benchmarks, but only completely manifest in large programs.
Second, adding concurrency to the mix can complicate matters a lot (depending on your concurrency model [1]). You may need additional atomicity guarantees for the reference count updates – which, while cheaper than a full-blown lock, is generally still more expensive than the purely sequential case; you preclude compiler optimizations that can eliminate the overhead of pointer updates; and you're still lift with the potential for race conditions (e.g. when one thread writes to a global variable or heap location while another thread is retrieving a pointer from the same location, and that pointer is the only remaining reference to an object).
Third, typical approaches to cycle detection and collection (such as trial deletion) reintroduce most of the challenges of tracing GC. While you may think that you can avoid cycles or handle them manually, they often crop up in unexpected places. A common case is when a closure is stored in an object that is also captured in the closure's environment [2].
I don't mean to discourage you from your decision – I don't know what your ultimate goals are and naive RC may very well be the best set of trade-offs you can make – I just want to alert you to possible trade-offs that you are making.
[1] Some popular concurrency models (such as Erlang's, Dart's, or Nim's) avoid shared references, so these issues would not even crop up for such languages, of course.
[2] I have a hypothesis that naive RC as an approach to memory management is particularly common in languages that do not (or historically, did not) support closures, but that's a different story.
What's the difference been naive RC and non-naive?
Would the above languages (Erlang, Dart and Nim) count as naive implementations?
I was under the impression Nim, at least, had pretty efficient RC implementation that performed favourably compared with a Boehm GC when considering single threaded performance. Is this accurate?
By naive I refer to RC implementations that simply instrument pointer assignments with the necessary reference counting instructions. Non-naive RC implementations try to minimize the overhead associated with it; deferred reference counting or having the compiler optimize away unnecessary reference count updates are typical approaches.
Erlang and Dart have tracing GCs; I was referring to their concurrency model (which uses thread-local heaps), not their method of garbage collection. Nim uses either deferred reference counting or a simple mark-and-sweep collector, depending on a compile time switch, but also uses thread-local heaps. The main attraction of Nim's deferred RC collector is that it can keep pause times very low; I haven't actually benchmarked its amortized performance, but I expect it should be competitive with or better than the Boehm GC, especially for large heaps.
Thanks for your answer. "Garbage collection" seems to cover a huge range of actual implementation approaches and sometimes it's hard to see the pros and cons of each approach when they're all lumped under the same thing.
In my mind, 'stop the world' pauses are the biggest disadvantage of some GC approaches, so it's nice to hear that deferred RC ameliorates that to some extent.
I submit this is only true for equivalent number of things to keep track of. In practice this is not the case. Languages with GC go completely overboard with GC, using it completely everywhere, even when not strictly necessary and certainly java does.
In C++, if you have a map with items, you use move semantics and you have have either 0 or 1 refcounts to keep track of, the one for the map itself. The rest is still "refcounted" but without ever touching any integer, by the normal scoping rules. Same goes for Rust code. That's ignoring the fact that a Java object is, at minimum, 11 bytes larger than a C++ object. Given java's boxing rules and string encoding, the difference in object sizes becomes bigger with bigger objects. Because out-of-stackframe RTTI is a basic necessity of tracing garbage collectors this is unavoidable, and cannot be fixed in another language. Bigger object sizes also mean more memory needed, more memory bandwidth needed, ... And Java's constant safety checks also mean
In Java, the same map will give the GC 3 items to keep track of (minimum) per entry in the map, plus half a dozen for the map itself. One for the object that keeps the key and the value, one of the key and one for the value. That's assuming both key and value are boxed primitives, not actual java objects. In that case, it'll be more.
Now you might argue that it'll therefore depend on a case by case basis. And while that is technically correct, any real program will in fact have a number of maps, and vectors, and ... and C++ smart pointers will wipe the floor with their java tracing equivalent (but will be significantly less correct, or perhaps expressed better it will take a far better programmer to get the C++ code right memory-use wise than for Java).
An additional comment should be made that the Java JVM's garbage collector is the result of 21 years of development by very, very good programmers. It is not able to beat C++'s smart pointers in most cases, mostly due to it's VM overhead. If you are not able to beat a department of senior Sun/Oracle programmers you cannot beat their GC's performance without doctoring benchmarks. Therefore the chances of any new GC'ed language beating the performance of C++ smart pointers any time soon (by which I mean decades) seems negligible. In practice it is considered very good performance for any non-C++ language to be about half as fast as C++ when it comes to memory allocation and deallocation.
Sounds like you're conflating GC with Java's limited type/value system.
> That's ignoring the fact that a Java object is, at minimum, 11 bytes larger than a C++ object.
In C++ minimum object size is 0 bytes. No virtual methods and no member variables.
> In Java, the same map will give the GC 3 items to keep track of (minimum) per entry in the map, plus half a dozen for the map itself. One for the object that keeps the key and the value, one of the key and one for the value.
This is because of lack of value types in Java, nothing to do with GC. Think of how horrible C/C++ would be if you had only primitive types (char, int, double, etc.), and pointers to objects or primitive arrays.
> That's assuming both key and value are boxed primitives, not actual java objects. In that case, it'll be more.
Boxed primitives are actual Java objects.
> "and C++ smart pointers will wipe the floor with their java tracing equivalent"
No, C++ "smart pointers" (are you talking about std::shared_ptr?) will definitely be slower in amortized runtime costs. High cost of std::shared_ptr is one reason why std::unique_ptr exists.
> Sounds like you're conflating GC with Java's limited type/value system.
Well, yes. It's the JVM that seems to prevent value objects, not Java per se. Since async GC will need RTTI this is going to be more general than just Java (it needs to know, given only memory adress, what the type of the object is, which means it needs at least a pointer in there (these days 8 bytes)). I also don't know any VM that really does a better job.
> In C++ minimum object size is 0 bytes.
C++ minimum object size is 1 byte (since 2 objects can't have the same address).
> Boxed primitives are actual Java objects.
I should have been clearer : I meant classes with fields, as apparently a class with a single int is bigger than a boxed Integer in the JVM.
Years pass but I still prefer languages with reference counting over languages with proper GC.
Main reasons for me:
- Reference counting is deterministic (barring refcnt loops, which can usually be avoided by a careful programmer). Each run of the program will allocate/free memory in _exactly_ the same way.
- No latency spikes. "When GC kicks in, the latency must raise"
- This means it's way easier to understand the performance characteristics of code.
Having that in mind, there are couple of problems with refcnt:
- Cache spill. Increasing/decreasing a counter values in random places in memory kills cache. I remember one time I separated the "string object" data structure in python away from the immutable (!) string value (python has usually C struct object followed in memory by value blob). The python program suddenly went _way_ faster - immutable (readonly) string values in memory were close to each other, mutated refcnts were close to another. Everyone was happier
- When dealing with large data (binary blobs), refcnt programming languages have a tendency to copy over needed data. Copying over is in many cases slower.
In general I'd say it's the last point that is critical. The big question is less of "refcnt" vs "gc", but how to deal with binary blobs. "Copy over" pattern (C style) and newer version: "slices" (golang style) VS "proper GC" style of trying to guess how may pointers point to a blob.
Regarding latency spikes: that's not entirely true. If you drop the last reference to an object, then freeing that object can cause the last reference to other objects to be dropped, etc.
So it's possible that dropping a reference can cause a large amount of work, which if it happens during a critical path can cause a latency spike. Particularly if finalisers are involved. If your object ownership graph is straightforward this isn't a problem, but if you have large objects with shared ownership it's very easy to be surprised here.
(Used to work on a object-oriented operating system which used reference counting everywhere, and I discovered this the hard way.)
Those are some of the points I'm considering as well when doing the memory management for Leaf. My desire to have predictable destructors is tough to reconcile in a tracing GC.
I'm uncertain of what you mean with "copying the data over" in reference to large data. Surely it's still just the pointer being copied? Or are you speaking of things like `std::vector` in C++ that are value-copied by default?
I tend to agree but don't overlook the downside of rfcnt loops. It's quite easy to introduce them (for instance, when creating closures in Objective-C), hence negating the big selling point of gc in the first place.
Reference counting is a type of garbage collection. It's not complete without collecting cycles, but it can form part of a GC mix.
Assigning pointers in a scanning GC can have a cost; for example, generational GCs need to track writes to old generations that point to newer generation objects.
There's no discussion of thread safety and how that affects reference counting. That's probably the biggest problem; you need to use memory barriers to make reference counting safe with multiple threads, and that can kill performance.
I (author) actually agree that reference counting is a form of garbage collection. However, in most dicussions about memory management the generic term garbage collection tends to refer to scanning collectors. In particualr, people who often oppose my views on scanning GCs are the ones that actually use the term garbage collector to refer to those types of collectors.
Most places I see talking about reference counting just say reference counting, sometimes mentioning memory management. Automatic memory management is perhaps the most generic term.
How reference counting actually works is covered in my previous article, which I wrote as a lead-up to this one. It covers the particulars of how memroy coherency is the primary cost of reference counting.
The reference counting != GC thing gets repeated all the time on HN, so I'm glad you called this out. For the record, what these people call "GC" is better called tracing garbage collection.
The "rc != gc" ship has already sailed. I realize that it can be annoying to language purists but the way to make peace with it is to separate discussions into 2 types of contexts:
1) formal computer science discussion (such as Lambda The Ultimate) --> "rc == gc"
2) real-world discussions in Apple documentation and various other forums --> "rc != gc"
I'm not prescribing that you must like the situation but merely pointing out that's how it is these days. Repeatedly making corrections in the hopes that #1 overtakes #2 is not going to sway common usage. Even rigid and exacting computer scientists use words and concepts[1] that mathematicians frown on. In the end, the common usage (that often strays from original precision) prevails.
[1]e.g. "integer" as -32k to +32k instead of -inf to +inf, and "=" as "assignment" instead of "equality"
Alternatively, I'll continue to speak up for the established terminology in the GC literature, and continue to recommend that those interested in this subject read sources like Jones' GC Handbook rather than the "official" Apple documentation.
People who are actively hostile against using the established terminology are unlikely to have the expertise necessary to contribute productively to these kinds of discussions anyway.
>read sources like Jones' GC Handbook rather than the "official" Apple documentation.
I want to emphasize again that it's a language communication issue and not a technical issue and therefore, reading Jones' GC Handbook will not help nor solve the problem. One can thoroughly study that book, and 100% agree with the terminology that "rc is subset of gc", and in the end, still chooses to use the informal "rc != gc" -- because everyone else has been using it that way for a long time.
Also, by mentioning Apple, I may have given the impression that it's specifically "Apple" vs the "correct computer science" proponents. Or, I'm saying Apple is some sort of authority or blessing on loose terminology. No, the informal "rc vs gc" nomenclature has been used for over 30 years which is long before Apple did the GC/ARC flip flop in 2011. It is also before the Jones GC book published in 1996. Examples:
In other words, the "rc != gc" ship sailed a long time before Apple's documentation and the cited links show how pervasive its usage was across several programming languages.
Ideally, "RC" would have been "RCGC" Ref Counting Garbage Collector, and "GC" would have been "TGC" Tracing Garbage Collector so all subsequent conversations were "RCGC vs TGC" but that's not how history played out. Studying Jones' GC Handbook does not reverse 30+ years of how people informally use terminology.
I think the "reference counting != GC" meme originates in Apple's marketing, specifically they needed a simple way to market their reference counting (which they call ARC to distinguish it from manual reference counting previously used by ObjC) as different and superior to Android garbage collection.
And also to hide the fact that their Objective-C GC attempt was a failure.
It is true that their ARC semantics is a better solution to Objective-C's memory management semantics, but they needed to do the typical marketing stunt of selling a failure as something positive.
You don't necessarily need memory barriers (implementation detail).
All you need is an atomic fetch-and-add (like x86 XADD opcode). Many architectures provide that as a primitive. On those that don't, you have to compose it out of other atomic primitives.
if you have destructors or finalizers you do need memory barriers to make sure that any previous modification of the object happens-before the destructor is run.
Also many architectures with builtin fetch-and-add bundle it with an implicit memory barrier (that includes x86).
Barriers are never needed for accesses to a single location as all practical systems guarantee total ordering, that a given. But on a refcounted system the only place to put the barrier for correct visibility, it is together with the refcount update.
It is true that on x86 all stores and loads have the correct implicit barriers, but you still get a store-load (including its cycle cost) with lock xadd whether you want it or not.
assign(value)
{
if (value)
inc_ref(value)
tmp = value
atomic_swap(&root.ref, &tmp)
if (tmp && dec_ref(tmp) == 0)
free(tmp)
}
If you don't use an atomic swap, you race with other threads trying to assign to the same location. You can't decrement the destination before incrementing the source because they could be equal. You can't make the logic conditional on equality because it races.
You have a harder time with this:
a.b.c = value
You need to inc-ref each step in the chain, keeping them in temporaries, to ensure they don't get dec-refed out from underneath you on another thread.
But if you have both atomic primitives available, why would you ever do that?
Keyword is atomic. Fetch-and-add returns you the previous value atomically.
-------
Edit:
This (below) is obviously a buggy reference counting algorithm.
-------
In freeing (decrementing reference count) thread A, if it was 1 before (returned value) and A added -1, the object can be freed.
Because the operation is atomic, only one of two things can happen:
1) Another reference adding thread B increments it before A decreases it. So A doesn't see 1, thus won't free the object.
2) B increments it after A decremented it. So B gets zero and knows the object is invalid. (Edit: Oops, a bug: after B increments refcount, when thread C tries to do same, it would get 1 and conclude the object is valid!)
If B is allowed to allocate the object again (invalid state, refcount == 0), careful synchronization needs to be done remove any races with A's deallocation.
I think you're hung up on multiple threads reference-counting a single instance. That's not where reference counting as a GC approach gets difficult. It's when you're updating references to values.
It seems to me that any cache coherent system needs to do that. If multiple writers can be updating a cacheline a the same time [1] you can't talk of coherency at all. Intel is stronger as it guarantees a total ordering of modifications across all cachelines, but any coherent system will guarantee the ordering of modification of a single cacheline.
[1] visibly at least. HTM complicates things a bit of course.
If the refcount is allocated in the same cacheline with the rest of the object, then false sharing is also a problem. But the violation of the single writer principle when updating refcounts is actually a form of true sharing.
Another thing is the usual issue that tends to get omitted, how GC enabled languages like Modula-3, Oberon(-2), D among others, do allow for a finer control of the GC.
In Modula-3, I can either fully rely on the GC, do safe global and stack allocations, or if really needed make use of manual memory management in unsafe modules.
So it is not as all allocations need to be on the GC heap, in GC enabled languages.
I'm not sure about the others but the D model is probably the worst of both worlds - you can transparently store a GC allocated reference in to unmanaged heap and risk having dangling pointers after a random collection or you have to root everything you allocate increasing the scan.
Something like Rust has a type system powerful enough to track such things so if you added GC to Rust you could deal with it sanely.
This is true as well but at least that part is fixable (although I'm not sure it's in the realm of a GSoC project considering the kind of work that went in to stuff like various JVM GC implementations)
Rc boxes in Rust are not safe against cycles. That's not a memory safety error - i.e. it won't permit use-after-free - but it can be a potential memory leak.
Reference counting is fast with one CPU core. But it gets worse and worse as you add cores. Inter-CPU synchronization is slow and the buses can saturate. With enough cores, at some point that'll be all the system is doing.
As far as I know, reference counting also requires you to have a heap with "alloc" and "free". Memory allocation is a slow operation, because it involves searching for the best matching free span to avoid heap fragmentation.
On the other hand, GC can scale across any number of CPU cores.
That's not how cpu synchronization work. Reference count updates by themselves are purely CPU local even when atomic. Cross CPU synchronization happens only if an object is actually shared.
It is true that, if an object is actually shared, reference counting would 'upgrade' a read only sharing to writable which is expensive; there are ways around that though, look for differential reference counting.
Re alloc and free, that's completely orthogonal with refcounting.
Whether they're CPU local depends on cache line state. It's fast in MESI protocol exclusive or modified state. Slow in invalid and shared states.
However real systems have contended objects pretty much always.
I'm vaguely aware about differential reference counting, but don't know any systems that use it. A bit like fast multicore counters, that break one counter into multiple uncontended "sub-counters". Read operation adds all sub-counters together to get the "real" value.
> Re alloc and free, that's completely orthogonal with refcounting.
Technically true, but in real systems those go hand in hand almost always. Yeah, I have written firmware for embedded devices that do refcounting and don't allocate memory. But that's not the usual case.
I guess that the reason that existing systems do not use advanced refcount modes is that they simply that either they are not meant for high performance (e.g. python, perl) or just don't do much refcount traffic (i.e. only long lived objects are refcounted and new references are not created often).
Re alloc and free, I meant that you do not need to use the system malloc and free, you can be allocating and deallocating from a simple thread local bump allocator plus free list.
At the limit you can have a fully generational GC behind and use the rerfcount only to decide when to run destructors or as an optimization.
Don't describe it as "fast". Use "latency" or "throughput". RC is a useful building block for making real-time latency guarantees, and for many apps that is more important than peak throughput.
You don't have to clean up everything at once - it's possible to put items on a queue to be cleaned up when refcounts hit zero, and you can then control at what rate the queue is emptied.
Which makes it already far too complicated, more complex than a real time mark&sweep. Especially if there are loop checks - you cannot just empty the queue gradually, it's all or nothing.
The author unfortunately doesn't understand how GC works:
> GC’s overhead relates to the total memory allocated, not just
> the memory actively used in the code. Each new allocation
> results in an increased time per scanning cycle. Certainly
> there are many optimizations involved in a good GC to limit
> this, but the fundamental relationship is still linear. As a
> program uses more memory the overhead of it’s GC increases.
A GC's overhead (at least a mark and sweep collector) is proportional
to the number of live objects. This is because all of the dead objects
are unreachable and thus aren't scanned.
Plain mark and sweep has to touch all currently allocated objects, it's obvious, it either has to mark the object or deallocate the object in the sweep phase.
On the other hand two-space copying GC only touches live objects.
I'm not sure how this disputes what I say about a scanning garbage collector. The number of live objects is proportional to the amount of total memory allocated.
I don't consider dead objects, where the memory is not longer used and free to be allocated, to be part of the allocated memory.
That's a confusing choice of terminology. "allocated" reads like the past-tense of "allocate"; it's natural to think "allocated" means "was allocated at some point in the past". Maybe it is correct in some technical sense, but for the casual reader "live" would be a lot clearer.
> The number of live objects is proportional to the amount of total memory allocated.
No, that's not right. The number (or ratio) of live objects depends on application behavior. According to the weak generational hypothesis, most objects die young, so a garbage collector that only touches live objects (e.g. tracing) does no work at all for the most objects.
The actual survival rate is complex and depends on application behavior, but typical well-tuned GC systems often see survival rates from young generation collections around 10%--that means that only 10% of objects ever occur any GC overhead. Those 90% of quickly dying objects do, however, take up memory and cache space until they are reclaimed. IMO that is only remaining situation where manual memory management can outperform GC by wide margins: when a program can carefully reuse recently allocated memory to keep reusing the same cache lines.
In my conclusion I mention why there are no benchmarks, it's nearly impossible to construct a realistic comparison between the two techniques.
When I cosnider tracing GC my primary arguments against it have nothing to do with performance at all, yet it often comes up as an argument against my position. This article is attempting to address that critique, by saying I doubt you could compare the two directly.
> In my conclusion I mention why there are no benchmarks, it's nearly impossible to construct a realistic comparison between the two techniques.
You "mention" this as if it were fact, yes. Apparently you are not aware of realistic comparisons that do exist, such as http://www.cs.utexas.edu/users/mckinley/papers/mmtk-sigmetri...
(Alternatively, the article needs a discussion of why you think this is not a "realistic comparison".)
The evidence is that all high performance garbage collected languages use some form of generational collectors. These implementations aren't made by idiots.
Of course reference counting is slower. Especially if you take loop handling into account. Imagine a degenerate scenario: using up all of your memory to allocate a single linked list, and then dropping it all at once.
Another, much more common scenario: allocating huge numbers of very short living objects. A generational GC handles this with nearly zero overhead, while reference counting plus a heap allocation sucks on an epic scale.
Also, do not forget about the compactification and cache locality implications.
Yes, but Eiffel, D (on your list), Modula-3, Oberon(-2), Component Pascal, C#, Go are all examples of GC languages with finer control over static vs dynamic allocation.
Something that most GC vs RC discussions seem to keep forgetting about.
In Modula-3 I can even make use of manual memory management in unsafe modules, if really required. And control storage layouts even in safe code.
> Imagine a degenerate scenario: using up all of your memory to allocate a single linked list, and then dropping it all at once.
Sorry, I am not sure I follow this example. I assume you create a list O(n) in size where n is the maximum allowed by memory (not sure memory matters).
For RC: You drop the single list, as you drop the head, you decrease reference counts to each member (one-by-one), check reference counts for 0, and free them. I.e. taking O(n) time.
For Tracing GC: You'll eventually sweep, then check if each members is unreachable as determined by the sweeper, and free each member. I.e. also taking O(n) time.
That's the key here. Eventually. When it is convenient to do so. You have a very fine grained control over when sweeper process may do its dirty job.
It's far more complicated with RC to ensure that it only does a certain small amount of work once in a while.
And with a loop detection, you have to do everything at once.
Also, with mark&sweep you can keep your tags somewhere else altogether and avoid touching (and invalidating) the actual data at all, at no additional cost.
> Reference counting time is proportional to data structure sizes on deletion.
The actual garbage collection in all collectors is proportional to the amount of data that needs to be deleted. There are plenty of implementations of refcounting where the data is pushed into a freelist or cleaned up deferred.
At which point you start making RC behave more like tracing collectors, so why not start with them in first place?
Also there are generational tracing compacting collectors with (almost) no pauses, so I doubt such approach wins over them when measuring performance and fragmentation across application usages.
Why do you even care about the GC epoch length? It does not affect the mutator performance at all.
EDIT: actually, it does a bit: the longer the epoch, the better, because at the end of the epoch a short stop-the-world is required to collect the global roots from registers and stacks.
What? Stop-the-world in the realtime collectors is always for a set duration, only to scan the registers and tops of the stacks (if there are any stacks at all, stackless implementations have even a better latency).
Heap fills up, you need a full collection. During that collection your program is running and allocating, but runs out of space, so it grinds to a halt.
Look, there is a reason why Rust is a thing, why realtime collectors are only used in niche cases, why there are people whose sole job is to fix GC issues, why there are people programming in Java/C# very carefully to ensure that a full heap collection never happens, etc. GC isn't a magic bullet.
> During that collection your program is running and allocating, but runs out of space, so it grinds to a halt.
It's a marginal case that should never happen, just like the case of running out of memory altogether. And RC will behave even worse in such a case.
Also, all three parts are running concurrently - marker, sweeper and mutator. Sweeper is constantly supplying the mutator with the new free cells from the current epoch, so you don't need to wait for another epoch to start getting new free cells.
It's not about running out of memory altogether, it's about the heap filling up with dead objects while the GC doesn't yet know that they are dead. Of course things are great when they are great and the GC can keep up with the mutator. Yet failure does happen whether or not you think it should happen, hence all the efforts that I mentioned that are trying to fix the issues.
Attempts to bring amortized RC performance up to par with tracing GCs [1] generally require an underlying machinery that is just as complicated as that of a tracing GC; you will typically rely on stack scanning to implement deferred reference counting or non-trivial compiler optimizations with the goal to eliminate or reduce the overhead of local pointer assignments and you may still be left with a more expensive write barrier, so you may also have to gun for a hybrid approach.
Advantages of naive RC are that it is easier to add to an existing language without extensive compiler/runtime support (assuming you don't need cycle detection) and that it can more easily coexist with other approaches to memory management; amortized cost is not one of them.
[1] Such as https://www.cs.purdue.edu/homes/hosking/690M/urc-oopsla-2003...