
Is reference counting slower than GC? - nikbackm
https://mortoray.com/2016/05/24/is-reference-counting-slower-than-gc/
======
rbehrends
It is well-known that the amortized cost of naive reference counting is
considerably higher than that of a modern tracing garbage collector
(regardless of whether you also support cycle collection). It is also well-
known that non-naive RC-based approaches generally make it easier to achieve
lower pause times (though cycle collection may throw a spanner in the works).

Attempts to bring amortized RC performance up to par with tracing GCs [1]
generally require an underlying machinery that is just as complicated as that
of a tracing GC; you will typically rely on stack scanning to implement
deferred reference counting or non-trivial compiler optimizations with the
goal to eliminate or reduce the overhead of local pointer assignments and you
may still be left with a more expensive write barrier, so you may also have to
gun for a hybrid approach.

Advantages of naive RC are that it is easier to add to an existing language
without extensive compiler/runtime support (assuming you don't need cycle
detection) and that it can more easily coexist with other approaches to memory
management; amortized cost is not one of them.

[1] Such as [https://www.cs.purdue.edu/homes/hosking/690M/urc-
oopsla-2003...](https://www.cs.purdue.edu/homes/hosking/690M/urc-
oopsla-2003.pdf)

~~~
iofj
I submit this is only true for equivalent number of things to keep track of.
In practice this is not the case. Languages with GC go completely overboard
with GC, using it completely everywhere, even when not strictly necessary and
certainly java does.

In C++, if you have a map with items, you use move semantics and you have have
either 0 or 1 refcounts to keep track of, the one for the map itself. The rest
is still "refcounted" but without ever touching any integer, by the normal
scoping rules. Same goes for Rust code. That's ignoring the fact that a Java
object is, at minimum, 11 bytes larger than a C++ object. Given java's boxing
rules and string encoding, the difference in object sizes becomes bigger with
bigger objects. Because out-of-stackframe RTTI is a basic necessity of tracing
garbage collectors this is unavoidable, and cannot be fixed in another
language. Bigger object sizes also mean more memory needed, more memory
bandwidth needed, ... And Java's constant safety checks also mean

In Java, the same map will give the GC 3 items to keep track of (minimum) per
entry in the map, plus half a dozen for the map itself. One for the object
that keeps the key and the value, one of the key and one for the value. That's
assuming both key and value are boxed primitives, not actual java objects. In
that case, it'll be more.

Now you might argue that it'll therefore depend on a case by case basis. And
while that is technically correct, any real program will in fact have a number
of maps, and vectors, and ... and C++ smart pointers will wipe the floor with
their java tracing equivalent (but will be significantly less correct, or
perhaps expressed better it will take a far better programmer to get the C++
code right memory-use wise than for Java).

An additional comment should be made that the Java JVM's garbage collector is
the result of 21 years of development by very, very good programmers. It is
not able to beat C++'s smart pointers in most cases, mostly due to it's VM
overhead. If you are not able to beat a department of senior Sun/Oracle
programmers you cannot beat their GC's performance without doctoring
benchmarks. Therefore the chances of any new GC'ed language beating the
performance of C++ smart pointers any time soon (by which I mean decades)
seems negligible. In practice it is considered very good performance for any
non-C++ language to be about half as fast as C++ when it comes to memory
allocation and deallocation.

~~~
vardump
Sounds like you're conflating GC with Java's limited type/value system.

> That's ignoring the fact that a Java object is, at minimum, 11 bytes larger
> than a C++ object.

In C++ minimum object size is 0 bytes. No virtual methods and no member
variables.

> In Java, the same map will give the GC 3 items to keep track of (minimum)
> per entry in the map, plus half a dozen for the map itself. One for the
> object that keeps the key and the value, one of the key and one for the
> value.

This is because of lack of value types in Java, nothing to do with GC. Think
of how horrible C/C++ would be if you had only primitive types (char, int,
double, etc.), and pointers to objects or primitive arrays.

> That's assuming both key and value are boxed primitives, not actual java
> objects. In that case, it'll be more.

 _Boxed_ primitives are actual Java objects.

> "and C++ smart pointers will wipe the floor with their java tracing
> equivalent"

No, C++ "smart pointers" (are you talking about std::shared_ptr?) will
definitely be slower in amortized runtime costs. High cost of std::shared_ptr
is one reason why std::unique_ptr exists.

Disclaimer: C++, not a Java programmer.

~~~
iofj
> Sounds like you're conflating GC with Java's limited type/value system.

Well, yes. It's the JVM that seems to prevent value objects, not Java per se.
Since async GC will need RTTI this is going to be more general than just Java
(it needs to know, given only memory adress, what the type of the object is,
which means it needs at least a pointer in there (these days 8 bytes)). I also
don't know any VM that really does a better job.

> In C++ minimum object size is 0 bytes.

C++ minimum object size is 1 byte (since 2 objects can't have the same
address).

> Boxed primitives are actual Java objects.

I should have been clearer : I meant classes with fields, as apparently a
class with a single int is bigger than a boxed Integer in the JVM.

------
majke
Years pass but I still prefer languages with reference counting over languages
with proper GC.

Main reasons for me:

\- Reference counting is deterministic (barring refcnt loops, which can
usually be avoided by a careful programmer). Each run of the program will
allocate/free memory in _exactly_ the same way.

\- No latency spikes. "When GC kicks in, the latency must raise"

\- This means it's way easier to understand the performance characteristics of
code.

Having that in mind, there are couple of problems with refcnt:

\- Cache spill. Increasing/decreasing a counter values in random places in
memory kills cache. I remember one time I separated the "string object" data
structure in python away from the immutable (!) string value (python has
usually C struct object followed in memory by value blob). The python program
suddenly went _way_ faster - immutable (readonly) string values in memory were
close to each other, mutated refcnts were close to another. Everyone was
happier

\- When dealing with large data (binary blobs), refcnt programming languages
have a tendency to copy over needed data. Copying over is in many cases
slower.

In general I'd say it's the last point that is critical. The big question is
less of "refcnt" vs "gc", but how to deal with binary blobs. "Copy over"
pattern (C style) and newer version: "slices" (golang style) VS "proper GC"
style of trying to guess how may pointers point to a blob.

~~~
david-given
Regarding latency spikes: that's not _entirely_ true. If you drop the last
reference to an object, then freeing that object can cause the last reference
to other objects to be dropped, etc.

So it's possible that dropping a reference can cause a large amount of work,
which if it happens during a critical path can cause a latency spike.
Particularly if finalisers are involved. If your object ownership graph is
straightforward this isn't a problem, but if you have large objects with
shared ownership it's very easy to be surprised here.

(Used to work on a object-oriented operating system which used reference
counting everywhere, and I discovered this the hard way.)

I don't follow your comment about binary blobs?

------
barrkel
This isn't an informed article.

Reference counting is a type of garbage collection. It's not complete without
collecting cycles, but it can form part of a GC mix.

Assigning pointers in a scanning GC can have a cost; for example, generational
GCs need to track writes to old generations that point to newer generation
objects.

There's no discussion of thread safety and how that affects reference
counting. That's probably the biggest problem; you need to use memory barriers
to make reference counting safe with multiple threads, and that can kill
performance.

~~~
vardump
You don't necessarily need memory barriers (implementation detail).

All you need is an atomic fetch-and-add (like x86 XADD opcode). Many
architectures provide that as a primitive. On those that don't, you have to
compose it out of other atomic primitives.

~~~
xxs
Atomic part has to notify the rest of caches to invalidate the cache line.
Local latency in the best case.

~~~
gpderetta
FYI, _any_ store has to invalidate remote cachelines, that's taken care by the
coherence protocol even before the store is actually performed.

Memory barrier (and atomic operations) don't have anything to do with it.

~~~
JoachimSchipper
Whether stores invalidate remote cachelines depends entirely on the
architecture. But yes, Intel has fairly strong coherency.

~~~
gpderetta
It seems to me that any cache coherent system needs to do that. If multiple
writers can be updating a cacheline a the same time [1] you can't talk of
coherency at all. Intel is stronger as it guarantees a total ordering of
modifications across all cachelines, but any coherent system will guarantee
the ordering of modification of a single cacheline.

[1] visibly at least. HTM complicates things a bit of course.

------
vardump
Reference counting is fast with one CPU core. But it gets worse and worse as
you add cores. Inter-CPU synchronization is slow and the buses can saturate.
With enough cores, at some point that'll be all the system is doing.

As far as I know, reference counting also requires you to have a heap with
"alloc" and "free". Memory allocation is a slow operation, because it involves
searching for the best matching free span to avoid heap fragmentation.

On the other hand, GC can scale across any number of CPU cores.

~~~
gpderetta
That's not how cpu synchronization work. Reference count updates by themselves
are purely CPU local even when atomic. Cross CPU synchronization happens only
if an object is actually shared.

It is true that, if an object is actually shared, reference counting would
'upgrade' a read only sharing to writable which is expensive; there are ways
around that though, look for differential reference counting.

Re alloc and free, that's completely orthogonal with refcounting.

~~~
vardump
You're more exact, I tried to simplify.

Whether they're CPU local depends on cache line state. It's fast in MESI
protocol exclusive or modified state. Slow in invalid and shared states.

However real systems have contended objects pretty much always.

I'm vaguely aware about differential reference counting, but don't know any
systems that use it. A bit like fast multicore counters, that break one
counter into multiple uncontended "sub-counters". Read operation adds all sub-
counters together to get the "real" value.

> Re alloc and free, that's completely orthogonal with refcounting.

Technically true, but in real systems those go hand in hand almost always.
Yeah, I have written firmware for embedded devices that do refcounting and
don't allocate memory. But that's not the usual case.

~~~
gpderetta
I guess that the reason that existing systems do not use advanced refcount
modes is that they simply that either they are not meant for high performance
(e.g. python, perl) or just don't do much refcount traffic (i.e. only long
lived objects are refcounted and new references are not created often).

Re alloc and free, I meant that you do not need to use the system malloc and
free, you can be allocating and deallocating from a simple thread local bump
allocator plus free list.

At the limit you can have a fully generational GC behind and use the rerfcount
only to decide when to run destructors or as an optimization.

------
nwalfield
The author unfortunately doesn't understand how GC works:

> GC’s overhead relates to the total memory allocated, not just > the memory
> actively used in the code. Each new allocation > results in an increased
> time per scanning cycle. Certainly > there are many optimizations involved
> in a good GC to limit > this, but the fundamental relationship is still
> linear. As a > program uses more memory the overhead of it’s GC increases.

A GC's overhead (at least a mark and sweep collector) is proportional to the
number of live objects. This is because all of the dead objects are
unreachable and thus aren't scanned.

~~~
mortoray
I'm not sure how this disputes what I say about a scanning garbage collector.
The number of live objects is proportional to the amount of total memory
allocated.

I don't consider dead objects, where the memory is not longer used and free to
be allocated, to be part of the allocated memory.

~~~
titzer
> The number of live objects is proportional to the amount of total memory
> allocated.

No, that's not right. The number (or ratio) of live objects depends on
application behavior. According to the weak generational hypothesis, most
objects die young, so a garbage collector that only touches live objects (e.g.
tracing) does no work at all for the most objects.

The actual survival rate is complex and depends on application behavior, but
typical well-tuned GC systems often see survival rates from young generation
collections around 10%--that means that only 10% of objects ever occur any GC
overhead. Those 90% of quickly dying objects do, however, take up memory and
cache space until they are reclaimed. IMO that is only remaining situation
where manual memory management can outperform GC by wide margins: when a
program can carefully reuse recently allocated memory to keep reusing the same
cache lines.

------
oelang
All talk & no benchmarks. If reference counting with cycle detection was
faster, the jvm & clr would use it.

~~~
mortoray
In my conclusion I mention why there are no benchmarks, it's nearly impossible
to construct a realistic comparison between the two techniques.

When I cosnider tracing GC my primary arguments against it have nothing to do
with performance at all, yet it often comes up as an argument against my
position. This article is attempting to address that critique, by saying I
doubt you could compare the two directly.

~~~
tom_mellior
> In my conclusion I mention why there are no benchmarks, it's nearly
> impossible to construct a realistic comparison between the two techniques.

You "mention" this as if it were fact, yes. Apparently you are not aware of
realistic comparisons that do exist, such as
[http://www.cs.utexas.edu/users/mckinley/papers/mmtk-
sigmetri...](http://www.cs.utexas.edu/users/mckinley/papers/mmtk-
sigmetrics-2004.pdf) (Alternatively, the article needs a discussion of why you
think this is not a "realistic comparison".)

------
albeva
There are other differences to consider: such as deterministic destruction
with reference counting.

------
vardump
I wish people would realize they're just some of the different tools in the
memory/resource management toolbox.

Neither is the best option in every case.

We should just use whatever fits better for the purpose.

------
sklogic
What?

Of course reference counting is slower. Especially if you take loop handling
into account. Imagine a degenerate scenario: using up all of your memory to
allocate a single linked list, and then dropping it all at once.

Another, much more common scenario: allocating huge numbers of very short
living objects. A generational GC handles this with nearly zero overhead,
while reference counting plus a heap allocation sucks on an epic scale.

Also, do not forget about the compactification and cache locality
implications.

~~~
qznc
> Another, much more common scenario: allocating huge numbers of very short
> living objects

This is true in Java. Other languages (C++, Rust, D, etc) give the programmer
the means to not allocate so much garbage instead.

~~~
pjmlp
Yes, but Eiffel, D (on your list), Modula-3, Oberon(-2), Component Pascal, C#,
Go are all examples of GC languages with finer control over static vs dynamic
allocation.

Something that most GC vs RC discussions seem to keep forgetting about.

In Modula-3 I can even make use of manual memory management in unsafe modules,
if really required. And control storage layouts even in safe code.

