
Golang’s Real-Time GC in Theory and Practice - signa11
https://blog.pusher.com/golangs-real-time-gc-in-theory-and-practice/
======
bjourne
The weak generational hypothesis states that most objects are short-lived.
Therefore concentrating on collecting the youngest objects yields the most
garbage/cpu cycle spent.

In the benchmark, all objects survive the same amount of time and it is the
oldest objects that become garbage first. It is the opposite of the behavior
the weak generational hypothesis presumes. So collectors optimized for that
scenario is at a disadvantage. :)

For example, I ported the benchmark to Python 3 and got a max pause time of
0.05 ms. That's 40 times faster than OCaml but it doesn't mean Python is
faster in general. It just "got lucky" because ref counting is an almost
optimal fit for the benchmark. There is no strategy that fits all when it
comes to memory management. :)

~~~
iainmerrick
Reference counting is great. I used to be skeptical but I'm a complete convert
after using Objective-C.

The old myth about GC was that it was slow. Nowadays that's usually false, but
what _is_ true that it wastes a lot of memory, or has long pauses, or both.

The old myth about reference counting, on the other hand, is that uncollected
cycles are a problem, but in practice they're usually not that hard to find
and fix.

Reference counting _is_ a bit slower, but it doesn't waste memory and doesn't
pause. So it's great for throughput. And throughput is very often more
important than overall performance!

For anything with a UI, you want maximum throughput. For anything on the
server side that's realtime, or even indirectly user-facing, you want maximum
throughput. The only time you want performance over throughput is batch jobs,
which I reckon crop up less often than you expect.

(Edit to add: on memory-constrained devices, you might not want the memory
overhead of GC even for batch jobs.)

~~~
titzer
Reference counting does not scale to multi-threaded programs well. Keeping
reference counts up to date across multiple threads is maddeningly and
frightenly hard to do efficiently. By the time you correctly solve the
concurrent update problem for essentially every access the program does, you
have a very complicated memory management system that incurs overhead on every
single operation. Compare that with a GC that imposes overhead (usually) only
on writes to objects, not "writes" to the stack or local variables going out
of scope. Reference counting also incurs overhead for reading from objects.
Empirical studies have shown that reads outnumber writes 10x to 1 usually. Why
would you consider it to be just a _bit_ slower?

And of course reference counting pauses. Killing one object (its count
dropping to zero) may cause a chain reaction of objects whose reference counts
to drop to zero and need to be cleaned up. The difference is that you now
suffer work proportional to the _dead_ data of the program. In comparison, a
tracing GC does work proportional to the _live_ objects.

~~~
staticassertion
> Reference counting does not scale to multi-threaded programs well. Keeping
> reference counts up to date across multiple threads is maddeningly and
> frightenly hard to do efficiently.

Is this not just a matter of using an atomic reference count?

~~~
iainmerrick
Yes, but atomics are pretty slow.

~~~
staticassertion
I guess, but that doesn't seem that bad.

> maddeningly and frightenly hard to do efficiently.

It definitely doesn't seem maddening or frightening. I'm not saying a GC isn't
stricly better (because I'm unsure either way) but arc doesn't seem like some
awful thing.

------
tomcam
Good writeup, with a surprise about OCaml. The authors welcome PRs for other
languages. I look forward to seeing the results, and wonder if they will end
up in a Techempower-style global brouhaha (which the Techempower folks have, I
hasten to say, apparently handled with grace and integrity).

~~~
willsewell
There are already versions in:

* C - [https://gitlab.com/gasche/gc-latency-experiment/merge_reques...](https://gitlab.com/gasche/gc-latency-experiment/merge_requests/5)

* C# - [https://gitlab.com/gasche/gc-latency-experiment/merge_reques...](https://gitlab.com/gasche/gc-latency-experiment/merge_requests/4)

* Nim - [http://forum.nim-lang.org/t/2646](http://forum.nim-lang.org/t/2646)

------
brian-armstrong
If you have a long-lived functionality in your Go program that needs buffers,
it may just be best to allocate a pool of buffers and reuse them. Go does
offer a basic primitive for this with sync.Pool

I once wrote a high throughput UDP listener that maxed out all the cores to
deal with receiving some bursty UDP traffic. I realized that interacting with
the memory allocater was my bottleneck, so I just moved to a pool of UDP
buffers. My ability to deal with load spikes improved substantially. There's
no reason not to use Pools in general if you're building a service with a
predictable and repeatable object lifetime

------
wfleming
I'm a little surprised it didn't seem that they considered a GC-less language
like Rust. If your latency requirements are that strict & you're already
considering a total rewrite, I would personally have been looking into a
solution that would avoid GC entirely.

~~~
jakevn
Rust has a semantic overhead and isn't quite as mature as Go. Their
requirements don't necessitate a lack of GC pause.

~~~
echelon
I disagree with your first point, at least from a personal perspective. I find
Rust more expressive than Go, and slightly more productive.

You're right about maturity though. Hopefully the Rust library ecosystem
continues to make strides.

~~~
wfleming
I agree: I personally find Rust more expressive than Go. I don't think it has
any more "semantic overhead" than any other language with a highly featured
type system.

To tie it back to the article a bit, Rust also seems like a relatively natural
choice for a team considering switching from Haskell if their reason for
switching is performance. The niche Rust fills for me personally is that it
has a very robust type system (putting it in a similar camp as Haskell) with
an explicit focus on fast, performant code (which Haskell does not have).

------
mathw
A great writeup and I feel it's a welcome reminder of some important points:

\- your programming language and runtime aren't equally good at everything and
never will be \- you have to choose your environment based on your anticipated
workload \- you have to understand what your anticipated workload actually is
\- you can't just treat the garbage collector as a magic box that does the
memory thing for you

Unless, of course, you're writing software which is nowhere near the
performance boundaries of the system and never will be.

Can't say I share the author's surprise about the JVM's performance, but I
never did like Java anyway.

~~~
jamesfisher
We were surprised by Java's poor performance because the HotSpot JVM also uses
a concurrent mark-and-sweep collector, and is famed for the amount of
engineering effort that has gone into it. We suspect our benchmark could be
improved by someone with more high-performance JVM experience.

[Edit] jcipar on Reddit made some promising progress by tuning the JVM GC
parameters for our benchmark:
[https://www.reddit.com/r/programming/comments/5fyhjb/golangs...](https://www.reddit.com/r/programming/comments/5fyhjb/golangs_realtime_gc_in_theory_and_practice_pusher/daof47v/)

~~~
sandGorgon
This was a super insightful comment

 _The Java GC does not optimise for low latency exclusively, it tries to
strike a balance between many different factors which your analysis entirely
ignores! In particular it is compacting (Go 's GC is not), which is very
useful for long term stability as you can't get deaths due to heap
fragmentation, and it tries to collect large amounts of garbage quickly, which
Go's GC doesn't try to do (it's pure mark/sweep, not generational), and it
tries to be configurable, which Go's GC doesn't care about. For many apps you
do tend to want these things, they aren't pointless. _

_This is especially true because for most servers a 100msec pause time is just
fine. 8msec is the kind of thing you need if you 're doing a 60fps video game
but for most servers it's overkill and they'd prefer to bank the performance._

 _To put this in perspective Google is having problems migrating to G1 because
even though it gives lower and more predictable pause latencies, it slows
their servers down by 10%. A 10% throughput loss is unacceptable at their
scale (translates to a 10% increase in java server costs) and they want G1 to
become even more configurable to let them pick faster execution at the cost of
longer pauses_

~~~
ngrilly
> This is especially true because for most servers a 100msec pause time is
> just fine.

I disagree (a bit). A 100 msec pause is perceptible by a human, and can lead a
user to leave your website or you app, especially is the pauses are compounded
among several serialized remote procedure calls.

