Hacker News new | past | comments | ask | show | jobs | submit login
Golang: sub-millisecond GC pause on production 18gb heap (twitter.com)
119 points by eis on Dec 3, 2016 | hide | past | web | favorite | 30 comments

Would be interesting to have a raw comparison to the JVM e.g. to the new GC1 which can handle also >10GB with low pauses. I know from experience (with the JVM) that you can get nice pauses on average, but what about the maximum values?

I spent a lot of the past six years dreaming nightmares of the unpredictability of CMS and, to a lesser extent, G1. So it's been with extreme skepticism that I've received the Go teams claims of how the Go GC performs.

Maximum STW of <1ms? Uhh, ok. The best-of-breed open Java GCs have targets of <100ms on big heaps, 1ms ceiling sounded insane.

However, diving into it, the problems solved by the Go GC and those solved by the Java GC are very different in at least two major ways:

- Go has substantially less dynamic runtime behavior; there's no JIT, no runtime loading of new code, very limited reflection, &c.

- The Go GC does not do heap compaction

Both of these things mean: a) Objects do not randomly move about the heap and b) code does not randomly redefine itself.

Which, the Go team have made their bed, just like the Java team have, and the Go bedroom looks a heck of a lot more comfy to sleep in. Building a fast GC for Go is a massively simpler problem than building a GC for Java, hence the claims they are making of worst-case scenarios are not completely off the wall.

Now, that doesn't mean they should be taken at face value. Notably, the Go GC still has to decide when to start a cycle, and still needs to stop the application if it runs out of heap due to starting a cycle too late. I would suspect that you can construe a test case with a very large heap and a random and extremely varying allocation rate that could cause the Go runtime to explode just fine.

Likewise, the time to complete a GC cycle is (at best) linear to the heap size of Go - as the heap grows, the time to reclaim memory does as well, meaning you need to keep larger margins in the heap for floating garbage waiting to be reclaimed.

You've got some great points, but there's one more dimension to why GC in Java hurts performance as much as it does: Java, by the design of its standard libraries and by the conventions that many (most?) Java developers use, just tends to allocate memory for every single operation.

I was writing the Java part of a cross-platform game engine once, and I was trying to eliminate all dynamic allocations during the game (front-loading all allocations and then never touching the heap). Did some profiling and determined I'd missed something.

The allocation turned out to be in a call to get the current system timer (in milliseconds, if I recall). I was calling a Java function that returned a Java long int, probably straight from an OS call that returned a 64-bit int, and somewhere in that call it was allocating an Object.

Ripped out the Java library call, replaced it with a JNI call to a two line C function that returned the result of the corresponding OS call. No more dynamic allocations.

It's just part of the philosophy of Java to allocation things everywhere. So it's not only hard to make a GC fast for architectural reasons; the GC also tends to be worked harder by the way Java code tends to be written.

Excellent points. One of the emotional appeals of Go, to me, is the community behavior and mindset. Go code very often "feels" like a low level language, written by someone that was considerate of the underlying machine.

You're also hitting another nerve: The horrific world of allocation tracking in Java. While your case sounds like a legitimate thing, so many times I've been in that situation and been tricked by my tools to optimize the wrong path.

Most popular allocation tracking tools, like YourKit, turn off escape analysis. Obviously, when you do that, stupid things like boxed primitives will show up as your main sources of heap allocation even though Hotspot would stick those suckers on the stack in a heart beat.

Eg: Here's a program running without YourKit allocation tracking: https://twitter.com/jakewins/status/707759792547233792

And here's the same program with it on: https://twitter.com/jakewins/status/707754868497231872

Still makes me furious - what in the world am I supposed to do with an allocation tracking tool that creates millions of false positive allocations, drowning out the thing I care about? Gah.

Which actually could have been fixed since version 1.0, had the language designers added support for value types instead of trying to retrofit them into Java 10.

I like the language, but it did a bit of left turn by not adopting the features of Oberon(-2), Modula-3, Component Pascal or Eiffel in regards to value types and native AOT compilation.

Azul's C4 collector can handle 2TB heaps with "pauseless" (<50 microsecond) operation:


I know and I really like the achievements! Also there are other JVM GCs usable for RT application. But I would preferable have a comparison between two open source systems ie. go and openjdk8/9 :)

At what cost on happy path code? Azul has some other problems with the JIT that make happy path code slower in exchange for better GCs.

> Would be interesting to have a raw comparison to the JVM e.g. to the new GC1

And with Shenandoah [1] too!

Also, everyone talks about heap sizes; which is kind of the size of the problem if you let it accumulate. I would be much more interested in having these latencies measured in the context of sustained allocation throughput, with varying object lifetime lengths. How much the system can take while staying afloat? Here is the most performant gc: never allow to allocate anything.

Go programming constrains heavily towards message passing. I guess this is some kind of tradeoff. I really hope it still makes a lot of new apps possible (40k users concurrently interacting multiplayer games, anyone?).

[1] http://openjdk.java.net/jeps/189

Game studios have been using GC enabled languages for their servers for quite a while now.

For example Wooga has quite a few games using Erlang, Hallo used C#/Project Orleans, just to cite two examples.

Oh, definitely. And Whatsapp is able to maintain 2.8M connections, > 200k msgs/sec on one machine with Erlang [1].

I meant not in an embarrassingly parallel worload. The 40k would be with interacting with each other in realtime, and not by groups maxed at 10-20. Each of the 40k being impacted by the other 40k's actions; with an all-reduce step between them at each tick. Each of them getting a different overview of the other 40k.

Imagine a FPS where users receive updates on others' position 60 times per second if they are close, 1 time per second if they are far away, and none if the player is not looking at them. This could enable vastly larger games.

The question at hand would be: Go and Erlang's GC enable great latencies; but can their programming model still enable this worload easily? What kind of degree of shared mutable state do we need?

[1] http://highscalability.com/blog/2014/2/26/the-whatsapp-archi...

Here's one comparison with G1:


where G1 still had some reasonably large pauses FWIW. Which is surprising isn't its billing to not do that? But FWIW.

GC time is proportional to the number of objects tracked by the GC not the size of the heap, although they do tend to be related. Without knowledge of how many objects are being tracked this is a useless metric.

Hehe I thought that as well when I read the headline.

"We have three root objects of 6GB a piece!"... ummm :P

I'm sure they must be talking about something other than that, but it did make me giggle a bit.

What the the tweet mentioned is GC pause time, not GC time.

Just started learning Go. How is that possible? Does this mean the GC tries to do its thing without a pause? Or is it not doing its thing? What is happening here?

Thanks for the link. Found the proposal a really good read.

Direct link : https://github.com/golang/proposal/blob/master/design/17503-...

Well, congratulations on the bragging rights?! That is magnitudes too long for actual real-time or low latency use. It is squarely in a no-mans-land of applications where the worst-case is still too much for games or sound but all other uses didn't care very much for GC pause in the first place.

That's great... but

1. A 18 gigabyte heap puts it magnitudes larger than a game, presumably performance is better for smaller apps

2. I write web application servers and I very much care about GC latency

For gaming, assuming a target of 60fps, a 1ms GC pause still leaves 15.6ms to render the next frame. Which is 93% of the available time.

Or even better..

I've written 3D/Game Engines engines in languages with GC based runtimes that run at 1500fps with no GC pauses.

How to do it? Easy... don't create garbage! So yes, it is doable. PITA but doable

Certainly helps if you have a good generational GC.

That said, if you can force a compaction once a frame and it's pretty stable at a millisecond, I guess that's more than fine for a lot of stuff.

Exactly this. In Java, it's definitely possible to almost completely avoid garbage, and the GC is tunable enough that you could consistently have the GC run once per frame, which then makes it just a normal part of the frame timing if you have a general idea of what is being created. Object pooling is magical, yo. I imagine in Golang, this would only get easier as the pauses are super short and I feel more control over the memory allocation (whether this is true or not is an open question).

Yes, definitely a lot of object pooling! :+1:

Use POD's to store data

...and use "int's" instead of references. Those index into tables of the objects. These will be opaque to the GC so it doesn't have to follow any pointers.


Oops sorry.. Plain Old Data structures.

Specifically I mean data structures that contain only basic types and no reference types or pointers.


Another important trick is to use structures of arrays, rather than arrays of structures. Well that kind of depends on the use case, but you can get huge wins in memory access if you need to scan only one field.

Same tradeoff as Row vs Column Databases.

That isn't really how it works.

The catastrophic problem with GC for games isn't the specific amount of time it takes up, it's that it's a periodic task; it perhaps happens once every few seconds, or even less. Which means that if you're going to have GC in your game, then you're always going to be discarding some percentage of the available CPU time so that you can survive the occasional frame where the GC runs. (or else you're having frame rate hiccups every time it runs)

Nobody in the AAA industry wants to say "Okay, lets only use 90% of the hardware's capabilities, so that those few frames in which the GC is going to run, it won't cause a frame rate hitch"; we instead engage in extraordinary (some might say 'heroic', others might say 'foolhardy') amounts of engineering effort to remove those periodic expenses, so we can move ever-closer to consistent 100% usage of the available hardware capabilities.

Sometimes that means "we're going to build this in a more difficult language which doesn't support GC", sometimes that means "we're going to figure out how to build this in this GC-enabled language without doing anything that would accidentally invoke the GC", and sometimes that means "we're going to figure out how to provide our own GC implementation which operates progressively, only spending a budgeted amount of CPU time per frame".

Personally, I skew toward that first solution. But I'm old-fashioned that way. :)

(caveat: I haven't actually worked on a 'AAA' game in the current console generation, and the majority of my industry experience was during the PS2 era. Stuff might have changed in the last few years, though I imagine that the basic principles still hold.)

I'm just curious, but what are realistic goals for games and sound? I've never programmed professionally in those areas.

Games tend to target 60fps based on current displays, generally based on the vertical sync rate. (some monitors are 144Hz, for example)

60fps is roughly 16ms, for reference.

Audio is different, it's often about round trip latency (eg. from the time the sound is made, to when it is heard back) and has a few additional issues to consider. Eg. there can be phasing issues where one signal cancels out or amplifies another.

This is controlled by varying the buffer sizes, but issues can still arise due to processing latency. If the output buffer falls behind you'll hear pops as the speakers cut out.

A lot of DSP code is written to be lock-free by allocating vars when initializing so no memory allocations actually occur during updates. There are a couple great youtube talks on the subject. (look at the JUCE and CppCon channels)

Not everyone is doing AAA games, and thankfully so, because in 2016 many of those are just pretty graphics with no real content.

Also I remember the days when any serious game developer would use Assembly, C and Pascal were only used by indie developers.

Times change.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact