Maximum STW of <1ms? Uhh, ok. The best-of-breed open Java GCs have targets of <100ms on big heaps, 1ms ceiling sounded insane.
However, diving into it, the problems solved by the Go GC and those solved by the Java GC are very different in at least two major ways:
- Go has substantially less dynamic runtime behavior; there's no JIT, no runtime loading of new code, very limited reflection, &c.
- The Go GC does not do heap compaction
Both of these things mean: a) Objects do not randomly move about the heap and b) code does not randomly redefine itself.
Which, the Go team have made their bed, just like the Java team have, and the Go bedroom looks a heck of a lot more comfy to sleep in. Building a fast GC for Go is a massively simpler problem than building a GC for Java, hence the claims they are making of worst-case scenarios are not completely off the wall.
Now, that doesn't mean they should be taken at face value. Notably, the Go GC still has to decide when to start a cycle, and still needs to stop the application if it runs out of heap due to starting a cycle too late. I would suspect that you can construe a test case with a very large heap and a random and extremely varying allocation rate that could cause the Go runtime to explode just fine.
Likewise, the time to complete a GC cycle is (at best) linear to the heap size of Go - as the heap grows, the time to reclaim memory does as well, meaning you need to keep larger margins in the heap for floating garbage waiting to be reclaimed.
I was writing the Java part of a cross-platform game engine once, and I was trying to eliminate all dynamic allocations during the game (front-loading all allocations and then never touching the heap). Did some profiling and determined I'd missed something.
The allocation turned out to be in a call to get the current system timer (in milliseconds, if I recall). I was calling a Java function that returned a Java long int, probably straight from an OS call that returned a 64-bit int, and somewhere in that call it was allocating an Object.
Ripped out the Java library call, replaced it with a JNI call to a two line C function that returned the result of the corresponding OS call. No more dynamic allocations.
It's just part of the philosophy of Java to allocation things everywhere. So it's not only hard to make a GC fast for architectural reasons; the GC also tends to be worked harder by the way Java code tends to be written.
You're also hitting another nerve: The horrific world of allocation tracking in Java. While your case sounds like a legitimate thing, so many times I've been in that situation and been tricked by my tools to optimize the wrong path.
Most popular allocation tracking tools, like YourKit, turn off escape analysis. Obviously, when you do that, stupid things like boxed primitives will show up as your main sources of heap allocation even though Hotspot would stick those suckers on the stack in a heart beat.
Eg: Here's a program running without YourKit allocation tracking: https://twitter.com/jakewins/status/707759792547233792
And here's the same program with it on: https://twitter.com/jakewins/status/707754868497231872
Still makes me furious - what in the world am I supposed to do with an allocation tracking tool that creates millions of false positive allocations, drowning out the thing I care about? Gah.
I like the language, but it did a bit of left turn by not adopting the features of Oberon(-2), Modula-3, Component Pascal or Eiffel in regards to value types and native AOT compilation.
And with Shenandoah  too!
Also, everyone talks about heap sizes; which is kind of the size of the problem if you let it accumulate. I would be much more interested in having these latencies measured in the context of sustained allocation throughput, with varying object lifetime lengths. How much the system can take while staying afloat? Here is the most performant gc: never allow to allocate anything.
Go programming constrains heavily towards message passing. I guess this is some kind of tradeoff. I really hope it still makes a lot of new apps possible (40k users concurrently interacting multiplayer games, anyone?).
For example Wooga has quite a few games using Erlang, Hallo used C#/Project Orleans, just to cite two examples.
I meant not in an embarrassingly parallel worload. The 40k would be with interacting with each other in realtime, and not by groups maxed at 10-20. Each of the 40k being impacted by the other 40k's actions; with an all-reduce step between them at each tick. Each of them getting a different overview of the other 40k.
Imagine a FPS where users receive updates on others' position 60 times per second if they are close, 1 time per second if they are far away, and none if the player is not looking at them. This could enable vastly larger games.
The question at hand would be: Go and Erlang's GC enable great latencies; but can their programming model still enable this worload easily? What kind of degree of shared mutable state do we need?
where G1 still had some reasonably large pauses FWIW. Which is surprising isn't its billing to not do that? But FWIW.
"We have three root objects of 6GB a piece!"... ummm :P
I'm sure they must be talking about something other than that, but it did make me giggle a bit.
Direct link :
1. A 18 gigabyte heap puts it magnitudes larger than a game, presumably performance is better for smaller apps
2. I write web application servers and I very much care about GC latency
I've written 3D/Game Engines engines in languages with GC based runtimes that run at 1500fps with no GC pauses.
How to do it? Easy... don't create garbage! So yes, it is doable. PITA but doable
Certainly helps if you have a good generational GC.
That said, if you can force a compaction once a frame and it's pretty stable at a millisecond, I guess that's more than fine for a lot of stuff.
Use POD's to store data
...and use "int's" instead of references. Those index into tables of the objects. These will be opaque to the GC so it doesn't have to follow any pointers.
Specifically I mean data structures that contain only basic types and no reference types or pointers.
Another important trick is to use structures of arrays, rather than arrays of structures. Well that kind of depends on the use case, but you can get huge wins in memory access if you need to scan only one field.
Same tradeoff as Row vs Column Databases.
The catastrophic problem with GC for games isn't the specific amount of time it takes up, it's that it's a periodic task; it perhaps happens once every few seconds, or even less. Which means that if you're going to have GC in your game, then you're always going to be discarding some percentage of the available CPU time so that you can survive the occasional frame where the GC runs. (or else you're having frame rate hiccups every time it runs)
Nobody in the AAA industry wants to say "Okay, lets only use 90% of the hardware's capabilities, so that those few frames in which the GC is going to run, it won't cause a frame rate hitch"; we instead engage in extraordinary (some might say 'heroic', others might say 'foolhardy') amounts of engineering effort to remove those periodic expenses, so we can move ever-closer to consistent 100% usage of the available hardware capabilities.
Sometimes that means "we're going to build this in a more difficult language which doesn't support GC", sometimes that means "we're going to figure out how to build this in this GC-enabled language without doing anything that would accidentally invoke the GC", and sometimes that means "we're going to figure out how to provide our own GC implementation which operates progressively, only spending a budgeted amount of CPU time per frame".
Personally, I skew toward that first solution. But I'm old-fashioned that way. :)
(caveat: I haven't actually worked on a 'AAA' game in the current console generation, and the majority of my industry experience was during the PS2 era. Stuff might have changed in the last few years, though I imagine that the basic principles still hold.)
60fps is roughly 16ms, for reference.
Audio is different, it's often about round trip latency (eg. from the time the sound is made, to when it is heard back) and has a few additional issues to consider.
Eg. there can be phasing issues where one signal cancels out or amplifies another.
This is controlled by varying the buffer sizes, but issues can still arise due to processing latency.
If the output buffer falls behind you'll hear pops as the speakers cut out.
A lot of DSP code is written to be lock-free by allocating vars when initializing so no memory allocations actually occur during updates. There are a couple great youtube talks on the subject. (look at the JUCE and CppCon channels)
Also I remember the days when any serious game developer would use Assembly, C and Pascal were only used by indie developers.