

Building a lower-latency GC - yminsky
https://blogs.janestreet.com/building-a-lower-latency-gc/

======
jordwalke
Sounds like amazing work is happening here.

For real time user applications such as games, newsfeeds, and mobile apps that
have visual/audio feedback, I'm certain that the most important feature is the
ability to have applications themselves control the process of collecting and
the time constraints to operate within. It is possible to have a collector
that is _less_ efficient, but be entirely perceived to be more responsive as
long as the wasteful or less efficient work is done at times when it does not
matter - and this tradeoff would be gladly welcomed.

So much of UI application time is idle. Even during fluid animations, the
majority of the frame time goes wasted. But then something like Objective-C's
ARC frees large trees of memory, or perhaps Java's GC collects, and causes
your frame time to exceed 16ms. For UI applications, there are periods of time
(during animations) where you must hit 16ms deadlines, and the difference
between 1ms and 15ms is nothing, but the difference between 16ms and 17ms is
_everything_. UI application animations are like train schedules. If you miss
one train by only a microsecond, you still have to wait the entire time until
the next train arrives. You don't get points for _almost_ making the train.
Furthermore, _only_ the application can know the train schedule. It isn't
something the language runtime can possibly know, so to do this right, the
language must anticipate this and accept input from the application
frameworks.

Then there are other times when we are not performing an animation and we know
that we could block a thread for as much as 50ms, without having any perceived
delay experienced by the user. The latency constraints for _starting_ a
continuous interaction are larger than the constraints for _continuing_ a
continuous interaction. So in this case, our application still knows the train
schedule, it's just a different train schedule that allows for more time to
kill. If applications could tell the GC about this, it might decide that it's
a good time to perform a major collection.

I've found that many of what people consider to be performance problems in UI
applications aren't problems of efficiency, they're problems of scheduling.

~~~
chongli
_the difference between 1ms and 15ms is nothing, but the difference between
16ms and 17ms is everything. UI application animations are like train
schedules. If you miss one train by only a microsecond, you still have to wait
the entire time until the next train arrives._

This is why we need a better display protocol. Why are we still treating our
displays like synchronous, scanline CRTs? The display should have a simple
buffer into which we can do random reads/writes at any time and the affected
pixels will be updated immediately. This would solve all of these problems
(and eliminate video tearing)!

~~~
throwawaymylife
Yes, this would be much better than the status quo, and hopefully technologies
like "gsync" will bring something like it to us soon. But this approach will
still result in microstutter from frames being displayed at somewhat random
times, that may or may not be perceptible (I've never used such a display).
Obviously a bit of microstutter is preferable to tearing or missing frames
entirely under heavy loads, but really, a synchronous refresh rate would be
ideal, if only our process schedulers were designed to give strict priority,
more reliable vblank timing, and some degree of safety from preemption to the
foreground UI process.

------
kazinator
> False promotions

> ... When you do hit the end, you do a minor collection, walking the minor
> heap to figure out what's still live, and promoting that set of objects to
> the major heap.

I somewhat deal with that issue here:

[http://www.kylheku.com/cgit/txr/tree/gc.c](http://www.kylheku.com/cgit/txr/tree/gc.c)

When make_obj exhausts the available FRESHOBJ_VEC_SIZE in the freshobj array
(nursery), it first tries to shake things out by doing a partial GC. The
partial GC, during the sweep phase, will rebuild the freshobj array with just
those gen 0 objects that are reachable. The sweep phase of the incremental GC
marches through the freshobj array, and a second index is maintained where
live objects are copied back into the array. At the end, that index becomes
the new allocation pointer from that array.

Only if the array is still full after this do we then have to do a full gc,
and that's when any "false promotion" will happen.

One problem with this approach is that this freshobj nursery becomes a
miniature memory arena in itself; as it gets close to full, these partial
garbage collections get more frequent. They are not _that_ cheap.

Hmm, it occurs to me I can address that with a one-liner:

    
    
        diff --git a/gc.c b/gc.c
        index 6832905..7a0ee1c 100644
        --- a/gc.c
        +++ b/gc.c
        @@ -173,7 +173,7 @@ val make_obj(void)
               malloc_delta >= opt_gc_delta)
           {
             gc();
        -    if (freshobj_idx >= FRESHOBJ_VEC_SIZE)
        +    if (freshobj_idx >= FRESHOBJ_VEC_SIZE / 2)
               full_gc = 1;
             prev_malloc_bytes = malloc_bytes;
           }
    

I.e. after doing an ephemeral GC, if at least half of the nursery space isn't
free, then set the flag for a full gc. (That will happen right away if the
nursery has no more objects to satisfy the current request, or else the next
time gc is triggered.)

This has been a useful HN submission since it got me thinking about something,
leading to some findings in a different project. :)

------
bjourne
> typical advice in Java-land is to have a young generation in the 5-10 GiB
> range, whereas our minor heaps are measured in megabytes.

Where does that advice come from? Of what I've seen typical suggested values
for the young generation in the jvm and clr is 10-20mb. A young generation
measured in gigabytes would defeat the idea of dividing the heap into
generations in the first place.

Also, unless you have profiled the gc and know exactly in which steps of the
process most time is spent, then you are fumbling in the dark.

~~~
yminsky
Here's an example of someone from Linked-In describing a "low latency" setup
with a 6GiB heap.

[https://engineering.linkedin.com/garbage-
collection/garbage-...](https://engineering.linkedin.com/garbage-
collection/garbage-collection-optimization-high-throughput-and-low-latency-
java-applications)

I've heard similar things from folks at Twitter, IIRC. But I do find the whole
thing kind of mysterious, I have to admit. I'd love to learn that I was wrong.

~~~
bjourne
I don't think that person knows what he is talking about. Either you pick
high-throughput or low-latency. You don't get both. They got the young
generation collection pause down to 60 ms which is completely crap. :) The gc
for Factor which I've been hacking on has a young generation pause of 2-3 ms
(depending on object topology, generation sizes and so on). But the young
generation is only 2mb so the number of gc pauses is significantly higher.

~~~
chipsy
Although it's probably not at all what that author meant, there is a way in
which throughput and latency have a kind of fractally-layered relationship.
For example, if you have a distributed computation that requires all parts to
be finished before any result is returned, your throughput is dependent on the
average latency for each part being consistently low. If one of them gets
stuck it becomes a huge bottleneck, and speeding up the rest won't matter. And
at the lowest levels of optimization a similar concern arises when CPU
throughput is ultimately concerned with avoiding unnecessary latency such as
cache misses.

For modern systems, latency seems to be increasingly important, as the fast
paths have all grown various asynchronous aspects. For some systems 60ms pause
might be fine, for others 2ms might be way too much. It's definitely a "what
scale are you thinking about" kind of thing.

~~~
bjourne
What the author (and I) meant by "low-latency" is a gc were the work is very
fairly distributed among the mutators allocation requests. The gc with highest
throughput just allocates memory and never frees it. But then you sacrifice
other properties such as extremely high memory usage..

------
amelius
It would be extremely useful if GC solutions were built in a more language
agnostic way. This would allow language designers to pick a GC that suits
their needs, instead of reinventing the wheel every time.

~~~
benjiweber
Perhaps we could have some sort of virtual machine, which could allow
different languages that target it to share garbage collection solutions? ¬_¬

~~~
amelius
I was thinking of a solution that is a little more orthogonal (modular). The
JVM for example, is overdesigned for this purpose. One problem it has, is that
it doesn't support proper tail call optimizations, and this affects most
functional languages targeting this virtual machine. The GC of this VM,
however, would be perfectly suitable for these languages.

------
edwintorok
Since you brought up Java here is some information that I could find about
real-time garbage collection:

* JamaicaVM's RTGC seems to be the closest to OCaml's GC, including the 'GC does more work when you allocate more', although the benefit is that threads that don't allocate don't have to run the GC which doesn't apply to OCaml (don't know about Multicore OCaml): [https://www.aicas.com/cms/sites/default/files/Garbage.pdf](https://www.aicas.com/cms/sites/default/files/Garbage.pdf) [http://www.aicas.com/papers/jtres07_siebert.pdf](http://www.aicas.com/papers/jtres07_siebert.pdf) It seems they provide a tool (static analysis?) to determine the worst-case allocation time for an application

* FijiVM's approach: [http://janvitek.github.io/pubs/pldi10b.pdf](http://janvitek.github.io/pubs/pldi10b.pdf)

* on eliminating pauses during GC and compaction, although with quite invasive changes to the read barrier: [http://www.azulsystems.com/sites/default/files//images/wp_pg...](http://www.azulsystems.com/sites/default/files//images/wp_pgc_zing_v5.pdf) [http://www.azulsystems.com/sites/default/files/images/c4_pap...](http://www.azulsystems.com/sites/default/files/images/c4_paper_acm.pdf)

Once you have a GC with predictable behaviour you also need a standard library
with predictable behaviour (which you have with Core_kernel, right?), and
Javolution has an interesting approach, they annotate the public API of the
library with time constraints and the compiler can give some warnings:
[http://javolution.org/apidocs/javolution/lang/Realtime.html](http://javolution.org/apidocs/javolution/lang/Realtime.html)
[http://javolution.org/apidocs/javolution/lang/Realtime.Limit...](http://javolution.org/apidocs/javolution/lang/Realtime.Limit.html)
[http://javolution.org/](http://javolution.org/) I don't know how this works
in practice, it'll probably just warn if you try to use a LINEAR time function
when you claim your function should be CONSTANT time, or if you put things
inside loops. Not sure if it'd be worth trying out this approach for
Core_kernel, but with OCaml annotations and ppx it might be possible to
automatically infer and check:

* annotate stack depth of function (none, tail-recursive, linear-recursive, tree-recursive)

* annotate worst-case time complexity (guaranteed const (no loops, or just constant length ones, no recursion, no allocation); const (like guaranteed but with allocations); log (manual annotation); linear (recursion with decreasing list or loop); quadratic (one nested loop); unknown)

* I/O blocking behaviour (although you do this by hiding the blocking functions in Async.Std already)

------
AceJohnny2
(for OCaml)

~~~
chc
What language would you naturally expect Jane Street to be creating a garbage
collector for?

~~~
AceJohnny2
While I totally knew what I was clicking through, I don't expect everyone to
know what Jane Street is famous for.

Know your audience.

------
jim_greco
Just don't GC :)

~~~
nulltype
I hear redis is the new heapness.

