
GC Tuning Confessions of a Performance Engineer - luu
http://www.slideshare.net/MonicaBeckwith/gc-confessions
======
banachtarski
Having worked with all sorts of GCs in the past, I basically stopped using
them altogether some years ago in favor of manual memory management. I have to
say, it's been liberating and so much easier to write performant code since I
know what memory I need instead of relying on the computer to guess (and
giving the computer hints as to how to guess). I have no doubt that GC
research and systems is making good forward progress, but I can't imagine
myself using them again.

~~~
pron
I love GCs. They give you better memory throughput (in exchange for more
footprint and higher latency, although latency can be made rather low), and
they let you build and use very scalable concurrent data structures. On large
machines with lots of cores and lots of RAM, they let you work with large, in-
memory data sets very efficiently.

~~~
banachtarski
You will never beat a tuned system without GC with a GC system because of all
the instructions necessary to traverse references and such. Having lots of
cores and lots of RAM just means you can eat the cost more easily. It just
means you can afford to let the VM do the work instead of the programmer.

Also, no matter how much RAM you have, cache sizes are more or less the same,
and cache line misses hurt.

~~~
pron
> You will never beat a tuned system without GC with a GC system because of
> all the instructions necessary to traverse references and such.

That's not at all how it works. The generational hypothesis means that most
objects die young. Allocating them is a simple, uncontended pointer bump in
the thread-local allocation buffer (as fast as stack allocation), and freeing
them is free, as they are never traversed. They are much, much (much) faster
than malloc/free. They do however increase the frequency of young-gen so they
have an indirect cost.

When it comes to throughput (i.e. total time the application spends doing
memory management), modern GCs handily beat malloc/free. What you can do,
however, with manual memory management is all sorts of arena allocations, but
then you have to be careful when sharing pointers (Rust helps with that). Then
there's the question of concurrent data structures, that are very, very hard
to do well without a GC.

> cache line misses hurt.

What does that have to do with GCs? If anything, copying GCs bring related
objects together, so the prefetcher can help. What affects cache misses (in
Java) is the lack of arrays-of-structs, which is scheduled to be fixed, with
the addition of value types, in Java 10.

~~~
vitalyd
I really dislike the comparisons of TLAB to stack allocation. Stack, by its
nature, is going to be hot in cache. TLAB, once filled up, will be retired and
possibly assigned to a different thread. But even if it's not assigned, it's
constantly moving forwards, and not revisiting the same space. You'd need
prefetch to be perfect, and then on top of that, you'd need to make sure that
by the time you go to allocate again, the prefetched cachelines have not been
evicted.

Languages/platforms with a GC should still use, support, and encourage stack
allocation for temporary memory -- this _is_ your TLAB!

There is a cost to traversing references; card marking and generational
collectors just reduce the amount of references you need to visit, but it
doesn't mean reference chasing isn't requiring extra instructions. Finally,
don't forget that card marking requires write barriers, which is extra
instructions (and possible cache misses) on each reference store (modulo
trivial ones, such as new allocations, where JIT knows it's not required).

~~~
pron
While everything you say is 100% true, it is also a second-order effect, with
a much lower magnitude than the primary GC performance behavior for short-
lived objects.

Obviously stack allocation is preferable to TLAB allocation (and there's no
reason to allocate objects with stack scope on the heap), if only for the fact
that it never triggers a collection. Nevertheless, Java allocation/collection
of short-lived object is much closer in cost to stack allocation than to
malloc/free.

~~~
vitalyd
I'm not sure memory locality effects can be considered secondary, unless their
effects are completely dwarfed by something else the app is doing (e.g.
there's no point in discussing this topic for i/o bound workloads).

I don't think malloc/free should enter this conversation because languages
that use malloc/free do so very infrequently, and for the cases where dynamic
memory needs to be allocated frequently, they use specialized memory managers
within the application. This is also subject to which allocator is used and
what the application's allocation pattern is. There're suboptimal GC mechanics
as well in some cases, such as CMS tenured space using free chunk lists with
no compaction, so any young GC that requires promotion can possibly increase
the young GC time because the GC needs to find appropriate free block size,
and if there's fragmentation, this may take a while. Point being is that
malloc/free vs GC isn't quite as clear cut on its own, nevermind that
malloc/free aren't called that often. Generally speaking, though, if you can
give GC ample headroom in terms of RAM, it'll have better throughput than
incessant malloc/free use (which, I argue, is rare in properly written
applications).

~~~
pron
> I'm not sure memory locality effects can be considered secondary

What is secondary isn't the TLAB/stack performance ratio, but that ratio vs
malloc/TLAB.

Also, I'm not sure why you think locality matters much here in the case of
stack reuse. Within each frame, the stores always come first, and those go in
the store buffer (and the reads are from the store buffer, too), so those are
pretty benign cache misses.

> which, I argue, is rare in properly written applications

Sure, it is rare in "well written applications", but how costly is it to write
a well-written application in a large team, and how much extra performance can
you get? Remember, we're not talking about a DSP, a microcontroller or a net
router when we're discussing GCs, but big, complex applications. Nobody is
claiming you can't beat a GC given enough work (though it's harder the more
concurrency is involved).

Also, think about what kind of data we're talking about. The interesting data
is database data, and that has both arbitrary lifetime as well as concurrent
read/write. And if you don't use malloc/free, at best you need to write your
own manual memory allocator which is just as complex, and at worst you
basically need to write your own GC.

~~~
vitalyd
Ok, we keep talking about malloc -- which malloc impl are you specifically
referring to? There are many allocators out there these days, so let's be a
bit more concrete. If not specific name, at least the class of allocator. Most
of the common ones you'll find support thread-local allocation buffers, for
starters.

>Sure, it is rare in "well written applications", but how costly is it to
write a well-written application in a large team, and how much extra
performance can you get?

>Also, think about what kind of data we're talking about. The interesting data
is database data, and that has both arbitrary lifetime as well as concurrent
read/write. And if you don't use malloc/free, at best you need to write your
own manual memory allocator which is just as complex, and at worst you
basically need to write your own GC.

If we're going to talk about databases, then "well-written" better be one of
the top concerns, and team size should be irrelevant to that. And the more
mechanically sympathetic of a product you're building (and db's are right up
there in pretty much all aspects: cpu, i/o, net, mem, etc), the more you need
to have control over those resources.

Have you, for example, looked at how postgresql manages memory? sqlite? redis?
memcached? nginx? varnish? And, as I mentioned in the other reply, most of the
big data java solutions end up rolling their own off-heap mem management infra
using the same techniques as you'd use without GC.

~~~
pron
> Most of the common ones you'll find support thread-local allocation buffers,
> for starters.

And what about concurrent deallocation?

> Have you, for example, looked at how postgresql manages memory? sqlite?
> redis? memcached?

Not too well (basically lots and lots of locking, much of it is very coarse-
grained). Our spatial in-memory Java database (SpaceBase) offers an order-of-
magnitude better performance in concurrent usage (and much better scalability
with core number). We do over 200K transactions (with lots of contention) per
second on a single 4-core laptop without breaking a sweat (concurrently with
the application itself), and over a million on a large server (with some
careful tuning).

But even for less super-concurrent databases, C++ databases don't outperform
Java ones. In this benchmark, the Java databases (H2 and HSQLDB) almost always
outperform MySQL and Postgres:
[http://www.h2database.com/html/performance.html](http://www.h2database.com/html/performance.html)
(and I don't even know how the Java solutions handle concurrency, whether they
do locking, optimistic locking or a clever combination, like SpaceBase).

In both cases, the amount of effort put into the Java solutions is orders of
magnitude less than the C/C++ solutions.

> nginx? varnish?

Those are (virtually) read-only use cases. That is very easy to do
concurrently no matter how you manage concurrency. The trick is concurrent
writes, not reads.

~~~
anarazel
> But even for less super-concurrent databases, C++ databases don't outperform
> Java ones. In this benchmark, the Java databases (H2 and HSQLDB) almost
> always outperform MySQL and Postgres:
> [http://www.h2database.com/html/performance.html](http://www.h2database.com/html/performance.html)
> (and I don't even know how the Java solutions handle concurrency, whether
> they do locking, optimistic locking or a clever combination, like
> SpaceBase).

It's hardly surprising that a vendor's own benchmark shows it as winning
against the competition.

~~~
pron
Let's suppose for a second that the numbers are biased in the Java databases'
favor. I doubt that other benchmarks would show such dramatically different
results that would have those databases losing by much. So maybe they're not
ahead as those number show (though I have little reason to doubt them), but a
little behind. The difference would still not justify the claim that C++
provides superior performance for databases.

Also, it's an open source database with no commercial support by the authors,
so I wouldn't really call them "vendors".

------
jeffcoat
The slides are interesting enough that I went looking to see if the actual
talk that goes with them was online anywhere; found it at

[http://chariotsolutions.com/screencast/philly-
ete-2015-7-mon...](http://chariotsolutions.com/screencast/philly-
ete-2015-7-monica-beckwith-gc-tuning-confessions-of-a-performance-engineer/)

------
phamilton
For some reason, GC is on of those areas that I have zero desire to learn
sufficiently to tune. That's probably why OP has a successful career as a
consultant.

When making tradeoffs in development, a simpler GC with predictable behavior
is worth losing a lot of raw performance to me. Thus I find myself drawn to
platforms like erlang/elixir or Haskell where GC is isolated.

I suppose the rational approach is build on Hotspot and when it becomes a
problem hire a consultant for instant performance gains (assuming low hanging
fruit is in abundance).

~~~
pron
In 80% of the cases, simple GC ergonomics (i.e. let HotSpot figure out the
tuning) are more than ok, and in 95% of the cases, trivial tuning is enough.

The thing with Erlang is that the GC doesn't work on any shared memory data
structure (like ETS), so you pretty much have to delegate any shared data to
an out-of-process database, even in simple cases that are easily addressed
with ConcurrentHashMap, ConcurrentSkipListMap/Set, or CopyOnWriteArrayList/Set
on the JVM.

~~~
phamilton
I generally use an Agent in Elixir for anything shared and there's very little
cognitive overhead (though perf does take a hit). Agent.cast/2 is helpful for
low latency modifications (at the cost of losing back pressure).

As stated, it's a tradeoff.

~~~
pron
But, if I'm not mistaken, an agent isn't concurrent. You make your shared data
a bottleneck which is exactly what concurrent data strctcures are meant to
prevent.

------
nickpsecurity
Interesting article. What I see is another failure of industry to solve the
root of the problem: different GC strategies work for different applications
and should be selectable. Might even allow pluggable GC's. This is the path
taken in JX Operating System and certain academic works on improving Java
runtime. JX's first tier of memory management just gives a certain amount of
resources a component/process/app can use. The second tier is its GC, which
can be different per application. So, you could partition your application
into components with associated GC's that focused on throughput, low-latency,
or productivity. Then the problem shifts to making an optimal scheduler for
the app and GC runtimes. There's already good schedulers, strategies for
building them, and even tools for automated planning of scheduling strategies.
Any of these might be integrated to optimize that aspect.

Overall, using pluggable garbage collectors, feeding a planner some
constraints, and compiling the result sounds much easier that this person's
day job. Embedded Java already does some of this while some systems did it in
hardware to avoid most issues entirely. Enterprise Java should adopt such a
method. Meanwhile, reading of these nightmares, I'll continue avoiding such
GC-based tools wherever possible in my work.

JX Operating System for reference
[http://www4.cs.fau.de/Projects/JX/publications/jx-
usenix.pdf](http://www4.cs.fau.de/Projects/JX/publications/jx-usenix.pdf)

------
jokoon
I wonder, isn't it working on the stack much faster than working on the heap
anyway?

So when you query data on the heap, shouldn't you just query larger objects to
put it on the stack, and work from there, instead of using the heap so often ?

With that in mind wouldn't that render garbage collecting almost irrelevant if
your code is well designed, by not working too much on the heap ?

It's true that more ram makes the GC more relevant, but if it's an excuse for
negligent software design, maybe GCs are not such a good idea. It's good to
have features that make the job of the programmer easier, but if it only save
the time of the skilled programmers who knows a little about how it works
underneath, is it such a good idea ?

You can hardly convince that simple tools and predictable behaviors in
machines are not the safest way of having fair results.

A programming language is already a big shortcut to work faster, and I doubt
you should try to work even faster if it means creating new drawbacks.

~~~
evanpw
The stack and the heap are both just regions of memory, so there's nothing
inherently faster about using the stack. The difference is the memory
management strategy: Usually, stack variables are all allocated at the
beginning of a function, and all deallocated at the end. If all of the memory
your program needs is tied to a particular scope like that, then it's
certainly faster than any other memory-management strategy. If you need to
share data between scopes, build complicated data structures of indeterminate
compile-time size, or allocate large blocks of memory (the stack is usually
limited in size), then you have to do something different (like use the heap).

That said, I think the Rust approach to memory management is really
interesting: try to tie all allocated memory to some scope, and keep track of
when ownership is borrowed by or moved to a different scope.

~~~
vitalyd
The inherent benefit to the stack is the memory region stays hot in cache due
to natural use of the stack. That's your fast reusable buffer for temporaries.

In addition, of course, cleanup/reclaim of the stack space is pointer bump, so
you get pointer bump allocation and deallocation, effectively.

~~~
evanpw
True, but you could do the same thing in any region of memory. The only thing
that using "the stack" buys you is that there are a few special instructions
to allocate / deallocate one machine word at a time (if you store the "top-of-
heap" pointer in a specific register), and you get a bit more locality by
virtue of return addresses being stored next to your locals and temporaries.
(For x86 processors at least; maybe other architectures treat the stack in a
more special way, I don't know.)

~~~
vitalyd
> True, but you could do the same thing in any region of memory. The only
> thing that using "the stack" buys you is that there are a few special
> instructions to allocate / deallocate one machine word at a time (if you
> store the "top-of-heap" pointer in a specific register)

This is pretty much what a TLAB is in Hotspot JVM, except of course the
pointer never moves backwards (once the TLAB is filled up, it retires, and
then can get assigned to a different thread to start allocating from the
beginning). Each allocation into the TLAB moves allocations further into the
region -- there's never any reuse until the TLAB starts afresh.

The stack locality comes from the rest of the stack execution mechanics
keeping this region very warm, and the majority of the read/write action to
the stack is localized (apart from large stack allocations that may
temporarily expand the region's use).

~~~
TheLoneWolfling
Except that a TLAB is worse for cache than a stack.

Look at what happens when you call a bunch of small methods that each allocate
a couple things. With a stack, everything stays in nice proximity
automaticially. With a TLAB, you quickly fill the TLAB, get another one, get
another one, and so on.

Effectively: the stack does any garbage collection that can be statically
determined for free, keeping data in cache that actually matters, whereas with
TLABs you need to wait for it to pass through GC before any of it is reused.

~~~
vitalyd
Right, that's why it bothers me when TLAB and stack are compared.

------
jimrandomh
Performance is nice, but correctness is more important. Complex programs
written without any GC tend to have use-after-free and double-free
vulnerabilities. Programs written using automatic reference counting only
(without a way to catch cycles) tend to have memory leaks. If GC is killing
performance, then that is a failure of the specific GC or programming language
design. Don't blame the applications programmers for wanting GC semantics;
that's a completely reasonable thing to want and something that language
designers are capable of providing.

In the specific case of Java, I think the real problem is not having value
objects (like C#'s struct) or generics with non-reference types. Serious GC
problems happen not because GC is too hard, but because in Java an
ArrayList<int> of a million items makes a million tiny objects. This problem
is entirely Java-specific.

~~~
astral303
Specifically about ArrayList<Integer> \-- there are specifically a bunch of
projects (Goldman Sachs Collections, trove, FastUtil, Koloboke) that solve the
array of primitives problem.

I would say that's a Java-standard-libary-specific problem.

~~~
TheLoneWolfling
And then you have to do a copy just to pass it around to any other external
libraries. And you lose the advantages of generics.

You end up with 9 copies of everything, that are 99% the same except for a
couple find-replaces. If not more. (For instance, if you have a method that
takes two generic arrays, you need 81 copies! Even if they are the same type
you still need 36 (!) copies.)

Not a good solution.

~~~
astral303
How do you end up with 9 copies of everything? I'm not sure I follow that.

If you are passing in massive ArrayList<Integer> lists to external libraries,
you need to re-evaluate what you're doing. What will that external library do
with this?

~~~
TheLoneWolfling
Look at java.util.Arrays for a concrete example of the problem:

    
    
        static int 	binarySearch(byte[] a, byte key)
        static int 	binarySearch(char[] a, char key)
        static int 	binarySearch(double[] a, double key)
        static int 	binarySearch(float[] a, float key)
        static int 	binarySearch(int[] a, int key)
        static int 	binarySearch(long[] a, long key)
        static int 	binarySearch(Object[] a, Object key) 
        static int 	binarySearch(short[] a, short key)
    

This, sort of works. It's a lot of code duplication, but it's all in the
library.

Except that a relatively common thread goes like this:

You start by having a generic array that gets passed to said binary search. It
works, but it's too memory-hungry when it gets called with a primitive. So,
then what do you do?

Well, you go "huh, I could specialize". And then you start specializing, and
realize that every function that calls binarySearch with a generic type _also_
needs 8 implementations (would be 9, but no sane person would binary search a
boolean[] ).

Basically, its complexity that you cannot even punt off to an external
library. If you have code that needs to be able to be called with generics
without inefficiencies for primitive types, you'll end up with ~9x code
duplication at a minimum.

------
TheLoneWolfling
I don't see why people put so much work into GC before even bothering to do
any work to avoid objects becoming garbage in the first place beyond simple
escape analysis.

------
decafbad
presentation video:
[https://www.youtube.com/watch?v=PnPzeavMjmE](https://www.youtube.com/watch?v=PnPzeavMjmE)

