
Performance of modern Java on data-heavy workloads - cangencer
https://jet-start.sh/blog/2020/06/09/jdk-gc-benchmarks-part1
======
willvarfar
A very clear and interesting post.

I've been trying to fit big-enough long-running stuff into JVMs for a few
years, and have found that minimizing the amount of garbage is paramount. Its
a bit like games- or C programming.

Recent JVM features like 8-bit strings and not having a size-limit on the
interned pools etc have been really helpful.

But, for my workloads, the big wastes are still things like java.time.Instant
and the overhead of temporary strings (which, these days, copy the underlying
data. My code worked better when split strings used to just be views).

There are collections for much more memory-efficient (and faster) maps and
things, and also efficient (and fast) JSON parsing etc. I have evaluated and
benchmarked and adopted a few of these kinds of things.

Now, when I examine heap-dumps and try and work out where more I can save
bytes to keep GC at bay, I mostly see fragments of Instant and String, which
are heavily used in my code.

If there was only a library that did date manipulation and arithmetic with
longs instead of Instant :(

~~~
rb808
You're absolutely right. Its one reason I struggle with the modern fashion for
immutable classes and FP, they are always making copies of everything, seems
crazy.

~~~
mumblemumble
Ideally, a good compiler that understands FP will, behind the scenes, detect
when it's safe to mutate the old data rather than creating a copy. That's a
big part of why Haskell manages to be neck-and-neck with C despite being
functionally pure.

Where it gets tricky is in an environment like the JVM where programming in
that style was not anticipated, and introducing any optimizations along these
lines for the benefit of the proverbial Scala fans needs to be balanced
against the obligation not to adversely impact idiomatic Java code.

That said, even without that, it's not necessarily crazy. It's just a value
call: Do you believe that more functional code is easier to maintain, and
perhaps value that above raw performance? I'm old enough to remember similar
debates about how object-oriented C++ code should be, and to have at least
encountered Usenet posts from similar debates about how structured C code
should be. I don't bring this up by way of trying to weasel in some
"historical inevitability" argument - these are legitimate debates, and there
are still problem domains where coding guidelines may discourage, or even
prohibit, certain structured programming practices. For very good reasons.

~~~
chrisseaton
> That's a big part of why Haskell manages to be neck-and-neck with C despite
> being functionally pure.

Come on... that's not what you typically get with Haskell, is it.

~~~
nicoburns
Rust gives you FP-style without loads of copies and C-like performance.
Immutable data-structures aren't idiomatic, but it's ownership gives you most
of the same benefits.

~~~
zozbot234
The nice thing about Rust is that it gives you the main benefits of FP even
when you're not programming in "FP style". The Rust "sharing xor mutability"
default model provides underlying semantics and ease of analysis that's quite
comparable to what you get with a pure-functional language. Of course extended
mutability as with Cell<>, RefCell<> etc. undermines this, but these are only
used when necessary.

------
blinkingled
I wonder how things would have stacked with OpenJ9 - AdoptOpenJDK project
makes OpenJ9 builds available for Java 8/11/13/14 - so it should be trivial to
include it in the benchmarks.

We have been experimenting with it in light of the Oracle licensing situation
and it does provide interesting set of options - AOT, various GCs (metronome,
gencon, balanced) along with many other differentiators to OpenJDK like
JITServer which offloads JIT compilation to remote nodes.

[https://www.eclipse.org/openj9/docs/gc/](https://www.eclipse.org/openj9/docs/gc/)

It doesn't get as much coverage when it should - it's production hardened -
IBM has used it and still uses it for all their products - and it's fully open
source.

~~~
pron
> in light of the Oracle licensing situation

You mean the licensing situation where Oracle completed open-sourcing the
entire JDK and made Java free of field-of-use restrictions for the first time
in its history?

If you're talking about the JDK builds you download from Oracle, then there
are two (each linking to the other): one paid, for support customers, and one
100% free and open-source: [http://jdk.java.net/](http://jdk.java.net/)

~~~
fgonzag
I think he means the part where you have to pay to use JDKs older than 6
months, which means basically everyone has to pay.

~~~
Spinfusor
Not true. You can use the OpenJDK for free until the end of time. If you want
ongoing updates beyond six months, there are a bunch of free distributions:
Azul Zulu Community (7/8/11/13/14), AdoptOpenJDK (8/11/14), etc.

~~~
fgonzag
The whole subject is about Oracle's JDKs here, and that's a very recent
development too.

~~~
pron
Oracle distributes two JDKs: one for support customers and one 100% free,
forever: [http://jdk.java.net/](http://jdk.java.net/)

------
molodec
Specific workload matter a lot. I had a good experience with Shenandoah
collector on an application that generates very few intermediate objects, but
once an object is created it stays in the heap for a while ( a custom made
key/value store for a very specific use case). Shenandoah collector was the
best in terms of throughput and memory utilization. Most collectors are
generational, so surviving objects have to be moved from Eden to Survivor to
Old. Shenandoah is not generational, and I suspect it has less work to do for
objects that survive compare to other collectors. When most objects live long
enough generational collectors hinder performance.

~~~
bestboy
Yep, workload matters. Generational garbage collectors are fundamentally at
odds with caching/pooling of objects. They are based on the assumption that
objects die young. Typically that is not the case for internal caches, though.
Caches usually consist of long-living/tenured objects.

~~~
NovaX
It is a stretch to claim caching is fundamentally at odds with GC. It is more
correct to say that LRU breaks the generational hypothesis, because it
prioritizes new entries which take a long time to be evicted. However many
workloads are frequency biased and these one-hit wonders degrade the hit rate.
That is why you'll see more aggressive eviction in a modern policy, so you'll
have better GC behavior and higher hit rates using something like Java's
Caffeine library.

------
ww520
G1 looks very good. Glad it becomes the default so one less thing to tune for
a deployment.

------
cangencer
Follow-up post: [https://jet-start.sh/blog/2020/06/23/jdk-gc-benchmarks-
remat...](https://jet-start.sh/blog/2020/06/23/jdk-gc-benchmarks-rematch)

------
xvilka
Converting Java code to Kotlin, then compiling it with the Kotlin Native[1] is
more promising from the performance point of view. Native code is always
faster (assuming compiler is good enough).

[1] [https://kotlinlang.org/docs/reference/native-
overview.html](https://kotlinlang.org/docs/reference/native-overview.html)

~~~
haxen
An ahead-of-time compiler doesn't have the advantage of the call profile of
polymorphic call sites. The JIT compiler has much more inlining opportunities,
and in some cases this results in better performance.

Also, there are cases where manual memory management, which usually boils down
to reference counting, has great overheads where a GC-managed runtime has no
overhead at all. They involve repeatedly building up and then discarding large
data structures. GC algorithms simply don't see the dead objects, whereas
refcount-based management must explicitly free the memory of each object.

~~~
kllrnohj
> The JIT compiler has much more inlining opportunities

That's largely only true for devirtualization, which tends to not be as much
of an issue in AOT compiled languages due to having features that just make
reliance on virtual calls less prevalent (think C++ templates as an example in
the extreme).

The only other case where JITs can inline more than AOTs is across shared
library boundaries, which can be useful but if it is useful in a particular
place it's also typically easy to "fix" by just making that function
statically linked (or implemented in the header, even) instead.

Otherwise the time constraints of JITs near universally mean they cannot
optimize as well as AOTs, even though they do have more runtime information
available. Unless you do a multi-tiered JIT approach like WebKit does (
[https://webkit.org/blog/3362/introducing-the-webkit-ftl-
jit/](https://webkit.org/blog/3362/introducing-the-webkit-ftl-jit/) ), with
the last tier being the one that finally lets a full "AOT quality"
optimization pass happen because you can finally justify the time spent on the
optimizer. But then you also have ridiculous warmup latencies.

> Also, there are cases where manual memory management, which usually boils
> down to reference counting, has great overheads where a GC-managed runtime
> has no overhead at all. They involve repeatedly building up and then
> discarding large data structures. GC algorithms simply don't see the dead
> objects, whereas refcount-based management must explicitly free the memory
> of each object.

There's a lot more to this than such a simple claim. GC'd languages also
almost always need to pay a zero'ing cost in conjunction with freeing memory
which makes the actual free that happens a lot slower, and GC'd languages are
slower the larger the object count gets while manual memory managed languages
are ~constant. There's also more strategies in play for manual memory managed
languages than just ref counting - such as just single ownership
(std::unique_ptr, Rust's Box<>, etc..)

If you are doing something that involves repeatedly building up & and then
discarding a data structure, though, then that's where a manual managed memory
would run circles around a GC'd one. A simple arena allocator is a superb
match for that and cannot be beat in performance. Bump-pointer allocation
speed, zero GC pause, zero collection latency, etc... This is what games do
for per-frame allocations, for example. Essentially a single-frame GC without
a collection pass being needed. Not a lot of things actually do build up and
then discard a structure repeatedly, so you don't get to use this trick very
often, but when you can it's stupid fast.

