
Java JIT vs. Java AOT vs. Go for Small, Short-Lived Processes - robfig
http://macias.info/entry/201912201300_graal_aot.md
======
robfig
It's hard to square these articles with the reality I see on the ground: our
baseline memory usage for common types of Java service is 1 GB, vs 50 MB for
Go. We do have a few mammoth servers at the top end in both languages though
(e.g. 75 GB heaps)

The deploy JARs have 100+ MB of class files, so perhaps it's a function of all
the dependencies that you "need" for an enterprise Java program and not
something more fundamental.

These blog posts also present AOT as if its just another option you can
toggle, while my impression is that it's incompatible with common Java
libraries and requires swapping / maintaining a parallel compiler in the build
toolchain, configuring separate rules to build it AOT, etc. I don't have
actual experience with it though so I could be missing something.

~~~
shantly
It’s been the case ever since I can remember that “Java’s basically as
efficient as native, it’s super fast” but all the actual java software I
encounter is a slowish, bloated memory hog. I don’t know why, but that’s how
it is.

~~~
smabie
The style of programming for the dominant JVM languages (Java, Scala, Kotlin)
involves an over reliance on churning through a lot of short-term garbage. I
think this partially due to how annoying the platform is to actually use.
Incredibly complex and chalk full of programmer pitfalls (like type-erasure).
In fact, the majority of devs have no clue at how the JVM works, treating it
as just “magic.”

I would imagine that Go’s GC is worse than the tunable JVM ones, but go isn’t
powerful enough that one would ever be tempted to program in an abstraction
heavy style. While I would argue it takes a lot more work to program in Go ca,
day Scala, I think the constraints imposed lead to better software. I for one
have never seen a JVM desktop or business app that worked super well. The JVM
manages to be a highly optimized efficient platform in which almost
exclusively slow, laggy, and memory hungry apps are produced. With that being
said, it’s a highly productive platform (if you’re using something besides
Java), especially for backend business apps.

~~~
mdasen
GC is complicated because what works well for one thing might not work well
for other things. GC can be both about clever things and about hard trade-
offs. That said, I'll try to talk about GC without having a religious war
erupt. Keep in mind, something here might be wrong.

Go's GC is tuned for low latency. People hate pause times. They hate it even
more than they hate slowness. Go makes a trade-off for short pause times, but
in doing so they do sacrifice throughput. That's great for a web server. With
a web server, you care a lot about pauses offering a bad experience. A 500ms
pause is going to give a customer a bad experience. That's less good for batch
processing. Let's say you're running a backfill job that you expect to take
10-15 hours. You don't really care if it pauses for 1 minute to GC every 30
minutes. No one will even know. However, you will know whether it took 10
hours or 15 hours.

Go's GC is meant to be simple for users. I think the only option is "how much
memory should I be targeting to use?" That makes it dead simple for users. I
think the Go authors are right that too many tuning knobs can be a bad thing.
People can spend months doing nonsense work. Worst is that knobs are hard to
verify that anything is really different. If you're running a web server, was
the traffic really the same? What about other things happening on the machine?
Wouldn't you rather be writing software instead?

Go's GC is a non-copying GC. One thing this means is that Go's (heap) memory
allocation ends up being very different because Go's memory is going to become
fragmented. So Go needs to keep a map of free memory and allocations are a bit
more expensive. Java (with GCs like G1) can just bump allocate which is
insanely cheap. This is because Java is allocating everything contiguously so
it just needs to move a pointer. How does that work once something becomes
freed? Java's G1 (the default in the latest LTS Java) will copy everything
that's still alive to a new portion of memory and the old portion is then just
empty. You kind of see this in Go's culture. Web frameworks obsess about not
making heap allocations. Libraries often have you pass in pointers to be
filled in rather than returning something.

Go misses out on the generational hypothesis. The generational hypothesis is
one of the more durable observations we have about programming - that most
things that are allocated die really quickly. C# and Java both use
generational collectors by default and they've done way better than what came
before. C# and Java don't have as-low pause times as Go, but part of that is
that they're targeting other things like throughput or heap overhead more.

Go doesn't need GC as much. Go can allocate more on the stack than Java can
and, well, Go programmers are sometimes a bit obsessed with stack allocations
even when it makes for more complicated code. Having structs means creating
something where you can just have contiguous memory rather than allocating
separate things for the fields in your object. Go's authors have observed that
a lot of their objects that die-young are stack allocated and so while the
generational hypothesis holds, it's a bit different. Go has put a good amount
of effort into escape analysis to get more stuff stack allocated.

Java has two new algorithms ZGC and Shenandoah which are available in the
latest Java. They're pretty impressive and usually get down to sub-millisecond
pause times and even 99th percentile pauses of 1-2ms.

Go's new GC was constrained by the fact that "Go also desperately needed short
term success in 2015" (Rick Hudson from Google's Go team) and the fact that
they wanted their foreign function interface to be simple - if you don't move
objects in memory, you don't have to worry about dealing with the indirection
you'd need between C/C++ expecting them to be in one place and Go moving them
around. Google's going to have a lot of code they want to use without
replacing it with Go code and so C/C++ interop is going to be huge in terms of
their internal goals (and in terms of what the team targeted regardless of
whether it's useful to you). And I think once they had shown off sub-
millisecond pause times, they were really hesitant to do something that might
introduce things like 5ms pause times. I think they might have also said that
Google had an obsession with the long-tail at the time. Especially at Google,
there's going to be a very long tail and if that's what people are all talking
about and caring about, you end up wanting to target that.

Go has tried other algorithms. They had a request-oriented-collector and that
worked well, but it slowed down certain applications that Go programmers care
about - namely, the compiler. They tried a non-copying generational GC, but
didn't have a lot of success there.

Ultimately, Go wanted fast success and to solve the #1 complaint people had:
extreme pause times (pause times that would make JVM developers feel sorry for
Go programmers). Going with a copying GC might have offered better
performance, but would have meant a lot more work. And Go gets away with some
things because more stuff gets stack allocated and Go programmers try to avoid
heap allocations which would be more expensive given Go's choices (programming
for the GC algorithm).

I don't think that JVM languages have a programming style that lends
themselves to an over-reliance on churning through short-term garbage. Well,
Clojure probably since I think it does go for the functional/immutable
allocate-a-lot style. Maybe Kotlin and Scala if you're creating lots of
immutable stuff just to re-allocate/copy when you want to change one field.
That doesn't really apply to most Java programs. And I have covered the way
that Go potentially leads to more stack allocations. However, I don't think
most people know how their Go programs work any more than their JVM programs
and this really just seems to be a "I want to dislike Java" kind of thing
rather than something about memory.

Java programs tend to start slow because of the JVM and JIT compilation. Java
has been focused on throughput more than Go has (at the expense of
responsiveness). That is changing with the two latest Java GC algorithms (and
even G1 which is really good). Java is also working on modularizing itself so
that you won't bring along as much of what you don't need (Jigsaw) and AOT
compilation. Saying that Java is slow just isn't really true, but it might
feel true - things like startup times and pause times can inform our opinions
a lot. There's absolutely no question that Java is a lot faster than Python,
but Python can feel faster for simple programs that aren't doing a lot (or are
just small amounts of Python doing most of the heavy work in C).

I mean, are you including Android in "all Java apps"?

Java, C#, and Go are all really wonderful languages/platforms (including
things hosted on them like Kotlin). They're all around the same performance,
but they do have some differences. I think Go should re-visit their GC
decisions in the future, especially as ZGC and Shenandoah take shape, but
their GC works pretty well. But there are certainly trade-offs being made (and
it isn't around language features that make the platform productive for
programmers). I think GC is very interesting, but ultimately Java, C#, and Go
all have very good GC that offers a good experience.

~~~
ptx
Java is slower than Python for simple short-running programs. When Python is
finished Java is still struggling through start-up and class-loading.

~~~
twic
Is this actually true? Have you measured this, or seen measurements?

~~~
ptx
Here are measurements for an extreme case of a short-running simple program:

    
    
      $ cat Hello.java 
      class Hello { public static void main(String[] args) { System.out.println("Hello from Java!"); } }
    
      $ cat hello.py 
      print("Hello from Python!")
    
      $ time /usr/lib/jvm/java-13-openjdk/bin/java -Xshare:on -XX:+TieredCompilation -XX:TieredStopAtLevel=1 Hello
      Hello from Java!
    
      real 0m0.102s
      user 0m0.095s
      sys 0m0.025s
    
      $ time python3 -S hello.py 
      Hello from Python!
    
      real 0m0.034s
      user 0m0.020s
      sys 0m0.013s
    

It's a bit faster if you create a custom modular JRE with jlink:

    
    
      $ /usr/lib/jvm/java-13-openjdk/bin/jlink --add-modules java.base --output /tmp/jlinked-java13-jre
      $ /tmp/jlinked-java13-jre/bin/java -Xshare:dump
      $ time /tmp/jlinked-java13-jre/bin/java -Xshare:on -XX:+TieredCompilation -XX:TieredStopAtLevel=1 Hello
      Hello from Java!
    
      real 0m0.087s
      user 0m0.050s
      sys 0m0.035s

------
correct_horse
Go has no proper solution to garbage collector ballast, a hack which
businesses you've heard of are using in the wild. see
[https://blog.twitch.tv/en/2019/04/10/go-memory-ballast-
how-i...](https://blog.twitch.tv/en/2019/04/10/go-memory-ballast-how-i-learnt-
to-stop-worrying-and-love-the-heap-26c2462549a2/) and
[https://github.com/golang/go/issues/23044](https://github.com/golang/go/issues/23044).
A golang team member's reasons for not adding a minimum heap size include that
it would require more testing each release, and that they might want to have a
max heap size hint instead.

I posted a comment similar to this one recently, but it seems more relevant
here.

~~~
Groxx
I've noticed similar things while profiling my tests - making a large static
allocation or bumping GOGC=1000 can cause them to run more than 2x faster (my
favorite took a 12 second test suite and dropped it to about 2.5 seconds). So
much time is spent assisting the GC to keep memory at like an 8-12mb range, as
if that small of a heap was somehow the most important thing Go could be doing
with the CPU.

------
ape4
For a short lived Java process you can use the no-op garbage collector (ie
don't collect).
[http://openjdk.java.net/jeps/318](http://openjdk.java.net/jeps/318)

~~~
MaxBarraclough
That would save the trouble of initializing a garbage collector that isn't
going to be used. Is that a significant saving?

~~~
stygiansonic
_“Last-drop throughput improvements. Even for non-allocating workloads, the
choice of GC means choosing the set of GC barriers that the workload has to
use, even if no GC cycle actually happens. All OpenJDK GCs are generational
(with the notable exceptions of non-mainline Shenandoah and ZGC), and they
emit at least one reference write barrier. Avoiding this barrier can bring the
last bit of throughput improvement. There are locality caveats to this, see
below.”_

From: [https://openjdk.java.net/jeps/318](https://openjdk.java.net/jeps/318)

~~~
MaxBarraclough
I'd missed that, thanks.

------
harikb
When I was looking for my first car in early 90s, I knew nothing about cars or
brands. One thing I noticed was that all the TV ads for most of the cars would
say "more room than a Camry". I knew what I needed to buy.

If Go doesn't survive another decade, I would still be happy about what it
triggered.

------
zestyping
100 ms is nowhere near what I'd call "negligible" for a process that might
only live a second or two!

python -c print 'hello' starts up and shuts down an entire Python interpreter
in less than 50 ms on my machine, whereas the equivalent Java program never
takes less than 120 ms. That seems pretty sad for a language that's had at
least an order of magnitude more resources thrown at it.

~~~
pron
> whereas the equivalent Java program never takes less than 120 ms

Then you must be using an old version of Java.

~~~
Groxx
Given that hardware varies, and the article was showing 80-90ms floors: 120ms
floor seems entirely reasonable to me.

~~~
pron
The article wasn't using a current Java, either. Current numbers are under
40ms for Hello, World: [https://cl4es.github.io/2019/11/20/OpenJDK-Startup-
Update.ht...](https://cl4es.github.io/2019/11/20/OpenJDK-Startup-Update.html)

~~~
ptx
That's certainly good news and the renewed focus on start-up performance is
very welcome. I wonder what the numbers look like for small Kotlin programs.

(Although I see now that the comparison disabled CDS on Java 8, so the actual
improvement is perhaps not as large as it seemed.)

------
mikece
I'm curious how Dart2Native would fare in this test, if it would be more or
less the same as Go or if it would be more efficient.

~~~
erokar
I'm curious about this too. Dart has an interesting runtime since it supports
both quick JIT compilation and AOT to machine code.

~~~
mikece
With Flutter it's fully AOT compiled to binary ARM code and is a compelling
choice for IoT applications as well.

------
karmakaze
I've run small services built with Java, Go and Crystal to achieve good
startup and throughput performance while minimizing memory usage. My
experience with Crystal is limited but has been positive thus far.

The sweet spot for me is using Java with OpenJ9 which has very fast startup
time while sacrificing only a bit of top-end throughput.

I would only choose AOT if packaging/deployment were issues with the JIT
approach. In the case of Go, AOT is practically free but I prefer not being
limited to array/slice, map, and channel generics.

------
CountHackulus
I'd kill for a comparison with the IBM J9 AOT flag. It essentially just caches
jitted code for startup. Also if startup time is super important than you can
try -Xquickstart.

------
skywhopper
Interesting numbers. I have been out of the Java world for a few years and I’m
unfamiliar with GraalVM but I’m curious how compatible it is with Oracle Java
or OpenJDK.

Of course for small scale stuff like a QuickSort implementation, JVM starts
are fast-ish, but for a nontrivial service, library load times during boot can
balloon quickly depending on your build discipline.

~~~
evacchi
incredibly compatible, albeit with some limitations. The compiler is quite
aggressive, so e.g. you can't do reflection on "any" class, you just have to
tell the compiler what you are going to use at runtime (so called "closed-
world assumption"). Also most initialization code is forced to run at
"compile-time" to keep boot time low. Pretty incredible piece of code.

~~~
mcguire
What's the license on it?

I'm looking at a project that makes " _The main benefits of doing so is to
enable polyglot applications (e.g., use Java, R, or Python libraries)_ " sound
attractive.

------
sreque
This article reiterates the belief that jvm start-up time is slow is somewhat
a myth. When. I measured it years ago, I found jvm start-up time to be roughly
equivalent to node.js.

What makes start-up time bad in any interpreted language, including Java,
python, and JavaScript, is code-loading time. This time is 0(n), where n is
the total size of your app, including transitive dependencies. It takes time
to load, parse, and validate non-native code. This time far dwarfs any vm
start-up time.

As an experiment, write a hello world node.js app and time it's execution.
Then add an import statement on the aws sdk. Don't actually use it. Just
import it! When I last measured it, this caused start-up time to go from 30 Ms
to something like 300 Ms. The extra time mostly comes from loading code.

For a native app, the binary itself just gets mmapped into memory. Shared
library loading is more expensive, but not much, and way less than loading
source code or byte code.

The tldr is if you want a fast starting non-native app, you have to shrink the
transitive dependency closure your app loads to do it's job. This is easy for
toy benchmark apps but can be harder for real apps. It also goes against the
philosophy of most devs to rely on third party libraries for everything. For
instance, If you care about start-up time, it may be worth re implementing
that function you'd normally get from guava or apache commons. You can
alternatively use a tool like proguard to shrink your dependency closure.

------
xvilka
It would be nice to have some automated tools to convert from Java to Go or
Rust. Something like c2rust [1], but for Java. There exist[2] some kind of
automation, buts it's too basic to be practical.

[1] [https://GitHub.com/immunant/c2rust](https://GitHub.com/immunant/c2rust)

[2] [https://github.com/aschoerk/converter-
page](https://github.com/aschoerk/converter-page)

------
gok
"3 Bad Tools for a Job, which is least bad?"

~~~
ncmncm
This.

With a big enough hammer, every screw is a nail.

------
AtlasBarfed
Why can't the JVM preload on startup the majority of the core runtime as a
shared library architecture in a sort of super-permanent generation, and then
the JVMs piggyback on that?

I remember solaris boxes had little of the startup cost and someone told me
they preloaded the java runtime.

~~~
thu2111
It does. That's called AppCDS and is a relatively recent feature, it's also
not on by default buy they're working on making it be used automatically.

------
rijoja
Wouldn't the JIT:ed code be optimized for the specific CPU, whereas the AOT
perhaps can not make use of certain instructions? Also, when making a
comparison off the different models one must keep in mind that the software
that processes the code have different characteristics.

~~~
mdasen
JIT'd code isn't just about specific CPUs, but about optimizing hot paths. For
example, you don't want to inline a commonly used function all over the place,
but if there's one area that calls it 10,000 times per second while the others
call it once a minute at most, you can re-write the program at runtime having
observed that hot code path.

To get an intuition for JITs: there are pieces of your code that you know will
always be run in a certain way (or mostly run in that way), but it isn't
provable at compile time that it will always be that way. JITs can notice that
pattern and optimize it (or even provide optimizations for common, but not
exclusive paths).

~~~
ncmncm
While they could, in principle, do this, does anybody actually do it in
production?

~~~
jacques_chester
Yes, if you have Java code that runs for more than a few minutes in
production.

------
e12e
Interesting write-up - would be nice to see not just the quicksort code, but
the harness/scripts used for benchmarking. As far as I can tell it's not
included?

------
haolez
I wonder if AOT would speed up Groovy as well. My intuition is that it doesn't
matter, since the runtime will get bundled and execution time will be the
same.

~~~
imtringued
That's exactly what happens. When you compile Groovy with GraalVM it will
generate a fallback image that still requires a full JVM.

------
michaelcampbell
Since the article is about "short lived processes", I'm having a hard time
caring about the memory use as a first order issue. <shrug>

~~~
AnimalMuppet
Depends on how many of them you have running at any one instant.

~~~
firethief
If you have short-lived tasks popping off at that rate, the JVM looks some
orders of magnitude better if you don't spin up a fresh process for every
task.

------
mister_hn
The question is: if you compile Java to native code, what's the purpose of
using Java instead of Go or Rust or C++?

~~~
chrisseaton
Existing code.

Existing expertise.

Existing tooling.

~~~
xvilka
You can convert it to Kotlin and just use LLVM-based native[1] compilation
that is even more complex efficient. It's completely automated, and gives you
the better and more modern language without much hassle.

[1] [https://kotlinlang.org/docs/reference/native-
overview.html](https://kotlinlang.org/docs/reference/native-overview.html)

~~~
vips7L
Except it barely has a stdlib without the JVM. You cant even read a file in
knative without having to do a posix syscall.

------
ddtaylor
For anyone looking to avoid attacks because of no SSL:
[https://archive.is/Rzdko](https://archive.is/Rzdko)

