
Build OpenJDK for a Nice Speedup - mring33621
http://august.nagro.us/optimized-openjdk.html
======
pron
Precisely what compilers do for a given flag varies by compiler and by
version. There could potentially be issues with certain optimizations
(unfortunately, the OpenJDK VM does have undefined behavior, although we're
attempting to gradually reduce it; the chance of a real problem due to
undefined behavior certainly increases with the optimization level, as does
the probability of a compiler bug), and there have been such issues in the
past.

So while building OpenJDK with particular optimizations for your particular
hardware might be worthwhile in some cases, it's not for the faint of heart,
and it should be done with care and extensive testing. If you intend to deploy
such a custom-built OpenJDK in production, you should strongly consider
getting the JCK [1] (the Java TCK) and testing your build for conformance (for
all the VM configurations you'll use: GC choice, compiler choice etc.).

A safer and easier way to get "free" performance speedups is to use the most
recent JDK (currently 13).

[1]:
[http://openjdk.java.net/groups/conformance/JckAccess/](http://openjdk.java.net/groups/conformance/JckAccess/)

~~~
saagarjha
> unfortunately, the OpenJDK VM does have undefined behavior, although we're
> attempting to gradually reduce it

Is it a goal to remove undefined behavior completely? Because I seem to recall
certain architectural decisions being completely reliant on undefined
behavior, such as signal handlers as implicit null checks…

~~~
chrisseaton
You can write these little architectural parts in assembly rather than C, to
avoid having to write undefined behaviour.

~~~
saagarjha
_Does_ the JVM use assembly for this? I’d expect writing all of the VM code
that handles Java objects in assembly for each platform would be a bit
difficult, and I’m not sure I saw enough assembly files in the project for
this…

~~~
chrisseaton
No I mean it is an option available to avoid undefined behaviour, not that
this is what the JVM does.

But moving more of the implementation of the JVM (or all of it!) to Java is
the real solution to all this in my opinion.

~~~
saagarjha
> No I mean it is an option available to avoid undefined behaviour, not that
> this is what the JVM does.

Ah.

> But moving more of the implementation of the JVM (or all of it!) to Java is
> the real solution to all this in my opinion.

And what, ship a Graal binary to cut the bootstrapping/startup step?

~~~
chrisseaton
Yes, rewrite all of the JVM into Java and ship an AOT-compiled binary.

------
eadler
The entry to the article suggests compiling for your architecture and the n
deps several other things.

The article benchmarks - Ofast which is poorly named. It's really -Obroken-by-
design. It'll be "faster" but completely break applications.

It also suggests using omit frame pointer which destroys debugability.

-march and -mtune are the parts that the the article title and intro actually suggest. While possible, I see no evidence that this matters. As I understand it the arch that Java is compiled with is not the same as the one that gets used for JIT compiling.

~~~
earenndil
> [-Ofast will] be "faster" but completely break applications

Except for the jvm, apparently. And everything else I've tried it with.

> It also suggests using omit frame pointer which destroys debugability.

Which is completely useless except for jvm developers.

> As I understand it the arch that Java is compiled with is not the same as
> the one that gets used for JIT compiling.

The performance of the compiler itself matters, not just the performance of
the generated code, because, since it's a JIT, compiler code continues to run.

~~~
rjsw
The Hotspot JIT reads the CPU configuration at runtime to choose which
optimizations are best or which instruction extensions are available.

~~~
pron
Right, but the OpenJDK VM (HotSpot) uses three JITs -- C1, C2 and Graal -- and
two of them are written in C++, so C++ compiler flags could affect the
performance of the JIT compilers, although not of the code they generate.
Because the performance of the emitted code is far more important than the
performance of the compiler, I doubt that will make a difference, but there
are other important parts of the OpenJDK VM that are written in C++ and whose
performance might be affected, most notably the GCs.

~~~
rjsw
I would like gcc to compile OpenJDK correctly for AArch64 before I started
messing with any performance settings.

~~~
pron
I agree that messing with compiler flags is not the most effective use of time
and risk as fat as improving the JDK's performance goes.

------
oso2k
Interesting that `-Ofast` & `-O3` were chosen to bench and not `-Osize`. In
some real-world cases, `-Osize` can beat `-O3` because it will impact caches
less with the smaller code size. This has interesting affects also when the
workload is is highly parallel/threaded and less sequential. It also reduces
side-effects of trying to create fast, possibly unrolled loop code.

~~~
nwallin
It's _very_ dependent on the code.

I'm currently writing a path tracer as a side project, and -Ofast is
sickeningly faster than -O2. Like 4-5 times faster. -O3 duplicates all the
loops into two versions: one with AVX instructions chunking 8 iterations at a
time, and a second scalar version that does the final 1-7 iterations. -Ofast
is a little bit faster than -O3 because it generates the approximate rsqrtps
and rcpps instructions.

However this is a special case. Most code isn't heavy on crunching massive
quantities of floats, and much of the code that does isn't written in a way
that gives the compiler the freedom to autovectorize your loops. And a
surprisingly large amount of code is still compiled with MSVC which won't
vectorize at all.

In gcc, -O2 is generally fastest for general purpose code, both faster than
-Os and -O3. As an example of why -O2 is faster than -Os, -O2 will optimize
signed integer division by a constant power of two into a handful of bitwise
instructions, which are larger than a single (but much, much slower) division
instruction. (signed integer division can't be replaced with a single bitshift
because negative integers work differently. unsigned integer division can be
replaced by a single bitshift.) People think this is a universal optimization,
but it's a size tradeoff so -Os specifically eschews it.

~~~
oso2k
You might be misremembering but strength reduction is part of ‘-O1’ [0].

However, my point for a general purpose language system like OpenJDK that
favors multithreading or multiprocessing or server workloads, each core/thread
and process/thread share the I-Cache/L1 cache. Address lines are still 64
bytes. Code for these systems rarely is expected to run in isolation. I tend
to want to be a good neighbor and reduce code size when I can.

[0] [https://gcc.gnu.org/onlinedocs/gcc/Optimize-
Options.html](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html)

~~~
nwallin
I'm actually super way too drunk to give you a full rebuttal, so here's a
godbolt compilation thing that proves that I'm right and you're wrong:

[https://cpp.godbolt.org/z/-zeMWT](https://cpp.godbolt.org/z/-zeMWT)

(stop reading now if that's enough)

The -Os compilation is some bookkeeping and an idiv instruction, the -O2
compilation is bookkeeping, two shifts, and an and instruction. I don't know
what they because I'm drunk sorry not sorry but the one that does the idiv
instruction is both slower and smaller and that's on purpose. (you can tell
the idiv one is smaller by clicking the 11010 button to display offsets. the
starting instructions of both functions have the same offset, but the final
instruction of the -Os compilation is _substantially_ smaller than the -O2
compilation.)

The description of -Os which you linked is telling. It says it's -O2 without
certain optimizations, and then it also says:

> It also enables -finline-functions, causes the compiler to tune for code
> size rather than execution speed, and performs further optimizations
> designed to reduce code size.

Somewhere buried in that tuning for code size rather than execution speed and
further optimizations designed to reduce code size is an optimization that
will replace bitwise magic with division statements. -Os does what it says on
the tin. It makes your code small. It makes your code slow. It does so on
purpose.

People think that -Os is a superset of -O1 and a subset -O2. It's neither. It
is neither a superset of -O1 nor -O2, nor is it a subset of -O1 or -O2. There
are speed optimizations that -Os adds to -O1 and there are size optimizations
that -Os adds to both -O1 and -O2.

The point, I think, of -Os, is for embedded. If you have a size n PROM, and
your code compiles to size n+1 with -O2, and if you apply -Os and it compiles
to size n, that's a feature. -Os, in my opinion, ought to be uncompromising
towards that goal. For better or for worse.

I'll be sober in the morning and can engage with you better then. Sorry.

~~~
hak8or
Just wanted to say, you were surely drunk when writing that, but it was still
very fun to read.

------
NullPrefix
>funroll loops

Reads like it came right from the Gentoo ricing guide.

------
bcaa7f3a8bbc
-Ofast enables unsafe floating-point optimizations that might give a performance boost for applications which don't need strict compliance. For example, FP operations cannot be optimized by using the association rule due to various concerns, but -Ofast enables -fassociative-math and allows such optimization. If your program just computes a bunch of numbers and doesn't rely on some uncommon features such as NaN or signed zero, it would only slightly change the last irrelevant few significant digits of the result, and greatly speed up your program.

But the general rule-of-thumb is that -Ofast should not be used, unless you
know what the program is doing and how the optimization affects it.

A more meaningful comparison is -O2 vs. -O3 vs. -O3 -march=native
-mtune=broadwell. Or run the OpenJDK test suite with -Ofast and see whether
there are failed tests.

------
this_user
Whether or not the build may already be optimised depends on your source for
the JDK. Looks like the Arch build of the OpenJDK is already using -O3.

~~~
mumblemumble
It's not about what level of optimization is used, so much as what particular
CPU generation the target is being optimized toward.

The public binary distributions have to limit themselves to what X86-64 looked
like when it first came out in 2003, which means they can't take advantage of
any new instructions that were introduced in the past 16 years.

------
ambrop7
> -march=native and -mtune=broadwell tell the compiler to optimize for your
> architecture. One would think given the compiler documentation that march
> implies mtune, but this is apparently not the case.

That sounds like a bug to me which should be reported.

~~~
saagarjha
There's a thread discussing that topic: [https://lemire.me/blog/2018/07/25/it-
is-more-complicated-tha...](https://lemire.me/blog/2018/07/25/it-is-more-
complicated-than-i-thought-mtune-march-in-gcc/#comment-321471)

------
exabrial
Seems like something GraalVM could do at runtime!

