
Quick Tips for Fast Code on the JVM - wskinner
https://gist.github.com/djspiewak/464c11307cabc80171c90397d4ec34ef
======
barrkel
I'd recommend a process rather than a handful of implementation-specific rules
of thumb that are, IMO, at risk of being applied too generally. I'd hate
someone to read this article and then go ahead and avoid 'new' wherever
possible because they think that's how you make code fast, or avoiding boxing
and polymorphic call sites, etc.

My process for writing fast code on the JVM:

1) Measure. Set up a benchmark so you know whether you're on the happy path or
not, or whether you've fallen off a performance cliff, for whatever reason.
Make this part of your testing suite with some kind of notification for
regressions.

2) Start small, ideally with a do-nothing loop over the input. This gives you
a baseline; you can't go faster than a do-nothing loop (presuming you can't
skip part of your input, which is an algorithm problem, not specific to JVM
optimization).

3) Incrementally build up your algorithm and pay attention to when it falls
off the performance cliff, using your benchmark from (1). If and when you fall
off the performance cliff, that's when you start bringing in tricks like
avoiding new, ensuring call sites are monomorphic / bimorphic, avoiding
boxing, reducing pointer indirections and other cache friendly code, etc.

Another trick to consider is playing around with inlining, but not in the way
you might think: try pushing infrequently executed code (conditionals) one
level deeper in the call stack (i.e. making the body of an if-block into a
method and calling it). The idea here is to encourage inlining of method doing
the calling. Inlining is where the JVM gets to specialize your code to the
specific situation at hand, but the JVM is reluctant to inline big methods
because it has a time budget. So you need to help it.

~~~
kasey_junk
Be careful with 2) because of inlining. Can’t count the number of times I
wrote benchmarks that got completely inlined away.

~~~
vvanders
Also for(Foo x : y) will allocate iterators on some JVM implementations which
will nail you on the GC.

~~~
_old_dude_
It's not on some JVM implementations, to be able to remove the Iterator
creation, the VM requires the call to y.iterator() to be either monomorphic or
bimorphic.

Hence the advice "Keep Things Monomorphic or Bimorphic" !

~~~
jnordwick
Even if it inline the iteration creation and calls, then hoist and remove the
allocation, you're likely to see suboptimal instructions.

On hot paths, I often find it better to just avoid iterators in the first
place than trying to clean up the mess later.

~~~
_old_dude_
yes, the Iterator protocol of Java is sub-optimal \- hasNext() has to be
called twice so you hope the VM will kick the common sub expression
elimination which i have found works well only if you do not do any side
effect in hasNext. \- and most of the Iterator are fail fast so the code has
to check that the modCount of the collection has not changed.

but from my own experience, you usually do not have to optimize that far,
removing that 'synchronized' block, is far more common.

------
tenpoundhammer
One point I’d like to add which a lot of people disagree with is that we
should all focus on writing simple and easy to understand code. Enigineers
shouldn’t worry too much about optimizing code unles there is a good reason to
do so.

I’d say about 1% of my code has ever required me to return and optimize it.(I
usually write apis that are utilized by end user apis). Most of the time my
code is more than fast enough and user concerns are more around bugs related
to requirements.

In the few cases I have had to optimize its usually been a super obvious
problem. For instance there was a Scala API that looped over a list of lists
and called the Scala built in find function. That function is linear and
incredibly slow I just converted the first list into a hashmap took a call
that took 30 seconds to under 3 seconds still slow but order of magnitudes
faster. It’s usually been silly stuff like that.

All of this of course depends on the domain you are working in some
applications require you to squeeze out every millisecond in which case most
of this won’t apply to you.

~~~
userbinator
_we should all focus on writing simple and easy to understand code_

That _is_ an optimisation. I've found that the simplest solution is usually
the most efficient.

~~~
virgilp
This is a good-sounding statement, so it "feels true". But, if you stop to
really think about it... It's probably false in most circumstances [1]

[1] If by "efficient" you mean time-efficient. If you pick your own efficiency
metric, then yes, sure, I guess so...

------
kodablah
If you're getting to this level of optimization, you could just consider going
native. Making a shared lib in Rust that uses JNI and does everything you
could do in Java is really really easy.

~~~
twic
Careful, though. It costs about a millisecond to do a return trip across the
JNI boundary, so don't do that anywhere near an inner loop. Accessing Java
objects from native code is also a pain, with potential performance gotchas
(pinning, mostly).

An approach I've seen work is to have dedicated Java and native threads,
communicating via shared memory. We have some funky in-house code for this,
but I think Aeron [1] should be a good open source option.

[1] [https://github.com/real-logic/aeron](https://github.com/real-logic/aeron)

~~~
wging
> Accessing Java objects from native code is also a pain, with potential
> performance gotchas (pinning, mostly).

What do you mean by pinning? If you mean pinning threads to cores, how does
that relate to JNI?

~~~
bennofs
Probably pinning the object in memory, so that you can keep a stable reference
to it without the danger of the GC moving it away under your pointer.

------
jcdavis
The example of bad allocations

    
    
      def first[A](left: A, right: A): A = left
    
      first(12, 42)   // => 12
    

actually allocates nothing, because both numbers are in the Integer box cache
range (-128 to 127 inclusive by default, but configurable to be greater). The
general argument is good though, so mabye first(128,129) is a better example
:)

~~~
ychen306
Seems odd that the compiler can't inline this pattern.

~~~
peoplewindow
It probably can. Graal certainly could.

------
time4tea
Hmmmm. Fast code for my particular application, in my area of specialization.
But not actually very helpful for 99.9% of applications that are actually
deployed, which dont really suffer from the type of problem here, and far
higher level optimisation would have greater benefit...

E.g. prefer final fields; don't share between threads unless you need to;
there is little point micro-optimising a poor algorithm; don't do the same
work repeatedly...

Having said that, in a specialised context, very interesting....

~~~
jnordwick
Can't have a performance oriented article here (or on any dev site) without a
chorus on how irrelevant/useless/harmful the information is. Or half the
comments being about profiling.

Also final doesn't help hotspot nearly as much as the pointers the article
gives. Final doesn't help hotspot much if at all.

------
aglionby
Usual disclaimer of profile your code to identify areas for improvement and
try to quantify changes as far as possible -- it's no good doing things just
because you _think_ it'll make the program faster.

Specifically on the Java side of things, while not quite as low-level as what
the post is talking about, I've seen situations where proper use of logging
libraries has given perf increases. It's as simple as log.info("Foo: {}",
"bar") instead of log.info(String.format("Foo: %s", "bar")) -- reducing the
number of string operations dependant on your chosen logging level. Simple but
I've seen it happen.

Shoutout to Brandon Gregg's blog [0] with some interesting stuff on
performance, some of which is Java-specific.

[0] [http://www.brendangregg.com/](http://www.brendangregg.com/)

~~~
_old_dude_
yes, String.valueOf() is dog slow.

We recently moved all logging stuff to use a constant lambda as first
parameter, log.info(s -> "Foo: " \+ s, "bar") so if the logger is not enabled,
the constants are discarded by the VM and otherwise the string concatenation
is faster than any string interpolations.

~~~
specialist
Huh. I'll have to try that. Thanks.

My solution was to flip the problem, making a separate logger instance for
each log level. So something like

    
    
      Logger warn = Logger.factory(Level.WARN); 
      warn.log( "message" );
    

If warn is disabled, factory returns a NullObject (do nothing) logger.

(My actual implementation had one level of indirection, so logs levels could
be updated at runtime.)

Then I got nutty optimizing printf, doing JIT class generation per template...
But that's another story.

TL;DR: Logging in Java drives me nuts, so I made my own.

~~~
_old_dude_
Did you take a look to java.lang.invoke.StringConcatFactory, it does more or
less what you want using method handles instead of bytecode generation ?

~~~
specialist
Thanks for the tip. Will check it out. I'll be happy if it moots my hackery.

------
a_c
> It occurred to me that, really, I had more or less picked up all of it by
> word of mouth and experience, and there just aren't any good reference
> sources on the topic. So… here's my word of mouth.

Can someone provide some reference on how JVM optimisation works, or maybe JVM
generally?

~~~
hyperpape
The Java Performance book by Oaks has some sections on internals. Cliff Click
has a talk called “A JVM Does What?” that has some interesting material (very
compressed though).

There’s also some old Oracle or Sun white papers on Hotspot, but I’ve never
actually read them.

One nitpick to keep in mind: the major JVMs do many things in similar ways but
have important differences. In fact, there is a spec that’s much less
prescriptive, and JITs are entirely an implementation detail. Often when you
read “the JVM does X” it means that the author is talking about Hotspot.

------
quotemstr
At a certain point, it's simpler, more robust, and more predictable to write
some JNI and put the hot path in C++ or Rust or something.

------
nayuki
About the monomorphic and bimorphic virtual methods, Aleksey Shipilev has some
great articles on the topic, including low-level benchmarks:
[https://shipilev.net/blog/2015/black-magic-method-
dispatch/](https://shipilev.net/blog/2015/black-magic-method-dispatch/)

------
cdman
The link from the submission seems to 404. This link seems to work:
[https://gist.github.com/djspiewak/464c11307cabc80171c90397d4...](https://gist.github.com/djspiewak/464c11307cabc80171c90397d4ec34ef)

------
gravypod
I'd actually like to add "Don't touch synchronized functions" to the list. If
need be try and use Thread Locals instead.

~~~
imhoguy
Foremost one should look what java.util.concurrent[0] has to offer before
reinventing concurrency patterns from scratch.

[0]
[https://docs.oracle.com/javase/9/docs/api/java/util/concurre...](https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/package-
summary.html)

