
Java and SIMD - mmastrac
http://prestodb.rocks/code/simd/
======
faragon
Auto-vectorization is hard. Very hard. E.g. even in C/C++, the compiler (e.g.
GCC C/C++ or MS VC/VC++) is unable to vectorize loops unless you help it _a
lot_ , and in most cases, you end writing SIMD "intrinsics" (e.g. [1]) in
order to get optimal results. From my experience, despite auto-vectorization
being better than 10 years ago, still is very far for optimizing code properly
without lots of tuning (e.g. you can try build any graphic processing library
and see the vectorization warnings -i.e. why the vectorization was not
possible-).

Ten years ago, in the SSE2/Altivec times, I thought that it would be matter of
time having much smarter compilers making graphics/pixel processing code much
faster, but not. So for JIT the case it can not be better, because is similar,
as even taking runtime information, the auto-vectorization phase is
equivalent. I would love to see smarter compilers, understanding the code,
many steps over current hardwired pattern-matching based optimizations.

[1]
[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)

~~~
YSFEJ4SWJUVU6
A few years back I decided to entertain myself by testing how smart today's
smart compilers really are when it comes to auto-vectorization.

I had this small and simple C application I'd written years earlier that tried
to find inputs whose corresponding MD5 hashes started with certain bytes. It
was a good base because it was obviously vectorizable.

At first enabling the vectorizer didn't result in any changes to the binaries.
I then correctly guessed that (potentially) calling printf function inside a
hot loop might confuse it. After slight refactoring I got the compiler to
output SSE instructions, which resulted in a nice 2.5× testing speed over the
original (incidentally even without auto-vectorizing the refactored code
resulted in faster binaries, which is not all that surprising).

Anyway, I also rewrote the application to use intrinsics. I hadn't used them
before myself, but it didn't really take much time at all familiarize myself
with them and write the code, and it was indeed quite a bit faster than what
the compiler was capable of with resulting binary having 14× speed compared to
the original, or over 5× compared to what the compiler could achieve without
explicit hints from intrinsics.

Edit: added back a few words I had accidentally removed when rearranging
sentences, causing a confusing incomplete sentence. Corrected comparing
figures like for like.

~~~
evincarofautumn
I’ve had similar experiences. If I want vectorised code, I just write it
myself using intrinsics or assembly. It’s fine if the compiler can
autovectorise something I didn’t feel like doing by hand, but I’m not going to
rely on heuristic voodoo to get the machine code I want for a hot loop. I
wouldn’t mind a slightly nicer wrapper API for the intrinsics, though,
something like glsl-sse2[1].

And that’s more or less what I’m planning to do in a programming language I’m
working on, actually—if you use a SIMD-compatible array type, the compiler
will try to keep it in a vector register, and some operations will be faster
(e.g., “+” on two Float32^4 values will compile to an addps) but it’s up to
the programmer to use the instructions they actually want, or _tell_ the
compiler with a macro “please vectorise this loop or warn me about why you
can’t”.

[1] [https://github.com/LiraNuna/glsl-sse2](https://github.com/LiraNuna/glsl-
sse2)

~~~
t0rakka
[https://github.com/t0rakka/mango](https://github.com/t0rakka/mango)

------
sargun
Be very careful about this in a shared environment. AVX512 slows down the CPU
cores because of thermal, voltage throttling. The instructions "take more
work". Intel CPUs take 1 MILLIsecond to return to normal speed.

If you're doing any kind of rapid context switching, or multiple workloads,
depending on how your scheduler is setup, the OTHER workloads will show up as
using more percentage of CPU time per work item. It's non-intuitive, and
difficult to debug.

~~~
stu2010
Now that major cloud vendors are selling VMs with guaranteed AVX512 support,
how are they going to deal with the "noisy neighbor" problem?

~~~
sargun
For one, this is done on a core-by-core level. I believe Intel has the
capability to monitor for this behaviour and do some dynamic throttling in
certain models, which cloud vendors may have access to. I think they
introduced some of this in Broadwell-E, where you can set the AVX offset per
core, and the on-die PCU will throttle that core below the AVX base frequency,
allowing the rest of the cores to remain at speed. These are typically
controlled by BIOS, or MSRs.

We're currently trying to figure out how to do deal with this. Some ideas come
from Google's CPI2 paper, and trying to dynamically schedule workloads with
diversity if we think they interfere. Other thoughts have been simpler, like
core pinning (knapsacking for latency, or throughput).

this is hard.

Disclaimer: These views represent my own, and not my employer's, or their
vendor's views

------
ldargin
I asked James Gosling about SIMD (MMX and SSE) back in the year 2000 JavaOne
conference. He said the answer is method calls, and that the compiler can use
whatever instructions it likes. (I submitted the question online, and he
answered it on stage, and I later saw the recording; I didn't attend, nor meet
him in person. He seemed a bit annoyed that the question was too simple.)

------
aardvark179
There's room for two approaches in Java really. The JIT can be smart and use
SIMD instructions where it can see they are applicable, but there's also room
for a small DSL like API that allows library authors and other very
experienced users to express a suitable algorithm and have it easily
translated into the SIMD instructions available at runtime. Anybody interested
in the latter should take a look at the work being done under project Panama.

~~~
_old_dude_
This presentation was posted recently on the general OpenJDK mailing list

    
    
      http://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorization_in_HotSpot_JVM.pdf

------
gigatexal
This has got to be one of the best blog posts I've read in a while. It's
clean, concise, has a clear use-case and is well benchmarked. Kudos to the
author.

~~~
pnowojski
Thanks, I'm glad that you liked it

------
georgewfraser
My understanding is that once the Truffle/Graal language implementation
framework lands with Java 9, you will be able to substitute method calls with
platform-specific instructions. See page 62 of [http://lafo.ssw.uni-
linz.ac.at/papers/2015_CGO_Graal.pdf](http://lafo.ssw.uni-
linz.ac.at/papers/2015_CGO_Graal.pdf)

------
memracom
This is not a Java question. If a C compiler can do it then a Java Virtual
Machine can do it, provided that the C code of the JVM makes it so. There are
many implementations of the JVM so you really should be asking which JVMs do
this. Oracle is not the only game in town, not to mention at least two open
source JVMs where you could add whatever capabilities that you need.

On the other hand, if you are asking whether or not some magic compiler
optimization will take your crappy code and make it run fast on SIMD, not only
is that the wrong question but you have already lost the race.

The winners of the race asked the question, "How can we add a capability to
our Java application to run computations fast using SIMD?" and they found
number of ways to do this without relying on magic. It might be a bit of work
to code because you have to do it with intent, like the old timers who placed
code and data carefully on their drum memory computers to make the code run
much faster. You can code with intent in any language on any platform, but
because your intent is stronger than the asthetic perfection of the platform,
things can look a little grungy to an outsider. Comment your code and document
it well.

And ask yourself whether offloading the computation to a GPU might not be
cheaper and even faster than SIMD.

~~~
pnowojski
Often it is not that simple to change JVM, especially if your application is
very fine tuned for a specific GC. In such case, is it always worth to spend
hundreds of work hours for migration (and testing it afterwards)? Other aspect
is the technical support or other legal/contract bindings.

Besides, I wasn't asking for "magic compiler optimizations". I would prefer to
use intrinsics directly. Is there a way to do that in Java?

~~~
pjmlp
Kind of.

[https://www.slideshare.net/RednaxelaFX/green-teajug-
hotspoti...](https://www.slideshare.net/RednaxelaFX/green-teajug-
hotspotintrinsics02232013)

Here is also a presentation about them

[https://www.youtube.com/watch?v=7J0RELNadks](https://www.youtube.com/watch?v=7J0RELNadks)

Here is a list with some of them.

[https://gist.github.com/apangin/7a9b7062a4bd0cd41fcc](https://gist.github.com/apangin/7a9b7062a4bd0cd41fcc)

The Panama JVM has more related to SIMD.

Of course all of this is JVM specific and each one has its own set.

------
drej
I've been wondering what the best way to implement SIMD in a JITless language
was. While I'm a big proponent of static compilation (coding mostly in Go),
this runtime inflexibility is a bit painful. Sure, you could compile multiple
version, check the CPU and deliver the correct methods on runtime... but,
that's not quite right.

Another issue is how you code it up - both in terms of delivering multiple
versions depending on CPU support and in terms of shielding the developer from
low-level assembly. I'm all for high level APIs (think `sum(a vec, b vec)`)
that get compiled down to whatever is supported, but I haven't seen many good
examples of this.

------
alkonaut
Anyone knows how RyuJIT compares to Java8 and Java9 in a comparison like this?

~~~
jackmott
ryujit does zero autovectorizing but net has a nifty simd library thst lets
you do it by hand. big plus is you can write once run on sse or avx or avx512.
big downside is only a tiny fraction of simd instructions are supported,
though they are working on improving that.

------
marmaduke
Why not use some OpenCL in some form either raw or Aparapi?

~~~
stusmall
Crossing the JNI barrier is expensive performance-wise. Unless it's a longer
running, CPU heavy calculation its usually best to stay in pure Java. There
are some things that going out to another high performance framework will
speed things up, but the bar is a bit higher because of the JNI overhead.

