
TornadoVM: Running Java on GPUs and FPGAs - pjmlp
https://www.infoq.com/news/2020/03/TornadoVM-QCon-London/
======
Raphael_Amiard
> Due to limitations in the underlying programming model, TornadoVM doesn’t
> support objects (except for trivial cases), recursion, dynamic memory
> allocation or exceptions.

So basically Java syntax for some kind of restricted C/CUDA dialect. How can
you even say you're running Java if you don't have objects or dynamic
allocation? Everytime the promise of a general purpose programming language
running on GPUs is made, this is actually what is delivered, eg. marketing
fluff, not compilers actually getting smarter in any fashion

And when you think about how a GPU works, it completely makes sense. A high
level language for a GPU will _not_ look like Java

~~~
gauravphoenix
>And when you think about how a GPU works, it completely makes sense. A high
level language for a GPU will not look like Java

Can someone explain this further? I am genuinely interested in learning what
makes Java etc not suitable for GPUs.

~~~
imtringued
CPUs are like taxis and GPUs are like buses. You have 64 people sitting in
your bus all taking the same path. As soon as two people need to arrive at two
different destinations (branching) this whole system breaks down. Eventually
the bus must drive to all destinations. Then there is the time wasted stopping
to load passengers. To efficiently utilize the bus you must load 64 people at
the same. If you were to let people hail the bus like a taxi it would
constantly have to stop to just load a single passenger while the other 63
people in the bus have to wait for that single person (random memory access).

A pointer can point to object of any arbitrary sub type in any memory
location. Suddenly you have both branching and random memory access. If your
program is branching heavily then you may end up with 1 piece of data being
computed by a core that can handle 64 pieces of data at once and random memory
access kills you because GPUs don't have huge caches like CPUs. The way GPUs
deal with high latency is that they simply batch a large amount of work and
switch to a different "thread" during memory access. If all your threads are
busy loading from memory then you won't see any speedups.

Finally GPUs do not actually have 4096 cores. The RX Vega 64 only has 64 cores
and those cores only run at around 1.3GHz. A Ryzen with 8 cores running at
4GHz with out of order execution can trivially outperform a GPU that is only
using 1.6% (1/64) of its theoretical performance.

~~~
ggm
Analogies break down or.. is this actually within the analogy? Some buses,
they permit small variances from the set route to deal with specific issues
which can be done en-route.

The simplest is "drop me off anywhere on the route, but only pick up on demand
at scheduled stops"

The more complex is "wear one diversion every now and then"

Some small buses actually self-optimize to a locally efficient route. Post-
Buses in rural locations.

It is surely possible GPU can segment, into small disjoint sets of "like" and
so continue to offer parallelism, but at reduced intensity? Or, have point
inefficiencies in the computation, toward the overall goal of "all alike"

~~~
dsun180
Yes. GPUs also do not have tires. Who cares?

------
monocasa
Just because you can doesn't mean you should. Even OpenCL for FPGAs was too
much of an impedence mismatch to be useful for the vast majority of cases.

And getting Java to even run on these is going to remove so many features that
there's no real reason to pick it anyway. It's sorta like how JavaCard that
removes keywords like "new", "long" and "throw". Like at that point it's a
stretch to even call it Java.

~~~
pjmlp
Being stuck with C, while CUDA allows for higher level programming models, is
one of the reasons why most researchers just flocked to CUDA.

When Khronos realised they should have had their own PTX (SPIR) and higher
level programming models, the race was already lost.

~~~
h91wka
No. Language was not a problem at all. For people who are accustomed to
writing code for modern CPUs, switching languages is no big deal: they all
perform as good as compiler and language runtime are designed. That's because
CPUs are forgiving to the programmer, and they employ a lot of techniques to
run _arbitrary_ code well. GPUs were super "dumb" at the time, but they had
truckload of ALUs, and people wanted to maximize usage of ALUs at all costs.
Because running a power-hungry cluster on a task for one month or two months
actually makes a lot of difference. So you choose a tool that gives _the most
control over tricky hardware_. And this is going to be a vendor-provided tool.
So, if CUDA was based on Forth or Unlambda instead of C++, people would've
used it anyways.

OpenCL advertised "write once, run everywhere" approach, it was completely
bonkers when it comes to HPC. Optimizations that one applies on different
hardware make code look completely different. It has its niche, but not in
HPC. I am no longer in this industry, so maybe things changed. But at the
time, some teams picked up OpenCL and quickly dropped, as it didn't give
enough bang for a buck, and tooling didn't compare with rather polished CUDA
stuff.

~~~
llukas
Initially OpenCL and CUDA were competitors. You _had_ choice of C (OpenCL) and
limited support of C++ (CUDA). C++ especially in HPC context is way easier
(templates!) as properly organized C++ codebase doesn't have problem with:

"Optimizations that one applies on different hardware make code look
completely different."

Unless you got C program where this statement is true.

~~~
h91wka
> as properly organized C++ codebase doesn't have problem

I'm still having hard time to believe that kernels optimized for SIMD
architecture will be useful on CPU and vice versa. And OpenCL people
advertised this, if memory serves me well.

~~~
llukas
If you got SIMD it means you had to solve most of problems that you'd have
when going for AVX implementation.

------
bgorman
"Due to limitations in the underlying programming model, TornadoVM doesn’t
support objects (except for trivial cases), recursion, dynamic memory
allocation or exceptions."

Given Objects are not supported, I am having a hard time seeing any valid use
case for this over C,C++,Rust etc...

~~~
dukoid
Simplified integration with some host environment? But now I am wondering what
it would take to run "regular" code on a GPU -- even if it's super inefficient
-- just to avoid the need of a "host" environment / CPU?

~~~
h91wka
> just to avoid the need of a "host" environment / CPU?

Why would you need to avoid that?

~~~
o-__-o
Malware?

------
Traster
It's worth noting that this isn't really a solution for FPGA. It just takes
Java, generates OpenCL and then pushes it through the existing Intel OpenCL
compiler for FPGA. The problem is that to get even close to Ok performance on
FPGA with the current OpenCL compiler you need to spend months hand optimizing
your code (and often the compiler too), but with this tool you can't even
touch the OpenCL code.

Also, just as a separate point can someone point me to a design that acheives
5TFlops on Stratix 10, let alone the claimed 10TFlops - because my
understanding is to get that performance you would need to run the fabic at
1GHz and use 100% of the DSPs - which is frankly hilariously impossible.

~~~
jfumero
Totally agree. OpenCL code is portable, but performance is not. That's why
TornadoVM specializes the OpenCL code depending on the target device. For
FPGAs we do a lot more optimizations compared to GPUs, such as tuning the
thread-scheduling, better loop unrolling and loop flattening, use of local
memory, etc. All of these optimizations are automatically performed in the
compiler-IR (GraalIR) before generating the actual OpenCL C code.

With those compiler specializations, we aim to close the performance gap
between hand-tuned code and generated code.

------
dukoid
Where exactly is this in the range from running an actual JVM on a GPU to
scheduling some tasks using a restricted Java subset limited to specialized
libraries on a GPU? The title seems to suggest the former but I'd be quite
surprised by anything other than the latter?

~~~
bilbo0s
It's closer to the latter, but obviously it isn't using specialized hardware
libraries. It's a compiler. It just compiles Java down to VHDL for instance.
The Java doesn't run on the FPGA at all.

Obviously he's still hit an extremely useful sweet spot right there. I don't
know many non-masochists who would choose to use VHDL over Java, that's for
sure. Who out there would really rather work out the intricacies of VHDL
instead of just compiling a Java class and calling it a day?

Ditto for the GPU. You could learn Vulkan, but if what you're doing is just
GPGPU type stuff, why?

Just throw it in a Java class and call it be done with it.

~~~
krapht
As someone who occasionally has to program fpgas, the syntax of vhdl is not
why performance fpga programming is difficult.

~~~
h91wka
Same goes for GPGPU

~~~
imtringued
The restrictions in language features make the limitations of GPUs visible to
high level language users. The reason why GPGPU is hard is that most tasks
simply don't satisfy these constraints. You don't need to be a genius to run a
multiplication of two arrays on a GPU.

~~~
h91wka
> You don't need to be a genius to run a multiplication of two arrays on a
> GPU.

You don't need to be a genius, but you need to know a lot about low-level
stuff. "Simple" matrix-vector multiplication is a task where quirks of
hardware already make quirks of the language fade in comparison. You need to
find out how to split your task into blocks to minimize access to global
memory, you need to manage your shared memory, etc, etc. _Somewhat_ performant
matrix-vector multiplication algorithm looks nowhere close to textbook
definition because of this. So sticking Java or whatever popular language on
the problem is not going to make it much more accessible, as you still need
very specific knowledge to not waste electricity by writing an algorithm that
is 5-10 times slower than it should be, because all it does is waiting for
global memory.

------
cpswan
I wrote the InfoQ piece. Some further thoughts on my blog -
[http://blog.thestateofme.com/2020/03/10/further-thoughts-
on-...](http://blog.thestateofme.com/2020/03/10/further-thoughts-on-
tornadovm/)

------
daniel_iversen
I used to love the idea back in the day of BEA's "LiquidVM" product
([https://docs.oracle.com/cd/E13217_01/wloc/docs10/lvm/overvie...](https://docs.oracle.com/cd/E13217_01/wloc/docs10/lvm/overview.html))
- not sure what advances there's been made since that (apart from better
hardware and newer versions of Java etc)?

------
leggomylibro
It's amazing to think that Java started out as a language for small embedded
devices. I guess it's coming full circle :)

Cool project, even if there are a lot of features missing atm. It's always
nice to see alternatives to raw VHDL or Verilog for FPGA programming.

~~~
rolltiide
All it ever needed was two additional levels of abstraction 20 years later to
ever fulfill that promise, so obvious and purpose built!

------
rbanffy
I wonder if it'd be feasible to make a Java VM that runs on nothing but a GPU,
without any assistance of the CPU except for bootstrapping the system.

~~~
kotselidis
It is not possible to run the entire VM due to lack of OS support for some
features required. However, even you could run all of it, it would not yield
the best performance due to control flow divergence.

We did some preliminary work on executing some parts of the interpreter on a
GPU this year:
[https://github.com/jjfumero/jjfumero.github.io/blob/master/f...](https://github.com/jjfumero/jjfumero.github.io/blob/master/files/JuanFumero-
MoreVMs2020-Preprint.pdf)

~~~
rbanffy
The only way something like this would work would be the OS treating the
heterogeneous compute resources as "first-class citizens" and part of the
normal OS scheduling and resource management. I don't think any production OS
today does that.

And just after I write it, an article on AMD's work on ROCm pops up.

Exciting times.

------
unixhero
What is it with the Java ecosystem and the term/name tornado. A lot of
projects have been named tornado. Any interesting reasons for this?

------
h8hawk
Why not using DSL like [http://taichi.graphics](http://taichi.graphics) rather
than create entire VM?

~~~
chrisseaton
It’s not an entire VM. Where did you get that idea from? It’s _less_ than a
DSL - it’s Java with a modified compiler library.

~~~
ailideex
Just a shot in the dark here ... but maybe the person you are responding to
got the idea that this is a VM from the fact that TornadoVM literally has VM
it it's name...

~~~
chrisseaton
It's clarified in the _first sentence_ of the article.

~~~
monkpit
Then why create that confusion with the name?

~~~
jfumero
The VM name came because TornadoVM implements its own set of bytecodes for
handling heterogeneous execution. These bytecodes are used for handling JIT
compilation, device exploration, data management and live task-migration for
heterogeneous devices (multi-core CPUs, GPUs, and FPGAs). We sometimes refer
to a VM inside a VM (nested VM). The main VM is the Java Virtual Machine, and
TornadoVM sits on top of that.

You can find more information here:
[https://dl.acm.org/doi/10.1145/3313808.3313819](https://dl.acm.org/doi/10.1145/3313808.3313819)

------
fancyfredbot
How would this compare to aparapi? I think aparapi worked by converting java
to openCL, which could then run on GPU and FPGA.

~~~
mikepapadim
Indeed aparapi worked by converting java to OpenCL. However, exposed to the
user many aspects of GPU programming such as thread indexing (e.g global ids)
and memory allocation for specific optimizations (e.g local memory). In the
case of TornadoVM these aspects handled by the compiler and the runtime.

~~~
alcidesfonseca
If I am not mistaken, aparapi included some templates where those low-level
aspects were hidden.

The runtime can be added via a jar file, but lambda-based operations must be
converted to OpenCL/Cuda/PTX/LLVM or other low-level GPGPU language.

Aparapi did the latter using runtime byte code instrumentation. TornadoVM also
does the same thing as a JDK compiler plugin. AeminiumGPU did the same using a
transpiler [0] before the actual Java compilation step.

[0]
[https://github.com/AEminium/AeminiumGPUCompiler](https://github.com/AEminium/AeminiumGPUCompiler)

~~~
jfumero
Aparapi is a direct translation from Java bytecode to OpenCL. To do so,
Aparapi provides a compiler and a runtime system to automatically handle data
and execute the generated OpenCL Kernel.

TornadoVM compiles from Java bytecode to OpenCL as well. But additionally, it
optimizes and specializes the code by interleaving Graal compiler
optimizations, such as partial escape analysis, canonicalization, loop
unrolling, constant propagation, etc) with GPU/CPU/FPGA specific optimizations
(e.g., parallel loop exploration, automatic use of local memory, parallel
skeletons exploration such as reductions). TornadoVM generates different
OpenCL code depending on the target device, which means that the code
generated for GPUs is different for FPGAs and multi-cores. This is because of
OpenCL code is portable across devices, but performance is not portable.
TornadoVM addresses this challenge by applying compiler specialization
depending on the device.

Additionally, TornadoVM performs live task migration between devices, which
means that TornadoVM decides where to execute the code to increase performance
(if possible). In other words, TornadoVM switches devices if it knows the new
device offers better performance. As far as we know, this is not available in
Aparapi (in which device selection is static). With the task-migration, the
TornadoVM's approach is to only switch device if it detects application can be
executed faster than the CPU execution using the code compiled by C2 or Graal-
JIT, otherwise it will stay on CPU. So TornadoVM can be seen as a complement
to C2 and Graal. This is because there is no single hardware to best execute
all workloads efficiently. GPUs are very good at exploiting SIMD applications,
and FPGAs are very good at exploiting pipeline applications. If your
applications follow those models, TornadoVM will likely select heterogeneous
hardware. Otherwise, it will stay on CPU using the default compilers (C2 or
Graal).

Some references:

* Compiler specializations: [https://dl.acm.org/doi/10.1145/3237009.3237016](https://dl.acm.org/doi/10.1145/3237009.3237016)

* Parallel skeletons: [https://dl.acm.org/doi/10.1145/3281287.3281292](https://dl.acm.org/doi/10.1145/3281287.3281292)

* Live task-migration: [https://dl.acm.org/doi/10.1145/3313808.3313819](https://dl.acm.org/doi/10.1145/3313808.3313819)

------
DesiLurker
But Why?

