
JVM implementation challenges: Why the future is hard but worth it [pdf] - pron
http://cr.openjdk.java.net/~jrose/pres/201502-JVMChallenges.pdf
======
vardump
Cache, cache and cache. Three of the most important things Java gets wrong
today. Mostly because Java doesn't have object arrays at all. Only primitive
types can be array elements. So you get array of references to objects instead
of array of objects.

There are only 512 L1D cache lines in a CPU core. After just 512 references to
memory areas covered by different cache lines, L1D entries will start to drop.
Or earlier. In Java this happens very easily. Like when just iterating a
single, not very large, data structure.

Everything else (except maybe SIMD/GPU related items for compute purposes) are
not anywhere near as important as fixing Java's current problems with CPU
caches.

~~~
kristianp
This reminds me of something I'm curious about: why have L1 caches have been
the same size for quite a few generations of Intel Core processors?

~~~
jjoonathan
This reminds me of something I'm curious about: why aren't large SRAM arrays
displaying DRAM? I know 6Ts are larger than 1T+1C, but even if SRAM were 6x as
expensive (which ought to be a severe upper bound) that would put it at $50/GB
which sounds like it could easily be worthwhile for certain workloads -- not
unlike SSDs a few years ago. The latency advantage would be gigantic,
especially if you could crowd the chips near the CPU.

Is SRAM comparatively power hungry or something?

~~~
wmf
RLDRAM exists, so SRAM would have to beat that. In your 6x upper bound maybe
you're not taking into account (dis)economies of scale that could cause SRAM
to be much more expensive.

~~~
jjoonathan
What diseconomies of scale do you mean? If you mean that demand is low right
now and so there is no economy of scale yet, that's hardly evidence against a
gradual market takeover. That's what investors and early adopters are for,
see: SSDs. If you really do mean diseconomy of scale, where unit price will
rise with the number of units produced, what would produce such an effect in
the SRAM market? Chip fab is almost always in the opposite regime, what's so
unique about SRAM?

Also, I looked up RLDRAM [1] and it looks like their "low" latencies are still
10ns, which is good for DRAM but abysmal in comparison to SRAM.

[1] [http://www.micron.com/products/dram/rldram-
memory/1_15Gb#/](http://www.micron.com/products/dram/rldram-memory/1_15Gb#/)

------
x0x0
There's some exciting stuff in the presentation. In particular: next gen
threading (fibers/warps, fj), plus attention to java's profligate use of
memory b/w because of it's reliance on pointers-for-all-the-things. Plus the
idea of the jvm helping out languages besides java is also exciting.

That said, it's a little scary to contemplate the power that microsoft,
steward of the clr, and oracle, steward of the jvm have over our industry.
It's staggering to see the amount of engineering that's gone into bullet-
proofing the jvm. What happens when microsoft or oracle decline to continue
paying for hundreds of very expensive senior compiler/vm/language engineers?

~~~
tormeh
I've long wondered what's in it for Oracle and now Microsoft. Microsoft stands
to profit from applications written for Windows that are hard to port, but
they seem to be abandoning that approach. Oracle seems to have no real upside
to maintaining Java. No one is paying for C# and Java directly, so I don't get
it.

~~~
tsotha
I always believed Sun shot itself in the head with Java. They had this idea
getting people invested in a cross platform ecosystem meant new software could
be run on Sun hardware.

But what really happened is people who had been locked into Sun with legacy
software wrote applications in Java and then switched from Sun to cheaper
hardware running Linux or Windows.

~~~
SixSigma
They did have Microsoft removing their helmet and doing the reloading.

Sun called it: Mankind vs Microsoft [1]

It drained their energy and stole their focus.

It was the apex of Embrace, Extend, Extinguish

[1]
[http://www.theregister.co.uk/2004/04/03/why_sun_threw/](http://www.theregister.co.uk/2004/04/03/why_sun_threw/)

Interesting from that 2004 piece

> Microsoft's biggest global competitors are exactly as they were on Thursday:
> Nokia and Sony.

How different things looked then!

------
pjmlp
Great to see AOT compilation coming to the reference JVM instead of relying on
third parties.

Also the desire to go meta-circular and reduce even more the amount of C++
code.

Most likely based on the Graal/SubstrateVM work.

Finally JNI being replaced by something developer friendly.

------
haddr
Lightweight threads (fibers) looks especially interesting to me. I wonder if
there is some library that already tries to ger around threads with something
more lightweight...

~~~
rdtsc
An interesting one is

[http://docs.paralleluniverse.co/quasar/](http://docs.paralleluniverse.co/quasar/)

They claim to have "true lightweight" threads. (I am sure pron if he is a
around can jump in and expand on it).

It takes a lot from Erlang even pattern matching.

~~~
pron
Well, the pattern matching is only in the Clojure API, but yeah, these are
true fibers based on continuations implemented with bytecode instrumentation,
scheduled by a work stealing scheduler (or any other scheduler of your
choice). It lets you write simple blocking code with all its familiarity and
advantages (exceptions, thread-local data) but enjoy the performance of async
code.

The downside is that libraries have to be integrated in order to block
gracefully when called on fibers, but we already have a long and growing list
of integration modules for popular libraries.

------
oldmanjay
Oh please give me continuations in the VM and I will take back every angry
thought I've ever directed at Oracle.

------
tomp
On slide 16, which talks about cache, there is this:

 _> Rule #1: Cache lines should contain 50% of each bit (1/0) > – E.g., if
cache lines are 75% zeroes, your D$ size is effectively halved_

Can anyone explain this?

~~~
x0x0
a cache line is the unit of read from ram, and is at a premium

At least for data, a common win is to compress the data in ram and decompress
once it is in-core; this is often essentially free as many machine learning
algorithms are bandwidth starved but have plenty of compute available. Suppose
you are eg storing small integer counts in ints; if it's a java int you are
using 4B to store 1B, while if it's a java.lang.Integer it costs 16 bytes
_plus_ most likely an 8B pointer.

Another way to consider this is if you are using 8B pointers, you waste a lot
of that as constant zeros -- 1TB is 2^40, so even on a 1TB machine 3B/24b are
wasted. Particularly with (all? at least that I know of) jvms that have 8B
alignment, another 4 bits are wasted. This is how the CompressedOops hack
words to access 32G ram w/ 32b pointers.

~~~
bluecalm
>>At least for data, a common win is to compress the data in ram and
decompress once it is in-core

I have problems imagining how it could be done. Would you mind elaborating
and/or sharing some examples ?

~~~
x0x0
The insight is that on a modern processor memory bandwidth is scarce but it
can issue several floating point ops per clock. So multiplying and adding
stuff in L1 or already in registers is cheap. Streaming algorithms that walk
large chunks of ram will be repeatedly memory starved, so anything you can do
to effectively increase memory b/w is valuable.

Say you have 8bit integers; store them packed in ram, then upon reading,
unpack. So instead of striping an array of int[], you have an internal array
of long[] and you read them with a function. Your memory read will suck in 8
at once.

The same technique works for floats or ints with a small-ish range and limited
precision; you can store a scale and offset, then pack on write / unpack on
read. It's common to be able to quadruple your effective memory bandwidth,
then the read operation -- ie

    
    
       // instead of:
       double[] _data;
       // accessed as
       _data[idx];
    
       //instead you do
       getd(idx);
    
       // using the below
       bytes[] _mem;
       float _bias, _scale;
       double getd(int idx){
          long res = _mem[idx];
          return (res + _bias) * _scale;
       }
    

executes entirely from registers, and is essentially free. The price of all
this is you have to process your data on ingestion, but if you run iterative
algorithms -- like convex optimizers -- that repeatedly walk your entire
dataset, this is often a big win. You can often lose some of the low precision
bits on the float or double, but those probably don't matter much anyway.

Like anything else, you'll have to measure.

------
nusbit
Is there a video?

