
Why Aren’t More Users More Happy with Our VMs? Part 1 - dochtman
http://tratt.net/laurie/blog/entries/why_arent_more_users_more_happy_with_our_vms_part_1.html
======
nkurz
_why did CPython running the Mandelbrot benchmark have a running time of
134.419s and an incredibly wide confidence interval of ± 32.347s? I had no
idea, so, in the best tradition of VM measurements, I buried my head in my
sand and hoped that no-one would say anything. They didn’t._

A beautiful summary of the field!

 _In other words, this VM is highly optimised for this benchmark — but no-one
ever ran it long enough before to notice that performance gets much worse over
time._

Perhaps, although more likely, they noticed; and then in accordance with
tradition, promptly buried their heads in the sands.

\---

Since this thread is likely to be read by people who care about unexplainable
performance glitches, I'll highlight an otherwise inexplicable ~20% run-to-run
performance drop that I recently saw an explanation for:
[https://www.realworldtech.com/forum/?threadid=179700&curpost...](https://www.realworldtech.com/forum/?threadid=179700&curpostid=179700)

There has been a lot of talk about whether AVX512 is of real world benefit.
One big issue is that it's so power hungry that the entire core compensates by
slowing down whenever 512-bit instructions are executed. What's not well known
is that with an AVX512 capable processor running the common Ubuntu 16.04, you
may well encounter this slowdown even if your program never uses AVX512.

More specifically, there is a problem with the way that some versions of glibc
are saving and restoring the 512-bit ZMM registers. Even at program start-up,
the upper 256-bits can be in a "dirty" state, which can cause the core to slow
down to the "AVX512-light" speed on any SSE, AVX, AVX2, or FP operation.
Patches for glibc were backported as far at 2.25, but Ubuntu hasn't included
these fixes in their updates.

The bottom line is that if you are working on an AVX512 capable system, you
should probably make sure that you understand this bug before publishing any
benchmark results.

~~~
dullgiulio
Just for completeness, this problem with AVX is biting even when using
containers instead of VMs.

~~~
jholman
Just for completeness, the "VM"s in this article have nothing to do with
containers. But of course we all read the article before commenting and we all
know that, right?

Without being an expert in virtualization or containerization, it seems pretty
obvious to me that if a certain instruction causes thermal throttling, then
that'll cause thermal throttling whether the instruction is issued from a host
OS or a guest OS or for any other reason. Indeed, the more layers of
abstraction involved, the likelier you are to hit the problem.

------
cm2187
One thing I am struggling to understand is why do we need JIT in the first
place? What would be wrong with statically compiling java or c#? Portability
is nice but the reality is that almost no one uses it. Typically you have to
create different packages for different platforms anyway just because of the
installer. Most backend applications are designed to work on a single platform
(who is running a mixed windows+linux/x86+ARM backend for the same binaries?).
For front end applications, most platforms have different UIs, and all new
platforms (android, iOS) have incompatible APIs anyway.

Is there a fundamental benefit I am missing?

~~~
varjag
The theoretic fundamental benefit is you are able to optimize for code paths
practically occurring during the runtime with particular set of data or
inputs. There is no way to know that at compile time, hence it's often touted
as a way to make things "faster than C" with sufficiently smart JIT compiler.

But of course no JIT compiler ever is sufficiently smart and any execution
path benefit gained is lost to poor cache discipline of VMs. And the whole
JIT/hotspot concept was really a hindsight attempt to get away from abysmal
performance of VM based languages.

~~~
hyperman1
There is another very practical benefit: You can use all features available on
your PC.

E.g. when you compile an application C-style, you set the target architecture
to the lowest CPU you want to support. If this is , say, SSE, then no SSE2
instructions will be used by the program and the sometimes huge performance
gains from them are lost.

A VM like hotspot can generate code for this specific machine. If you have
SSE2, it will generate SSE2 instructions. If you dont, it wont.

Recently, I see more applications choosing code paths based on CPUID, but this
only works for CPUs existing at compile time. If you app was compiled in 2016,
in wont use CPU features from 2018. If your java app from 2016 runs on a JRE
from 2018, it will use them.

~~~
varjag
I would really like to see a Java program beating a C program in this use
case. Say, outperforming it (or even matching to same order of magnitude) on
DCT/IDCT even by using SSE2 vs SSE/MMX. Can only say am immensely sceptical.

Another thing is these kind of gains with instruction level efficiencies are
exceedingly rare and your typical VM program never runs any code where SIMD
extensions would make a difference.

~~~
hyperman1
This article is not what you ask, but it is an example of a VM beating native:

[https://nullprogram.com/blog/2018/05/27/](https://nullprogram.com/blog/2018/05/27/)

Java uses the same technique.

Basically, a JIT allows you to skip the overhead of the GOT.

Devirtualization is another case.

I have personally seen Java beating an existing C++ program for the reason I
describe, but I cant publish the results here. You have to program carefully
or you loose all speed benefits because of pointer chasing. But it is
possible.

~~~
wolfgke
> Basically, a JIT allows you to skip the overhead of the GOT.

This is a problem that is GNU/Linux-specific. Windows DLLs are not PIC
(position-independent code), so they do not suffer from this problem and don't
need a GOT.

See

> [https://www.symantec.com/connect/articles/dynamic-linking-
> li...](https://www.symantec.com/connect/articles/dynamic-linking-linux-and-
> windows-part-one)

for details.

The method that Windows uses is much faster in the average case, but slower in
the "bad" case. So GNU/Linux vs Windows have different performance tradeoffs.

~~~
hyperman1
Yes and no: there is another indirection mechanism at work on windows: All
results for imports in the import table are at the beginning of the DLL's data
segment. So a call to another DLL first has to fetch the data from there, then
do an indirect call. In x86 assembler:

    
    
      some_import_address    dd some_import ; will be filled in by dll import
          [... lots of unrelated code and  data...]
          call [some_import_address]
    

a JIT can read the address from code and do directly:

    
    
          call some_import
    

that's one less memory load.

The benefit is that the dynamic linking does not modify the code segment, so
there are less private, modified pages mmapped, thus more sharing and less
memory usage.

------
theamk
Am I reading that right? There is a whole article about how long JIT compilers
take to settle, but the actual performance changes by just a few percent tops?
(for example, first graph has 0.571 for cold JIT, and 0.568 for warm JIT, a
0.5% difference)

I think for me, the main takeaway from article is startup time for JITs is
irrelevant, since it does not affect performance much.

~~~
ltratt
In general, yes, you're reading the timings right, but there's a couple of
factors to consider.

First, the JIT compiler has often done most of its work during the first in-
process iteration, so the plots are not generally following a "one in-process
iteration is slow, then the JIT kicks in from in-process iteration 2 onwards"
model. Put another way, the JIT compiler is often making even the first in-
process iteration run pretty fast.

Second, I would suggest that the often small timing differences are more
worrying than they may first appear. In a semi-mature compiler, optimisations
are frequently in the range of a 0.5-1% improvement. So if your measurements
are only accurate to (say) 2%, most of your attempted optimisations will be
misclassified (bad optimisations will sometimes be measured as good; good
optimisations will sometimes be measured as bad). Furthermore, when a VM
recompiles things, it generally performs several optimisations at once. Thus,
if the overall performance gets worse (even if by a small bit), it may suggest
that several optimisations performed badly at the same time.

~~~
Boxxed
But doesn't that mean the article's point is kind of moot? That when people
blame slow languages / VMs, they actually _are_ slow and it has nothing to do
with warm-up time?

~~~
codeflo
Both can be true, benchmarking JITs is really hard, but even so, (despite
decades of claims to the contrary), managed VMs actually are a bit slower than
unsafe native code.

~~~
wbl
Nothing prevents native code from being safe

------
kristianp
Interesting stuff. To learn that it can take 190 iterations for warm-up to
finish may be a revelation to VM developers.

The idea about some runs having slowdowns might not be significant if it's
just one run out of 30 that does it though. The result in the last Diagram of
TruffleRuby on spectralnorm having 25 slowdowns and 5 no steady-state is
interesting.

Full paper at : [http://soft-dev.org/pubs/html/barrett_bolz-
tereick_killick_m...](http://soft-dev.org/pubs/html/barrett_bolz-
tereick_killick_mount_tratt__virtual_machine_warmup_blows_hot_and_cold_v6/)

------
eslaught
It would be interesting to see how much wall clock time these applications run
for. One answer to the question of how many iterations to run for is to
compare it to the expected running time of real applications. If the expected
running time is minutes, and the VM hasn't warmed up after ~1 minute, then
it's pretty irrelevant whether it would "eventually" warm up or not. This
would be one way to side step the halting-problem-esque question they've
gotten themselves into with respect to how long to run the benchmarks for
(which I'm pretty sure is intractable).

------
daurnimator
Is anyone able to link to the krun software used?

What do people use to make their computers stable for benchmarking?

~~~
ltratt
The best place to start with Krun is its GitHub page
[https://github.com/softdevteam/krun/](https://github.com/softdevteam/krun/)

------
eugene_pirogov
How do I make LaTeX tables like that?!

~~~
another-cuppa
Using booktabs for a start. There are various other packages for making tables
nice but that one gets you a long way. Also study well typeset tables to learn
why they look nice.

~~~
LambdaComplex
>study well typeset tables to learn why they look nice.

Can you link to any resources that discuss this?

~~~
another-cuppa
I don't know any. What I meant was just to find good looking tables yourself
and study them with an eye for typesetting. The manuals for LaTeX packages,
including booktabs, are often very good too.

