
Why Aren’t More Users More Happy With Our VMs? Part 1 (2018) - agronaut
http://tratt.net/laurie/blog/entries/why_arent_more_users_more_happy_with_our_vms_part_1.html
======
pizlonator
It would be better if VM comparisons included JSC rather than, or in addition
to, V8. JSC tends to outperform V8 so if you find a pathology in V8 it’s just
not so surprising. It would be more interesting if you found a pathology in
JSC.

I think that the use of small benchmarks obscures what’s going on. The VM is
trying to win in the average. It’s like a professional gambler. Observing that
the VM did something dumb for a program is like observing that a professional
gambler lost a bet. That’s not interesting. In a game of chance, even a really
great strategy will have its outliers.

I think that to understand the quality of a VM you have to throw millions of
lines of code at it and see if the optimizing JIT can consistently produced
speedups or at least produced speedups more often than not using some
aggregate metric. As someone who studies the behavior of JSC on million line
code bases, I can tell you that a pretty good outcome is if only a small
number of functions experience an “upside down” effect from optimization and
ends up running slower over time.

Finally, the whole search for a methodology to pinpoint warmup is broken. It’s
pure brain damage. VMs need to be fast even for small programs that don’t have
a chance to warmup. Startup time is absolutely important. So it’s a
methodological antipattern to even try to find the warmup.

The questions worth asking are:

\- for some program, how long does it take to run that program. Start to
finish. No ignoring warmup.

\- how long does it take to run some very long program or the average running
time of a small program averaged over many iterations

\- some percentile of behavior, like the 99th, to get an average of the janky
behavior.

Ideally you measure all of those things and include both short running and
long running programs.

This tells you how good a VM is.

If you’re doing math or methodology to identify the warmup point then you’re
effectively biasing your experiment to forgive VMs for bad behavior so long as
that bad behavior happens early. Nothing could be sillier. Users care about
the perf of their VMs at startup not just in steady state.

Anyway, that’s the way I like to do optimizations in JSC.

~~~
Jasper_
> \- for some program, how long does it take to run that program. Start to
> finish. No ignoring warmup.

This methodology likely comes from Java, which has long-running server
applications. "How long does it run" is often "until someone hits ^C". Here,
startup cost can be slow as long as the peak performance is fine. It's
accepted that the first minute or two of the server are slow, but that's small
compared to the month or so that the server will be running for.

> This tells you how good a VM is.

I think papers like this approach it from the wrong angle. I don't care about
the VM's theoretical peak performance. I care about being able to measure and
track performance in a reliable way. Put simply, I'm fine with bad codegen as
long as I can consistently measure it. Feel free to improve it, but adding to
_sometimes_ give me good codegen, unreliably, is much more frustrating than
bad codegen. But this seems to be the way the VMs are going, with things like
probabilistic profiling.

If I refactor my code and replace for(let i = 0; i < L.length; i++) with
for(const i of L), what's the cost? Will performance go up or down? We don't
have tools or metrics to handle that right now. How can I ensure my codegen is
good won't regress?

I work on a particularly demanding website in my free time (
[https://noclip.website/#smg/AstroGalaxy](https://noclip.website/#smg/AstroGalaxy)
, unfortunately won't run in WebKit due to missing WebGL 2 ), and performance
varies drastically from Chrome release to release, and I do extensive testing
with node.js to make sure that I'm getting good codegen.

~~~
pizlonator
I know that the warmup skipping comes from Java. It was a mistake there.
Saying that it’s because Java is for servers is a lame excuse and may be
getting it backwards - maybe Java only succeeded on servers because all the
tuning ignored warmup.

I hear ya that having tools would be great - but the best speedups do come
about from probabilistic methods so it would be weird to rely on whatever a
profiler told you.

------
alfalfasprout
The reality is... it's been decades and while JVM languages can be pretty
fast, I have yet to see many non-contrived examples where the VM based
language consistently outperforms competently written but not heavily
optimized C++. Even then, extensive tuning is done to the VM. Heck, with the
advent of Go you now have another great higher level language that
consistently outperforms Java/Scala, has top notch garbage collection, and
doesn't make you deal with a bloated VM.

JIT is very nice in theory. It's great in certain applications (eg; in very
tightly scoped domains like accelerating linear algebra). Its proponents
always talk about how it allows for optimizations that would be too costly or
difficult when doing AOT compilation. But the operational complexity to get it
to actually perform at that level on a production language VM (eg; oracle's
JVM) is often its undoing.

~~~
continuational
According to The Computer Language Benchmarks Game, Go performance is in the
same ballpark as Java, and sometimes several times slower:

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/fastest/go.html)

~~~
norswap
Honest question: isn't the benchmarks game unrepresentative because it
encourages submission of heavily optimized non-idiomatic programs?

Edit: looked at the source, which is easily accessible. The Java examples are
reasonable (if slightly performance-minded, but nothing shocking).

------
ddevault
It occurs to me that, in practical terms, the "steady state" of performance
with the increasingly large blobs of JavaScript we find littered throughout
the web tends to impact the user more than other JITs. As the user navigates
from page to page, the VM is reset, a fresh set of minified blobs is
downloaded and JITed, and a core or two is pinned to do so. That translates
pretty directly to a hotter phone, less battery life, and a frustrated user.
Sure, when your JVM startup is 0.1% of the runtime of your program, it's not
as big of a deal, but when it's more like 20% of the time a user spends using
the program (their web browser), it's a lot worse, and has slim potential for
improvement.

~~~
ehnto
It has been really frustrating watching the frontend-web evolve. Take Jira for
example, the platform is written in Java yet all my time is spent waiting for
all the widgets to jiggle their way into existence. No matter how fast your
server is, if you have to do 10-20 network requests to hydrate your frontend
architecture the network time alone is going ruin your perceived performance.
Your server could deliver sub 5ms response times, with 50ms Time to First Byte
on the initial request, and it would still feel like wobbly molasses. The JVM
is the least of their worries.

------
smabie
Haven't we learned that VMs aren't worth it? The amount of engineering
resources it takes to design a half-way decent one is just absolutely
staggering. We all know that theoretically they can be faster than native
code, but how many people have actually experienced that? The costs are too
high: the warm-up costs, the engineering costs, the complexity costs. If every
single language just generated LLVM IR and compiled it, we'd be in a lot
better place and we'd probably have better, or at least more predictable,
performance.

Like more resources have probably been put into the JVM than any other VM or
compiler on earth and what has it given us, exactly? Performance is still
worse than C and with homogenized operating systems, as well as the move
towards the web, the portability guarantees don't feel very important. If I'm
missing something, please tell me!

~~~
ncmncm
The same can also be said about GC. Every year somebody is promoting a new
Incremental Realtime Generational garbage collector that is _almost_
acceptable for serious work, if you ignore enough real-world overheads. People
writing code that has to manage resources besides memory have found that,
given the right core language facilities, managing memory too is no bother,
but managing other resources while fighting GC is intolerable. Of vourse you
need destructors.

Meanwhile, C performance has not been a worthy goal in decades. Getting "as
fast as C", if you could get there, would still leave you firmly in second or
third place. Thr computers are not getting much faster anymore, but the
problems are getting much bigger and the networks much faster, so performance
matters more each year than the last.

~~~
lokedhs
That's a True Scotsman fallacy. I guess all those millions upon uncountable
millions of lines of Java, Erlang, C#, etc code in production is not serious?
Because it's running in a VM?

~~~
hedora
What percentage of those lines run with an incremental realtime GC?

The only such Java GC I know of is part of the Azul Zing VM.

~~~
WorldMaker
I believe that's why the poster referenced the "No True Scotsman" fallacy. It
entirely depends on your definitions and you can keep endlessly narrowing your
definitions. All of them GC, right? All of them are doing some form or another
of incremental, right? (Some in all generations, others only in some
generations, based on balancing and tuning needs.) Do all of them have
realtime characteristics? Yes and no. None of them are allowed to stop the
world, certainly, but does that mean that all of them or any of them can be
called "realtime"? It's a semantic hedge forest.

------
overgard
Ugh at risk of being the grumpy old programmer: jit code is just never going
to touch actual handwritten native code in performance. It's been like 50
years. If we care about performance we need languages for it.

~~~
hansvm
I mostly agree, but compilers intentionally don't explore every optimization
path to save compilation time, and a JIT is able to trace the program for real
data and find places that could benefit from additional attention.

I don't think JITs are the only way to address that trade-off in compilation
time vs runtime performance metrics, and whether their benefits are worth
their other costs is an interesting, program-dependant question, but they have
strictly more information available than a naive compiler and can potentially
use that to make better decisions.

~~~
jjnoakes
Any languages explore both? Compile to native and run with a jit which
profiles and optimizes further based on runtime behavior?

~~~
jdsully
Its called profile guided optimization, although they don’t run with an actual
JIT. The big 3 C compilers all have it.

~~~
The_rationalist
It is clearly less powerful, it analyse one sample and extrapolate from it.

Every user should have it's own PGO and regularly refresh it to be competitive
with the advantage of JIT profiling

~~~
btrask
This is a completely legitimate comment (which was 'dead' when I saw it). PGO
really is less powerful than JIT.

A compiled language with a "micro-JIT" (e.g. for v-tables) seems like an
interesting idea to me.

------
pso
It seems that the authors are not measuring what they think they are, or have
explained it poorly. Most transitions from interpreter to JIT show speedups of
x10 to x100, eg luajit or V8. How is it possible that the variation of V8 (as
an example), according to their numbers is showing improvements of only a few
percent, when it should be orders of magnitude faster after transition? My
conclusion, they are measuring variations after warmup.

All of the warmup, and transitions from interpreter, to JIT, to optimised JIT
, happen inside the first few micro or milliseconds of EVERY one of their
thousands of process iteration. Their measurements are ALL of the system
variation of the VM after warm up has taken place. The VM is optimizing within
the first 1-1000 inner loops occuring at the start of EACH process iteration.
For most working programmers, a variation of a few percent on a running system
AFTER warm-up in "steady-state peak performance", and before any I/O takes
place (because language benchmarks avoid I/O), would not be an issue. If it is
an issue, then the article perhaps demonstrates that a compiled language would
offer less variation.

The benchmarks listed range from a shortest of around 0.4s for
fannkuch/hotspot/linux, up to 1.8s for n-body, pypy, linux. This 'long-
running' benchmark code (of .4 to 1.8s ), by definition, has to include
multiple inner loops/hot code, which is quickly optimized, otherwise benchmark
code would have to be millions of lines long, in order to have a sufficient
runtime length. Tests need to run for at least tenths of a second, for cross
language comparisons, since JITted languages take some iterations to warm-up.

~~~
hedora
Their first iteration is an entire run of the underlying benchmark. Subsequent
iterations are reusing the same VM. They run each plot multiple times, and
reboot between plots.

They’re trying to show that “warmed up steady state” isn’t something that
reliably exists.

~~~
pso
Yes, I know. But the tone of the whole article, is as if, they've found deep
flaws across many VMs. They call something 'warmup" which I think has little
or nothing to do with the JIT, but is unaccounted variations in the whole
running system.

The final graph shows a binary trees program in C, with a 6% variation between
"in process executions", and no steady state, it seems logical that most VMs
will show the same or worse variation.

The "warmed-up steady state" does exist, but not if they define it so
narrowly. All of their iterations and timings are running at x30 to x100
interpreted speed, the only 'cold' interpreted code is in a few microseconds
of the first loops of an execution.

------
Spivak
Stupid question for people who know more about these things. Why can’t we
fight the warmup time by running an already warmed up snapshot of the program?
Or say dumping some data structure when it hits steady state to give hints to
the JIT the next time it runs?

~~~
pizlonator
That would be great but it’s hard because the trick, at least in JS VMs, is to
have the JIT specialize based on the heap.

So you’d need a heap snapshot or some way to link the generated code to a
different heap.

Maybe not impossible, just hard enough that it’s not widespread.

~~~
hamburglar
It does seem like you could do JIT with a persistent cache which stores the
JIT output along with a key that's a hash of all the relevant system
parameters like CPU model and VM parameters like heap size. This would mean
that the typical case of re-running a program in the same environment would be
pre-warmed.

~~~
pizlonator
It’s much harder than that because the JIT is speculating on what lots of
objects in the heap are doing, including watchpointing them to constant fold
properties. It’s not clear what the key should be in that case.

Still not impossible but I want to be clear on what exactly makes this hard.
CPU model for example is not what makes it hard.

~~~
saagarjha
> including watchpointing them

I assume this isn't literally using hardware watchpoints?

~~~
pizlonator
Yeah

------
darksaints
I love intellij idea, but it's pretty damn slow for a lot of things (though
never quite as bad as eclipse). I have always wondered if the JVM is
optimizing for tasks that happen at startup, at the expense of performance
during use. It would be a shame if the slow performance simply came down to
the JVM profiling at startup and determining that the core function of the
entire program was to index the code in your workspace.

~~~
qqssccfftt
In my experience, IntelliJ is as fast as VS Code. It'll never be as fast as a
terminal editor because it has actual features as opposed to just being a text
editor.

Outside of startup (which has gotten so much better in the last years I don't
even see the splash screen anymore) and initial indexing upon project creation
or library downloading, it's perfectly fast enough.

~~~
tracker1
In my own experience IntelliJ is slower than VS Code to use... It will depend
on what you're doing as VS Code has an interesting plugin module to not hang
up the UI/Editor itself...

Not all the plugins respond in time while typing, but I can at least keep
typing and it continues to function with basic editing (general worst case).

------
CalChris
There are a couple of talks associated with this post and paper.

 _Virtual Machine Warmup Blows Hot and Cold_

[https://www.youtube.com/watch?v=LgCHAU8ZB00](https://www.youtube.com/watch?v=LgCHAU8ZB00)

 _Why Aren 't More Users More Happy With Our VMs?_

[https://www.youtube.com/watch?v=cmrzOkEM9fc](https://www.youtube.com/watch?v=cmrzOkEM9fc)

------
lokedhs
People are complaining about VM performance, while at the same time I'm
guessing a fair number of the same people happily writes production code in
Python. It's hard to find a slower language than Python.

People are not using VM-based languages because they want to beat the numeric
performance of hand-optimised Fortran. They use it because of the memory
safety, compatibility and performance which is on par with or even better than
other languages (at least when it comes to the JVM) for the tasks that they
want to perform, which is for most programmers not going to be numeric
simulation.

~~~
jonathanstrange
I have nothing against lean VMs/JIT compilation like LuaJit or the way Racket
compiles on the fly, but I do avoid Java VMs for various reasons:

\- Dependency hell and deployment problems: It's hard to make correct
assumptions about which VM version is available on which platform. Pre-
installed versions interfere with side-installed versions, and there is a ton
of software that requires older Java VMs to work properly. It's a huge mess.

\- Potential for losing future OS support: Apple, Microsoft, and others may at
any time decide to block Java VM or no longer support it on their platform.
That means you have to bundle your software with a Java VM, e.g. Crashplan has
done this, making installation and deployment even more difficult.

\- A thousand past problems on Linux: Various versions of OpenJDK and Oracle's
java in combination with user software written in Java have caused massive
problems on my Linux machines during the past 15 years, from causing extreme
slowdowns to freezing the desktop until you hard reset.

Nothing else has given me as much troubles on Linux than Java, not even
proprietary graphics card drivers and kernel extensions. Whatever the Java VM
does, if it can freeze your whole system just because you run desktop software
like Jabref, then there is something wrong with it.

~~~
qqssccfftt
t. person who has not used java since Java 1.5

~~~
jonathanstrange
My post wasn't intended to be taken as a statement about Java, the programming
language, it's about VMs/Java implementations. Unfortunately I have software
to run on the Java VM.

------
jasonhansel
I suspect that part of the issue is that when _devs_ are writing and testing
their code, they rarely keep the VM running for long enough to reach peak
performance. So the performance still feels slow to developers, and
performance issues still block the develop/reload/test cycle.

~~~
AceJohnny2
This is what overnight "regression" & "aging" (where you run the test suite
over and over, to try and capture those rarely-seen corner cases) tests are
supposed to capture.

I'd be surprised if the VM developers didn't run these.

More likely, the environment the VMs are run in has changed in the decades
since their development: amount of RAM, cache size, latency of one subsystem
over another....

------
im3w1l
The title doesn't do the surprising conclusion justice.

"When we set out to look at how long VMs take to warm up, we didn’t expect to
discover that they often don’t warm up. But, alas, the evidence that they
frequently don’t warm up is hard to argue with."

By not warming up they refer to instances when early performance is higher
than later performance or when the performance doesn't settle.

~~~
afiori
> By not warming up they refer to instances when early performance is higher
> than later performance or when the performance doesn't settle.

which are both cases where you would be disappointed were you to use the VM as
a server.

I think their point is that VMs are more unpredictable than many realize and
also intrinsically unpredictable in some cases.

------
quotemstr
It's psychological. Most users don't really care how fast something runs once
the JIT warms up: instead, they care about the latency of their starting the
program to the time they can use it and use this latency as a psychological
measure of the performance of the program as a whole. Stupid? Yes. But that's
what actually happens.

The JVM has always had a slow startup path. Much slower-overall systems like
Python don't. That's why people don't complain about Python performance and do
complain about Java performance.

------
daxterspeed
These tests could be on a website of their own. I'd love to see how the
results change with time and how competing implementations compare in terms of
behavior (not necessarily performance), V8 especially has changed a lot but it
wouldn't surprise me if it still ran into issues.

