
Register-based VMs have a higher performance than stack-based VMs - frjalex
https://arxiv.org/abs/1611.00467
======
lacampbell
I would love to see the code used for some of these instructions. Their
description of "swap" for the stack VM jumped out at me:

> pop 1 st and 2nd value out of local stack and swap them

You don't need to pop _at all_ for swap. The stack isn't actually changing
size and can be modified in place. You also should only have one pop for
'add', etc. A lot of people don't seem to realise this.

Also, this article may be of interest:

[https://blogs.msdn.microsoft.com/ericlippert/2011/11/28/why-...](https://blogs.msdn.microsoft.com/ericlippert/2011/11/28/why-
have-a-stack/)

TL;DR - a stack machine is conceptually simpler than a register machine, and
it doesn't matter if it's slower since you are JIT'ing it anyway.

~~~
kjksf
It does matter even for jitting.

Pike and Winterbottom
([http://www.vitanuova.com/inferno/papers/hotchips.html](http://www.vitanuova.com/inferno/papers/hotchips.html))
used register-based VM for Plan9/Inferno because it's much faster and easier
to write a high-quality jit from register-based VM to register-based CPU (and
all of them are register based) than from byte-code VM to register-based CPU.

This is probably the same reason Android chose register-based VM design for
their Java-based platform.

Jitting is not free - an increase in jit complexity is paid by lower
performance of the code being jitted. Java's HotSpot does wonders but it takes
much latency and high memory use to get the really high-quality results.

It makes sense to do as much work as possible where you have time and CPU
cycles to spare (during compilation to VM instruction set) and not on the
device when every millisecond and every byte matters.

Lower complexity of stack-based VM is a moot point - when you're jitting,
you're doing the hard stuff anyway.

~~~
lacampbell
Well we are both just appealing to experts here (unless you are infact one. I
am certainly not - just an enthusiast).

What did you think of the last part of Erik Lipperts post?

 _The code is much smaller and much easier to understand. A stack machine is a
very simple way to describe a complex computation; by being able to write code
for such a simple machine, it lowers the cost of making a compiler. And not
only is it easier to write compilers and jitters that target simple stack
machines, it is easier to write other code analysis tools as well. The IL
verifier, for example, can quickly determine when there is a code path through
a method that, say, misaligns the stack, or passes the wrong types of
arguments to a method._

~~~
vidarh
The stack machine may be conceptually simple, but here's the thing:

Either you have a way to address an offset into the stack, or you don't. If
you do, the stack is conceptually equivalent to a register file where you can
optionally swap out the entire content in one go (by changing the stack
pointer).

If you don't, then registers will in most designs offer you a strict superset
of operations: I've yet to see any register-based machines where you can't
load things into registers, do the operation and push them back on the stack.
In fact, on most machines that are not purist RISC designs, you will tend to
be able to do many/most operations with one of the operands in memory.

That superset comes at a complexity cost in some areas, sure, but being able
to address more different data items without having to manipulate the stack
also turns out to be _very_ handy. If it wasn't, we'd just stick to mostly
manipulating the stack even in register-based designs.

EDIT: In fact, you'll often see "naive" first code generators for register-
based machines heavily rely on stack manipulation for expression evaluation
because it massively reduces the need for register allocation complexity.

~~~
int_19h
An even shorter way to say this is that stack-based VMs are more high-level.
With a register-based VM, you shift some important questions (like register
allocation, and handling temporaries) to the compilers. With a stack-based VM,
the compiler can be simpler, but the JIT will need to be correspondingly more
complicated.

Which one to choose depends on how it is all used. For .NET, multi-language
targeting was seen as an important design goal, and so making the bytecode
easier to target was prioritized, with all the complexity relegated to JIT.

Long-term, I think it was a right solution in general, because most platforms
seem to be moving from JIT to AOT; and with AOT, having the complexity in the
bytecode-to-native compiler still has the upside of DRY, and slower compile
times stop being a downside.

~~~
AntonErtl
However, if you leave register allocation to the source->VM compiler and don't
do it in the VM->machine compiler (JIT), you cannot perform machine-specific
register allocation (8 registers on IA-32, 16 on AMD64 and ARM, 32 on
Aarch64). So register-based VMs typically work with large register sets, and
JITs allocate them to the smaller register sets of real machines.

I have not read the present paper yet, but earlier papers I have read were
about reducing interpretation overhead by performing fewer VM instructions (if
VM instruction dispatch is expensive; dynamic superinstructions make it cheap,
but not everyone implements them).

~~~
frjalex
Note that the JIT mechanism mentioned in the paper meant basically JIT from a
program code to the bytecode executed in the VM, and directly executing it
(very much like PHP), not necessarily directly to machine assembly.

------
white-flame
Many stack systems boast about the large number of operations they can perform
per second. However, a stack "operation" is not equivalent to a register
machine operation.

Something like "reg1 = reg2 + 125" is a single instruction in almost every
register ABI, while at they _very, very least_ , a stack machine would need
two instructions: "push 125, add". If reg1 isn't immediately accessible as the
stack head, and reg2 isn't consumed as the head, then you also need
load/store/swap/roll/etc stuff wrapping the addition.

That's up to 6 (or even more) stack instructions to perform what a register
machine can do in 1.

Also, when talking about interpreters, there is a not-insignificant overhead
of fetching and dispatching the next instruction. That overhead is continually
compounded when you need multiple instructions to perform what you could do in
a single operation otherwise. The simpler the instructions are, the higher
percentage of time is "wasted" performing instruction dispatch instead of
carrying out core instruction processing. This can be parallelized in hardware
instruction dispatch, but is pretty firmly stuck being serial in software
interpreters.

Stack machines have very concise source code representations, and their
individual instructions are simple and fast. But they can be very bloated in
terms of the number of native instructions executed to carry out an overall
program's work.

~~~
vilda
You seem to be confused with theoretical stack-based machines with their clear
implementation, and their practical implementations. There's no reason why
stack-based machine cannot have one instruction to add a constant, for example
"addc 125" to add 125 to the top of the stack.

Similarly, there's no reason why register-based VMs wouldn't have push/pop
instructions.

------
Someone
The main result is hardly surprising.
[https://www.usenix.org/legacy/events/vee05/full_papers/p153-...](https://www.usenix.org/legacy/events/vee05/full_papers/p153-yunhe.pdf)
(2005) measured the speed difference at 26.5-32.3%. Given the subject, that's
not significantly different from the 20.39% reported here (and why do they
even report this with two decimals? I'm sure that last digit will change if
they do so much as sneeze at their code)

~~~
throwaway4891a
Yup. Stack-based is great for high-level language VM interface ease-of-use (ie
JVM languages) but generally sucks at the hardware level because most CPUs are
register-based: the mapping doesn't scale as well because of the extra
overhead of optimal scheduling. A good register-based VM would have a large
number of registers/(pseudo)variables and support sequence points to allow
actual instructions to be more optimally-scheduled. Atomics are also nice in
certain edge-cases, but being mostly immutable / lock-free / branch-free is
typically a better strategy to avoid contention and pipeline stalls.

~~~
bogomipz
>" Stack-based is great for high-level language VM interface ease-of-use (ie
JVM languages) but generally sucks at the hardware level because most CPUs are
register-based: the mapping doesn't scale as well because of the extra
overhead of optimal scheduling"

Is this the graph coloring register allocation optimization problem?

------
dfox
I see few flaws in their approach:

The stack based virtual machine represents instructions as two field structs
that are essentially predecoded (and larger in memory) while the register VM
uses what boils down to array of words which are then presumably decoded at
run time (the paper even contains some kind of argument why the code is
represented as array of single-field structs, but completely neglects to
mention what would happen if the structs contained some kind of pre-decoded
instructions).

The function call mechanism of their stack based VM seems to be essentially
modeled after how forth works, which is not how typical stack based VM used in
implementation of languages that do not directly expose the VM semantics
works. Typical stack based VM (eg. Smalltalk, JVM, CPython...) allocates new
argument stack for each function invocation (often using alloca() or such, ie.
on native control stack) and does not directly use the VM argument stack to
pass parameters and results between functions.

~~~
bogomipz
Forth a register-based VM right?

Does it the VM stack as well instead of the native stack?

~~~
dreamcompiler
Some Forths compile to native machine instructions and no VM is involved.
Sometimes they compile a list of threaded jump addresses with an "interpreter"
that's basically just two instructions: JSR [nextaddress++]; LOOP. Sometimes
they just compile literal sequences of JSR instructions and there's no
interpreter at all.

~~~
bogomipz
Wow this is really interesting. I didn't realize there were different
implementations.

Do you have a resources regarding these? I would love to learn more about
this.

~~~
tbirdz
Moving Forth is pretty good:
[http://www.bradrodriguez.com/papers/moving1.htm](http://www.bradrodriguez.com/papers/moving1.htm)

There are 8 parts to the series, you can look at all of them and other Forth
writings by the author on his website:
[http://www.bradrodriguez.com/papers/](http://www.bradrodriguez.com/papers/)

~~~
Senji
Seeing as you know something about Forth. Is it used somewhere in production
outside of purely academic setting?

~~~
abecedarius
Yes, but its heyday was the 70s through mid-80s. The most recent use I'm
familiar with was the OpenBoot firmware in OLPC laptops, though I've been out
of it so long I wouldn't know what's current.

~~~
pktgen
In addition to OLPC, Open Firmware was used on PowerPC Macs.

------
jamesu
Having implemented both a stack and a register-based VM I'd agree with the
sentiment in the title. The register-based implementation (based partly off
LUA 5 bytecode) was notably faster which I attributed to improved use of CPU
cache combined with the instruction doing more work in per memory-fetch.

However I'd also point out the register-based VMs was far more difficult to
debug, plus the compiler was much more long-winded to account for register
allocation. Plus by the time you add on function dispatch overhead and real-
world usage patterns, the gap where the register machine is technically better
(i.e. besides just doing fibonacci) narrows somewhat.

------
userbinator
Not surprising. Register-based machines in general have higher performance
than stack-based machines, because the register set can be addressed randomly
like RAM whereas in a 'strict' stack machine this isn't possible.

[https://en.wikipedia.org/wiki/Stack_machine#Performance_disa...](https://en.wikipedia.org/wiki/Stack_machine#Performance_disadvantages_of_stack_machines)

Of course the advantages such as easy code generation and high code density
make stack machines a good intermediate compilation target, which then gets
compiled to a register machine for actual execution.

~~~
bogomipz
>"Of course the advantages such as easy code generation and high code density
make stack machines a good intermediate compilation target, which then gets
compiled to a register machine for actual execution"

Are you referring to a specific language here or are you saying all register
machine always compile to stack machines first?

~~~
userbinator
In general they do. The JVM and .NET CLR being the most prominent examples.

------
transfire
I should imagine implementing a register-based VM on a register-based CPU is
going to have some advantages over a stack-based VM implemented on a register-
based CPU. I wonder how well a register-based VM would fair on a stack-based
CPU.

~~~
shincert
That's interesting. Are there any actual stack-based CPUs out there today?

~~~
NonEUCitizen
[http://www.excamera.com/sphinx/fpga-j1.html](http://www.excamera.com/sphinx/fpga-j1.html)

~~~
frjalex
That's fascinating. Would love to see what performance such thing can actually
perform.

------
Lerc
I remember reading a paper a while ago with similar conclusions.

It concluded that register was faster than stack based but stack based had
better code density. The reason seemed to be due to the branch prediction cost
per instruction. Fewer but more more powerful instructions reduced the number
of times the instruction decode operation occurred.

------
sitkack
My feeling is that has more to do with memory traffic and the cache hierarchy
than the difference between register and stack VMs. Stack VMs are going to
generate more read-modify-write cycles than a register VM. Of course
experiments are in order (or out of order, ha!).

~~~
questerzen
I imagine you are right. But there should be plenty of scope for optimising
this in the VM itself. Most stack operations are immediately preceded by a
push, so hold the top of the stack in a register and the last push in another
and you could avoid a high proportion of memory calls in the most common
cases. Does anyone know if JVM implementations do this?

------
e3b0c
[http://fpgacpu.ca/stack/Second-
Generation_Stack_Computer_Arc...](http://fpgacpu.ca/stack/Second-
Generation_Stack_Computer_Architecture.pdf)

The article has detailed discussions about the topic.

------
erikb
This reminds me a little of the "nosql is faster than sql" argument a few
years back, or "Firefox is faster than Chrome". Let's just add a "now" and
we're fine. Most programs we have now are so complex that none of them is
really optimized to the max. Therefore the question is basically who spends
more of his ressources on optimization.

~~~
iopq
But some problems are very tough to solve in some languages. Like in Java,
it's very hard to avoid allocating everything on the heap. In C++/Rust it's
trivial so you can get that performance boost without crude hacks.

------
CalChris
To quote another post:

> generally sucks at the hardware level because most CPUs are register-based

Whoa, hold on there folks. This is not a _CPU_ register VM. It's a 'register-
based VM' and it isn't even that.

    
    
      iconst 1    vs    set t3, t1
      iconst 2          set t4, t2
      iadd              add t3, t4, t5
    

t1 through t5 are allocated in the stack frame. They are NOT magically somehow
mapped to x86_64 registers R8-12.

Actually, calling this a register-based VM is just wrong. It's lazy thinking
but really it's just wrong. The value of t1 will be in memory.

Yes, it's standard terminology which easily lets people misunderstand what's
happening under the hood, as in the above quoted case. The article doesn't
even say _virtual registers_ which would have helped. It could. It should. It
doesn't.

BTW, a good interpreter will cache the top of stack in a CPU register.

~~~
hyperpape
I believe this is a standard bit of terminology: it means that the virtual
machine is implemented with _virtual_ registers. Google "lua registers", for
instance.

