Hacker News new | past | comments | ask | show | jobs | submit login
Register-based VMs have a higher performance than stack-based VMs (arxiv.org)
119 points by frjalex on Dec 11, 2016 | hide | past | web | favorite | 70 comments

I would love to see the code used for some of these instructions. Their description of "swap" for the stack VM jumped out at me:

> pop 1 st and 2nd value out of local stack and swap them

You don't need to pop at all for swap. The stack isn't actually changing size and can be modified in place. You also should only have one pop for 'add', etc. A lot of people don't seem to realise this.

Also, this article may be of interest:


TL;DR - a stack machine is conceptually simpler than a register machine, and it doesn't matter if it's slower since you are JIT'ing it anyway.

It does matter even for jitting.

Pike and Winterbottom (http://www.vitanuova.com/inferno/papers/hotchips.html) used register-based VM for Plan9/Inferno because it's much faster and easier to write a high-quality jit from register-based VM to register-based CPU (and all of them are register based) than from byte-code VM to register-based CPU.

This is probably the same reason Android chose register-based VM design for their Java-based platform.

Jitting is not free - an increase in jit complexity is paid by lower performance of the code being jitted. Java's HotSpot does wonders but it takes much latency and high memory use to get the really high-quality results.

It makes sense to do as much work as possible where you have time and CPU cycles to spare (during compilation to VM instruction set) and not on the device when every millisecond and every byte matters.

Lower complexity of stack-based VM is a moot point - when you're jitting, you're doing the hard stuff anyway.

I don't follow the reasoning. Tracking the stack depth to convert into the equivalent of Pike and Winterbottom's VM registers takes at most one instruction at JIT time per stack-VM instruction (to increment or decrement the depth). If you want distinguish different types at the same stack offset, that might be one more instruction. Pike and Winterbottom seem to be saying their VM has an infinite number of registers, so the frontend is simple, and then the JIT backend has a choice of how hard to work at register allocation. But the same goes for a stack machine, with a very small overhead at JIT time, nothing like the expense of HotSpot.

The argument that does make sense to me is, if you want a somewhat faster interpreter at some expense in code size, then a register VM wins. (Perhaps by not as much as commonly thought, because of some optimizations like stack caching and superinstructions: http://savannah.gnu.org/projects/vmgen/ But it does seem to win still.) This has the disadvantage of having to having to implement spilling or some other such complication in the frontend when the number of virtual registers needed exceeds 256 or whatever.

Well we are both just appealing to experts here (unless you are infact one. I am certainly not - just an enthusiast).

What did you think of the last part of Erik Lipperts post?

The code is much smaller and much easier to understand. A stack machine is a very simple way to describe a complex computation; by being able to write code for such a simple machine, it lowers the cost of making a compiler. And not only is it easier to write compilers and jitters that target simple stack machines, it is easier to write other code analysis tools as well. The IL verifier, for example, can quickly determine when there is a code path through a method that, say, misaligns the stack, or passes the wrong types of arguments to a method.

The stack machine may be conceptually simple, but here's the thing:

Either you have a way to address an offset into the stack, or you don't. If you do, the stack is conceptually equivalent to a register file where you can optionally swap out the entire content in one go (by changing the stack pointer).

If you don't, then registers will in most designs offer you a strict superset of operations: I've yet to see any register-based machines where you can't load things into registers, do the operation and push them back on the stack. In fact, on most machines that are not purist RISC designs, you will tend to be able to do many/most operations with one of the operands in memory.

That superset comes at a complexity cost in some areas, sure, but being able to address more different data items without having to manipulate the stack also turns out to be very handy. If it wasn't, we'd just stick to mostly manipulating the stack even in register-based designs.

EDIT: In fact, you'll often see "naive" first code generators for register-based machines heavily rely on stack manipulation for expression evaluation because it massively reduces the need for register allocation complexity.

Also, it's worth noting that computation stack and execution stack need not be the same thing (and, indeed, often aren't).

In CLR, for example, local variables are referenced by index on the execution stack, but you cannot dynamically push new variables onto it - you have to predeclare the locals at the beginning of your execution frame, and the only way to push a new execution frame is by calling another function. On the other hand, opcodes like ADD operate on the computation stack, which is an abstraction completely separate from, and not affected by, the execution stack (every execution frame has its own independent computation stack; there's no way to peek across the boundary).

So in reality it's a bit more complicated - those locals are also kinda like named (or indexed) registers (at least until you take their address). Then the stack adds more temp registers on top of that. In this model, you could always take any register-based code, and translate it to stack-based with a very blunt transform, where something like "ADD x, y, z" becomes "LOAD x; LOAD y; ADD; STORE z". But in addition to that, you also have the flexibility of doing a bunch of operations directly on the temp stack, without intervening stores and loads, that are unavoidable in the register representation. So I would argue that it's the stack representation that is a superset of register one, not the other way around.

An even shorter way to say this is that stack-based VMs are more high-level. With a register-based VM, you shift some important questions (like register allocation, and handling temporaries) to the compilers. With a stack-based VM, the compiler can be simpler, but the JIT will need to be correspondingly more complicated.

Which one to choose depends on how it is all used. For .NET, multi-language targeting was seen as an important design goal, and so making the bytecode easier to target was prioritized, with all the complexity relegated to JIT.

Long-term, I think it was a right solution in general, because most platforms seem to be moving from JIT to AOT; and with AOT, having the complexity in the bytecode-to-native compiler still has the upside of DRY, and slower compile times stop being a downside.

However, if you leave register allocation to the source->VM compiler and don't do it in the VM->machine compiler (JIT), you cannot perform machine-specific register allocation (8 registers on IA-32, 16 on AMD64 and ARM, 32 on Aarch64). So register-based VMs typically work with large register sets, and JITs allocate them to the smaller register sets of real machines.

I have not read the present paper yet, but earlier papers I have read were about reducing interpretation overhead by performing fewer VM instructions (if VM instruction dispatch is expensive; dynamic superinstructions make it cheap, but not everyone implements them).

Note that the JIT mechanism mentioned in the paper meant basically JIT from a program code to the bytecode executed in the VM, and directly executing it (very much like PHP), not necessarily directly to machine assembly.

Interesting.... the paper referenced you if you're the correct Anton Ertl :)

Interesting expressions usually use local variables. JVM, CLR etc. despite being nominally stack-oriented, use local variable slots by index rather than shuffling the stack for anything much more complicated than eager expression evaluation.

This gets the best of both worlds in practice.

The last page of the publication has a GitHub link. Here are the relevant lines (I think):


Yeah, really naive. Ditto with their iadd commands, which pop twice (you need only pop once). I see this a lot - presumably because most people just aren't familiar with stack machines, as register machines are completely dominant in hardware.

See https://news.ycombinator.com/item?id=13155259 for an example of what I mean.

> TL;DR - a stack machine is conceptually simpler than a register machine, and it doesn't matter if it's slower since you are JIT'ing it anyway

It totally matters. See [1] for previous work on this topic. You need to perform expensive data flow optimisations to generate information the stack format doesn't need to encode, but that a register format does need. Better to do this analysis ahead of time in the compiler than at runtime.

[1] https://www.usenix.org/events/vee05/full_papers/p153-yunhe.p...

IIRC, strictly speaking stack machines are not allowed to modify values in place on the stack. Any decent JIT or interpreter is likely to recognize that it can optimize certain operations by performing them in place, but the bytecode/IL/whatever will always be "pop two values and swap them".

> IIRC, strictly speaking stack machines are not allowed to modify values in place on the stack.

I am curious as to where you heard this. Modifying the stack in place seems like an implementation detail. I mean even if swap wasn't a primitive (which again seems really far fetched), you are still going to have to modify part of the stack in place eventually. It's just in the naiive way you decrement the stack pointer twice then increment it twice, while in the way I outline you don't bother since the length of the stack is the same after the operation is completed.

Probably easier to write less and pseudo code more - I am assuming a downward growing stack here

    # First approach
    temp_a = pop(stack)
    temp_b = pop(stack)
    push(stack, temp_a)
    push(stack, temp_b)

    # second approach
    temp_a = stack[0]
    stack[0] = stack[1]
    stack[1] = temp_a

A stack is an abstract type having two operations: push(element) and pop(). The implementation can cheat till it's transparent to the user. To quote CLRS,

The INSERT operation on a stack is often called PUSH, and the DELETE operation, which does not take an element argument, is often called POP. These names are allusions to physical stacks, such as the spring-loaded stacks of plates used in cafeterias. The order in which plates are popped from the stack is the reverse of the order in which they were pushed onto the stack, since only the top plate is accessible.

"A stack" is an abstract type, but "The Stack" isn't.

"The Stack" of a virtual machine definitely is.

> I am curious as to where you heard this. Modifying the stack in place seems like an implementation detail.

You are correct, but as the other reply mentioned, when we're talking about abstract specifications a stack typically doesn't allow reads and writes to arbitrary indices, therefore a stack machine can't perform an in-place swap. The implementation will almost certainly optimize the operation as you're describing, but I'd still expect to see a swap operation defined as "pop pop push push". Likewise I would usually expect to see an add described as popping two arguments and pushing one result, even thought the implementation probably pops one argument, then reads the second and overwrites it with the result.

In short, it's the abstract description of the thing vs the actual implementation.

I don't see why the abstract specification should mention pushing or popping at all. The abstract specification for swap is simply "swaps the first and second elements of the stack". How that's achieved doesn't seem important, and it seems odd to me to define it in terms of primitive operations it's not actually defined in terms of - unless it's some ultra minimalist Forth thing where swap isn't a primitive.

I can only partially agree with this - having a stack exposed to modification would virtually make it not a stack.

> You don't need to pop at all for swap.

Your skepticism is justified: https://github.com/frjalex/Conceptum/blob/master/src/main.c#...

(note that this doesn't implement a swap by any means)

Many stack systems boast about the large number of operations they can perform per second. However, a stack "operation" is not equivalent to a register machine operation.

Something like "reg1 = reg2 + 125" is a single instruction in almost every register ABI, while at they very, very least, a stack machine would need two instructions: "push 125, add". If reg1 isn't immediately accessible as the stack head, and reg2 isn't consumed as the head, then you also need load/store/swap/roll/etc stuff wrapping the addition.

That's up to 6 (or even more) stack instructions to perform what a register machine can do in 1.

Also, when talking about interpreters, there is a not-insignificant overhead of fetching and dispatching the next instruction. That overhead is continually compounded when you need multiple instructions to perform what you could do in a single operation otherwise. The simpler the instructions are, the higher percentage of time is "wasted" performing instruction dispatch instead of carrying out core instruction processing. This can be parallelized in hardware instruction dispatch, but is pretty firmly stuck being serial in software interpreters.

Stack machines have very concise source code representations, and their individual instructions are simple and fast. But they can be very bloated in terms of the number of native instructions executed to carry out an overall program's work.

You seem to be confused with theoretical stack-based machines with their clear implementation, and their practical implementations. There's no reason why stack-based machine cannot have one instruction to add a constant, for example "addc 125" to add 125 to the top of the stack.

Similarly, there's no reason why register-based VMs wouldn't have push/pop instructions.

The main result is hardly surprising. https://www.usenix.org/legacy/events/vee05/full_papers/p153-... (2005) measured the speed difference at 26.5-32.3%. Given the subject, that's not significantly different from the 20.39% reported here (and why do they even report this with two decimals? I'm sure that last digit will change if they do so much as sneeze at their code)

Yup. Stack-based is great for high-level language VM interface ease-of-use (ie JVM languages) but generally sucks at the hardware level because most CPUs are register-based: the mapping doesn't scale as well because of the extra overhead of optimal scheduling. A good register-based VM would have a large number of registers/(pseudo)variables and support sequence points to allow actual instructions to be more optimally-scheduled. Atomics are also nice in certain edge-cases, but being mostly immutable / lock-free / branch-free is typically a better strategy to avoid contention and pipeline stalls.

>" Stack-based is great for high-level language VM interface ease-of-use (ie JVM languages) but generally sucks at the hardware level because most CPUs are register-based: the mapping doesn't scale as well because of the extra overhead of optimal scheduling"

Is this the graph coloring register allocation optimization problem?

I see few flaws in their approach:

The stack based virtual machine represents instructions as two field structs that are essentially predecoded (and larger in memory) while the register VM uses what boils down to array of words which are then presumably decoded at run time (the paper even contains some kind of argument why the code is represented as array of single-field structs, but completely neglects to mention what would happen if the structs contained some kind of pre-decoded instructions).

The function call mechanism of their stack based VM seems to be essentially modeled after how forth works, which is not how typical stack based VM used in implementation of languages that do not directly expose the VM semantics works. Typical stack based VM (eg. Smalltalk, JVM, CPython...) allocates new argument stack for each function invocation (often using alloca() or such, ie. on native control stack) and does not directly use the VM argument stack to pass parameters and results between functions.

How do most stack based VM represent instructions? When I went through a phase of writing VMs, the approach I used was to represent everything with a single byte. if a "push" byte was encountered, it would scan ahead 8 more instructions and construct an int. So it had the overhead of having to decode and encode actual data, but the instruction stream was very dense.

I imagine in the real world, they use something like

    union instruction_packet {
        int64 value;
        int8[8] instructions;
Which would be less space efficient (when you hit a push instruction you'd need to move to the next packet, even if it was the first element of the array) but gets rid of the decoding/encoding problem.

The JVM uses an 8-bit opcode, of which 202 are defined. Some opcodes have implicit operands (e.g., iload_1, load the first local variable as integer on the stack or iconst_m1, push -1 on the stack), while others take usually an index that's usually 1 byte (or 2 byte for jump offsets)--although there's a wide instruction that makes the next instruction use 2-byte operands instead.

Instruction decoding for the entire JVM instruction set save tableswitch and lookupswitch amounted to about 50 lines of code, not counting the enum listing every opcode.

I more meant - what shape is in the instruction stream? Is it literally just a stream of individual bytes, that have to be decoded when pushing constants etc?

The Code attribute in the JVM spec consists of the number of local variable slots and the maximum stack depth, a byte array, and a collection of attributes to represent things like the stack map or exception handling tables.

Most of the opcodes have zero or one operands, usually a single byte (or two bytes with a prefix) that's an index into the local variable array or the constant pool. There's several ways to load constants: the integer constants -1 through 5, long constants 0 and 1, float and double 0, 1, and 2 all have 0-operand opcodes (e.g., iconst_0); there's an opcode to load a 1-byte immediate, another to load a two-byte immediate, and an opcode to load a constant in the constant pool (which can be a string, a Class<?> reference, an integer, long, double, or float).

It should be noted that the JVM specifies a fixed big-endian format in its bytecode, and as a result, even the two-byte immediates are specified in the manual as two operands of a single byte each.

That's one possible design, but it's common to use a constant pool. Placing constants in the instruction stream is possible but awkward, since there is no guarantee that they are either aligned or of the right endianness (assuming you are trying to write a more or less portable system).

Constants that fit in a byte don't have this problem and can be easily placed in the instruction stream using dedicated byte constant instructions, sometimes including dedicated single-byte instructions for common constants like zero, one, etc.

Forth a register-based VM right?

Does it the VM stack as well instead of the native stack?

Some Forths compile to native machine instructions and no VM is involved. Sometimes they compile a list of threaded jump addresses with an "interpreter" that's basically just two instructions: JSR [nextaddress++]; LOOP. Sometimes they just compile literal sequences of JSR instructions and there's no interpreter at all.

Wow this is really interesting. I didn't realize there were different implementations.

Do you have a resources regarding these? I would love to learn more about this.

The wikipedia entry [0] contains some refs. The out-of-print book "Threaded Interpretive Languages" [1] was the definitive treatment of the topic, and what I mostly used back when I was writing Forth compilers.

[0] https://en.m.wikipedia.org/wiki/Threaded_code

[1] https://www.amazon.com/Threaded-Interpretive-Languages-Desig...

Moving Forth is pretty good: http://www.bradrodriguez.com/papers/moving1.htm

There are 8 parts to the series, you can look at all of them and other Forth writings by the author on his website: http://www.bradrodriguez.com/papers/

Thanks for the links!

For anyone else interested I found this book which is about hardware-based stack machine sand Forth.

To quote the book:

"While some of the material is obviously dated and new work has been done in this area, I believe that the book remains the principal reference work on Forth-style stack computers. "


It is also available on Amazon for a dollar.

The author's page:


Seeing as you know something about Forth. Is it used somewhere in production outside of purely academic setting?

FreeBSD bootloader uses Forth:


Also, while it's not Forth, it could be argued that the RPL language as used on HP scientific calculators, is a kind of higher-level Forth produced by cross-breeding with Lisp. Indeed, its name, while officially not an acronym, originated as "Reverse Polish Lisp". Its implementation is also kinda awesome, because it has two layers - User RPL, which is more high-level and completely memory-safe, with error checking everywhere, and is exposed directly to calculator users; and System RPL, which basically just compiles the source directly to a bunch of call opcodes (so if you mismatch the stack at any point, it just crashes), and which is used to implement the calculator OS and stock apps, as well as User RPL.


Yes, but its heyday was the 70s through mid-80s. The most recent use I'm familiar with was the OpenBoot firmware in OLPC laptops, though I've been out of it so long I wouldn't know what's current.

In addition to OLPC, Open Firmware was used on PowerPC Macs.

Anyone that worked with Sun Sparc hardware probably remembers the PROM monitor(SUN's BIOS basically) which was written in Forth and had a Forth interpreter. There's a couple of questions on this FAQ about Forth:


Not as much nowadays as back when embedded computers were severely memory-limited. In those days, you could use Forth and get a semi-high-level, interactive REPL language with a resident compiler in a few K of memory. Now you would usually just use C or Python because memories are huge.

While we're on the subject, Forth is interesting because as a stack-based language, it forces you to structure your programs as function compositions. This was not a popular idea in the 80s, but with the increasing popularity of functional programming, it's relevant again.

One more I forgot: Postscript (the full language of which PDF is a simplification). It's not Forth per se, but it's very Forth-like in much the same way Clojure is Lisp-like.

I know of at least 2 commercial implementations: VFX Forth by mpe[0] and SwiftForth by Forth Inc[1]. Both companies are still in business, selling their tools for hundreds to thousands of dollars, so there is obviously some people using them. I think they have a partial customer listing somewhere on their site. Oh, and there's also gforth of course.

Then again, part of the charm of forth is how easily it is to bootstrap your own forth on your own system. There could be countless homegrown forths out there, in addition to these commercial offerings.

I haven't used forth for anything in production, myself. I just think it's a really interesting language. It's a great "what if" question. What if everything built on C had been built on Forth instead? What if all the work in improving the C language, tooling, operating systems, libraries, hardware etc had been spent improving Forth and stacks instead? It's fun to ponder.

If you want to read more about forth, check out Thinking Forth[2]. It's really got some great insights about programming and software design in general, not just applicable to forth programming in particular.

Starting Forth[3] is also good more of a beginner's introduction to forth. And Moore's own Problem Oriented Language is a good book talking about what his intentions were when he created forth[4].

For more reading: This page has a lot of good forth links too [5] Chuck Moore's color forth blog[6] This site has lots of interviews with Chuck Moore on it and other forth opinion pieces[7] Forth Interest Group site[8]

There's also a new forth standard that's being worked on[9]. Seems pretty active, too.

[0] http://www.mpeforth.com/software/pc-systems/vfx-forth-for-wi...

[1] https://www.forth.com/swiftforth/

[2] http://thinking-forth.sourceforge.net/

[3] https://www.forth.com/starting-forth/0-starting-forth/

[4] http://www.forth.org/POL.pdf

[5] http://www.complang.tuwien.ac.at/projects/forth.html

[6] https://web.archive.org/web/20160305013837/http://www.colorf...

[7] http://www.ultratechnology.com/

[8] http://www.forth.org/

[9] https://forth-standard.org/

Forth's machine model has two stacks, one for return addresses and one for data, plus random-access memory. In some small Forth machines, these three memories were physically separate, and all three had an access on each machine cycle.

AFAIK Forth is a stack-based language, so most VMs are stack-based.

I might be wrong.

In any case I'm very interested to learn about what register-based Forth VMs are out there!

Having implemented both a stack and a register-based VM I'd agree with the sentiment in the title. The register-based implementation (based partly off LUA 5 bytecode) was notably faster which I attributed to improved use of CPU cache combined with the instruction doing more work in per memory-fetch.

However I'd also point out the register-based VMs was far more difficult to debug, plus the compiler was much more long-winded to account for register allocation. Plus by the time you add on function dispatch overhead and real-world usage patterns, the gap where the register machine is technically better (i.e. besides just doing fibonacci) narrows somewhat.

Not surprising. Register-based machines in general have higher performance than stack-based machines, because the register set can be addressed randomly like RAM whereas in a 'strict' stack machine this isn't possible.


Of course the advantages such as easy code generation and high code density make stack machines a good intermediate compilation target, which then gets compiled to a register machine for actual execution.

>"Of course the advantages such as easy code generation and high code density make stack machines a good intermediate compilation target, which then gets compiled to a register machine for actual execution"

Are you referring to a specific language here or are you saying all register machine always compile to stack machines first?

In general they do. The JVM and .NET CLR being the most prominent examples.

I should imagine implementing a register-based VM on a register-based CPU is going to have some advantages over a stack-based VM implemented on a register-based CPU. I wonder how well a register-based VM would fair on a stack-based CPU.

That's interesting. Are there any actual stack-based CPUs out there today?

That's fascinating. Would love to see what performance such thing can actually perform.

>"I should imagine implementing a register-based VM on a register-based CPU is going to have some advantages over a stack-based VM implemented on a register-based CPU."

Can you elaborate why is that? What would those advantages be in practical terms?

I remember reading a paper a while ago with similar conclusions.

It concluded that register was faster than stack based but stack based had better code density. The reason seemed to be due to the branch prediction cost per instruction. Fewer but more more powerful instructions reduced the number of times the instruction decode operation occurred.

My feeling is that has more to do with memory traffic and the cache hierarchy than the difference between register and stack VMs. Stack VMs are going to generate more read-modify-write cycles than a register VM. Of course experiments are in order (or out of order, ha!).

I imagine you are right. But there should be plenty of scope for optimising this in the VM itself. Most stack operations are immediately preceded by a push, so hold the top of the stack in a register and the last push in another and you could avoid a high proportion of memory calls in the most common cases. Does anyone know if JVM implementations do this?

From the Wikipedia page on the JVM: "[code verification]...allowing the JIT compiler to transform stack accesses into fixed register accesses. Because of this, that the JVM is a stack architecture does not imply a speed penalty for emulation on register-based architectures when using a JIT compiler."


The article has detailed discussions about the topic.

This reminds me a little of the "nosql is faster than sql" argument a few years back, or "Firefox is faster than Chrome". Let's just add a "now" and we're fine. Most programs we have now are so complex that none of them is really optimized to the max. Therefore the question is basically who spends more of his ressources on optimization.

But some problems are very tough to solve in some languages. Like in Java, it's very hard to avoid allocating everything on the heap. In C++/Rust it's trivial so you can get that performance boost without crude hacks.

To quote another post:

> generally sucks at the hardware level because most CPUs are register-based

Whoa, hold on there folks. This is not a CPU register VM. It's a 'register-based VM' and it isn't even that.

  iconst 1    vs    set t3, t1
  iconst 2          set t4, t2
  iadd              add t3, t4, t5
t1 through t5 are allocated in the stack frame. They are NOT magically somehow mapped to x86_64 registers R8-12.

Actually, calling this a register-based VM is just wrong. It's lazy thinking but really it's just wrong. The value of t1 will be in memory.

Yes, it's standard terminology which easily lets people misunderstand what's happening under the hood, as in the above quoted case. The article doesn't even say virtual registers which would have helped. It could. It should. It doesn't.

BTW, a good interpreter will cache the top of stack in a CPU register.

I believe this is a standard bit of terminology: it means that the virtual machine is implemented with virtual registers. Google "lua registers", for instance.

There's a specific issue in implementing / caching the top of the stack with registers. (I'm assuming that you're talking about stuffs related to C's `register` keyword here) CPU registers don't have memory addresses so it would be basically impossible to achieve that, as in the stack machine implementation of this paper, void pointers are stored in stacks

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact