> Pretty much every non-native language is being compiled first to a stack langu...

fasquoika · on Nov 18, 2017

To be fair, just about all of the most widely used bytecode VMs are stack-based. The JVM, CLR, CPython and YARV are all stack based. You're basically describing the exceptions

wolfgke · on Nov 19, 2017

> To be fair, just about all of the most widely used bytecode VMs are stack-based. The JVM, CLR, CPython and YARV are all stack based. You're basically describing the exceptions

Be very cautious with this kind of statement. Dalvik (that is used on Android), LLVM (Low Level Virtual Machine) and Erlang's virtual machine are also register-based (I just forgot about them when writing the above post). So I would rather claim it is a 50-50. I could write something about the advantages and disadvantages of both approaches (stack-based vs register-based), but this would become off-topic.

exikyut · on Nov 19, 2017

> I could write something about the advantages and disadvantages of both approaches (stack-based vs register-based), but this would become off-topic.

But still very interesting. And this subthread has been discussing Forth, so I argue that at least an overview discussion isn't out of order.

As I understand it, stack-based software and hardware architectures have been shown to have poor performance when compared to traditional register-based approaches (I specifically read somewhere that even hardware has been shown to be suboptimal, but I unfortunately don't remember where, and the source didn't qualify if the stack architecture in question was pipelined or had other enhancements or if it was just a dumb basic 70s/80s design). In any case, modern (pipelined) register hardware doesn't particularly like the L1/L2/Letc thrashing that stack dancing creates.

Obviously with registers you have a bunch of switches that have theoretically O(0) access time regardless of execution machine state. At least that's how I mentally model it, at least superficially; I presume 0 becomes 0.nnnnnnnn for fluctuating values of nnnn depending on pipeline filled-ness and other implementation details. (And I use floating-point here just as an aesthetic representation of hard-to-quantify derivations from theoretical perfect efficiency.)

I feel like there's gotta be some kind of cool middle ground between stack-like architectures and register architectures that nobody's found yet, but I do wonder if I'm just barking up an impossible tree. Or is there any research being done in these areas?

The main problem I see with stack architectures is that it's effectively impossible to optimize (ie build a pipeline) for. Because if all you're dealing with is the top $however_many_things_the_current_word_pops_and_pushes items on the stack (which, to clarify, the hardware can't even know because that information is locked away in the implementation of the currently-executing word), well... you're in an impossible situation. For want of a better way to express it, the attention span of the implementation is inefficiently small.

Anyway, this is some of how I see it. What are your thoughts?

standupstandup · on Nov 19, 2017

If you'd like to learn about a CPU arch that is neither register nor stack based, watch the videos for the Mill and in particular the belt.

exikyut · on Nov 21, 2017

Ooh, alright then. Thanks.

wolfgke · on Nov 19, 2017

> > I could write something about the advantages and disadvantages of both approaches (stack-based vs register-based), but this would become off-topic.

> But still very interesting. And this subthread has been discussing Forth, so I argue that at least an overview discussion isn't out of order.

I think this whole topic is more complicated than the points you gave.

I start with an example: The JVM takes the stack-based instructions and JIT-compiles (in principle AOT-compiling is also possible) them into (register-based) native instructions of the CPU. For this of course lots of transformations etc. have to be done. So executing the stack-based VM instructions naively one after each other fits the CPU badly, but this does not matter - thanks to modern JIT compilers, which transform the code completely.

One clear advantage of stack-based VM instruction sets is that there are much less "parameters to decide about". If you work register-based:

- How many registers? 8? 16? 32? 256 (i.e. a large number that can nevertheless be reached by real, though artificial programs)? "Near infinite" (say 2^31 or 2^32)?

- What register sizes? ARM has only 32 bit (and in AArch64 64 bit) integer registers. x86 has 8, 16, 32 and 64 bit registers.

- Should one allow to interprete the lower bits of some large register as a smaller one? What shall happen if we load some constant into such a subregister: Will the upper bits be unchanged (if done on a real CPU this can lead to a pipeline stall), zero-extended or sign-extended?

- What about floating point registers: Should one be able to interprete them as integers (encouraged on SSE/AVX (x86), but dicouraged on ARM)?

- If we consider SIMD registers to be useful: What size?

- Do we want 2-operand or 3-operand instructions: 2-operand instructions have the advantage that the graph coloring problem that is used for allocating CPU registers can be solved in polynomial time, since the graph is chordal. This is also (before AVX) the preferred instruction format for x86. 3-operand instructions have the disadvantage that the graph coloring problem is NP-hard so that in practise often heuristics are used. 3-operand instructions are common on RISC and RISC-inspired processors (e.g. ARM A32, A64 instruction set; note however that I think T32 uses 2-operand instructions).

As I pointed out this really large design space forces you to make lots of inconvenient decisions. I think this was a problem of Parrot VM who introduced lots and lots of different instructions to their VM. So if you want to keep the VM portable over lots of architectures, a stack-based approach is more convenient (I don't claim "better"). This was - I believe - one reason why the Java Bytecode was designed stack-based.

On the other hand, if you do it the right way, register-based code tends to be be more compact and is simpler to transform into machine code. These are surely central reasons why a register-based implementation was chosen for Dalvik (Android) and the Lua VM.

On the other hand to run stack-based code fast, you typically have to do a lot more transformations to the code - which one would love to avoid, in particular for embedded/small systems. So in some sense one can argue that a register-instruction based VM is the more low-level approach for designing VMs - much more decisions to do (which are the best one tend to depend on the primary CPUs that you want to target), but less code transformations to do in the runtime.

exikyut · on Nov 19, 2017

> I think this whole topic is more complicated than the points you gave.

And now I've read your reply I agree. Thanks very much for the additional considerations.

I had no idea the JVM applies analysis to the stack VM state to turn it into register-based code. I realize it's a JIT, but I never really thought through the implications.

Regarding register vs stack - don't stack based systems also have to decide about stack-item size? I'm not sure how this works in practice but surely size considerations get factored in at some point.

Regarding the 2/3-operand instruction problem, this is very interesting but must admit I need to do some reading about graph theory. I do very vaguely understand it but for example https://en.wikipedia.org/wiki/Graph_theory doesn't mention the world "chord" anywhere.

This indeed is a complex problem, and thanks very much for illustrating the difficulty.

wolfgke · on Nov 19, 2017

> Regarding the 2/3-operand instruction problem, this is very interesting but must admit I need to do some reading about graph theory. I do very vaguely understand it but for example https://en.wikipedia.org/wiki/Graph_theory doesn't mention the world "chord" anywhere.

Concerning chordal graphs:

> https://en.wikipedia.org/w/index.php?title=Chordal_graph&old...

Concerning using graph coloring for register allocation:

> https://en.wikipedia.org/w/index.php?title=Register_allocati...

I remark that in particular on architectures that have lots of different register sizes and the capability to reinterprete parts as subregisters (such as x86) graph coloring is only some rough approximate of register allocation; here one has develop more complicated models.

exikyut · on Nov 20, 2017

Thanks! :) I'll admit this'll take me a while to get my head around, but it's a very interesting subject.

comex · on Nov 19, 2017

I don’t see how a stack-based architecture avoids the need to think about register sizes. You just get equivalent questions, like - what sizes of integers do you have primitive ops for? do you have separate pop/swap/drop/etc. for every possible size, and if not, what’s the standard size of a stack item?

wolfgke · on Nov 19, 2017

> I don’t see how a stack-based architecture avoids the need to think about register sizes. You just get equivalent questions, like - what sizes of integers do you have primitive ops for? do you have separate pop/swap/drop/etc. for every possible size, and if not, what’s the standard size of a stack item?

This is correct. But these are still less decisions, e.g.

- no number of registers,

- no reinterpretation of parts of registers,

- SIMD fits the stack-based model rather badly,

- stack-based VM instruction sets are typically of the kind "take two values from top of stack, do something with it and push back" (very inspired by Forth - but I don't know much Forth), see for example for the Java bytecode instructions (https://en.wikipedia.org/wiki/Java_bytecode_instruction_list...) or CIL instructions (https://en.wikipedia.org/wiki/List_of_CIL_instructions and http://download.microsoft.com/download/7/3/3/733ad403-90b2-4...), so no worry about 2-operand vs 3-operand instructions.

monocasa · on Nov 19, 2017

I said non-native, sort of thinking of LLVM. And to be fair, Dalvik bytecode is compiled from JVM bytecode, so a stack based language was used as an IR in that language stack.

foota · on Nov 18, 2017

I believe that many compilers have an intermediate step that looks like a stack based language.

wolfgke · on Nov 18, 2017

> I believe that many compilers have an intermediate step that looks like a stack based language.

I admit that in the 80th many compilers used something reminiscent of a stack based language for the code generation phase. The reason is that it is rather easy to write a code generator for an accumulator machine (do it as an exercise if you want).

But this is not how typical modern compilers look like. Just to give one point of evidence beforehand: Modern processors have many more general purpose registers (x86-64 has 16 and AArch64 has 32, though some have reserved purposes). So such code would be a waste of the possibilities.

A typical modern compiler looks like this (I am aware that your preferred compiler might have additional or somewhat different stages, but the following is typical):

Frontend (very language-dependendent):

  - Tokenize input
  - Parse input (or report syntactical errors) to generate parse tree
  - Generate Abstract Syntax Tree (AST)
  - Do semantic analysis/type checking
  - Transform program in Static single assignment form (SSA form)

Middleend (not much depdendent on language or machine)

  - Do optimizations on SSA code (e.g. liveness of variables/dead code elimination, constant propagation, peephole optimizations, perhaps replace whole algorithms)

Backend (very machine-dependendent)

  - Do machine-dependent optimizations
  - Do register allocation
  - Generate machine-dependent code

As one can see: A stack-based intermediate step does not appear hear (and is to my knowledge uncommon) - instead transformations on code in SSA form are common.

jcranmer · on Nov 18, 2017

Compiler IRs are often based on reflecting expression DAGs rather than manipulating a finite register set or an infinite stack.

exikyut · on Nov 19, 2017

Manipulating an infinite stack sounds like a really big world of pain.

I realize you're only ever poking around near the top, but if the system has the chance to grow infinitely, well, it will.