One thing I thought of as a little weird is this:
> But remember, we aren't decoding anything since our instructions are just given to us raw. So that means we only have to worry about fetch, and evaluate!
The 'decode' step is being done by the switch statement, it's not being skipped. I think the author may be confusing turning assembly into machine code (i.e. what an assembler does) with decoding an opcode.
I built a (very simple) VM in Go live on stage at a local mini-conference last year. I really enjoyed it. There's a video of it here: https://www.youtube.com/watch?v=GjGRhIl0xWs
Take 2 hours to grok this and you can skip compiler books and move straight to compiler papers.
In my experience with similar code you can get GCC and LLVM to generate a proper jump table if you're careful to make your code as regular as possible in order for the optimizer to understand what you're doing.
Case in point: https://github.com/simias/psx-rs/blob/master/src/cpu/mod.rs#...
Here LLVM manages to inline all the functions and generate a proper jump table. IIRC I tried getting the function address in the "match" and calling it afterwards but then the compiler wouldn't optimize it properly anymore.
It probably doesn't, because otherwise Intel wouldn't have submitted this patch that adds computed goto support to CPython: http://permalink.gmane.org/gmane.comp.python.devel/153401
Another way of achieving the exact same result would be to use an array of function pointers.
I think the more interesting dilemma is when you have a more complex VM possibly operating on an abstract syntax tree instead of just a bytecode stream, and the ways to optimize code (in C) while maintaining a proper stack.
After figuring out how to make a switch statement work, there are a ton of more pressing organizational concerns like how to make function trampolining work without any overhead.
Another way to make it faster would be to have it interpret * two * op-codes at once, but hey, that wouldn't do anything except save two cycles and explode the code size to a square of what it was.
> a more complex VM possibly operating on an abstract syntax tree
Don't do it.
Oberon did it with relative success. Relative to other VMs at the time.
But for an immediate execution, a flat stream of instructions (or, better, a threaded code) is much more efficient than any kind of tree. Yes, I'm aware of things like SISC, but I'm yet to see a proof that their approach is more efficient than a flat bytecode, even when JVM overhead is taken into consideration.
Having a jump table in itself isn't special
(the compiler does exactly that when compiling the switch() statement).
What's special is that a dedicated indirect jump instruction at the end
of each opcode helps the CPU make a separate prediction for which opcode
follows the other one, which is not possible with a switch statement
where the jump instruction is shared by all opcodes. I believe that's
where most of the speedup comes from.
So I don't think you can conclude that the compiler "probably doesn't" generate the jump table, just that you might do a better job yourself if you know what you're doing.
No, compiler cannot rewrite your bytecode into a threaded code.
> Otherwise you're just making your code harder to maintain and less portable for zero gain.
No. Threaded code is almost always faster, even if run only once.
> a proper jump table
I'm not talking about jump tables. I'm talking about replacing the bytecode indices with jump addresses once (yes, with a precalculated table) and then doing an indirect jump for each translated opcode from the input stream.
An example: https://github.com/ocaml/ocaml/blob/trunk/byterun/interp.c
(see the Next macro definition)
I'm not sure I understand the code you've linked but "goto (void )(jumptbl_base + *pc++)" looks very much like my big "match" statement except that I match only a subset of the 32 bit instruction.
I don't see why the compiler wouldn't be able to rewrite a bit switch/match statement as a jump table without having to spell it out directly.
> I'm not talking about jump tables. I'm talking about replacing the bytecode indices with jump addresses once (yes, with a precalculated table) and then doing an indirect jump for each translated opcode from the input stream.
Now you've lost me completely.
1. Build a jump table
2. Read a bytecode
3. Replace all the instruction codes in the bytecode stream with values from the jump table (i.e., build a threaded code from your bytecode).
4. In a loop, take the next value at PC and jump to it - it's already an address. No table lookups (just once, when you load your bytecode). So it's much faster than any switch can be.
EDIT: probably you did not notice that for most of the targets jumptbl_base = 0, and an indirection is removed.
The only downside I can see is that you'd significantly increase the size of the code. On a 64bit architecture you trade each byte for a 64bit address, effectively multiplying by eight the footprint in the data cache. A 256 entry LUT on the other hand will fit snugly in cache and the lookup shouldn't be very costly.
Also if I understood you correctly what you're proposing doesn't have much to do with the "computed gotos" extension.
> The only downside I can see is that you'd significantly increase the size of the code
This is why adding the jumptbl_base on 64-bit platforms. Pointers are still 32-bit, just with an added offset (and a tiny overhead).
> doesn't have much to do with the "computed gotos" extension
You cannot implement indirect threaded code without a computed goto. You can implement a direct threaded code, of course (generating jump or call instructions directly), but this is a totally different thing.
Other things that can be added easily is Labels, Loop0 - loop while a variable is greater than zero and subX - pop stack and instruction pointer and go to location X.
Rather than having an enum of opcodes I have a array of pointers to functions - one for each opcode I implement. It is a logical/clean way of implementing things, but i think it is not necessarily better.
(In my case I lose due to inline strings, and variable length opcodes which make complications.)
But on top of that, there is another issue with this representation: Your interpreter will very likely be dog slow, because you have to make a function call on each iteration, and it cannot be inlined, because it is dynamic.
And I suspect there are implications for the I-Cache that should probably be considered as well.
EDIT By "considered", I mean, measured, and then considered.