"[A] large set of instructions" conflicts with the first and last of that. So perhaps start with a hot instruction or two, and try just 2 baked variants? Hmm, a random weird thought: if the stack top was in a wide SIMD, one might use SIMD masks to select operands without baking or branching? Another random: perhaps instructions could have internal branching to choose registers, and the not-a-jit compiler could optimize for temporally-locally consistent "monomorphic" register choice to keep the branch predictor mostly happy? It might juggle multiple copies of hot instructions with different predictor state?
Can anyone suggest sources of vm-ish optimization advice for cutting-edge CPUs? For instance, the years-ago FORTH advice, IIRC, that putting 1 top-of-stack slot in a register is worth it, 2 is borderline, and 3 isn't... CPUs have changed a lot since then. Especially with register renaming?
A 1-operation instruction set would have N registers and 1 accumulator, with instructions of the form "add register R to accumulator", "load from register R to accumulator", "store from accumulator to register R", etc.
Maybe there's a balance here, a sort of mix where you allocate more registers to the source than the destination? Maybe have registers r0 and r1 be possible output registers, while r0 through r7 are possible input registers? That way you have 16 rather than 64 "physical" instructions per "logical" instruction.
The only way to figure out if these ideas are worthwhile is to give it a try and measure. You're balancing instruction set expressiveness against branch predictability and instruction cache pressure and I have no idea how that works out.
Let me see if I can reproduce the basic breakdown:
1. 8 registers for a machine word type, 8 registers for a floating point type.
2. Arithmetic opcodes for these types should use 3-operands, because program bytecode will be much more compact than with 2-operand opcodes, since the latter requires programs to add more move/load/store instructions to preserve register contents. Better to have more instructions that can be shared between VM instances than bloating the program bytecode which can't be shared. Each 3-operand instruction thus requires 8^3=512 permutations per instruction type.
3. Basic opcodes: unsigned word add, sub, mul, div, signed word add, sub, mul, div, floating point add, sub, mul, div, lshift, rshift, and, or, xor, load word, store word, load float, store float, load byte, store byte. So that's 512 permutations * 23 instructions = 11,776 machine instructions.
4. Then you have some 2-operand opcodes: beq, bne, bge, blt (branching instructions), so that's 8^2 permutations * 4 instructions = 256 machine instructions.
5. Machine instructions are ~8 bytes on 64-bit machines, so that's (11,776 + 256) * 8 = ~96kB for the raw machine code implementing the opcode instructions.
Much better than I remember actually. I think at the time I was looking at more registers and more primitive types, like supporting byte, signed and unsigned integers, etc. so that leads to more permutations, but those aren't strictly necessary. Since the permutations scale to the third power with the number of registers, the register count needs to be kept small.
I liked TFA's turn to metaprogramming, but have found using a "real" language (instead of C's PP) is usually worth the tooling cost. Metaprogramming is ideally not only more maintainable than raw code, but also clearer.
The program was all about composing simple, understandable DSLs to build a computing infrastructure, including compilers, graphics subsystems and more, with the larger goal of making computing understandable again from top to bottom.
I'm getting gateway timeouts on the actual papers right now though.
Edit: papers are loading now.
I'm sure there are nicer examples out there.
Better instead say a vm_d.py direct emitter, and skip the vmc?
The advantage is that CPUs have an hard limit on how many taken jumps per cycle they can execute, as it requires resteering the fetch unit even before the jump is decoded. For example typically intel cpus can only take one jump every other cycle.
It is possible that the apple cpu in question has a tighter limit than the amd one.
If I could ask what made you go for a list of unions for the functions instead of just a plain array of function pointers (or a typedef'd function type)? I can't see any benefits that would be gained from using the union, but I assume I'm just missing something.
Regarding the list of unions thing: The instruction stream needs to contain both the operations (the function pointer) and the operands. In a plain array of function pointers, how would a function get at its operands? For example, for `CONSTANT 15`, how do you encode the 15 if all you have is an array of function pointers? I suppose you could cast back and forth between a function pointer and an intptr_t, but I prefer having the explicit union. (EDIT: I see you figured it out already while I was writing this)
They don't; they still read a linear sequence of instructions.
JIT is the way out performance-wise while the boundaries of the VM are what provide portability and safety guarantees.
If you're implementing an interpreted programming language, and you don't want to invest all the resources to make a JIT, you may want to use some of these tricks to make execution faster for much less cost. Even if you're implementing a JIT, you may want a general fallback, and these approaches become relevant there too.
JITs can also be surprisingly accessible, as in ...
 How to write a simple JIT compiler https://news.ycombinator.com/item?id=29566249
 ... (others - I searched only briefly)
(I realize you may not have intended your comment to be swipey or patronizing, but it reads that way to me, and presumably to other readers too, since they downvoted it.)