Hacker News new | past | comments | ask | show | jobs | submit login
Faster virtual machines: Speeding up programming language execution (mort.coffee)
84 points by mort96 on Jan 17, 2023 | hide | past | favorite | 27 comments

Can I just point out the interactive virtual machine visualizer[1]? I thought it turned out really neat and it was a joy to make.

[1] https://mort.coffee/home/fast-interpreters/#interactive-vm

Are you the same mort from Libera?

I am! Hi anthk!

Since I can't JIT on iOS, I was wondering about other ways to increase performance (beyond what the article mentions, I think). One issue is that the virtual registers don't map to CPU registers, but rather locations in memory. So each VM instruction results in loads and stores. I was thinking rather to generate a large set of instructions which have the registers baked in. So for example, if I have 8 VM registers, and a 2 operand operation then I would generate 64 VM instructions for that operation. Then those registers get mapped to CPU registers. This of course has the downside of more code paths in the VM. Is that something worth exploring?

May all your code be found in the I-cache, your branch predictions be correct, and your speculative executions succeed. And may your performance be blessed with bottlenecks, and your optimization effort devoted there. :)

"[A] large set of instructions" conflicts with the first and last of that. So perhaps start with a hot instruction or two, and try just 2 baked variants? Hmm, a random weird thought: if the stack top was in a wide SIMD, one might use SIMD masks to select operands without baking or branching? Another random: perhaps instructions could have internal branching to choose registers, and the not-a-jit compiler could optimize for temporally-locally consistent "monomorphic" register choice to keep the branch predictor mostly happy? It might juggle multiple copies of hot instructions with different predictor state?

Can anyone suggest sources of vm-ish optimization advice for cutting-edge CPUs? For instance, the years-ago FORTH advice, IIRC, that putting 1 top-of-stack slot in a register is worth it, 2 is borderline, and 3 isn't... CPUs have changed a lot since then. Especially with register renaming?

Hmm. Yes, that sounds like it's worth exploring, though N*N "physical" instructions for every "logical" instruction is a lot. I would maybe investigate a 1-operand register-based instruction set? That way, you would only have N "physical" instructions per "logical" instruction, which feels like it would scale better. The cost of course would come in the form of limited expressiveness.

A 1-operation instruction set would have N registers and 1 accumulator, with instructions of the form "add register R to accumulator", "load from register R to accumulator", "store from accumulator to register R", etc.

Maybe there's a balance here, a sort of mix where you allocate more registers to the source than the destination? Maybe have registers r0 and r1 be possible output registers, while r0 through r7 are possible input registers? That way you have 16 rather than 64 "physical" instructions per "logical" instruction.

The only way to figure out if these ideas are worthwhile is to give it a try and measure. You're balancing instruction set expressiveness against branch predictability and instruction cache pressure and I have no idea how that works out.

I did the math on this design a number of years ago. For 8 registers you're looking at a VM that requires a few MB just for the bare instructions, given all the permutations required.

Well I was just throwing 8 out there as an example. How about other numbers of VM registers, or single argument operations?

8 is a good number actually, register machine performance improves with more registers. The problem is that VM code bloat also increases with register count.

Let me see if I can reproduce the basic breakdown:

1. 8 registers for a machine word type, 8 registers for a floating point type.

2. Arithmetic opcodes for these types should use 3-operands, because program bytecode will be much more compact than with 2-operand opcodes, since the latter requires programs to add more move/load/store instructions to preserve register contents. Better to have more instructions that can be shared between VM instances than bloating the program bytecode which can't be shared. Each 3-operand instruction thus requires 8^3=512 permutations per instruction type.

3. Basic opcodes: unsigned word add, sub, mul, div, signed word add, sub, mul, div, floating point add, sub, mul, div, lshift, rshift, and, or, xor, load word, store word, load float, store float, load byte, store byte. So that's 512 permutations * 23 instructions = 11,776 machine instructions.

4. Then you have some 2-operand opcodes: beq, bne, bge, blt (branching instructions), so that's 8^2 permutations * 4 instructions = 256 machine instructions.

5. Machine instructions are ~8 bytes on 64-bit machines, so that's (11,776 + 256) * 8 = ~96kB for the raw machine code implementing the opcode instructions.

Much better than I remember actually. I think at the time I was looking at more registers and more primitive types, like supporting byte, signed and unsigned integers, etc. so that leads to more permutations, but those aren't strictly necessary. Since the permutations scale to the third power with the number of registers, the register count needs to be kept small.

I've missed the front-page window to ask, but... Can anyone suggest a compiler making nice use of metaprogramming? So many seem more eyeball abstraction over repetition than Tufte minimal ink. Tnx.

I liked TFA's turn to metaprogramming, but have found using a "real" language (instead of C's PP) is usually worth the tooling cost. Metaprogramming is ideally not only more maintainable than raw code, but also clearer.

Nice post.

The papers from Alan Kay's VPRI/STEPS are worth a read:


The program was all about composing simple, understandable DSLs to build a computing infrastructure, including compilers, graphics subsystems and more, with the larger goal of making computing understandable again from top to bottom.

I'm getting gateway timeouts on the actual papers right now though.

Edit: papers are loading now.

I have seen a custom DSL used to do code generation once, but... it's not very nice. Here's the VM definition file for the Emerald virtual machine: https://github.com/emerald/old-emerald/blob/master/vm/vm.d

I'm sure there are nicer examples out there.

Hmm. So that vm.d is a source file for a nearby compiler[1] ... written in C. With a .l and .y, and implementations of assorted collection datatypes, and ... basically printf'ing code. Sigh.

Better instead say a vm_d.py direct emitter, and skip the vmc?

[1] https://github.com/emerald/old-emerald/tree/master/vmc

I'm 100% in agreement. I don't think the idea of one source file with a DSL for sections is a terrible idea, but it could be done as a 10-line python script rather than a giant lex+yacc C project.

I haven't profiled the code, but I don't think better predictability factors significantly in the computed goto approach. Because of the global history table probably both sample implementations are likely perfectly predicted.

The advantage is that CPUs have an hard limit on how many taken jumps per cycle they can execute, as it requires resteering the fetch unit even before the jump is decoded. For example typically intel cpus can only take one jump every other cycle.

It is possible that the apple cpu in question has a tighter limit than the amd one.

I'm actually slightly disappointed as I had just come up with the tail call optimization version as a replacement for the computed gotos since they aren't supported in MSVC (I was considering making it my own first blog post).

If I could ask what made you go for a list of unions for the functions instead of just a plain array of function pointers (or a typedef'd function type)? I can't see any benefits that would be gained from using the union, but I assume I'm just missing something.

You should absolutely write your own blog post about it! I'm sure your approach is somewhat different from mine, and different approaches to the same problem is always interesting. In a week or two everyone will have forgotten my post anyway and yours will seem fresh and interesting.

Regarding the list of unions thing: The instruction stream needs to contain both the operations (the function pointer) and the operands. In a plain array of function pointers, how would a function get at its operands? For example, for `CONSTANT 15`, how do you encode the 15 if all you have is an array of function pointers? I suppose you could cast back and forth between a function pointer and an intptr_t, but I prefer having the explicit union. (EDIT: I see you figured it out already while I was writing this)

I'll definitely still consider it, I think I got lost in the fact that I originally made mine similar to the original version of the dispatch table. Then I just had a byte array of memory where I wrote instructions and values directly into the memory and then executed it. Stack also lived in the memory block growing downwards like a normal stack. And yeah as you said the constant function I have reads in the value from the program_counter and increments it manually. but since all values and opcodes were only bytes it didn't feel off just reading it in the same way.

Oh wait I see I'm mistaken, the function pointer is being used as the opcode and you're running across the instruction list with the &instrs[2], that makes more sense at least. I can't personally do that for my VM however since I'm using an existing instruction set. I'll have to give this a full read when I get off work.

MSVC can sometimes be convinced to emit tail jumps from a switch() if the cases are densely packed and there is an __assume(0) on the default. Unfortunately, it also tends to encode an offset table instead of a jump table on x64, which I have not found a way to suppress.

> A LISP machine reads instructions in the form of LISP code and modifies its state accordingly.

They don't; they still read a linear sequence of instructions.

Trying to optimize parsing binary interpretation is a lost cause. The problem with pure virtual machines is they're basically meta-macrocode are mappable back to the ISA macrocode without the overhead.

JIT is the way out performance-wise while the boundaries of the VM are what provide portability and safety guarantees.

Yeah, as I mentioned in the article, tracing JITs are the way to get the best performance out of dynamic languages. But JITs are extremely complicated, hard and expensive to make, prone to security issues, require a separate implementation for every single CPU architecture you want to support, etc.

If you're implementing an interpreted programming language, and you don't want to invest all the resources to make a JIT, you may want to use some of these tricks to make execution faster for much less cost. Even if you're implementing a JIT, you may want a general fallback, and these approaches become relevant there too.

> But JITs are extremely complicated, hard and expensive to make,

JITs can also be surprisingly accessible, as in [1][2]...

[1] How to write a simple JIT compiler https://news.ycombinator.com/item?id=29566249 [2] https://eli.thegreenplace.net/2017/adventures-in-jit-compila... [3] ... (others - I searched only briefly)

In some environments (iOS for example) JITting isn't allowed. (Or more precisely, marking pages as executable isn't allowed)

The standard VM writers learning process. Now he just has to find out the advantages to optimize the data structures (smaller size), and maybe switch to a register VM, with 32bit ops. Smaller data will be much faster than jump-table optims.

Can you please keep swipes/putdowns out of your HN posts? This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html. Your comment would be better without the first bit ("The standard VM writers learning process") and the patronizing "Now he just". There are better, friendlier ways to convey information about register VMs.

(I realize you may not have intended your comment to be swipey or patronizing, but it reads that way to me, and presumably to other readers too, since they downvoted it.)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact