Hacker News new | past | comments | ask | show | jobs | submit login
Write Your Own Virtual Machine (justinmeiners.github.io)
546 points by vedosity on Dec 14, 2018 | hide | past | favorite | 77 comments

One of my favorite techniques for implementing VMs is the "computed goto":


Consider this example for dispatching instructions from the article:

    while (running)
        uint16_t op = mem_read(reg[R_PC]++) >> 12;
        switch (op)
            case OP_ADD:
                {ADD, 6}
            case OP_AND:
                {AND, 7}
            case OP_NOT:
                {NOT, 7}
            case OP_BR:
                {BR, 7}
That code has a lot of branching. The switch statement has to jump to the corresponding case, the break statement branches to the bottom, and then there is third branch to get back to the top of the while loop. Three branches just to hit one instruction.

Now imagine we had written the above as:

    static void* dispatch_table[] = { &&OP_ADD, &&OP_AND, &&OP_NOT, &&OP_BR,
                                      ... };
    #define DISPATCH() goto *dispatch_table[memory[reg[R_PC]++] >> 12]


        {ADD, 6}
        {AND, 7}
        {NOT, 7}
        {BR, 7}
Now there is only one branch per instruction. The handler for each instruction directly jumps to the next location via the goto. There is no need to be in an explicit loop because the interpreter runs until it hits a halting instruction.

Many VMs now use this technique, including the canonical Ruby and Python interpreters.

For the longest time, I swore by computed goto as well. But it has its share of problems; it's not very portable; it forces the code into a rigid, non-extendable format; and it's not as efficient as commonly assumed. I'm far from the first person to notice [0], so don't bother shooting the messenger.

My latest project [1] simply calls into a struct via a function pointer for each instruction. With a twist. Since it returns the pointer to the next operation and uses longjmp to stop the program, which means I can get away without lookups and without a loop condition. And as an added bonus the code is much nicer to deal with and more extendable, since the set of operations isn't hardcoded any more.

Comparing language performance is tricky business, but it mostly runs faster than Python3 from my tests so far. My point is that computed goto is not the end of the story.

I'll just add that there are many different ways to write a VM; the one posted here is very low level, the one linked below reasonably high level. Using physical CPUs as inspiration is all good, but there's nothing gained from pretending in itself unless you're compiling to native code at some point.

[0] http://www.emulators.com/docs/nx25_nostradamus.htm

[1] https://gitlab.com/sifoo/snigl

> so don't bother shooting the messenger

OK, let's shoot the original source: It only found that a certain unnamed version of GCC ten years ago performed tail merging, which does indeed defeat the purpose of the optimization. Without the author bothering to turn off tail merging in this case (as Python does), this doesn't mean much.

> Comparing language performance is tricky business, but it mostly runs faster than Python3 from my tests so far.

Presumably you not only have a different interpretation approach, but also a different object model, approach to boxing numbers, and garbage collector. "Tricky business" is a bit of an understatement ;-)

Better him than me. I wouldn't be so quick to dismiss relevant coding experience, it doesn't get old. Much of his reasoning about locality still holds, to an even greater degree today.

I've implemented more or less the same interpreter using computed goto and various variations on dispatch loops; and this is the fastest solution I've managed to come up with, and the nicest code to work with. Still too many parameters, my point is that I have plenty of experience pointing in that direction which is better than nothing. It's not a mystery to me. Using computed goto will involve some kind of lookup to get the relevant labels, I'm using pointers to actual instructions as jump targets. And since I'm longjmp'ing out, which means I don't need a loop condition; the loop is reduced to a single regular goto.

Like I said, better than nothing. There is no such thing as perfection in this world. I've spent quite some time [0] micro benchmarking different features to get a fair comparison.

[0] https://gitlab.com/sifoo/snigl/tree/master/bench

Oh, I wasn't dismissing experience in general, nor was I questioning your experiences. I would love to see your numbers on the different dispatch methods you tried.

I was only questioning that one blog post you linked that lazily said (paraphrasing) "computed-goto dispatch will be undone by the compiler, so don't bother". Others have posted numbers in this thread (https://news.ycombinator.com/item?id=18679477) showing that this information is at best outdated.

No harm done; I don't do pride any more, it kept getting in the way of truth. But there's a clear tendency, especially on HN if you ask me; to blindly trust research by whatever authority and beat people over the head with so called best practices; while completely disregarding personal experience.

Sometimes the reason no one finds a better solution for a long time is that no one believes it's possible. Let's say it's 50/50, believing it's possible is still the superior alternative.

I have no idea what Python uses for regular dispatch; the linked thread just says that their computed goto solution is faster, which comes as no surprise given that was the reason they switched.

I posted the link since I learnt a lot from it, and since it echoes my own experience.

The way I ended up structuring my opcodes was similar to your way but a little different still. I ended up using function pointers but an array of them. Each opcode is defined in an enum and will be assigned to each funtion pointer based on it's corresponding array index. When the virtual CPU reads an opcode it just calls the function pointed to by the pointer at that index.

I found doing it this way makes it easy to change or add opcodes without needing to go back through a giant switch statement to find and rearrange everything. As a bonus too, the assembler just uses the same enum and basically just works through it the opposite way the CPU does. Translating keywords to the matching opcode index in the array then writing the index to the correct memory address in the binary. This also means any time I update an opcode or add one in the virtual machine all I have to do is add it to the enum and both the assembler and virtual machine will be updated without having to do anything else.

Nice to hear! That's what I try to tell people who are asking for the right type theory book or the right dispatch method or whatever before even having a look. It's the same old problem solving; and here are always multiple solutions, each with it's own set of compromises. Claiming the one true way of doing anything related to software is delusional. We're barely scratching the surface.

Edit: While we're here; may I ask how you break out of the dispatch loop? Do you test an end-condition on each iteration? If that's the case, the longjmp trick might be worth trying. The condition will mostly be false, so it might make sense to pay more when it happens instead. You may find the essence of it in sgl_run at the bottom of sgl.c [0].

[0] https://gitlab.com/sifoo/snigl/blob/master/src/snigl/sgl.c#L...

> Edit: While we're here; may I ask how you break out of the dispatch loop? Do you test an end-condition on each iteration? ...

For what it is worth my JITter uses mprotect from other thread to change page permissions to generate an exception to break out of JITted code. Same would work for dispatcher as well.

The advantage is it's completely free performance wise while the code is running. The need for signal handler is less awesome.

> Since it returns the pointer to the next operation and uses longjmp to stop the program, which means I can get away without lookups and without a loop condition.

That threaded approach would be even faster using computed gotos. The loop conditional is a huge performance killer. Removing it claws back cycles from using function pointers, but it would benefit the computed goto approach at least much.

Personally I find using function pointers make for difficult to read code. Dispatching opcodes with a compact switch statement (or equivalent computed goto construction) makes it easier for me to see the meat of the engine and how it works. If the logic for an opcode is complex it can be easily pushed into a static function (which may or not be inlined, but at such a juncture readability is the primary concern). But for simple operations I prefer it inline, textually, so it's easier to see all the moving parts at once.

> That code has a lot of branching. The switch statement has to jump to the corresponding case, the break statement branches to the bottom, and then there is third branch to get back to the top of the while loop. Three branches just to hit one instruction.

That's a bit unfair. Not all branches are equal. Only the instruction fetch branch is going to be often mispredicted. Predicted branches, like that while loop, aren't that expensive. Mispredicted branches cost 10-20x more.

Of course less branches and less code in general is better.

One big issue writing interpreters in C/C++ is that compiler register allocation can't usually follow the data flow, and needs to keep unnecessarily loading and storing from/to memory same common variables.

Interpreters need to also be careful not to exceed 32 kB L1 code cache limits.

All this means to write a truly efficient interpreter, you'll need to do it in assembler.

The step after that is to write a simple JIT that does away with data dependent (= VM instruction) branches altogether.

Then you'll notice you don't need to update some VM registers every time, but can coalesce for example program counter updates to certain points.

Eventually you'll find you have a full fledged JIT compiler doing instruction scheduling and register allocation, etc.

Been down that rabbit hole, except for the last step. That's where it becomes a true challenge.

LuaJIT (http://luajit.org/) project followed all the way through, and studying it is a great resource for anyone interested on the topic. Kudos to Mike Pall.

I’m writing a C port of the LuaJIT VM at the moment. I’m hoping that will help people to understand the overall design and make it easier to interpret the asm code (amongst other things). Link: https://github.com/raptorjit/raptorjit/pull/199

> Predicted branches, like that while loop, aren't that expensive.

The dispatch loop conditional used to be quite expensive. Since Haswell it has become much less expensive, such that it wasn't even worth threading dispatch. But now with Spectre mitigations that cost may rise again.

Also, it depends on how tight the VM loop is. The prediction buffers are only so big, so if your ops are doing alot of work you can end up back with pre-Haswell performance.

I like to go one step further and use a first pass to convert bytecodes to a list of labels. Use a pointer in this list as the program counter. Then the dispatch becomes as simple as:

    goto *pc++;
Here's an example of the technique for a Brainfuck VM implemented in the Ethereum VM assembly. (a recent competition entry of mine):


This technique also makes it easy to do some peephole optimizations and combine multiple bytecodes into a single label. The above example also does a lot of that.

Brainfuck is actually a great language to built a VM for. You have a simple interpreter up in no time, but you can go really far in optimizing it. LC3 looks comparatively messy (more state, more instructions, etc), but more representative of real CPUs.

I presume you're aware, but for those reading along at home, this technique is called a "threaded" or "direct threaded" interpreter. Last I read, the BEAM virtual machine (used by Erlang and Elixir) translates arrays of opcodes to arrays of jump addresses at code load time.

For a time, outputting direct threaded code was a popular implementation technique for native code compilers.

I suspect that for many languages, what you really want is a compact SSA distribution format, and at program / library installation time, compile each extended basic block to a block of size-optimized blocks of strait-line native code. Functions would compile to stubs that pass arrays of basic block addresses to a common dispatch loop. In effect, you get a direct-threaded interpreter, where each "opcode" is a piece of strait-line native code from the application. Each basic block would need to set a known register to the address of the next basic block. In other words, the code would be compiled to native continuation-passing-style.

This provides the basis for a very low-overhead tracing JIT for native code. When a profiling timer triggers, the signal handler can walk the stack and perform some hot code detection heuristics. If a hot section is detected, the handler can perform on-stack replacement of the dispatch loop return address with a tracing version of the same. Once a hot loop is detected, the SSA form for each of the involved basic blocks can be stitched together and passed to an optimizing compiler. This would give you the fast start-up of size-optimized native code along with the long-term profile-optimized, cross-library-inlined-and-optimized code of a high performance JIT. The main downside would be the on-disk storage size of keeping both compressed SSA and size-optimized native code.

Alternatively, I could imagine processors with built-in support for native code tracing and optimization. If a trace register was non-zero, then every conditional or indirect branch would store the effective branch target at the location pointed to by the trace register, and increment the trace register by sizeof(size_t). If the trace register were equal to the trace limit register, or a performance timer had expired, then the CPU would trap to a userspace handler indicated by another register. Though, I think you'd need the threaded interpreter version of the idea to become popular before CPU manufactures started to consider the hardware supported version.

I googled and found a test[1]. Running this test with a few modifications (like removing unused variables and using PRIu64 instead of %llu) and disregarding the last two prints, I get:

switch = 5589926 goto = 5752079 goto_opt = 5618938

with -O0, and

switch = 2234105 goto = 2013742 goto_opt = 2016200

with -O2.


Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516

[1] https://gist.github.com/mmozeiko/7cc858985b57df30eebbf8c6e75...

If you're looking for a bigger and more "real-world" test, the Python interpreter can be built with or without computed gotos for dispatch. It suffices to arrange to define USE_COMPUTED_GOTOS to 0 in Python/ceval.c. You can just change the file by hand, but I vaguely remember there also being a configure flag for setting USE_COMPUTED_GOTOS or HAVE_COMPUTED_GOTOS.

Then run your favorite Python benchmark. For reasonable results, it must be one that spends almost all of its time in Python code, which excludes calls to native code libraries, but also Python programs that mostly do list manipulation like the popular fannkuch benchmark. (The list operations are implemented in C, and fannkuch, although looking like a Python program, spends most of its time in this C library.)

The only difference between the computed goto and the switch is that the computed goto leaves out the initial range check. But are doing simple and costly indirect jumps.

Much better is to keep the ops small (16bit in this case which is the best I've seen, lua/luajit/chez have 32bit, worse languages 3-5 words) to have much more ops in the icache, and to pass around the ptr to the next op, best if relative. Easy with a small jit. This ptr will be prefetched, which is not possible with those indirect jumps, from cgo or switch.

Wow, you were quick.

Mentioning compiler name + version and exact CPU model would be nice. The difference is small enough that those details can tip the scales.

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516

It'll be even faster without indirecting through dispatch_table at runtime. Exporting your label addresses without introducing a conditional initialization block is tricky but doable. For C++ you can (IIRC) smuggle dispatch_table out of the routine using a constructor on a statically-scoped object. For C code I've successfully used GCC's nested function and __attribute__((constructor, used)) extensions, though that was with 4.7 (may not work with current GCC).

I mostly use computed gotos for implementing poor-man's coroutines in C, where the addresses only need to be visible within the routine. For VMs I use the obvious method (i.e. dispatch_table) to keep the code sane, but it does incur a non-negligible performance cost, which may matter in some contexts.

Do you know if an optimizing compiler will ever make this transformation?

Every single C compiler worth mentioning will turn a switch statement into a jump table. The difference between a jump table and computed goto is only that one is `jmp [ecx+eax8]` and the other is `mov edx, [ecx+eax8]; jmp edx`. The later is faster because of weird branch prediction reasons, and no bounds checks in the switch.

> Every single C compiler worth mentioning will turn a switch statement into a jump table

Only if the case values are small integers with a compact range. But if your opcodes are large and/or sparse (more common in non-VM scenarios) compilers don't optimize switch statements very well.

There's a ton of research on switch statement optimization, but most of it doesn't translate to real-world scenarios very well. Compilers will give up fairly quickly on trying to generate a jump table because the work necessary to map the case value to an index at runtime can easily cost more than its worth in many situations.

Here's some real-world code which you can easily benchmark and compare: https://github.com/wahern/hexdump/blob/master/hexdump.c#L752

Just flip the VM_FASTER macro. Using `hexdump -C < 64MB.random.file > /dev/null`, on my circa 2012 Mac Mini the switch statement is 80% slower than the computed goto and on my circa 2018 Macbook Pro it's about 15% slower. (See my note elsethread about the post-Haswell branch prediction.) And this is despite the fact that the opcode range is 0-32 without holes!

Also, with computed gotos you can remove the jump table altogether. You can make your opcodes actual addresses (just like native code) if you can make the label addresses visible to the code generator. (See my post elsethread for how to make them visible.)

P.S. While I implemented hexdump.c using a VM mostly for fun, the end result not only out performs GNU od and BSD hexdump, but IMHO the code is much easier to read, too.

> Every single C compiler worth mentioning will turn a switch statement into a jump table.

Yes, but that is not the optimization discussed here. The optimization discussed here is looking ahead to the next bytecode instruction's opcode, using it as a key into a different jump table, and jumping to that target. No C compiler I'm aware of does that.

C compilers could try to match on common patterns in interpreter implementation (a switch within a loop, keyed on certain bits of a piece of data read mostly linearly from an array) and heroically generating a dispatch table, but that would still fail in the cases of branches where you don't want to call DISPATCH because execution does not necessarily proceed to the next instruction in the array.

HN formatting ate some * from the parent post:

  jmp [ecx+eax * 8]

  and the other is

  mov edx, [ecx+eax * 8]; 
  jmp edx
> The later is faster because of weird branch prediction reasons...

Things like that are microarchitecture dependent. Might be true on particular CPUs.

Of course it's possible separate MOV could be executed much earlier in the out of order pipeline while JMP effective address calculation might not. So JMP address (edx) would be already resolved by the time JMP is actually executing.

Do you know if the Go compiler does this too?

Edit: Nevermind, it doesn't: https://github.com/golang/go/issues/5496

When you are code a `switch` you're telling the compiler what you want. So you get decades of compiler optimization working for you.

Oh psh optimizing compilers go way beyond this. In fact, even hand implemented VMs also generally go further. The technique above is called tail dispatch and is really just the beginning.

As far as I am concerned, I stopped worrying and use direct threaded code (DTC): a VM instruction is just a pointer to the C function implementing it. It's reasonably portable and one can expect stable performance across platforms. Because, when you look at the benchmarks you find out that you are talking about single-digit percent difference between different techniques sometimes, and variations in cache size, branch prediction, even unaligned memory fetch tax on odd targets, all these details can make a different king-of-the-hill on a different processor. Compilers can also deal with a technique better than another.

You can gain far more by working on your instruction set. Making it easily extendable is better, making it easy to transpose interpreted code into native code is better, because you can easily benchmark and optimize as you have a more and more clear idea of what will be idiomatic code. For the DTC technique, a bonus is that there's normally no difference between instructions and library bindings.

> direct threaded code (DTC): a VM instruction is just a pointer to the C function implementing it.

Nitpick: This sounds more like subroutine threading than direct threading. https://en.wikipedia.org/wiki/Threaded_code

In subroutine threaded code, each function implementing an instruction returns before the next one is called. In direct threading, each code snippet implementing an instruction loads the next address and jumps to it directly without going through a return and another call.

It sounds like it, but it is not. It gets confusing fast when one talks about these things in even slightly accurate ways.

In DTC to call a function of the VM code (ie a another piece of DTC) you typically have a "call" virtual instruction, followed by the virtual address to call. you also have a virtual "return" function that pops the virtual instruction pointer from the virtual call stack.

Subroutine threaded is really just a very primitive form of AoT compilation.

None of this explains why you think that your code in which "a VM instruction is just a pointer to the C function implementing it" is direct threaded. If you call a C function for each VM instruction, and that function returns before the next C function is called, then your threading is not direct.

indeed. I think the second optimization on the list is super-instructions: frequent combinations of 2 (or even 3) instructions are transformed into 1 equivalent instruction, lowering the average cycle count.

Where can I learn about techniques like that? Any recommendation would be very appreciated!

Look up work by Stefan Brunthaler, like this overview paper ("Virtual-Machine Abstraction and Optimization Techniques"): https://www.sba-research.org/wp-content/uploads/publications...

Make sure to check out the references. I especially like "Ertl, M. A. and D. Gregg, The structure and performance of efficient interpreters, Journal of Instruction-Level Parallelism 5 (2003)"

There goes the weekend...

Do you know of any recent efforts to combine ML and VM optimizations ?

(I read about the VS compiler and profile-based optimizations)

This one is good: http://www.tara.tcd.ie/bitstream/handle/2262/64047/Optimizin...

Though you might not think of hill climbing as "machine learning".

Those techniques are a relatively niche thing. People don't write bytecode interpreters daily. Couple of months ago I wrote a few articles on the topic. Unfortunately, they are in Russian. I'll probably translate 'em at some but for now...

the repo and performance measurements are on Github[1]. It includes a couple of dispatch techniques, register caching, etc.

[1]: https://github.com/vkazanov/bytecode-interpreters-post

article below contains an overview and more pointers.


> the break statement branches to the bottom, and then there is third branch to get back to the top of the while loop

That suggests your C compiler isn't doing jump threading: replacing the jump to an unconditional jump instruction with a direct jump to that instruction's destination.

That's a very basic optimization I might expect even from a C compiler written in 1980.

> That suggests your C compiler isn't doing jump threading

And even if he has such a compiler (which is doubtful) he doesn't have to put the "break" in the switch! He can write continues, e.g. this:

        switch ( i ) {
        case 3:
            i = 7; continue;
is valid C, and specifies explicitly that the next loop pass is expected.

Thanks! I was not familiar with this technique. What do you think about the C++ table of function pointers towards the end? I imagine that would be slower with the additional level of indirection, but I haven't profiled or examined the assembly.

Is that {THING, VALUE} pseudocode or is it valid C somehow?

The featured article is written in a "literate programming" style. Those {ADD, 6} markers refer to snippets of code defined elsewhere to be inserted in place of the marker.

and what happens on an undefined instruction?

Typically an undefined instruction "can't happen" because the interpreter assumes that it gets its code from a trusted frontend that validated all the invariants assumed by the interpreter. The most important being that, for any instruction that might use DISPATCH, there actually is a next instruction at the following address. This means that every block of code must end in an unconditional branch and that unconditional branches must not dispatch using DISPATCH (which would not make sense anyway).

Do you know where i can find thrse implementations in ruby or python??

It's "of", rather than "in"... this is probably the Python one mentioned:


I once implemented a VM in Ada and just used a large switch. I extensively benchmarked it and it was blazingly fast at -O3.

However, it was only fast when I used packages very sparingly. In contrast to the usual advice given in the Ada community, splitting up the implementation into several packages slowed down the main loop tremendously. I suspect this wouldn't happen with whole-program optimization in C, but believe that the version of gcc I was using didn't support that for Ada. Also, my Green Threads were slower than a single thread, no matter which tricks I tried.

It's an abandoned project now, since the accompanying assembler was hacked together in Racket and at some point I simply lost track of what was going on where :O

FYI: I believe you have to use the flag -gnatn or -gnatN to inline packages with GNAT.

Optimizing your project: https://www.pegasoft.ca/resources/boblap/7.html

GNAT: Alphabetical list of all switches: https://gcc.gnu.org/onlinedocs/gcc-8.2.0/gnat_ugn/Alphabetic...

Object Pascal (Delphi) also optimizes case statements, depending upon the nature of the statement:


I tested this recently (Delphi XE6), and it definitely is as-described: you get straight jump instructions with enough case statement branches, and it is very fast.

Barry Kelly worked on the Delphi compiler, and I believe he comments here on Hacker News occasionally.

> you get straight jump instructions with enough case statement branches, and it is very fast

That's what pretty much any half-decent compiler does nowadays. It'd be much more surprising if it didn't.

The interesting part to me was the variations in the emitted instructions, based upon the source. IOW, it would be easy to mistakenly think that you weren't going to get jumps if you only used a few case branches to test things out.

I love these kinds of tutorials, and really any tutorial that makes me think of stuff that was previously "magic" and makes me feel stupid for not previously understanding it after I read it (I honestly mean that in a positive way).

This will be a fun weekend project for me...I have had an idea of a lambda-calculus-based VM that I've wanted to build for a few months ago, and I think this will be a good start for me to understand it.

I like this tutorial but it should be mentioned that before implementing a virtual machine one should understand that there are many computing models and many alternative mechanisms within each of them. Implementing VM for sequential program execution is relatively easy. What is more (conceptually) difficult is concurrency, asynchronous processes etc.

> What is more (conceptually) difficult is concurrency, asynchronous processes etc.

Know any good resources for these?

My own baby, Snigl [0], does always-on cooperative concurrency and asynchronous IO.

Just ask if you need help finding your way around.


Thank you. I’ll check it out over the holidays.

Author here. Definitely agree. The intent was to write the simplest possible to VM that could run real programs. VMs are obviously a deep field of research and there are plenty more projects to learn from.

Thats one of the main reasons i implemented my VM in go.

Its fairly rudimentary, a map for instructions to function pointers but.. I can quickly duplicate a subtree of code and execute it within a different goroutine, and wrap that section of code in an instruction that sets the output to a channel, and have another goroutine select the different channels from different goroutines.

At each node within the tree of code, i store the variables within a map instead of using a stack-based system; if the data is not in the current node, look down the tree until i find it.

I agree it's a great exercise. Years ago I added a very simple stack-based VM to my raytracer to allow procedural textures & normal maps. The scene script would e.g. contain a material description like this:

  material {
      diffuse rgb = [0, 0, 1] * (1-(0.5+0.5*noise(x*0.000002, y*0.000002, z*0.000002, 0.66, 2))) + [1, 1, 1] * (0.5+0.5*noise(x*0.000002, y*0.000002, z*0.000002, 0.66, 2));
i.e. an expression with 3-vectors and scalars and a few basic functions (noise, sin/cos, etc.). This can be easily "compiled" (during script parsing) for execution on the VM. Then the overhead during actual raytracing was quite small.

I don't suppose you have code or examples lying around? I'd love to see what you were trying to achieve, the results you had, and the code.

If you're comfortable sharing, that is.

I still have the code (old and ugly, so not on GitHub; but it still builds and works) and examples. Let me know where I should send it.

My email is in my profile.

Is it publicly visible? At the moment in your profile I can only see "about", submissions, comments, favorites.

Guess not. Huh. I added it to my 'about' block.

Could I trouble you to share the code with me as well? Email in profile about box. Thanks!


Does anyone know how I can convert AST into stream of bytecodes? Are there any good example language implementations to learn?

A little tangential, but check out "Prolog as Description and Implementation Language in Computer Science Teaching"by Henning Christiansen


> ... Definitional interpreters, compilers, and other models of computation are defined in a systematic way as Prolog programs, and as a result, formal descriptions become running prototypes that can be tested and modified ... These programs can be extended in straightforward ways into tools such as analyzers, tracers and debuggers.

Also "Logic Programming and Compiler Writing" By David Warren (and the work that followed on.)

If you want to play with the algorithms at a high level [1], there are two Python AST -> bytecode implementations I've seen, and one that I'm using.

(1) The 'compiler' module from Python 2. I'm using this to build my shell project Oil, so it works.

See Building Oil with the OPy Bytecode Compiler: http://www.oilshell.org/blog/2018/03/04.html . The heavily refactored source is in the opy/compiler2 directory of the repo on Github.

(This module was removed in Python 3 due to its redundancy with the C implementation. But there's no conceptual difference between Python 2 and 3. Some small details have changed.)

(2) tailbiter, which I mentioned here: http://www.oilshell.org/blog/2018/03/27.html

They are written in very different styles. Tailbiter reads like Lisp code, which may or may not be what you want.

The generated bytecode is different in the sense that the values it operates on aren't just ints and floats (which hardware deals with), but they are dynamically typed objects. But this doesn't change the overall algorithm to generate bytecode from an AST.

[1] which I recommend over doing it in C or C++, because ASTs are somewhat annoying in those languages

Walk the tree post-order and emit an instruction for each node.

https://compilerbook.com/ is a good read ..

I cranked out a compiler, assembler and VM for TXR Lisp earlier this year.

The code is very clean, and fairly suitable for study purposes. It is not commented, though.




The assembler accepts programs in a kind of Lisp syntax. It allows for symbolic labels and performs all the backpatching. The binary instruction format is written out with the help of the buf data type and some ffi-related functions for reading and writing binary types to and from a buffer.

Because the assembly code a Lisp data structure, the compiler can emit the instruction templates using backquote templates. Pieces of code can be catenated with append since they are just a list and so on.

The compiler comp-if method compiles (if ...) forms. The most general case with all three arguments (if expr then else) uses this code template:

     (if ,te-frag.oreg ,lelse)
     ,*(maybe-mov oreg th-frag.oreg)
     (jmp ,lskip)
     ,*(maybe-mov oreg el-frag.oreg)
The first element ,[star]te-frag.code splices in the code part of the fragment obtained from compiling the test expression. That code has to be run first. Then we emit the (if ...) instruction which tests the te-frag.oreg: the output register where the te-frag.code has stored the value. If the test is successful, the instructions continue below, otherwise a branch takes place to the else part. The else part is identified by lelse which holds the label: the ,lelse backquote syntax inserts that label into the template. So after the (if ...) instruction we splice the th-frag.code: the "then" fragment. After that we jump past the else part to whatever follows via (jmp ,lskip): jump to the skip label. Before this, we potentially insert a mov instruction that may be needed. We are expected to produce the result value of the (if ...) expression in an output register held in oreg. But the th-frag.code puts the result into th-frag.oreg: its own output register. Now those two may be the same. If they are not the same register, then a move is needed: the maybe-mov function produces the code when the registers are different, or else an empty list.

In action:

  This is the TXR Lisp interactive listener of TXR 203.
  Quit with :quit or Ctrl-D on empty line. Ctrl-X ? for cheatsheet.
  1> (compile-toplevel '(if x y z))
  ** warning: (expr-1:1) unbound variable x
  ** warning: (expr-1:1) unbound variable y
  ** warning: (expr-1:1) unbound variable z
  #<sys:vm-desc: 9739370>
  2> (disassemble *1)
      0: x
      1: y
      2: z
      0: A0020000 getlx t002 0
      1: 48000005 if t002 5
      2: 00000002
      3: A0020001 getlx t002 1
      4: 44000006 jmp 6
      5: A0020002 getlx t002 2
      6: 10000002 end t002
  instruction count:
  #<sys:vm-desc: 9739370>
Here, getlx t002 0 means look up the lexical variable named by the symbol at position [0] in the syms table, in other words x. Put the value into register t002. Then if t002 5 means, if t002 is true (non-nil), then continue, else branch to the instruction at offset 5.

end t002 means that the VM is done executing and its result value is in t002.

We can intercept the call to the assembler asm method:

  3> (trace (meth sys:assembler sys:asm))
  4> (compile-toplevel '(if x y z))
  ** warning: (expr-4:1) unbound variable x
  ** warning: (expr-4:1) unbound variable y
  ** warning: (expr-4:1) unbound variable z
  ((meth sys:assembler
     sys:asm) (#S(sys:assembler buf #b''
                                bstr #<buf-stream b767e90c> sys:max-treg 0 sys:labdef #H(())
                                sys:labref #H(()))
              ((sys:getlx (t 2) 0) (if (t 2) #:l0019) (sys:getlx (t 2) 1) (sys:jmp #:l0020)
               #:l0019 (sys:getlx (t 2) 2) #:l0020 (end (t 2))))
#<sys:vm-desc: 972a210>

Here the code looks like:

  ((sys:getlx (t 2) 0)
   (if (t 2) #:l0019)
   (sys:getlx (t 2) 1)
   (sys:jmp #:l0020)
   (sys:getlx (t 2) 2)
   (end (t 2)))
There is a (t 2) syntax for the t002 register, labels are uninterned symbols like #:l0019.

The assembler just works with that. It has an object for each opcode which knows how to encode it into the instruction buffer, plus logic for backpatching labels. When a forward jump is assembled, the label is added to a list of places needing backpatches along with a function to do it; later when the label is defined, the matching places are patched with the now known offset. The assembler contains very little stuff: buf is the buffer holding the output code (initially empty so it shows up as #b''). There is a bstr which is a stream over that buffer so we can do file-like I/O on the buffer. labref and labdef are hash tables for the label backpatching, and max-treg keeps track of the maximum T register number seen. The VM description emitted by the assembler will record this. When the VM is run, it just allocate only as many T registers on the stack as the code requires.

Ok,here is mine https://github.com/wenerme/bbvm

Writing a VM is very fun and addictive. Especially writing in different language can learn a lot!

Great site

This was a great piece, but I found it funny that there's a bug in the very first line of code mentioned:

    /* 65536 locations */
    uint16_t memory[UINT16_MAX];
Spot the leak.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact