And if you’re more interested in the Bytecode/VM than the tree-walking part: I wrote and published a direct sequel to the mentioned book, called Writing A Compiler In Go , which — just like the tutorial we’re looking at here — shows how to turn the tree-walking interpreter built in the first book into a bytecode compiler and stack-based virtual machine.
The obvious implementation of a register file for an interpreter is using an array of registers, something like:
So the advantages of register machines do not come from this theoretical closeness to the machine. They can still be better, though, because you eliminate a lot of bytecode for shuffling data around. I think the canonical source for the possible speedups is "Virtual Machine Showdown: Stack vs. Registers": http://www.usenix.org/events/vee05/full_papers/p153-yunhe.pd...
This paper also gives a very nice copy propagation algorithm that eliminates much of the need for a real register allocator, if you start from stack-based input code.
The flow analysis that modern JITs use to extract an SSA form from stack-based bytecode is just as easy to perform on register-based bytecode. This SSA form is then optimized, instructions selected, and CPU registers allocated.
From one perspective, you have a continuum from (essentially) infinite register machines with SSA (LLVM bitcode, SafeTSA) or memory machines (such as Inferno's Dis VM) to 256-register VMs (LuaJIT, Lua, others) to 16-register VMs (Dalvik, maybe others) to stack machines (JVM, p-code, FCode, etc.). From one perspective, stack machine bytecode is a compact representation of code for a register machine with 2 registers. Register machines aren't any less flexible than stack machines.
Relatively light on theory, which is a conscious choice by the author. The whole approach starts more from the applied angle than the theoretical, with the latter serving the former. It does discuss CFGs, for example.
(BTW, based on other comments I've seen by you before I doubt it will contain a lot of challenging content for you. But perhaps the writing style itself is interesting)
Since this is the sort of post I like, I have taken to doing technical posts on a topic myself. The thing I have found is just how time consuming it is. Maybe I'm just slow at this.
SICP e.g. starts with this and then goes into its improvements, whilst better lisps or lua start with tighter opcode designs, and do proper closures and lambda lifting. python cannot even do proper lexical blocks/functions, the most important part of a interpreter/compiler design.
It's true that Lisp is a simpler language and thus better for learning this with if you know it.
It means that you can't really add things like stack traces, coroutines, exceptions, single stepping, etc and understand how they work under the hood.
sys.setrecursionlimit(2 ** 32 - 1)
On the other hand CPython in fact does extra work because it also creates heap allocated frame objects (which essentially mirror the C stack) for debugging purposes.
Your experiment shows that the C stack overflows faster than the heap. It does not show that no heap space is consumed at all.
No? Pretty much the entire point of the recursion limit is to protect against segfaults.
In comparing a bytecode interpreter with an AST treewalker, and assuming parsing is not counted (i.e. we already have the AST), what makes bytecode interpreter faster?
Walking a tree doesn't seem like that much of a performance penality... perhaps the "execution model" includes something else?
For example for an assignment you'll have ast like (assign (foo 5)). This will have to look up "assign", run it, look up "foo" in the frame, and assign 5 to it. All of that will operate on pointers to different parts of the heap and kill your CPU cache too.
With bytecode, a lot of it will be already reduced to (for example) a machine with stack, so the assignment will be closer to: read the op (jump address in an array, so already in cache), dispatch via a small lookup table, put 5 in slot 2 (names already rewritten to stack indices) in the current frame.
Though for looking up operators/keywords etc, preprocessing this step doesn't feel like cheating to me (though strictly, it's something more than an AST).
With byte code, some stuff look easier and the performance boost is nice, but need to commit fully to become an "abstract machine". Thats cool if the host language not match the control flow you want, but is extra work otherwise.
With a naive AST interpreter you create an object for each language construct, which has references to other AST objects in a big tree. Execution involves calling eval() on the root and recursively allowing sub-expressions to bubble up their evaluated results. So this approach is nice because execution and code look similar, but it is also bloated since you create an object for each construct which is thrown away after calling eval() and it is succeptible to stack overflow.
With a bytecode interpreter you simplify the execution environment to a list of instruction codes, an env, and a self-maintained call stack. Each op is a simple manipulation of the stack or env, with little wasted effort.
Overall, compiling to byte code is equivalent to memoizing the AST to remove recursion from evaluation. It is much more likely to be cache coherent as well.
It really is. At the level of a language interpreter, cycles count and things like cache misses become important. Bytecode keeps the instructions densely-packed in memory and the dispatching code fairly densely packed in memory too.
With a tree-walker, you are dereferencing pointers, skipping around in memory, and blowing the cache frequently. If you're using the Visitor or Interpreter pattern, you have the overhead of looking up the vtable, finding the method pointer, and invoking it. There's the overhead of the function call itself too -- saving and restoring registers, etc. It all adds up quickly.
It's a bit slower to get rolling than Python - especially function value representation and dynamic dispatch, unlike Python, could not just make a callable object on the fly. Handling function/procedure calls in a stack VM is definitely a fun exercise, and there seems to be many ways to skin the cat. Shared vs. dedicated data stack across functions, adding registers (unlimited number if required) etc.
Oh, wow, you've really rocketed past the book. That's awesome!
Just executing off the parser is also a strategy. That's what some people do, since you can get even more impressive speed wins by simply making small/dense programs.
Awesome to see your post on HN! Hope all is well :)