
Why I wrote a book about interpreters - deanmen
http://thorstenball.com/blog/2016/11/30/why-i-wrote-a-book-about-interpreters/
======
haberman
The way I would sum up interpreters and compilers, trying to put it as
concisely as possible:

Step 1. Write a program that reads the input program into a data structure
that exactly represents the language. You just wrote a parser.

Step 2. Write a program that operates on the data structure from (1) to
execute the program. You just wrote an interpreter.

Step 3 (Optional): Write a program that converts the data structure from (1)
into a different data structure that you can interpret more efficiently. You
just wrote an optimizer.

Step 4 (Optional): Keep iterating on (3) until the output of your optimizer is
machine code. You just wrote a compiler.

Not saying the above obviates the need for a book like this one (at all), I
just had never thought of it in quite this way and wanted to write it down. :)

~~~
Arnavion
Nit: The output does not have to be machine code to be considered a compiler.

~~~
taneq
What, then, is the difference between a compiler and an interpreter? Just that
the compiler saves its output rather than executing it?

~~~
Patient0
I would define it as: when a compiler operates on its own source code, the
resultant compiled compiler executes as fast as the original. When an
interpreter evaluates its own source code, the resultant interpreted
interpreter is an order of magnitude slower. This definition allows for things
like "just in time" compilation, and rules out silly tricks like bundling an
interpreter and some source code into an executable image.

~~~
SixSigma
Until you get to the Forth family

~~~
bogomipz
Why does that change with going to Forth?

~~~
SixSigma
The speed differential. Your basically making a list of jumps into machine
code and then jumping in a loop. The interpreter takes your input and turns it
into a jump list which it pumps into the loop. You can reuse a jump list by
defining it as a new word.

------
zellyn
I highly recommend the Coursera compilers class: you don't have to work though
it in synch with a class; just sign up and work at your own pace. If you have
to watch the same video two or three times, you can. None of this stuff
actually turns out to be that complicated: it just takes a little while to get
the lingo and concepts.

I implemented the entire parser, typechecker, and compiler in Go here:
[https://github.com/zellyn/gocool](https://github.com/zellyn/gocool)

I then went back and hand-wrote a recursive descent parser, just for fun: Cool
is such a simple language that it really wasn't difficult.

~~~
riffraff
FWIW, I think the coursera class is now gone, but the course should have
matched this
[http://web.stanford.edu/class/cs143/](http://web.stanford.edu/class/cs143/)

~~~
zellyn
That's a pity: the lecture videos were worthwhile, and I got all the unit
tests by dissecting the auto-checking scripts that the class supplied :-/

~~~
riffraff
it's possible they are available in the internet archive backup, but I don't
have idea how to inspect those without downloading tens of gigabytes of stuff.

[https://archive.org/details/archiveteam_coursera](https://archive.org/details/archiveteam_coursera)

~~~
sn9
It seems to be available from Stanford's Lagunita site:
[https://lagunita.stanford.edu/courses/Engineering/Compilers/...](https://lagunita.stanford.edu/courses/Engineering/Compilers/Fall2014/about)

~~~
zellyn
Nice! Same lecturer: Alex Aiken

------
nickpsecurity
[Ignoring the dupes since they got no attention. Good reason to resubmit if
material is worthwhile.]

The author seems to be doing a good thing. Like he said, most of the write-ups
on this subject are either ultra-heavy with theory or basically nothing with
code examples. Doing interpreters piece-by-piece like in SICP or The Little
Schemer series in an accessible language gradually giving them the code and
theory they need is a good idea.

It could also help in my verifiable builds scheme where people show no
subversion exist by building from ASM to small language (or interpreter) then
to bigger one then whole compiler. I was debating p-code, Oberon, Scheme,
MiniML... Biggest problem is that the best stuff, eg Scheme's or ML's, is
least likely for imperative programmers to try to understand. Something like
this could help if I use an imperative base.

~~~
CarolineW
At the time I pointed out that this is a dupe the previous submission had been
less than an hour earlier, scarcely giving it enough time to get any
attention.

But screw that, let's just submit things every 10 minutes until they get
upvotes.

~~~
nickpsecurity
Oh I overlooked that. Guessed that the others were older. My bad on that.

------
gekkonier
Except the dupe story, does anyone gave the book a closer look and can say if
it's suitable for beginning programmers? By that I mean if it's suitable to
study and implement an interpreter in other language without a degree in
computer sience?

Programming is one of my hobbies and my dream is to do an interpreter on my
own, but most books are so heavy to understand if you know what I mean.

Thank you very much for your opinion.

~~~
misternugget
Disclaimer: I'm the author of the book and thus pretty biased.

As a matter of fact, I wrote this book specifically for people like you and
me: no CS degree, highly interested in interpreters and compilers, but
intimidated by the existing literature. (If you studied compilers in college,
this book will probably teach you nothing new.) There's a sample on the
landingpage that should give you an impression of the difficulty - my guess
is, that if you already know how to program and know the basics of Go, you'll
get along just fine.

~~~
senko
On a total tangent, I'd love to get that metacircular interpreter poster, any
chance you'd like to sell a print-quality copy? :)

------
ChicagoDave
I like these kinds of books a lot. I'm not at all interested in Go, so I'll
probably convert the code to C# for my own amusement...but I dig it. I've
mucked with lexers for text editing purposes and have always wanted to build
my own compiler. It's somewhere down on the bucket list.

------
CarolineW
Dupe:
[https://news.ycombinator.com/item?id=13079250](https://news.ycombinator.com/item?id=13079250)

And again:
[https://news.ycombinator.com/item?id=13071939](https://news.ycombinator.com/item?id=13071939)

 _Edit: To those who downvoted me, let me just point out that this was
submitted less than an hour after the previous submission, hence my pointing
at that (barely) earlier item._

~~~
alayne
Please don't post previous submissions without comments, they are worthless.

~~~
CarolineW
So let me ask a serious question here. Suppose someone submits a story. The
following day it has few points, and no comments. If someone then submits the
same story, fair enough.

Now suppose the story is submitted, has few points, no comments, and then 30
seconds later someone submit the same story. Is it fair to point at the one
that's only 30 seconds old? I would say yes, you might disagree, I would be
interested to know what you think.

If it's reasonable to point at a submission that's only 30 seconds older, then
we have a question:

How old should a previous submission be to declare that it's fair enough to
give it a second chance?

I would say that an hour isn't long enough, so if something is submitted a
second time within an hour then it's perfectly reasonable to point at the
previous submission.

If you disagree, fine, say so. If you think it's reasonable that something is
submitted within seconds of a previous submission, say so.

I'd be interested to know what you think is reasonable. It's plausible that we
have different opinions about what is reasonable in this context. If so, I'd
like to know what you think.

~~~
alayne
If you have issues with the site functionality, you should contact
hn@ycombinator.com.

~~~
CarolineW
So it's OK for you to tell me what to do, and what not to do:

    
    
      > Please don't post previous
      > submissions without comments,
      > they are worthless.
    

... and yet you decline to answer my question, or to tell me your opinion on
these matters.

That's a pity. I'd've liked to have known your thoughts as to whether it's
reasonable to have multiple submissions within minutes without indicating the
duplication.

------
tzs
A fun and fairly simple project, with a surprisingly high ratio of usefullness
to effort, is to write an interpreter for a concatenative language.
Concatenative languages, like FORTH, can do a lot with very limited resources,
making them good candidates for embedded systems.

If you want to play around with making your own concatenative language, it is
actually surprisingly simple. Here is an overview of a step-by-step approach
that can take you from a simple calculator to a full language with some
optimization that would actually be quite reasonable to use in an embedded
system.

So let's start with the calculator. We are going to have a data stack, and all
operations will operate on the stack. We make a "dictionary" whose entries are
"words" (basically names of functions). For each word in the dictionary, the
dictionary contains a pointer to the function implementing that word.

We'll need six functions for the calculator: add, sub, mul, div, clr, and
print. The words for these will be "+", "-", "x", "/", "clr", and "print". So
our dictionary looks like this in C:

    
    
        struct DictEntry {
            char * word;
            int (*func)(void);
        } dict[6] = {
            {"+", add}, {"-", sub}, {"x", mul},
            {"/", div}, {"clr", clr}, {"print", print}
        };
    

We need a main loop, which will be something like this (pseudocode):

    
    
        while true
            token = NextToken()
            if token is in dictionary
                call function from that dict entry
            else if token is a number
                push that number onto the data stack
    

Write NextToken, making it read from your terminal and parse into whitespace
separated strings, implement add, sub, mul, div, clr, and print, with print
printing the top item on the data stack on your terminal, and you've got
yourself an RPN calculator. Type "2 3 + 4 5 + x print" and you'll get 45.

OK, that's fine, but we want something we can program. To get to that, we
first extend the dictionary a bit. We add a flag to each entry allowing us to
mark the entry as either a C code entry or an interpreted entry, and we add a
pointer to an array of integers, and we add a count telling the length of that
array of integers. When an entry is marked as C code, it means that the
function implementing it is written in C, and the "func" field in the
dictionary points to the implementing function. When an entry is marked as
interpreted, it means that the pointer to an array of integers points to a
list of dictionary offsets, and the function is implemented by invoking the
functions of the referenced dictionary entries, in order.

A dictionary entry now looks something like this:

    
    
        struct DictEntry {
            char * word;
            bool c_flag;
            void (*func)(void);
            int * def;
            int deflen;
        }
    

(continued in reply)

~~~
tzs
Now the main loop has to look something like this:

    
    
        while true
            token = NextToken()
            if token is in dictionary
                if dict entry is marked as C code
                    call function from that dict entry
                else
                    InterpWords(def, deflen)
            else if token is a number
                push that number onto the data stack
    

and InterpWords looks something like this

    
    
        void InterpWords(int * def, int deflen)
            while (deflen-- > 0)
                InterpOneWord(&dict[*def++])
    
        void InterpOneWord(dict * dp)
            if (dp->c_flag)
                (*dp->func)()
            else
                InterpWords(dp->def, dp->deflen)
    

With this, we can add new built-in functions defined in terms of already
defined functions. Before we do any of that, though, we should add some more
stack manipulation functions. Add at least "dup", "drop", and "swap", which,
respectively, push a copy of the current top of stack onto the stack, pop an
entry from the stack, and exchange the top two items on the stack.

For reference, after adding those, the dictionary offsets of the 9 words we've
defined are:

    
    
        0 +
        1 -
        2 x
        3 /
        4 clr
        5 print
        6 dup
        7 drop
        8 swap
    

and they are all marked as C code. Now let us add a built-in function "sq",
which squares the top item on the stack. To do that, add a new dictionary
entry initialized like this:

    
    
        {"sq", false, 0, sq_def, 2}
    

where sq_def looks like this:

    
    
        int sq_def[] = {6, 2}
    

That's marked as being interpreted rather the C code, so when the main loop
finds it it will call InterpWords, passing it sqdef and 2, and InterpWords
will end up invoking dup and x, with the result being the top of the stack is
squared.

OK, that's all fine and dandy, but we want to be able to define functions in
our language from our language. So far, we can define functions in our
language, like sq, but we have to do it by editing a C data structure and
recompiling.

(continued in reply)

~~~
tzs
To do this, we'll need some changes to both InterpWords and/or InterpWord, and
to the main loop. The basic idea is that we add a "compiling" mode. When we
are in compiling mode the main loop does not execute the tokens it gets.
Instead it just makes a list of the dictionary offsets of the words it would
have executed. So in compiling mode, if you type "dup x", the main loop
doesn't actually call dup and x, it just makes the list {6, 2}. That is used
to make the function we are compiling. We'll have to find some way to tell the
interpreter what we want to call the new function. The interpreter can than
make a new dictionary entry for it, referencing that list of dictionary
offsets.

The above only covers words in compiling mode...what do we do when the main
loop sees a number in compiling mode? That's where we need to extend
InterpWords. We need to change the list of dictionary offsets it receives to
be able to also specify numbers. Maybe something like this (taking advantage
of the fact that dictionary offsets are always non-negative):

    
    
        void InterpWords(int * def, int deflen)
            while (deflen-- > 0)
                if (*def >= 0)
                    InterpOneWord(&dict[*def++])
                else if (*def == -1)
                    push def[1] onto data stack
                    def += 2; deflen--;
                else error
    

A -1 in the word offset stream now means that the next item in the stream is a
number to push onto the stack rather than a word offset. So now we can support
functions like "dup x 10 +" that squares the top of the stack and adds 10,
which would compile to {6, 2, -1, 10, 0}

We still aren't there, though, because we don't yet have a way to enter or
leave compiling mode. That's easy. We'll add a new built-in function, ":".
We'll write this in C. What ":" does is read the next token, see if it is a
legal word name and not already in use. If it is (legal) and isn't (in use),
it creates a new dictionary entry for it, allocates space for the definition,
and flips on the global compiling flag.

Great! Now we can start defining new words from in our language. Just one
problem...how do we terminate compilation? The main loop is compiling
everything, so how do we get it to execute the command to stop compiling and
finish making the dictionary entry?

For that, we need yet another flag in the dictionary: the immediate flag.

    
    
        struct DictEntry {
            char * word;
            bool c_flag;
            void (*func)(void);
            int * def;
            int deflen;
            bool immediate;
        }
    

Change the main loop so that when it sees a word with the immediate flag set,
it executes it even if it is in compiling mode. With that, we can add a word
";", with the immediate flag set, that turns off compiling mode and finalizes
the compilation of the word that was being compiled. When we get these changes
in, then we would be able to define "sq" from within the language instead of
making it a built-in like we did earlier: ": sq dup x ;"

Of course, we probably want our language to support some kind of if/else
construct and some kind of looping construct.

I bet you can see how this will be done. Most of the work is done in
InterpWords. Much like we used -1 to indicate that the next entry is a number
for the stack, we'll use -2 to indicate a goto. The entry after -2 will be the
offset to go to. (Should it be an absolute offset or relative? Your choice).
We'll use -3 to indicate a conditional goto. It works like -2, except that it
only goes if the top of the stack is non-zero, and it pops the top of the
stack. If the top of the stack is zero, it pops it, and then skips over the
offset.

Here's how we could use this to implement a do...while kind of loop. First
think of how we want it to look in our programs. It might be something like
this "do <body of loop> <loop condition> while". So we want the compiled code
to look like this:

    
    
        <dict offset list for body of loop>
        <dict offset list for loop condition>
        -3
        offset of body of loop above
    

To make "do" and "while" work, we need to flag their dictionary entries as
immediate words. So when we are compiling and hit "do" the code for "do" is
executed. For now, let's make this C code. What the "do" code does is simply
save the offset of where the next word will be compiled to. What the "while"
code does is put the -3 in the compiled code, and then put the offset that was
saved by the "do" into the compiled code.

You can do similar things for if/else constructs, for loops, and any other
control flow you want: make the low level operations you need in InterpWords,
and then make immediate words for use during compilation that gather and use
the necessary information to write that control flow.

Now let's talk a bit about optimization. One thing you can do is some
inlining. Consider this function ": cube dup sq x ;". When compiling this, we
could notice that "sq" is not C code, and put its definition inline, so we end
up compiling cube as if it were written like this: ": cube dup dup x x ;".

(BTW, remember earlier when I said it was up to you whether you use absolute
or relative offsets in the low level control flow? Relative will make inlining
easier if the inlined function has any branches).

Now if you are going to implement entirely in C or similar, that's about the
end of the simple optimizations (at least as far as my feeble mind can
handle). However, if you are willing to work in assembly language some, you
can get much fancier.

You can make the lowest level built-ins, such as add, mul, dup, and similar
assembly language. (In the following I'm going to continue to say C function,
for consistency with the earlier stuff, but what it now means is a function
that is not defined in our language). We can change the way non-C functions
are defined from being a list of dictionary offsets to a pointer to actual
code. So for ": sq dup x ;" instead of compiling that as a pointer to the list
{6, 2}, we compile it as a pointer to a list of call instructions, followed by
a return. For sq, that would look like this:

    
    
        call dup
        call mul
        ret
        

Cube would look like this (without optimization)

    
    
        call dup
        call sq
        call mul
        ret
        

and with sq inlined at the high level, it would become

    
    
        call dup
        call dup
        call mul
        call mul
        ret
    

For built-ins that are position independent assembly code, we can do another
level of inlining and replace calls to them with their body.

Depending on the addressing modes your processor supports, this could end up
with cube being as simple as four instructions (not counting the return), with
all the calls eliminated!

You can go farther and do things like cache the top few stack items in
registers, and add peephole optimizers that go over the compiled code to
eliminate redundant operations.

The really cool thing is how easy bootstrapping this kind of thing is. You
only need a small number of built-in assembly functions to give you enough to
write the rest of the language in itself, and these systems tend to be small
and perform reasonably well. These kind of languages can be great in embedded
systems, and indeed that is where the granddaddy of this kind of language,
FORTH, got its start.

~~~
jim_lawless
I agree with your approach! I built a non-Turing-complete Forth-like
interpreter in Go a few years ago with words similar to the set you've defined
above:

[https://lawlessguy.wordpress.com/2013/07/20/an-rpn-
interpret...](https://lawlessguy.wordpress.com/2013/07/20/an-rpn-interpreter-
in-go-golang/)

------
nv-vn
Awesome! As someone interested in compilers and interpreters who has not
completed a degree in computer science it was quite hard to get any of the
background. This post is spot on: you'll either read a 1000 page book about it
or you're stuck reading tiny blog posts. Notation also sucks for beginners,
few books ever bother explaining it at all. My favorite/most helpful text I've
found so far is Essentials of Programming Languages, which covers a wide
variety of topics and introduces them as parts of (complete) interpreters. The
code is written in very readable Scheme (so much so that I was able to
understand everything despite not knowing Scheme or Lisp at the time I began
reading it).

