Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Writing my first interpreter in C for fun (github.com)
109 points by jhedwards on June 21, 2016 | hide | past | web | favorite | 52 comments



Sometimes i wish for compiler tutorials that progress feature by feature, starting with simple features, rather than starting with complicated things like parsing expressions with multiple op codes (op codes being implemented as function calls, so it gets even more complicated)...

For example, a sequence like this:

1. variable assignment

2. commands of one parameter

3. if / branch statements

4. much later: nested expressions, function calls, multiple stacks, environments

basically, starting with an abstraction of 3 parameter opcodes, and then slowly building features up the abstraction layers, as well as slowly building down the optimization layers, step by step.


You are looking for this, "let's build a compiler" http://www.stack.nl/~marcov/compiler.pdf


not exactly. more like an expanded version of chapter 5 of sicp https://mitpress.mit.edu/sicp/full-text/book/book-Z-H-31.htm...

I will put it on github if i ever get around to writing one myself.

I am thinking something targeting / including a toy version of qemu.



Hey this is great work! What resources would you recommend for someone else who wants to build their own toy language for fun and practice?


It is hard to beat using Prolog for whipping out interpreters. Like Lisp, you get to use symbols in terms, plus you also get to use Prolog's ability to use infix operators to create terms:

    eval(X,X) :- number(X).
    eval(X+Y,Z) :- eval(X,X1), eval(Y,Y1), Z is X1+Y1.
    eval(X*Y,Z) :- eval(X,X1), eval(Y,Y1), Z is X1*Y1.
    
    main :- write("Enter an arithmetic expression followed by a dot \".\" and a newline\n"),
            prompt(_Old,>),
            repl.

    repl :- read(Input),
            eval(Input,Answer),
            print(Answer), nl,
            repl.

You might like:

https://www.metalevel.at/lisprolog/

http://www.swi-prolog.org/

https://www.amazon.com/Art-Prolog-Advanced-Programming-Techn...


Evaluator; Parsers too. Prolog is an amazing idea (even with its limitation). But when you come from the bit array banging side of computer programming it's such an alien idea.


For those interested in parsing with Prolog, check out Definite Clause Grammars and The Craft of Prolog.

http://www.learnprolognow.org/lpnpage.php?pagetype=html&page...

https://www.amazon.com/Craft-Prolog-Logic-Programming/dp/026...


Just a logically way of thinking. Many people don't know it, but Java is verified with prolog.

https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.ht...


Haaaa I remember reading the spec, before I knew prolog .. didn't realize what I was reading. Neat.


You can always try Erlang's pattern matching..


I've also been creating a programming language recently [1]. I'd recommend looking into the pypy tool chain. It makes it pretty easy to hack together a decent interpreter (that JITs). That and rply for the front-end gives you enough tooling to be dangerous. Here's a good tutorial [2].

[1] https://github.com/DanielWaterworth/plastic

[2] https://morepypy.blogspot.co.uk/2011/04/tutorial-writing-int...


Thank you! I've seen a lot of great articles and tutorials on HN about this subject, so you can probably find a lot of good resources doing a search here. To start I can link you to an article I wrote that teaches the basics of tokenizing and parsing: https://angularclass.com/making-a-mini-lisp-introduction-to-...


The classic way would be to start out with LEX/FLEX and YACC.


Forget about lex, flex, bison, yacc and all that. It's way easier and more important to start with a hand written recursive descent parser.


Q: Why do those programs exist? Lex and yacc originally come from Bell Labs. Why wouldn't they use what you recommend instead?

And when I look at the credits for lex and yacc today I see some very capable programmers. Why do these programs exist? I may be biased somewhat as I use them regularly for small tasks, but my question is an honest one.

Also, someone else mentioned swi-prolog. I think gprolog is just as easy and is noticably faster. Curious what HN readers think are the advantages to swi-prolog.


Because many don't know any better and are stuck in 70's parser tools mindset.

Tools like ANTLR[0], SableCC[1], JavaCC[2], MPS[5] and approaches like Attribute Grammars[3], Parsing expression grammars[4] are much more suitable for the modern days.

I don't know gprolog, but SWI has GUI tooling, compilation and I was already using it in 1997, so it is very sound toolchain.

[0] - http://www.antlr.org/

[1] - https://github.com/SableCC/sablecc

[2] - https://javacc.java.net/doc/JJTree.html

[3] - https://wiki.haskell.org/Attribute_grammar

[4] - https://en.wikipedia.org/wiki/Parsing_expression_grammar

[5] - https://www.jetbrains.com/mps


I've tried Antlr and it was a horrible experience compared to Menhir: Being stuck with LL(n) causes a lot of unnecessary refactoring, the syntax for anything out of the ordinary is arcane and IMO not well documented.

Give me the 70's tools any day.


You should of course use a tool that supports the grammar productions of the language being designed.

I just listed a few examples.


Menhir is hardly a 70s tool...


This is true of course. But it is ocamlyacc compatible and uses a separate lexer.

I'd at least say that it is using the lex/yacc approach, which I actually like.


LL(n) can be limiting indeed. But you can easily mix, say, PEG and Pratt. Because where else do you need LR-like parsing but in the binary expressions, the C-style dot syntax and so on.


I think this is kind of a big question, isn't it? You're in effect asking why people would write LL parsers versus LR.

The short answer might be that LL is simpler, and that handwritten recursive descent might be faster (especially if lexing and later stages are directly integrated into the parser) than tooled LALR parsers. LR is more expressive and can directly handle more grammars?


I think that, in the general case, the idea that recursive-descent parsers can only parse LL(k) languages is not quite true. For some theoretical conception of "parser", perhaps, but most real, hand-rolled recursive-descent parsers rarely recognize precisely an LL(k) language, because programmers tend to mix-and-match algorithms (mixing in a little shunting yard or precedence-climbing to handle infix operators, for example) and they tend to connect the parser with other modules in less restrictive ways (add in some two-way communication with the lexer and you can parse yourself a context-sensitive language, even!).

That's not to say that I don't agree with the general argument that there's no sense in shunning parser-generators all the time. I've made good use of yacc (and ocamlyacc, and menhir, and Lua LPEG, etc.) in the past, and the parser for our commercial compiler at work is written with ANTLR to generate an LL() parser in C++. Such metaprogramming tools let us work at a higher level of abstraction, and we get all the usual benefits of working at a higher level.

I think that, in most* cases, there's really no good reason not to use a parser generator (or, alternatively, parser combinators). I don't even suspect that you learn anything in particular that you wouldn't learn by learning how to formalize your grammar and specify semantic actions for a parser generator (unless perhaps you aren't familiar with mutual recursion, that is...). However, there are some good reasons not to use a generator[1], and, as with anything, the trade-offs need to be carefully considered before deciding.

This all reminds me -- I've been meaning to play with Treetop for Ruby lately :)

  --
[1]: For example, it tends to be much easier to provide good error messages with a recursive-descent parser. It's not always obvious how a particular configuration for a table-driven parser corresponds to a particular mistake or code formation. There has been some fruitful research in that area of late, though.

There's also something to be said for not having a dependency on an extra tool, especially since such tools tend to be complicated and difficult to debug or extend. As such, when you encounter a language construct that's not easily definable for the generator, you may have have any easier time resorting to some hacks to get what you want rather than just extending the tool.


> it tends to be much easier to provide good error messages with a recursive-descent parser

It is not any more complicated with a parser generator (if it's a PEG-based one). In fact, it's much easier, you're avoiding a lot of boilerplate this way.

> There's also something to be said for not having a dependency on an extra tool

Do not have a dependency on an extra tool. Use a meta-language, in which a parser generator can be embedded as a first class language feature.


> It is not any more complicated with a parser generator (if it's a PEG-based one). In fact, it's much easier, you're avoiding a lot of boilerplate this way.

True. PEGs are effing great!

> Do not have a dependency on an extra tool. Use a meta-language, in which a parser generator can be embedded as a first class language feature.

This is ideal, but unfortunately not always possible. For example, if you're stuck writing a language processor in C, you don't get the necessary tools for linguistic abstraction without an external dependency. In general, though, I agree with the sentiment.


they are not very noob friendly, "shift reduce conflict found in state 111" etc :)


Given how much of a novice I still am in the subject, might want to take my suggestions with a grain of salt compared to others.

I have always really enjoyed just reading the source code for programming languages. As I learn more and more, I seem to take away a bit more each time.

Personally I've enjoyed reading through the source code for Go, since it is hand written in Go. Being hand written, it can be a little repetition reading through it, but I find it to be pretty easier to read/understand.

https://github.com/golang/go/tree/release-branch.go1.6/src/g...

Also, I have read (at least parts of) the book, "Engineering a Compiler", which being a novice in the subject, some of it goes over my head, but I think it does a better job outlining the topic than any other books I have read.

https://www.amazon.com/Engineering-Compiler-Second-Keith-Coo...

Enjoy :)


I highly recommend Peter Norvig's example Lisp interpreters. He has Python and Java implementations and they are both very accessible, so you can digest them in a few hours' time. "Production" language interpreters tend to have so much complexity that it can be hard even for an expert to find the essence of the interpreter.


I recently created a toy interpreter for a minimalistic C-like language that is easily extensible:

https://github.com/bbu/simple-interpreter

I am also currently developing a "real" language that is based on the same lexing and parsing machinery:

https://github.com/bbu/quaint-lang



This looks great. Thanks!


I like Haskell + Parsec.



Very cool. Knowing little about programming language implementation, I wonder what it would take to make a compiler. For so simple a language, I would think it would be not much harder, no?


Not looked at OPs code, but if you build this stuff how a compiler book tells you to... The front end of a interpreter and a compiler is the same. You build a AST, if you interpret you well interpret that. If you have a compiler, the AST is turned into the target language, in most cases asm. The front end is the scanner/Lexer, the parser and most likely semantic analyser. So to answer you question :), if you make a non optimizing compiler, turning the AST into asm is trivial. The hard part of making compilers is the optimization. Most of the front end you do not need to write actually! There are compiler-compilers for that.

https://en.wikipedia.org/wiki/Abstract_syntax_tree

https://en.wikipedia.org/wiki/Compiler#Front_end

https://en.wikipedia.org/wiki/Lexical_analysis

https://en.wikipedia.org/wiki/Parsing

https://en.wikipedia.org/wiki/Semantic_analysis_%28compilers...


In fact, you can write a naïve interpreter or compiler completely in lex and yacc :)


Depends what you compile to. Writing a register allocator may be a little tricky (if you compile to a VM/Assembly/..), though if you compile to something like C, it's very easy. LLVM is another alternative which is also relatively simple to learn.


Spilling every variable to memory isn't too bad if you just want it to work and aren't worried about performance.


Linear scan register allocation is "good enough" in most cases and it is trivial if you're doing it on top of an SSA.


Compilers are in general far simpler than interpreters, especially if you're trying to build an interpreter with a reasonable performance.

For a simple language compiler is nothing more than a chain "parse -> strip off syntax sugar -> resolve lexical scope -> propagate types -> flatten the code -> emit target code", with each step being independent and trivial, nothing more than a set of tree rewrite rules.


Now to make it self hosting. ;)


Nice! If you're looking for another fun challenge, make your language compile into bytecode and execute that instead of an AST. :)


ASTs are fighting back these days! I work with a system at Oracle for going straight from AST to JIT, with no bytecode, and it's proving to be very fast and very easy.


That is super cool! Are there any public resources available about this?


The key paper: http://lafo.ssw.uni-linz.ac.at/papers/2013_Onward_OneVMToRul...

There's implementations of Ruby, JS, R, Python, etc, using it.

I have a blog about the Ruby effort http://chrisseaton.com/rubytruffle/


That's a nice set of readable code. I made a minor change just to add in floats instead of ints, which wasn't too hard to accomplish thanks to your structure.

(Though my changes are ugly).

If/When you get to implementing functions things will get really useful :)


I just realized reading your comment that I did implement functions but I forgot to document them! I will update the readme immediately.

If you look in the main file, huo.c, you will see something called store_defs. That function takes the ASTs of defined functions and stores them in a key-val type store. If someone invokes a user-defined function I just grab the AST, replace the variables with the values they passed in and execute it. That code is in process_defs and is invoked by execute.


Wonderful news. I managed to gloss over that :)


Nice work.

You may want to take a look at Bison/Flex. While they don't suit the tastes of everyone, it's good to know how to work with them. For example, BASH parser is written using those tools.


Nice work. What would take, to make it into a REPL? I haven't looked closely at the code yet, so maybe it's close to being there.


That's a good idea. I don't think it would be hard, you just get the interpreter to run in a loop and read from the keyboard input. I'll put that on the list for upcoming features :-)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: