Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: How to write a recursive descent parser (craftinginterpreters.com)
347 points by munificent on March 20, 2017 | hide | past | favorite | 148 comments

I don't understand why so many people glorify hand-written parsers. Maybe because they never used good parser generators (not unlike type systems and Java) ?

Personal opinion: writing parsers is the least interesting part of an interpreter/compiler, and your grammar should be boring if you want your syntax to be easy to understand by humans.

Boring, in this case, mean LL(*), LR(1), or another well known class. Just pick a damn parser generator and get the job done quickly, so you can spend time on the really difficult tasks. The grammar in this article is LR(1) and is trivial to implement in yacc-like generators, locations tracking and error messages included.

Bonus point: since your grammar stays in a well known class, it's much easier for other people to re-implement it. You can't introduce bullshit ambiguous extensions to your grammar (C typedefs, anyone ?). This article gives a good explanation of this: http://blog.reverberate.org/2013/09/ll-and-lr-in-context-why...

Hello, I work on the C# compiler and we use a handwritten recursive-descent parser. Here are a few of the more important reasons for doing so:

* Incremental re-parsing. If a user in the IDE changes the document, we need to reparse the file, but we want to do this while using as little memory as possible. To this end, we re-use AST nodes from previous parses.

* Better error reporting. Parser generators are known for producing terrible errors. While you can hack around this, by using recursive-descent, you can get information from further "up" the tree to make your more relevant to the context in which the error occurred.

* Resilient parsing. This is the big one! If you give our parser a string that is illegal according to the grammar, our parser will still give you a syntax tree! (We'll also spit errors out). But getting a syntax tree regardless of the actual validity of the program being passed in means that the IDE can give autocomplete and report type-checking error messages. As an example, the code "var x = velocity." is invalid C#. However, in order to give autocomplete on "velocity", that code needs to be parsed into an AST, and then typechecked, and then we can extract the members on the type in order to provide a good user experience.

My personal opinion is that everyone should just use s-expressions. Get rid of this whole debate :P

While I agree with you those 3 points are extremely important, it turns out there is at least one parser generator that can do all of it: http://gallium.inria.fr/~fpottier/menhir/

It supports both incremental parsing and an API to inspect and recover incomplete ASTs (which powers Merlin, the IDE-like thing for OCaml). It provides stellar debugging features for ambiguous grammars and a way to have good error messages (which is used in compcert's C parser and facebook's reason).

So, it's not impossible. Most parser generators are not that good, though.

Yeah, so we can spend a long time trying to find a parser generator that can solve all of our problems in the language that we use, which may or may not exist. Or you could just spend the same time to just write the parser. It turns out that writing a parser just isn't that hard, and in the time that you spend wavering on which parser generator to use, you could have most of your parser done.

> and in the time that you spend wavering on which parser generator to use, you could have most of your parser done.

Maybe, but if you want all of the features mentioned above (incremental parsing etc.), then writing a parser becomes much more complicated.

Or you could take the middle ground and use a library meant to make it easier to create hand built parsers. as opposed to a library that generates a parser for you.

* Repo: https://github.com/SAP/chevrotain

* Online Playground: http://sap.github.io/chevrotain/playground/

This handles the repetitive parts such as lookahead functions creation or Parse Tree building, while also providing advanced features such as error recovery or syntactic content assist (auto complete).

* ECMAScript family of languages supported (JavaScript/TypeScript/CoffeeScript).

Not really, it is rather easy to hack features into a recursive descent parser. That is is the main reason almost all production compilers are written that way.

With a parser generator, it quickly becomes a game of designing your language and its semantics around the limitations of the generator you are using.

That is super impressive! I can't find the part on incomplete AST or AST reuse in their reference docs though.

Details about the "incremental" mode are listed in the documentation PDF[0] at section 9.2

Here are the first couple of paragraphs:

"In this API, control is inverted. The parser does not have access to the lexer. Instead, when the parser needs the next token, it stops and returns its current state to the user. The user is then responsible for obtaining this token (typically by invoking the lexer) and resuming the parser from that state. The directory demos/calc-incremental contains a demo that illustrates the use of the incremental API.

This API is “incremental” in the sense that the user has access to a sequence of the intermediate states of the parser. Assuming that semantic values are immutable, a parser state is a persistent data structure: it can be stored and used multiple times, if desired. This enables applications such as “live parsing”, where a buffer is continuously parsed while it is being edited. The parser can be re-started in the middle of the buffer whenever the user edits a character. Because two successive parser states share most of their data in memory, a list of n successive parser states occupies only O(n) space in memory."

There does not appear to be a specific mention of having the partial AST available.

[0] - linked from their front page and available at http://gallium.inria.fr/~fpottier/menhir/manual.pdf


I did a part of the incremental parsing in Menhir and the whole recovery aspect. I can try to explain a bit. My goal was Merlin (https://github.com/ocaml/merlin/), not research so there is no paper covering the work I am about to describe. Also as of today only incrementality and error message generation are part of upstream version of Menhir, but the rest should come soon.

# Incrementality, part I

The notion of incrementality that comes builtin with Menhir is slightly weaker than what you are looking for.

With Menhir, the parser state is reified and control of the parsing is given back to the user. The important point here is the departure from a Bison-like interface. The user of the parser is handled a (pure) abstract object that represents the state of the parsing. In regular parsing, this means we can store a snapshot of the parsing for each token, and resume from the first token that has changed (effectively sharing the prefix). But on the side, we can also run arbitrary analysis on a parser (for error message generation, recovery, syntactic completion, or more incrementality...).

# Incrementality, part II

Sharing prefix was good enough for our use case (parsing is not a bottleneck in the pipeline). But it turns out that a trivial extension to the parser can also solve your case.

Using the token stream and the LR automaton, you can structure the tokens as a tree:

- starts with a root node annotated by the state of the automaton

- store each token consumed as a leaf in that tree

- whenever you push on the automaton's stack, enter a sub-level of the tree (annotated by the new state), whenever you pop, return to the parent node

This gives you the knowledge that "when the parser is in state $x and $y is a prefix of the token stream, it is valid to directly reduce the subtree $z".

In a later parse, whenever you identify a known (state number, prefix) pair, you can short-circuit the parser and directly reuse the subtree of the previous parse.

If you were to write the parser by hand, this is simply memoization done on the parsing function (which is defunctionalized to a state number by the parser generator) and the prefix of token stream that is consumed by a call.

In your handwritten parser, reusing the objects from the previous parsetree amounts to memoizing a single run (and forgetting older parses). Here you are free to choose the strategy: you can memoize every run since the beginning, devise some caching policy, etc (think about a user (un)commenting blocks of code, or switching preprocessor definitions: you can get sharing of all known subtrees, if this is of any use :)).

So with part I and II, you get sharing of subtrees for free. Indeed, absolutely no work from the grammar writer has been required so far: this can all be derived by the parser generator (with the correctness guarantees that you can expect from it, as opposed to handwritten code). A last kind of sharing you might want is sharing the spine of the tree by mutating older objects. It is surely possible but tricky and I haven't investigated that at all.

# Error messages

The error message generation is part of the released Menhir version. It is described in the manual and papers by F. Pottier (like http://dl.acm.org/citation.cfm?doid=2892208.2892224, PDF available on http://gallium.inria.fr/~fpottier/biblio/pottier_abstracts.h...).

I might be biased but contrary to popular opinions I think that LR grammars are well suited to error message generation.

The prefix propery guarantees that the token pointed out by the parser is relevant to the error. The property means that there exist valid parses beginning with the prefix before this token. This is a property that doesn't hold for most backtracking parsers afaik, e.g PEG.

Knowledge of the automaton and grammar at compile time allow a precise work on error messages and separation of concerns: the tooling ensures exhaustivity of error coverage, assists in migration of error messages when the grammar is refactored, or can give an explicit description of the information available around a given error.

This is not completely free however, sometimes the grammar needs to be reengineered to carry the relevant information. But you would have to do that anyway with a handwritten parser and here the parser generator can help you....

If you have such a parser generator of course :). Menhir is the most advanced solution I know for that, and the UX is not very polished (still better than Bison). It is however a very officient workflow once you are used to it.

So LR is not the problem, but existing tools do a rather poor job at solving that kind of (real) problems.

# Recovery

In Merlin's, recovery is split in two parts.

The first is completion of the parsetree: for any prefix of a parse, it is possible to fill holes in the tree (and thus get a complete AST).

This is done by a mix of static analysis on the automaton and user guidance: for major constructions of the AST, the grammar writer provide "dummy" constructors (the node to use for erroneous expressions, incomplete statement, etc). The parser generator then checks that is has enough information to recover from any situation, or point out the cases it cannot handle.

It is not 100% free for the grammar writer, but for a grammar such as OCaml one it took less than an hour of initial work (I remember only a few minutes, but I wasn't measuring properly :)). It is also a very intuitive step as it follows the shape of the AST quite closely.

The second part is resuming the parse after an error (we don't fill the whole AST every time there is an error; before filling holes we try to look at later tokens to resume the parse). There is no big magic for that part yet, but the heuristic of indentation works quite well: constructions that have the same indentation in source code are likely to appear at the same depth in the AST, and this is used to decide when to resume consuming tokens rather than filling the AST. I have a few lead for more disciplined approaches, but the heuristic works so well that it isn't an urgency.

# Conclusion

I would say that if I had to write a parser for a language for which a reasonable LR grammar exists, I will use a (good) parser generator without hesitation. For the grammars of most programming languages, LR is the sweet spot between parsing expressivity and amenability to static analysis.

Nice! By the way, I think the Dragon Book needs an update on incremental parsing :)

Another problem I have with parser generators is that they often have an awful API.

For instance, working in a OO language such as C#, I often want to turn the parse into an AST build out of idiomiatic objects.

Most compiler-compilers use the same "callback hell" interface that was used for the original yacc in the 1970's. Thus you wind up writing error-prone code.

Conceptually, the grammars used in compiler-compilers aren't that much more complex than regexes (you just have named sub-expressions, the possibility of recursion, and possibly some specification of associativity) Yet, regexes are probably used 100x more in application software development.

In many systems programming areas I think often using unergonomic tools is a badge of pride, it proves how smart you are, so I think ergonomics are often not taken seriously by those who develop the tools.

ANTLR4 automatically generates a visitor for your grammar that you can use to translate into your own idiomatic objects: http://jakubdziworski.github.io/java/2016/04/01/antlr_visito...

We use this in Presto for parsing SQL and generating an immutable AST:



I'm curious about your incremental re-parsing strategy.

I wasn't smart enough to figure out how to adapt these strategies to my LL(k) grammars (late 90s). Maybe someone's figured this out.

Efficient and Flexible Incremental Parsing TIM A. WAGNER and SUSAN L. GRAHAM


FWIW, I'm developing a library based on the technique outlined in this paper (and others by Time Wagner). Like the original paper, it uses LR(1) (and GLR), not LL(k).

The library itself is here: https://github.com/tree-sitter/tree-sitter

and here are some existing grammars:

  * https://github.com/tree-sitter/tree-sitter-javascript
  * https://github.com/tree-sitter/tree-sitter-go
  * https://github.com/tree-sitter/tree-sitter-ruby

Out of curiosity, how many people do you have working on the parser?

We don't have specific people that do parsing; if your feature needs changes to the parser, then you make those changes.

One nice thing about using recursive descent, is that your parser "is just a normal program." It's also a potential problem, but every substantive project needs coding standards anyhow.

I guess 12 or more poeple are on the new keyword team, always ready to jump on a new language feature added. After all it happens every 3-5 years.

Adding a new keyword to a parser is a huge work, even tests need to be written!

hint: irony

The best part about going open source was being able to outsource the bikeshedding discussions to the community!

(Just kidding, we still do them internally too)

> My personal opinion is that everyone should just use s-expressions.

Aren't S-expressions particularly problematic for providing autocomplete hints?

I don't see why they would be any better or worse than another grammar choice.

It might be true that s-expressions would make people prefer certain semantic decisions that would be difficult to analyze for autocomplete.

To clarify: do you think there are sets of semantic decisions using S-expressions that would still "get rid of the whole debate", and not make autocomplete harder in practice?

(not that it's important of course, but your remark made me curious, and the reactions to my question even more so)

Autocomplete is most commonly used on structures in order to find out which fields and methods that structure contains.

If you used an s-expression based grammar that still had fields and methods, then you'd get the exact same experience.

    foo.bar. <- autocomplete here

            ^ autocomplete here

Except that in some Lisps where you have multi-methods or generated field accessor methods you'd probably would write it backwards. Maybe like this:

    (. (bar foo) ...)
      ^ cursor is here

I develop Cursive, an IDE for Clojure code. I handle this for Java interop in a couple of ways. Cursive does type inference (or propagation, really) in the editor, so if you have enough type hints it knows what the type of a symbol is. This allows a few tricks (| is the caret):

If you've already typed the form and are just changing the head symbol, you're golden:

  (.myMeth| my-object)
If Cursive can accurately determine the type of my-object, you'll only see the relevant completions, i.e. methods from that type.

If you're using Clojure's threading forms, Cursive understands that:

  (-> my-object .myMeth|)
also works.

Otherwise, if you're typing out a new form you can do:

  (my-object .myMeth|)
and when you auto-complete the forms will be switched around to:

  (.myMethod my-object |).

Maybe a stupid question because I don't know much clojure but I'm curious. How can you be sure that -> is threading macro and my-object is what you think it is at compile time? Can't macros change the entire meaning of an expression?

That's not a stupid question at all, and yes they can. Cursive works using static analysis rather than runtime introspection, so I need to teach it how macro forms work. This uses an extension API internally which will eventually be open so that users can add support for their own forms or forms in libs they use. Cursive uses the information about the macro forms to do symbol resolution, so that's currently a best effort in the presence of forms it doesn't understand yet. It works pretty well though since Clojure doesn't tend to use as many macros as, say, Racket. Doing full resolution means that I can distinguish clojure.core/-> from some other symbol with the same short name, it's not just using the -> text.

One correction - this is all done at edit time, not compile time. At compile time there's no issue since all macros are expanded to the core language.

Thank you all for your comments. Swapping the forms is pretty neat!

Sure. That's a property of the language, not S-Expressions.

It's just like how SQL decided to write all their queries backwards making it impossible to do good autocomplete.

Lisps tend to be because of dynamic typing and macros; s-expressions themselves don't seem to add any particular difficulty.

Common Lisp via slime has quite good autocomplete because it extracts the information from a running system and doesn't have to infer it from the source code.

Are there any statically typed S-expression languages?

Typed Racket is the only one I've come across but I'm sure there are a few more.

Shen is another one:


There are also a ton of hobby languages that are statically typed and use s-exprs so the author doesn't have to spend as much time on syntax.

Clojure does allow type annotations that are used to avoid reflection. https://clojure.org/reference/java_interop#Java%20Interop-Ty...

    (defn len2 [^String x]
      (.length x))

Thanks it's great to hear from someone that makes the tools I use everyday.

> My personal opinion is that everyone should just use s-expressions. Get rid of this whole debate :P

C# 8 maybe?

The "best" part about project Roslyn is that if someone wanted to make an s-expression frontend, they could plug that into the compiler.

Untyped S-Expressions would be taking a step back from a C# perspective.

I don't see how s-expressions are inheritently less "typed" than utf-8 text?

Have a compiler target dialect that expects "(int 5)" and "(int x)" have a type-inference step that target such "typed" s-expressions?

There's a reason nearly every actually used programming language parser is hand rolled.

It's easier for adding errors messages, debugging and diagnostics tools. I know modern parser generators can do this as well. It's just easier in a recursive descent parser.

Learning to use parser generator is a learning experience in itself. Most can be pretty complicated to use, once you get past the basics. Yet nearly every programmer can understand a recursive descent parser. You can debug them using standard breakpoints for example. Yet parser generators can output quite difficult error messages, unless you have had a parser education.

In practical implementations, recursive descent can be optimised to be really fast, and to use a low amount of memory.

>I know modern parser generators can do this as well

Can you please list them here? These without build/runtime dependencies on java and other non-default platforms.

That alone would make the whole thread.

ANTLR has hooks you can use to help the error message.

Though it uses Java. What is a "default platform"?

Yet nearly every programmer can understand a recursive descent parser.

...and I also hypothesise that nearly every programmer will come up with some form of recursive descent parser if asked to parse a recursive language --- without ever having heard of the term "recursive descent" nor parser generators. It's the intuitive, "natural" solution to the problem.

Because the problem is not parsing alone, but doing type checks / ambiguity resolution during parsing and linking up the parse tree with type information etc.

See http://stackoverflow.com/questions/6319086/are-gcc-and-clang...

Also, you people may just want to learn how to build a parser.

Shameless plug if you need a small but flexible expression parser for java: https://github.com/stefanhaustein/expressionparser

My point is precisely that, if you are creating your language from scratch, you shouldn't have to do that. Those are mistake needed because the C grammar is full of cruft (and the C++ one is even worse).

It's much better, for a new language, to separate type checking from parsing completely, and to avoid any ambiguity.

In theory yes, in practice, if you want to make a serious improvement in the ergonomics of programming languages for "ordinary" folks, you have to face up to the fact that natural languages don't make an artificial distinction between different aspects of language.

> Because the problem is not parsing alone, but doing type checks / ambiguity resolution during parsing and linking up the parse tree with type information etc.

Holy shit, people think that's a good idea? Pipelines please!

It's often faster to do as much as you can in a single pass, and language front ends are one of those areas where squeezing every last drop of performance out still really matters.

It's actually faster to separate things, because you will be able to implement them cleanly. Having a good pipeline such as "text -> AST -> typed AST -> IR -> .... -> ASM/VM code" is very valuable.

The costy parts are the middle layer optimization anyway. Typechecking time, for a well written type checker, is not that big.

I think incremental compilation is a far more elegant and efficient way to shorten debug cycles.

Also makes IDE stuff easier too.

> Maybe because they never used good parser generators (not unlike type systems and Java) ?

Is there such a thing as a "good parser generator" where "good" includes "generates at least decent error messages"?

> Personal opinion: writing parsers is the least interesting part of an interpreter/compiler, and your grammar should be boring if you want your syntax to be easy to understand by humans.

Hardly. Humans have way less trouble with ambiguity than computers do, as demonstrated by, well, the entire history of programming languages. If syntactic simplicity and regularity were what humans seeked, the primary languages in use would be Lisps, Forths and Smalltalks, not C derivatives.

What are those "good error messages" in the parsing stage you keep talking about?

I don't remember I've ever used a language that reported anything different from "syntax error at line X" at this stage.

As a random example from Perl, Unmatched ( in regex... is the error when you open a capturing group in a regex and fail to close it.

In a recursive descent parser, this type of error is very easy to generate because working top down you know what you are in the middle of trying to do when you hit the error. But a bottom up parser lacks that context - it just knows where it was when it got stuck.

You can get a parser generator to automagically list you expected (non-)terminals at this point. It may not be helpful though, depending on the location of the error. But it is something at least.

writing parsers is the least interesting part of an interpreter/compiler, and your grammar should be boring if you want your syntax to be easy to understand by humans.

In my (admittedly limited) experience, most grammars for programming languages that are supposed to be easy on humans end up being harder to parse. (See below) One notable exception is, of course, Smalltalk.

One criticism I have of this article, is that it is written in a way that obscures the fact a top-down parser is an LL(k) parser. I'd posit that sticking to an LL(1) grammar is a great way to wind up with a simpler syntax that lacks sticky syntactic sugar that will be hard to optimize and build software tool for later on. (AT least for your personal learning hobby project. But in that case, at least use a separate Lexer!)

You can't introduce bullshit ambiguous extensions to your grammar

Ruby: http://po-ru.com/diary/ruby-parsing-ambiguities/

> One criticism I have of this article, is that it is written in a way that obscures the fact a top-down parser is an LL(k) parser.

Yes, I worry a lot that I conveniently designed Lox to be LL(1) which is why the parsing is so straightforward, but I don't spend a lot of time telling the reader about that fact.

At the very end, there's an aside that hints at that, but that's about it. Unfortunately, the chapter is quite long already, and it's hard to even start talking about what makes one syntax easier to parse than another without having to go pretty deep into lookahead, LL(k), left-factoring, etc.

So, for better or worse, I chose to leave it out. I look at this book as a guided tour of the language space. The reader may not know that they are being carefully herded away from the sketchy parts of town and not realize that things may appear cleaner and safer than they actually are when you start exploring on your own.

At the same time, it does ensure that their first experience in the area is a happy, positive one. My hope is that that will give them enough momentum to explore more, run into some of those nasty spots, and still have the fortitude to overcome them.

> I don't understand why so many people glorify hand-written parsers. [...]

I would say because they make the easy things easier, but the hard things harder.

If one tries to do something more elaborated, like getting a data structure for the AST (most tutorials just build simple "accepters" which I find unacceptable) and passing it to different functions - which is a standard procedure in an interpreter or compiler - things are not so trivial anymore and you need to know your generator really well and hope that your use case is supported.

Another approach are parser generators like you find in functional languages, and I really like them, as you do not have to learn an additional language and they compose incredible well. They can blow-up your memory requirements but this is another story.

I've never used a parser generator that was worth the extra time and hassle incurred by adding code generation and another implementation language into the project. Parsing just isn't that hard, and having your parsing code written in the same language as the rest of your compiler pays off over the long run.

To better understand how generated parsers work. ANTLR LL(k) thru LL(*) is largely a black box for me. I can't say writing my own helped my ANTLR skills. But hopefully I'm doing less trial and error now.

> I don't understand why so many people glorify hand-written parsers.

it is a learning exercise. Once you have written one, you can choose to go for yacc and co, or stick with hand written ones.

I don't understand your complaint, it's faster and easier to implement a recursive descend parser, than to use a yacc-like parser generator. It's not much of a choice, isn't it?

If you know the toolchains, that is backwards. It is faster and easier to use a parser generator, and easier to figure out what it is doing when you read one. Plus it performs better.

But recursive descent gives you more flexibility and better error messages.

Here's a recursive descent trick worth mentioning: instead of decomposing expressions into levels called things like "term" and "factor" by the precedence of the operators involved, you can do it all using a while loop:

    function parseExpressionAtPrecedence(currentPrecedence) {
      expr = parseExpressionAtom()
      while op = parseOperator() && op.precedence < currentPrecedence {
        if op.rightAssociative {
          b = parseExpressionAtPrecedence(op.precedence)
        } else {
          b = parseExpressionAtPrecedence(op.precedence + 1)
        expr = OperatorExpression(op, expr, b)
      return expr
The parseExpressionAtom function handles literals, expressions in parentheses, and so on. The idea is to keep pushing more operators on to the end of an expression until a higher-precedence operator appears and you can't any more. This technique (called precedence climbing) makes parsing these sorts of arithmetic expressions a lot less painful.

Yes, and if you need more expressiveness for the ternary operator and so forth, you will want to structure it as mutually recursive functions, like Pratt parsing.

See "Pratt Parsing and Precedence Climbing Are the Same Algorithm": http://www.oilshell.org/blog/2016/11/01.html

The difference is really about how you structure your code. If you have a single recursive procedure like the above, it looks like "precedence climbing". But if you have a set of mutually recursive procedures, it looks more like Pratt parsing. But they are the same algorithm, or you can say "precedence climbing" is just a special case of Pratt parsing.

Yes! Using explicit functions for each precedence level is a little tedious. It's also really simple and fairly common, though, so I thought it was a good gentle way to ease people into parsing.

In part III of the book, when we write a second interpreter in C, we use a Pratt parser for expressions and it's a lot less boilerplate-heavy.

By going to the other extreme, you can use only precedence climbing (with just a grammar) by using Floyd's algorithm, which will treat all tokens as operators. It uses two precedences functions instead of one, a left and right precedence.


This is very close to the way Babylon (Babel's JS parser) works:


If anyone in SF is interested in hacking babylon to build a superset of JS, I'm happy to get coffee and share what I've learned building LightScript [0]. It's been a lot of fun.

[0] http://lightscript.org

As a long-time C programmer, I am immediately urged to consider the simplification

    b = parseExpressionAtPrecedence(op.precedence + !op.rightAssociative)
...and IMHO those variable names are a bit longer than useful --- expr(prec), atom(), and op() would be my choice, as it shows the structure more clearly, which I think is definitely the most important thing to see here: it's a loop with a recursive call inside it, and an ending condition based on the precedence.

This technique (called precedence climbing) makes parsing these sorts of arithmetic expressions a lot less painful.

This article is a more detailed explanation of this technique: https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing

It also naturally leads to being table-driven, whereby adding/removing operators and adjusting their precedences becomes extremely easy.

Oh my god, a dream come true!

I had always hoped Bob Nystrom would write a book about interpreters/compilers.

Back when I tried to learn how to write a recursive descent parser, the examples I found either ignored correct expression parsing or wrote an additional parse method for each precedence level. Writing a parser by hand seemed just too much work. Along comes this great article about pratt parsers http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-e... and all can be done with two simple functions, a loop and a map. :) Saved my enthusiasm right there.

Another great example is this article about garbage collection: http://journal.stuffwithstuff.com/2013/12/08/babys-first-gar... Instead of making things more complicated than they are, Bob Nystrom simplifies daunting topics and makes the accessible.

Thanks so much for your work! Really looking forward to this one. Will there be code generation too? :)

You're welcome!

Some of the book will revisit topics I've already covered in my blog (but I'll write all new prose to seamlessly integrate it into the rest of the material and the language we're building). In Part III where we build an interpreter in C, it will use a Pratt parser and we'll implement a mark-sweep GC from scratch.

The book doesn't touch on native code generation because (1) that's a really big, messy topic and (2) I don't have much experience with it.

The C interpreter does compile to bytecode, though, so we cover a lot of the basic concepts around lowering the code into a denser more efficient representation, representing the stack and callframes, etc.

One piece of advice I have for people writing a new language.

Consider making your language so you can do one-character lookahead, with something like the 'shunting algorithm' to handle precedence.

I work on a system called gap ( www.gap-system.org ), and when parsing we only ever need to look one character ahead to know what we are parsing. This makes the error messages easy, and amazingly good -- we can say "At this character we expected A,B or C, but we found foo". It also makes the language easy to extend, as long as we fight hard against anyone who wants to introduce ambiguity.

If your language is ambiguous, and you have to try parsing statements multiple times to find out what they are, then your error handling is always going to be incredibly hard, as you won't know where the problem arose.

Of course, if you are parsing a language given to you, you just have to do the best you can.

It's a balancing act between designing syntax that makes sense for the programmer and syntax that makes sense for the parsing technology.

I'd feel a bit silly explaining to someone that some twisted or ugly syntax construct was like that because it made writing the parser a bit easier, when the parser only has to be written once but the syntax has to be used in every program written using the language.

Code that's hard to parse is easy to get wrong in surprising ways. Difficulty in parsing almost always means ambiguous parses that need to be resolved with sufficient lookahead or semantic lookup; if the user makes a mistake, it may be in one of the ambiguous cases, which leads to surprise. In particular, error messages are impossible to keep clearly prescriptive in every case, because thy needs to take account of the inherent ambiguity.

Does anyone have any examples of popular languages that are very easy to parse or ones that have a lot of ambiguity?

I assume the lispy type languages fall into the easily parsed bucket, perhaps tcl as well? And Perl may be a good example of a language that has notable ambiguity?

* Easy:

Python, Java. Python generates a top-down parser from Grammar/Grammar using its own pgen.c system, and the Java spec has a grammr. This doesn't mean there's no ambiguity, but they are easier.

* Pretty easy:

JavaScript, Go. I think the semi-colon insertion rule kind of breaks the lexer/grammar model, but they are still easy to parse.

* Hard:

  - C (http://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar)
  - Ruby (related thread: https://news.ycombinator.com/item?id=13825975)
  - OCaml -- AFAIK, a naive yacc-style grammar will generate tons of shift-reduce conflicts
  - Haskell -- based on hearsay, I think this is hard
* Insane:

C++, Perl, to some extent bash. See the end of my post "Parsing Bash is Undecidable" for links: http://www.oilshell.org/blog/2016/10/20.html

I'm not as familiar with languages like PHP and C#, but they're somewhere between the easy and hard categories.

Smalltalk is famously easy to parse. They always proudly said you can fit the entire grammar on a single page. It's a really compact, neat syntax.

Pascal is also pretty clean and regular. Unlike Smalltalk, it was designed more to make the compiler's job easier, so not only is the grammar pretty regular, but it's organized such that you can parse and compile to native code in a single pass.

C is a pain to parse, it's context-sensitive in some areas. C++ inherits all of that and adds a lot more on top.

My understanding is that Perl can't be parsed without being able to execute Perl code since there are some features that let you extend the grammar in Perl itself?

Perl is mostly hard to parse because it using a dynamic lexer. The parser itself uses a normal static yacc grammar, but the lexer drives the parse dynamically. Eg if a class or method is already defined or not, and what types to expect as arguments. It's crazy, but beautiful. Not so insane, more like a very clever hack.

Forth is probably the simplest to parse, and when you write a forth you usually write your parser in assembly. It consists of "start at the first non-whitespace character of the pad and continue until you reach another whitespace character; that's a token" (though "token" has a different meaning in forth; it's a "word").

Lisp seems like it should be nearly as simple: recursively read in a balanced s-expression and attempt to intern the atoms in that s-expression; error out if any of those steps fails. However, Common Lisp's scanner/parser (called the "reader") allows the user to define arbitrary macros; basically the scanner has access to the full compiler if you want it to, so it can get really complicated really quickly.

And, just to finish that thought, "read macros" aren't some abstract corner case of interest to theoreticians; they are how a whole lot of the language and libraries (e.g. complex numbers, the regex library, etc.) are implemented.

Raw lisp is obviously very easy, not sure about the various schemes. OCaml, the whole SML family are easy to parse (just an LR grammar that is easy to copy). I think Rust is the same. IIRC, Java is LL(k), so very easy.

C is not LR(1) for annoying reasons (typedef). Same for Go, IIRC. Otherwise, it's okayish. Javascript has this inane "optional semicolon" thing, which probably makes it annoying.

C++ is probably the most insane thing to parse. And then, there is bash[1].


Rust is 99% easy to parse, but there's one, rarely used, syntactic construct that's context-sensitive.

Which one is that?

Raw string literals https://doc.rust-lang.org/stable/reference.html#raw-string-l... and raw byte string literals https://doc.rust-lang.org/stable/reference.html#raw-byte-str...

  r"…", r#"…"#, r##"…"##
  br"…", br#"…"#, br##"…"##
you need to count the #s.

JavaScript's not exactly "easy" to parse (though "easy" is subjective), but it is (mostly?) parseable with an LR(1) parser (only requires lookahead of 1 character) -- though it does require reinterpretation.

I used to use gap! I'd love some examples of explicit choices that were made that allow one-character lookahead.

I can tell you some things where were rejected.

One example of non-one-token lookahead is lambda functions, 'x -> x*2'. This means when you read an identifier, before you check if it valid in this scope you have to check if the next thing is '->'. If it is, this is a new identifer, if it's not, it's a variable reference and better be in scope.

Now, let's consider multi-argument lambda functions. One possible notation is: (a,b,c) -> a+b+c. However, (a,b,c) is how GAP denotes permutations, and it was decided that parsing a whole permutation as "maybe this is a permutation, maybe it is the beginning of a lambda expression" would be too annoying, and would mean we couldn't print out some errors until we got to the end, and knew what type of object we had.

Love the confidence of this author:

> Writing a real parser — one with decent error-handling, a coherent internal structure, and the ability to robustly chew through a sophisticated syntax — is considered a rare, impressive skill. In this chapter, you will attain it.

FWIW Bob Nystrom wrote one of my favorite posts of all times, about Pratt parsers[1]. Really looking forward to read his new work!

[1]: http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-e...


When we get to the second interpreter in the book (the one implemented in C), it uses a Pratt parser for expressions.

Having written a recursive descent parser after reading Bob's Pratt Parser post (and the other three notable Pratt Parser blog posts), I can attest to the fact that you can attain it :-)

It's not that I'm confident in myself, it's that I'm confident in the reader. I'm certain you can do it because once you strip away a lot of the formality and historical baggage, it's actually pretty simple. If I can do it, you can.

I'm fired up either way. Thank you!

When you're teaching someone, projecting confidence is not a bad thing to do.

I wrote some time back a tutorial on this: http://lisperator.net/pltut/ — but the implementation language is JS, and the language we implement is quite trivial (no objects, inheritance etc.; but does have lexical scope, first-class functions and continuations).

I think Java is ugly, but putting this knowledge in an accessible book is great. Typography is also very nice.

I learned a ton doing this! Thanks for making it! I also really enjoyed "A Little Javascript Problem"

I used your tutorial for learning to make my first interpreter and learnt a lot from it. Thank you for sharing!

I really enjoyed that tutorial!

Why is it called recursive descent, isn't that redundant?

I normally think of any recursion as some kind of descent into deeper levels. Maybe I'm being biased due to awareness of stackframes being a common way to implement recursion.

Or maybe it's a known phrase made popular by a paper or researcher. like "embarrassingly parallel". There are other ways to say it but comp sci people know by convention what an embarrassing problem is and that it's usually a good thing.

If everyone started saying only "recursive parser" would there really be any confusion?

> Or maybe it's a known phrase made popular by a paper or researcher.

Right. "Recursive descent" is the canonical term for this style of parser. You conceivably could have a non-recursive descent parser if your language grammar didn't have any cycles. (For example, if you only supported arithmetic expressions and didn't have parentheses for grouping.) In practice, languages are almost always recursive.

My hunch is that the inventor of recursive descent (can't remember his name at the moment) stressed "recursive" because recursive procedure calls were considered a fairly advanced topic at the time. Recall that this was back when Fortran didn't even support recursion.

> If everyone started saying only "recursive parser" would there really be any confusion?

Yes, "recursive ascent" is a thing too, though it was invented much later.

> I normally think of any recursion as some kind of descent into deeper levels. Maybe I'm being biased due to awareness of stackframes being a common way to implement recursion.

You can write bottom up algorithms using recursion too. It still technically goes down first, but does the real work on the way up.

For a simple example, this is like the difference between

    function sum(lst) {
        if (length(lst) > 0) {
            return head(lst) + sum(tail(lst));
        return 0;

    function sum(lst, currentSum) {
        if (length(lst) > 0) {
            return sum(tail(lst), currentSum + head(lst));
        return currentSum;
In the first case you don't actually start doing addition until after you get to the end of the list, while the latter is doing addition on the way down.

If anyone would be helped by an example, I worked on this recursive descent-based compiler a few years ago with a focus on clarity and readability:


The actual parsing is here: https://github.com/rzimmerman/kal/blob/master/source/grammar...

It's somewhere between a toy language and something professional, so it could be a helpful reference if you're doing this for the first time.

Beware, the project is no longer maintained and probably doesn't work with modern node.js runtimes.

Another fun way to write a recursive descent parser is to abstract and parametrize the recursive descent algorithm.

Best done in dynamic languages. I wrote an abstract recursive descent parser in JS which accepts an array of terminal (regexp) and non-terminal definitions (array of terminal/non-terminal names + action callback), and returns the function that will parse the text.

The parser "generator" has 125 lines of code and doesn't really generate anything. It's an extremely lightweight solution to quickly produce purpose made languages in the browser without need for any tooling.

Together with `` template strings to write your custom made language code in, it makes for a lot of fun in JS. :D

It would be really interesting to see that code if you ever want to publish it. Thanks!

I guess why not. :)


See the code. AGPL.

I learned about recursive descent parser by reading Anthony Dos Reis book: compiler construction using java, javacc and yacc[0]. I'm a bit lazy {tired} now I'll just refer to my previous comment. It has been my favorite book. Trust me, you'll learn about compiler technology with this wonderful book.


[0] - https://www.amazon.com/Compiler-Construction-Using-Java-Java...

Ever thought about using an interpreting parser, which takes a grammar and parses a string/file according to it. (To parse the grammar you use the interpreting parser itself with a hard coded version of the grammar of the grammar.) Have a look at: https://github.com/FransFaase/IParse

I've found parser expression grammars (PEG) to be a good solution to avoiding the ambiguity stated at the very beginning. This is done by making all choices (|) into ordered choices.

I've recently used PEGs to write a Python parser (parsing all of Python, except for possible bugs) in ~500 lines of Python [2]. Its entirely interpreted. No parser is generated, only trees.

I'll also add that Floyd's operator precedence grammar [3] includes an algorithm which can deduce precedences from a grammar.

[1] https://en.wikipedia.org/wiki/Parsing_expression_grammar

[2] https://github.com/asrp/pymetaterp

[3] https://en.wikipedia.org/wiki/Operator-precedence_grammar

My first project ever as a professional programmer was writing a recursive descent parser - in 1981, in Fortran, of Jovial language code. Of course there were no other choices. Thankfully today you can avoid writing your own, though sometimes it still pays to write it by hand.

how do recursive descent parsers compare to parser combinators? so far i've tended to use a combinator library when one of my small projects needs a parser, but now i'm wondering if i should just get good at doing recursive descent parsers instead.

They are essentially same.

I didn't see mention of first/follow sets. Might be handy to have that in there?

One of the hardest things about writing a book is deciding what to not include. This chapter is already over 7k words, which is longer than I'd like. First/follow sets are useful but I haven't found them that useful to think about in my own parsers, so I left them out.

Fair point.

A good grasp on first and follows sets is essential if you are building a bottom up parser.

Not so much for recursive descent.

Very cool Bob! I'm a big fan of your writing and Jasic was a nice and small project that really helped me understand compilers better. I look forward to my lunch hour when I can go over this chapter.

I'm actually going through a similar process at the moment. I would like to write a few articles about implementation considerations in compilers that text books often omit. For example, in the scanner phase, should identifier tokens contain a copy of the identifier name, should they have a length + pointer into the original stream, should they contain an index into a symbol table, etc.

Haha, I had assignement to built a recursive descent parser for a simple Ada-like language just 3 days ago. It was fun task, but if this guide was posted earlier, I could have used it.

I'm writing as fast as I can. :)

I enjoyed experimenting with this project, back when it was active...


-- A recursive descent parser for Python

-- Grammars are written directly as Python code, using a syntax similar to BNF

-- New matchers can be simple functions

I'm writing a Javascript parser right now so this is actually super useful. Thank you!

I worked on a recursive descent JSON parser. That was a valuable learning experience.

Was it for fun or for business?

Mostly fun. I was initially pissed off out how a JSON encoder/decoder exploded when parsing a large float. After a while, I learned what I wanted to learn about why it works they way it does and moved on to other things.

Interesting. What encoder exploded at the large float? I ask because I would be surprised if an encoder/decoder from one of the major languages couldn't handle that.

It's Poison, a hex package for Elixir. It blows up if you try to parse a number somewhere between "1e308" and "1e309". That is a limitation of the Elixir standard lib. Elixir doesn't try to handle that in a special way, it just raises and exception. I'd love to see something like a BigFloat, or an plug-in hook giving you the option to not explode if you didn't want to.

In Ruby, when you parse "1e309" with the built in JSON gem, you get "Infinity". That drives me bonkers, too :)

If you're serious about writing a parser, why not go for a bottom up parser?

It eliminates a lot of the headaches with top down/recursive descent parsers like left recursion and backtracking.

Been there. Done that. Yuck! I so much prefer bison or yacc.

It's a shame parsers are such a PITA to write. So many problems could be trivially solved if writing a grammar and generating a parser for it were in any way a pleasant process.

I've found parsers simple to make with a parser combinator library, like Bennu[0]. It matches how I thought parsing should work, and as a bonus the parsers made with it automatically support incremental parsing. Just for fun, I implemented a full JSON parser[1] with it.

I assume a hand-written a recursive descent parser would run faster, but it definitely doesn't strike me as flexible or quick to write or modify, and anything extra like incremental support or good error messages is just a lot of one-off manual effort.

[0] http://bennu-js.com/

[1] https://github.com/AgentME/bennu-json

This is the pain of text. The text format has won so handily that we rarely consider the possibility that files could be anything other than a sequence of characters. But if it were a tree instead, then "parsing" would be a simple, linear-time process with good error messages.

Editing would be annoying as hell though.

You would use a different editor: a tree or structured editor. It would be different, but I doubt it would be worse.

So, where are all the professional programmers doing blocks programming, or something like that? All visual programming is very niche. Last I got close to some was 18 years ago when working with LabView.

It's not just different. It's suboptimal. It's just not compelling. You'd need visual everything. Like visual diff, visual git, make it easy to copy visual crap between various applications. Copy visual code to a web page? Copy it to chat to send to a friend? Instead of just writing, you'd need to remember UI shortcuts/icon to insert visual crap into your codebase...

It's bleh. I clearly hate the idea. :D

Yeah, that's why no-one uses Maya to edit 3d scenes. It lacks the finesse of text, like povray ;-)

The difference there is that the thing you are authoring is itself visual. You can't be a skilled Maya user if you're not a good visual thinker/designer because that's the core competency of 3D design itself.

Code is not inherently visual. If you made the editor visual, then it means people who are good at abstract reasoning and solving problems but don't have good aesthetic and spatial sense wouldn't be able to "write good code".

> The difference there is that the thing you are authoring is itself visual.

Perhaps. Or it might be structural, an object to be 3d printed. It would have a physical appearance - but perhaps not be "inherently visual", any more than the algorithms for adjusting a quadcopter for autonomous hovering is "inherently aerial"?

> people who are good at abstract reasoning and solving problems but don't have good aesthetic and spatial sense wouldn't be able to "write good code".

Isn't that the case with textual representation of code too? Without a sense for aesthetics, one cannot create "beautiful code"?

A chessboard/game can be represented in many ways in the form of code/syntax - even mathematical equations - and yet I would argue it is less accessible to learn and play chess going from just the standard textual representation, than from a board?

How would you teach chess moves: by lifting a castle off the board, showing that it can move up-and-down and sideways, like so - or by saying that the piece-instances initialized to position a1,a8,h1 and h8 can move only be changing either of its two coordinates up and down?

> Code is not inherently visual.

Nor is it text. It is a tree (hence the term "abstract syntax tree"). We just save it in an awful serialized format.

Recursive descent is pleasant. It's probably the least brutish and most elegant of methods.[0]

[0] Of course, my opinion.

I've used ANTLR, which is relatively easy to comprehend by being primarily a LL(k) recursive descent parser.

Sadly, it also suffers from the usual problems of LL parser generators, like barfing on left-recursive grammars.

Well, there's always the "Ometa" approach/extension to PEGs - the author works(?) at now y-combinator sponsored research into language design (not sure if he's active on hn though):


Note, you need to select all before trying "do it"/"print it". See eg: the js compiler (top right, second drop down).

I dare you to not find it so easy and fun that you don't try to implement a small DSL ;)

There's the Ohm editor now.


Ohm is the successor of Ometa. The editor shows which portion of the string are matched by each rule.

You can probably use parser generators for this. In this way it's possible to have a function accepting a grammar and an input and returning the parse tree, so all complications are hidden inside.

I've found Parsing Expression Grammars (PEGs) great for that sort of task. Way nicer than kludging something with regexes anyway, and way more powerful.

There's a Lua library called LPeg [1], which introduces a DSL of a kind to Lua that lets you write PEGs with minimal effort. The vis editor [2] uses LPeg files as lexers for syntax highlighting, and they look very clean IMO; for example, this is the C lexer: https://github.com/martanne/vis/blob/master/lua/lexers/ansi_...

[1]: http://www.inf.puc-rio.br/~roberto/lpeg/

[2]: https://github.com/martanne/vis

There's a lot of information on compilers, parsers, etc on the Internet. I'm trying to gather some of the information h=on github:


There are tons of parser generators, but I find the code they generate unreadable and terrible for error reporting.

Try learning Prolog.

Or, if you care about efficiency, Haskell.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact