
Show HN: How to write a recursive descent parser - munificent
http://www.craftinginterpreters.com/parsing-expressions.html
======
Drup
I don't understand why so many people glorify hand-written parsers. Maybe
because they never used good parser generators (not unlike type systems and
Java) ?

Personal opinion: writing parsers is the least interesting part of an
interpreter/compiler, and your grammar _should_ be boring if you want your
syntax to be easy to understand by humans.

Boring, in this case, mean LL(*), LR(1), or another well known class. Just
pick a damn parser generator and get the job done quickly, so you can spend
time on the really difficult tasks. The grammar in this article is LR(1) and
is trivial to implement in yacc-like generators, locations tracking and error
messages included.

Bonus point: since your grammar stays in a well known class, it's much easier
for other people to re-implement it. You can't introduce bullshit ambiguous
extensions to your grammar (C typedefs, anyone ?). This article gives a good
explanation of this: [http://blog.reverberate.org/2013/09/ll-and-lr-in-
context-why...](http://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-
parsing-tools.html)

~~~
tyoverby
Hello, I work on the C# compiler and we use a handwritten recursive-descent
parser. Here are a few of the more important reasons for doing so:

* Incremental re-parsing. If a user in the IDE changes the document, we need to reparse the file, but we want to do this while using as little memory as possible. To this end, we re-use AST nodes from previous parses.

* Better error reporting. Parser generators are known for producing terrible errors. While you can hack around this, by using recursive-descent, you can get information from further "up" the tree to make your more relevant to the context in which the error occurred.

* Resilient parsing. This is the big one! If you give our parser a string that is illegal according to the grammar, our parser will still give you a syntax tree! (We'll also spit errors out). But getting a syntax tree regardless of the actual validity of the program being passed in means that the IDE can give autocomplete and report type-checking error messages. As an example, the code "var x = velocity." is invalid C#. However, in order to give autocomplete on "velocity", that code needs to be parsed into an AST, and then typechecked, and then we can extract the members on the type in order to provide a good user experience.

My personal opinion is that everyone should just use s-expressions. Get rid of
this whole debate :P

~~~
Drup
While I agree with you those 3 points are extremely important, it turns out
there is at least one parser generator that can do all of it:
[http://gallium.inria.fr/~fpottier/menhir/](http://gallium.inria.fr/~fpottier/menhir/)

It supports both incremental parsing and an API to inspect and recover
incomplete ASTs (which powers Merlin, the IDE-like thing for OCaml). It
provides stellar debugging features for ambiguous grammars and a way to have
good error messages (which is used in compcert's C parser and facebook's
reason).

So, it's not impossible. Most parser generators are not that good, though.

~~~
tyoverby
That is super impressive! I can't find the part on incomplete AST or AST reuse
in their reference docs though.

~~~
dsp1234
Details about the "incremental" mode are listed in the documentation PDF[0] at
section 9.2

Here are the first couple of paragraphs:

 _" In this API, control is inverted. The parser does not have access to the
lexer. Instead, when the parser needs the next token, it stops and returns its
current state to the user. The user is then responsible for obtaining this
token (typically by invoking the lexer) and resuming the parser from that
state. The directory demos/calc-incremental contains a demo that illustrates
the use of the incremental API.

This API is “incremental” in the sense that the user has access to a sequence
of the intermediate states of the parser. Assuming that semantic values are
immutable, a parser state is a persistent data structure: it can be stored and
used multiple times, if desired. This enables applications such as “live
parsing”, where a buffer is continuously parsed while it is being edited. The
parser can be re-started in the middle of the buffer whenever the user edits a
character. Because two successive parser states share most of their data in
memory, a list of n successive parser states occupies only O(n) space in
memory."_

There does not appear to be a specific mention of having the partial AST
available.

[0] - linked from their front page and available at
[http://gallium.inria.fr/~fpottier/menhir/manual.pdf](http://gallium.inria.fr/~fpottier/menhir/manual.pdf)

------
panic
Here's a recursive descent trick worth mentioning: instead of decomposing
expressions into levels called things like "term" and "factor" by the
precedence of the operators involved, you can do it all using a while loop:

    
    
        function parseExpressionAtPrecedence(currentPrecedence) {
          expr = parseExpressionAtom()
          while op = parseOperator() && op.precedence < currentPrecedence {
            if op.rightAssociative {
              b = parseExpressionAtPrecedence(op.precedence)
            } else {
              b = parseExpressionAtPrecedence(op.precedence + 1)
            }
            expr = OperatorExpression(op, expr, b)
          }
          return expr
        }
    

The parseExpressionAtom function handles literals, expressions in parentheses,
and so on. The idea is to keep pushing more operators on to the end of an
expression until a higher-precedence operator appears and you can't any more.
This technique (called precedence climbing) makes parsing these sorts of
arithmetic expressions a lot less painful.

~~~
munificent
Yes! Using explicit functions for each precedence level is a little tedious.
It's also really simple and fairly common, though, so I thought it was a good
gentle way to ease people into parsing.

In part III of the book, when we write a second interpreter in C, we use a
Pratt parser for expressions and it's a lot less boilerplate-heavy.

~~~
asrp
By going to the other extreme, you can use _only_ precedence climbing (with
just a grammar) by using Floyd's algorithm, which will treat all tokens as
operators. It uses two precedences functions instead of one, a left and right
precedence.

[https://en.wikipedia.org/wiki/Operator-
precedence_grammar](https://en.wikipedia.org/wiki/Operator-precedence_grammar)

------
w23j
Oh my god, a dream come true!

I had always hoped Bob Nystrom would write a book about
interpreters/compilers.

Back when I tried to learn how to write a recursive descent parser, the
examples I found either ignored correct expression parsing or wrote an
additional parse method for each precedence level. Writing a parser by hand
seemed just too much work. Along comes this great article about pratt parsers
[http://journal.stuffwithstuff.com/2011/03/19/pratt-
parsers-e...](http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-
expression-parsing-made-easy/) and all can be done with two simple functions,
a loop and a map. :) Saved my enthusiasm right there.

Another great example is this article about garbage collection:
[http://journal.stuffwithstuff.com/2013/12/08/babys-first-
gar...](http://journal.stuffwithstuff.com/2013/12/08/babys-first-garbage-
collector/) Instead of making things more complicated than they are, Bob
Nystrom simplifies daunting topics and makes the accessible.

Thanks so much for your work! Really looking forward to this one. Will there
be code generation too? :)

~~~
munificent
You're welcome!

Some of the book will revisit topics I've already covered in my blog (but I'll
write all new prose to seamlessly integrate it into the rest of the material
and the language we're building). In Part III where we build an interpreter in
C, it will use a Pratt parser and we'll implement a mark-sweep GC from
scratch.

The book doesn't touch on native code generation because (1) that's a really
big, messy topic and (2) I don't have much experience with it.

The C interpreter does compile to _bytecode_ , though, so we cover a lot of
the basic concepts around lowering the code into a denser more efficient
representation, representing the stack and callframes, etc.

------
CJefferson
One piece of advice I have for people writing a new language.

Consider making your language so you can do one-character lookahead, with
something like the 'shunting algorithm' to handle precedence.

I work on a system called gap ( www.gap-system.org ), and when parsing we only
ever need to look one character ahead to know what we are parsing. This makes
the error messages easy, and amazingly good -- we can say "At this character
we expected A,B or C, but we found foo". It also makes the language easy to
extend, as long as we fight hard against anyone who wants to introduce
ambiguity.

If your language is ambiguous, and you have to try parsing statements multiple
times to find out what they are, then your error handling is always going to
be incredibly hard, as you won't know where the problem arose.

Of course, if you are parsing a language given to you, you just have to do the
best you can.

~~~
tyingq
Does anyone have any examples of popular languages that are very easy to parse
or ones that have a lot of ambiguity?

I assume the lispy type languages fall into the easily parsed bucket, perhaps
tcl as well? And Perl may be a good example of a language that has notable
ambiguity?

~~~
munificent
Smalltalk is famously easy to parse. They always proudly said you can fit the
entire grammar on a single page. It's a really compact, neat syntax.

Pascal is also pretty clean and regular. Unlike Smalltalk, it was designed
more to make the compiler's job easier, so not only is the grammar pretty
regular, but it's organized such that you can parse and compile to native code
in a single pass.

C is a pain to parse, it's context-sensitive in some areas. C++ inherits all
of that and adds a lot more on top.

My understanding is that Perl can't be parsed without being able to execute
Perl code since there are some features that let you extend the grammar in
Perl itself?

~~~
rurban
Perl is mostly hard to parse because it using a dynamic lexer. The parser
itself uses a normal static yacc grammar, but the lexer drives the parse
dynamically. Eg if a class or method is already defined or not, and what types
to expect as arguments. It's crazy, but beautiful. Not so insane, more like a
very clever hack.

------
tbrock
Love the confidence of this author:

> Writing a real parser — one with decent error-handling, a coherent internal
> structure, and the ability to robustly chew through a sophisticated syntax —
> is considered a rare, impressive skill. In this chapter, you will attain it.

~~~
mstade
FWIW Bob Nystrom wrote one of my favorite posts of all times, about Pratt
parsers[1]. Really looking forward to read his new work!

[1]: [http://journal.stuffwithstuff.com/2011/03/19/pratt-
parsers-e...](http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-
expression-parsing-made-easy/)

~~~
munificent
Thanks!

When we get to the second interpreter in the book (the one implemented in C),
it uses a Pratt parser for expressions.

------
mishoo
I wrote some time back a tutorial on this:
[http://lisperator.net/pltut/](http://lisperator.net/pltut/) — but the
implementation language is JS, and the language we implement is quite trivial
(no objects, inheritance etc.; but does have lexical scope, first-class
functions and continuations).

I think Java is ugly, but putting this knowledge in an accessible book is
great. Typography is also very nice.

~~~
wry_discontent
I learned a ton doing this! Thanks for making it! I also really enjoyed "A
Little Javascript Problem"

------
WhitneyLand
Why is it called recursive descent, isn't that redundant?

I normally think of any recursion as some kind of descent into deeper levels.
Maybe I'm being biased due to awareness of stackframes being a common way to
implement recursion.

Or maybe it's a known phrase made popular by a paper or researcher. like
"embarrassingly parallel". There are other ways to say it but comp sci people
know by convention what an embarrassing problem is and that it's usually a
good thing.

If everyone started saying only "recursive parser" would there really be any
confusion?

~~~
megous
> I normally think of any recursion as some kind of descent into deeper
> levels. Maybe I'm being biased due to awareness of stackframes being a
> common way to implement recursion.

You can write bottom up algorithms using recursion too. It still technically
goes down first, but does the real work on the way up.

~~~
bjterry
For a simple example, this is like the difference between

    
    
        function sum(lst) {
            if (length(lst) > 0) {
                return head(lst) + sum(tail(lst));
            }
            return 0;
        }
    

vs.

    
    
        function sum(lst, currentSum) {
            if (length(lst) > 0) {
                return sum(tail(lst), currentSum + head(lst));
            }
            return currentSum;
        }
    

In the first case you don't actually start doing addition until after you get
to the end of the list, while the latter is doing addition on the way down.

------
rzimmerman
If anyone would be helped by an example, I worked on this recursive descent-
based compiler a few years ago with a focus on clarity and readability:

[https://github.com/rzimmerman/kal](https://github.com/rzimmerman/kal)

The actual parsing is here:
[https://github.com/rzimmerman/kal/blob/master/source/grammar...](https://github.com/rzimmerman/kal/blob/master/source/grammar.litkal)

It's somewhere between a toy language and something professional, so it could
be a helpful reference if you're doing this for the first time.

Beware, the project is no longer maintained and probably doesn't work with
modern node.js runtimes.

------
megous
Another fun way to write a recursive descent parser is to abstract and
parametrize the recursive descent algorithm.

Best done in dynamic languages. I wrote an abstract recursive descent parser
in JS which accepts an array of terminal (regexp) and non-terminal definitions
(array of terminal/non-terminal names + action callback), and returns the
function that will parse the text.

The parser "generator" has 125 lines of code and doesn't really generate
anything. It's an extremely lightweight solution to quickly produce purpose
made languages in the browser without need for any tooling.

Together with `` template strings to write your custom made language code in,
it makes for a lot of fun in JS. :D

~~~
Stratoscope
It would be really interesting to see that code if you ever want to publish
it. Thanks!

~~~
megous
I guess why not. :)

[https://megous.com/dl/parser/tests/index.html](https://megous.com/dl/parser/tests/index.html)

See the code. AGPL.

------
e19293001
I learned about recursive descent parser by reading Anthony Dos Reis book:
compiler construction using java, javacc and yacc[0]. I'm a bit lazy {tired}
now I'll just refer to my previous comment. It has been my favorite book.
Trust me, you'll learn about compiler technology with this wonderful book.

[https://news.ycombinator.com/item?id=13664714](https://news.ycombinator.com/item?id=13664714)

[0] - [https://www.amazon.com/Compiler-Construction-Using-Java-
Java...](https://www.amazon.com/Compiler-Construction-Using-Java-
JavaCC/dp/0470949597)

------
fjfaase
Ever thought about using an interpreting parser, which takes a grammar and
parses a string/file according to it. (To parse the grammar you use the
interpreting parser itself with a hard coded version of the grammar of the
grammar.) Have a look at:
[https://github.com/FransFaase/IParse](https://github.com/FransFaase/IParse)

------
asrp
I've found parser expression grammars (PEG) to be a good solution to avoiding
the ambiguity stated at the very beginning. This is done by making all choices
(|) into _ordered choices_.

I've recently used PEGs to write a Python parser (parsing all of Python,
except for possible bugs) in ~500 lines of Python [2]. Its entirely
interpreted. No parser is generated, only trees.

I'll also add that Floyd's operator precedence grammar [3] includes an
algorithm which can deduce precedences from a grammar.

[1]
[https://en.wikipedia.org/wiki/Parsing_expression_grammar](https://en.wikipedia.org/wiki/Parsing_expression_grammar)

[2] [https://github.com/asrp/pymetaterp](https://github.com/asrp/pymetaterp)

[3] [https://en.wikipedia.org/wiki/Operator-
precedence_grammar](https://en.wikipedia.org/wiki/Operator-precedence_grammar)

------
coldcode
My first project ever as a professional programmer was writing a recursive
descent parser - in 1981, in Fortran, of Jovial language code. Of course there
were no other choices. Thankfully today you can avoid writing your own, though
sometimes it still pays to write it by hand.

------
zem
how do recursive descent parsers compare to parser combinators? so far i've
tended to use a combinator library when one of my small projects needs a
parser, but now i'm wondering if i should just get good at doing recursive
descent parsers instead.

~~~
fooker
They are essentially same.

------
JustSomeNobody
I didn't see mention of first/follow sets. Might be handy to have that in
there?

~~~
munificent
One of the hardest things about writing a book is deciding what to _not_
include. This chapter is already over 7k words, which is longer than I'd like.
First/follow sets are useful but I haven't found them _that_ useful to think
about in my own parsers, so I left them out.

~~~
JustSomeNobody
Fair point.

------
gnuvince
Very cool Bob! I'm a big fan of your writing and Jasic was a nice and small
project that really helped me understand compilers better. I look forward to
my lunch hour when I can go over this chapter.

I'm actually going through a similar process at the moment. I would like to
write a few articles about implementation considerations in compilers that
text books often omit. For example, in the scanner phase, should identifier
tokens contain a copy of the identifier name, should they have a length +
pointer into the original stream, should they contain an index into a symbol
table, etc.

------
wideem
Haha, I had assignement to built a recursive descent parser for a simple Ada-
like language just 3 days ago. It was fun task, but if this guide was posted
earlier, I could have used it.

~~~
munificent
I'm writing as fast as I can. :)

------
rch
I enjoyed experimenting with this project, back when it was active...

[http://www.acooke.org/lepl/](http://www.acooke.org/lepl/)

\-- A recursive descent parser for Python

\-- Grammars are written directly as Python code, using a syntax similar to
BNF

\-- New matchers can be simple functions

------
lhorie
I'm writing a Javascript parser right now so this is actually super useful.
Thank you!

------
amorphid
I worked on a recursive descent JSON parser. That was a valuable learning
experience.

~~~
jdormit
Was it for fun or for business?

~~~
amorphid
Mostly fun. I was initially pissed off out how a JSON encoder/decoder exploded
when parsing a large float. After a while, I learned what I wanted to learn
about why it works they way it does and moved on to other things.

~~~
jdormit
Interesting. What encoder exploded at the large float? I ask because I would
be surprised if an encoder/decoder from one of the major languages couldn't
handle that.

~~~
amorphid
It's Poison, a hex package for Elixir. It blows up if you try to parse a
number somewhere between "1e308" and "1e309". That is a limitation of the
Elixir standard lib. Elixir doesn't try to handle that in a special way, it
just raises and exception. I'd love to see something like a BigFloat, or an
plug-in hook giving you the option to not explode if you didn't want to.

In Ruby, when you parse "1e309" with the built in JSON gem, you get
"Infinity". That drives me bonkers, too :)

------
prions
If you're serious about writing a parser, why not go for a bottom up parser?

It eliminates a lot of the headaches with top down/recursive descent parsers
like left recursion and backtracking.

------
hasbot
Been there. Done that. Yuck! I so much prefer bison or yacc.

------
grabcocque
It's a shame parsers are such a PITA to write. So many problems could be
trivially solved if writing a grammar and generating a parser for it were in
any way a pleasant process.

~~~
justinpombrio
This is the pain of text. The text format has won so handily that we rarely
consider the possibility that files could be anything other than a sequence of
characters. But if it were a _tree_ instead, then "parsing" would be a simple,
linear-time process with good error messages.

~~~
megous
Editing would be annoying as hell though.

~~~
justinpombrio
You would use a different editor: a tree or structured editor. It would be
_different_ , but I doubt it would be _worse_.

~~~
megous
So, where are all the professional programmers doing blocks programming, or
something like that? All visual programming is very niche. Last I got close to
some was 18 years ago when working with LabView.

It's not just different. It's suboptimal. It's just not compelling. You'd need
visual everything. Like visual diff, visual git, make it easy to copy visual
crap between various applications. Copy visual code to a web page? Copy it to
chat to send to a friend? Instead of just writing, you'd need to remember UI
shortcuts/icon to insert visual crap into your codebase...

It's bleh. I clearly hate the idea. :D

