
Which Parsing Approach? - matt_d
https://tratt.net/laurie/blog/entries/which_parsing_approach.html
======
WideCharr
I like the way the various approaches are presented here, though the author
briefly mentions a factor which I feel like really short circuits the entire
discussion: error messages. Most people readily agree that recursive descent
parsers can provide the best possible error messages, which if you're writing
a compiler in 2020 seems like it trumps all other considerations (especially
considering you can _also_ get the best possible performance via recursive
descent).

The static typing of LR is nice, but trading off user experience for developer
experience seems like a bad deal.

~~~
catpolice
Yeah, I think there are reasons why a lot of big compilers use some recursive
descent variant (even a few like gcc used YACC-generated parsers for a long
time and switched back to hand-written recursive descent parsers) and error
message generation is a big one.

IMO there's a kind of funny progression in which parsing approach turns out to
be the most appropriate depending on the scope of your project that circles
back on itself:

\- For pretty simple languages a hand-written recursive descent is obviously
easiest

\- Once your grammar is complicated enough that you start hitting precedence
and ambiguity issues, or get sick of rewriting a big chunk of your parser as
the grammar changes, you look into generating your parser from a BNF-like
specification and end up with some variant of LL or LR

\- At some point your language's grammar has mostly stabilized and you're
struggling with providing good error messages or parsing performance or you've
had to add one too many hacks to get around limitations of your parser
generator and recursive descent starts looking good again

For my money, I tend to think that Pratt parsing/precedence climbing can
extend recursive descent in a way that makes a lot of the common complaints
about the complexity of dealing with operator precedence and associativity
seem overstated. The trick is just that as you're building an AST, some
symbols will cause you to reparent nodes that you thought you'd already
placed, according to various rules. See:
[https://www.oilshell.org/blog/2017/03/31.html](https://www.oilshell.org/blog/2017/03/31.html)

I wrote a compiler for a vaguely C-like language by hand in javascript a while
back that's intended to show how simple a hand-written parser (and code
generator) can end up:
[https://github.com/j-s-n/WebBS](https://github.com/j-s-n/WebBS)

It's not that hard to statically track type information along the way - the
above example requires a second pass at the AST to nail things into place and
make sure all our operators are operating on the right type of thing, but a
lot of that information is captured during the first parser pass or even
during lexing.

~~~
derriz
Yes - Pratt embedded in a recursive descent parser is hard to beat. I say this
after writing parsers on and off for years. Like the author, I also went
through a cycle of tools but in my case it was recursive descent, LR
(Flex/Bison), parser combinators (Haskell), LL (Antlr) before returning to
recursive descent.

In the end the recursive descent (+Pratt) beats them all, in my opinion:

\- you can easily provide properly helpful error messages

\- best general performance

\- the parser can be debugged directly using normal debugging tools

\- flexibility - all the tools and features of the host programming language
are available

\- zero dependencies, no build tool integration required.

The only issues I could see that the author has regarding recursive descent
are excessive boiler plate and the complexity of handling precedence and
associativity but:

\- there should be no more boiler plate than there is for any other
programming task - you write functions, methods, classes, etc. just like
normal dev to reduce boilerplate.

\- using Pratt provides the structure to handle all the operator rule
complexity.

------
nybblesio
This is a good essay. However, the author didn't mention Pratt Parsing [1],
which cleanly addresses [2] the shortcomings of recursive descent parsing.
Parser generators look promising until you actually start building anything of
moderate complexity. Then it's a constant cycle of kludges and hair pulling.

[1]
[https://en.wikipedia.org/wiki/Pratt_parser](https://en.wikipedia.org/wiki/Pratt_parser)

[2] Mostly. The Pratt model does introduce potential context around operators
(depending on the grammar). It's easy enough to extend the basic Pratt model
to support this, but it isn't spelled out in many examples or literature.

~~~
throwaway_pdp09
> Parser generators look promising until you actually start building anything
> of moderate complexity

I've done both, by hand and with parser generators (flex/bison and antlr) and
getting the machine to do the boring work is total fuckload[0] faster and more
productive.

Edit: and unless you know what you're doing, you _will_ screw up when hand-
writing a parser. I know of a commercial reporting tool that couldn't reliably
parse good input (their embedded language).

[0] 3.14159 shedloads in metric

~~~
derriz
What do you think is special about recursive descent parsing that makes you
more likely to screw up unless you know what you're doing?

My experience has been the exact opposite - particularly as the language gets
complicated and/or weird. In which case the generated parser becomes horribly
brittle. Adding an innocent looking new rule to your working Antlr or Yacc
grammar feels like fiddling with a ticking bomb - it might work straight away
or it could explode in your face - leaving you with hours or days whack-a-
moling grammar ambiguities.

~~~
throwaway_pdp09
I didn't say recursive descent parsing wrt screwing up, I just said "hand-
writing a parser". Nothing about the flavour.

I guess our experiences differ but I don't know why. I have written a
seriously major SQL parser in antlr and had no problem. And it was huge and
complex, well that's TSQL for you.

It may be you have been parsing non-LR(1) grammars in bison which could prove
a headache but... well IDK. Maybe I've been lucky.

~~~
derriz
That's an interesting coincidence - the biggest parser I wrote was also an SQL
parser using Antlr. In fact, SQL was only part of it - it was a programming
language that supported multiple flavours of embedded SQL (DB2 and Oracle). It
worked but I always dreaded when it would have to be changed to support a new
feature in a new release of Oracle (or DB/2).

I don't think it's an LR vs LL thing either. I feel that there is no sense of
locality with parser generators; it's a bit like quantum mechanics - every
rule has a "connection" to every other one. Change one seemingly small part of
the Antlr grammar and some "far away" seemingly unrelated parts of the grammar
can suddenly blow up as being ambiguous.

~~~
throwaway_pdp09
Coincidence indeed - I'm currently modifying my giant SQL grammar right now
and building an AST off it. And struggling a bit with that, but that's mainly
down to me not antlr.

It is strange that we're having such different experiences of it. I don't
recognise your quantum view of it either, as a antlr rules, and bison, are
very context-specific as they can only be triggered in the context of larger
rules, and only when given piece of the target language (SQL here). They get
triggered only in those very specific cases. I've never had your struggles
with it. I don't understand.

------
kabdib
I've had good luck with a mix of recursive descent -- for high fidelity error
messages -- and Pratt at the expression level. I suspect that if expressions
got seriously complicated in my applications (e.g., lots of different types
with complicated promotion and mixing rules) that I'd write the expression
parser in R-D as well.

I'm happy giving up some performance for a better chance at finding problems.
Who cares if your compiler clocks at a bazillion lines a second if all it can
say is "line 101: syntax error" or mumble something indistinct about a reduce-
reduce conflict.

I'm not writing commercial compilers, just little language processors. And my
users need all the help they can get.

~~~
cuspycode
I too have had great experiences from doing recursive descent. I don't
understand why so many people don't like it. The techniques for handling left-
and right associativity of infix operators have been well known since many
decades, so I am surprised that the original article thinks it is still an
issue. But I have met some people who felt uncomfortable by recursive descent
parsers that were implemented via recursive procedure calls, just because they
didn't quite understand recursive procedure calls in the first place. So maybe
that's part of the reason.

~~~
userbinator
My experience has been that those who don't like RD have a mostly
theoretical/academic background and are likewise similarly unimpressed with
"simple" or "unsophisticated" code in general.

~~~
afiori
For the author his case against RD is not for the implementation. At the end
of the articles he calls for LR reference grammars (I imagine both for fuzzy
testing and documentation) to ensure that your perfectly usable RD parser with
nice error messages does not miss some edge cases.

------
hardwaregeek
I'd love for a post on reproducible, incremental parse trees. Roslyn, Rust
Analyzer, et al use them to provide good language server facilities. Basically
their parse trees are immutable and can recover the entire source file,
comments and all. These allow for fast parsing that can be used to do code
modification and good semantic analysis.

Another area that should be mentioned is purposefully messy grammars. A common
addition to languages these days are arrow functions:

    
    
          (a, b, c) => { ...
    

These are really nice for defining functions, but they usually cause unbounded
lookahead. For instance in JavaScript, the arguments list can also be parsed
as a comma expression inside parentheses. In my language they can be parsed as
a tuple expression. There might be some tricky way of doing this with an LR
parser, but I can't figure it out. Therefore I needed to parse arrow
functions/tuples by basically assuming they're an arrow function, then if I'm
mistaken, I convert my parameters list into a tuple expression and continue
parsing. For instance:

    
    
         (a, b, 10)
    

When we hit the 10, my parser goes "ah shoot, this is a tuple" and converts
the (a, b to be expressions in a tuple.

Some may argue that grammars should just stay within LR/LALR/LL, but I do
think adding a little bit of parsing complexity for ease of use is a worthy
tradeoff.

~~~
drakmaniso
I believe the way it's usually done is to always parse a tuple, then parse the
arrow as a binary operator applied to the two surrounding expressions. This
potentially produce a syntactically incorrect AST (e.g. if there is an arrow
following a tuple containing numbers), so you need to detect those in another
pass and produce the corresponding error.

~~~
tgv
Can be done in one go: just check the type of the left and right operand at
the end of X -> expr "=>" expr, or assign "=>" the highest priority when
following a tuple, X -> tuple ("=>" "{" body "}")?.

------
narwally
Lisp Programmers: "You guys are still trying to figure out parsing?"

On a more serious note, the grammar ambiguity mentioned in the article is one
of the things that has drawn me to languages that use s-expressions. Lisp are
just as varied in their semantics as any other set of languages. What sets
them apart is in their shared syntactic core of prefix notation in combination
with parenthesis to create s-expressions. This eliminates the vast majority
(though not all) ambiguities concerning precedence and associativity.

~~~
pdonis
While I appreciate the benefits of Lisp, there is still a tradeoff. PG said in
one of his essays that in Lisp you write your program directly in the parse
trees that in other languages get generated when the code is parsed. But that
means that in Lisp you have to express everything in terms of parse trees, and
parse trees can be cumbersome. That's why other languages have syntax; yes, in
a way it's just syntactic sugar for parse trees, but syntactic sugar has uses.

~~~
skybrian
It seems more like you hardly ever write parse trees directly in Lisp. Instead
you write macros that get expanded into them.

It’s not really parsed until you have a data structure that lets you search
for all occurrences of a special form. Using lists for everything ignores
important distinctions between levels.

~~~
pdonis
Macros _are_ parse trees, which define code that operates on other parse trees
to expand them into still other parse trees. The ability to operate directly
on parse trees and transform them into other parse trees is a key advantage of
writing parse trees directly instead of having a layer of syntax in between.

~~~
narwally
Yeah, they're the ultimate 'syntactic sugar'. They take a bit of time to wrap
your brain around, but it's much easier to figure out what a macro is doing
under the hood than it is to figure out that exact semantics of syntactic
sugars in other languages. I try not to use them as much as I can, but for a
certain subset of problems there's nothing better for eliminating repetitive
code and for taking the language semantics out of the foreground so that
you're just left with your application semantics.

------
jules
PEG/parser combinators + Pratt has many advantages over LR:

1\. It's faster

2\. It's quite high level, arguably higher level than LR

3\. Can more easily support good error messages

4\. Can easily escape hatch into the surrounding Turing complete programming
language, so you can parse anything

5\. Does character level parsing, so no lexer required, and easily supports
parsing sublanguages, e.g. regex literals inside another programming language

6\. Can make use of the abstraction facilities of the surrounding programming
language

7\. No shift/reduce errors

8\. The whole parsing stack is WAY simpler than LR

The main advantages of LR are:

1\. Its theory is nicer because CFGs have nicer denotational semantics

2\. It perhaps supports better automatic error recovery

Anything I forgot?

P.S. whether CFGs are actually a good formalism for programming languages is
debatable. At the character level, most programming languages grammars are not
context free. So the context freeness depends on the non context freeness of
the lexer. It's fine to use a lexer of course (as long as you do not need to
combine multiple languages), but this observation makes CFGs lose some of
their theoretical appeal, imo. Character level CFG parsers have some non
context free extensions to actually parse the things we want to parse.

------
brundolf
> We all know that parsing is an important part of designing and implementing
> programming languages, but it’s the equivalent of Brussels sprouts: good for
> the diet, but a taste that only a select few enjoy.

Ever since reading Crafting Interpreters I've found myself _obsessed_ with
writing parsers. I keep making new ones over and over, looking for little
things to improve on in projects that nobody will ever use. I'm actually a
little disappointed that parsing is considered "the easy part" of compilers; a
"solved problem", compared to things like type analysis and bytecode
optimization.

------
gameswithgo
Strong disagree on his comments about performance, unless you know that your
inputs will never be "big". Modern computers are fast yes, but modern data
sets are also often massive. Thousands of lines a second is not good, its like
1% of what the hardware is capable of. For a personal project or as part of a
product with constrained input sizes that may be fine but don't release that
to the world for unknown input sizes.

~~~
cellularmitosis
"Don't release this to the world" strikes me as a bit hyperbolic, because
parsing is rarely the bottleneck. Considering editors, individual source files
are typically (relatively) small, even in a large project. Considering
compilers, parsing is already the smallest portion of the run-time anyway.

~~~
formerly_proven
Maybe it's my bias, but in a lot of applications I touched parsing or
unparsing (and/or the associated inefficient I/O) would often be the single
largest consumer of CPU time.

------
smasher164
One thing I don't often see in the discussion of parser technology is what
stage in the development cycle you employ a certain strategy.

A serious compiler will inevitably use a hand-written parser. Once your
language needs mature IDE support, you will need a parser that not only
produces good error messages, but is also error tolerant, e.g. a local parsing
failure does not result in a global cascade of errors in your editing
environment.

But that's not the stage where grammars and parser-generators shine. LR(k)
grammars accept the complete class of deterministic context-free languages,
allowing you to produce an efficiently parsable language with no ambiguity.
The grammar itself can be easily modified to introduce new syntax, just by
updating some production rules, and it also serves as documentation to
incorporate into your language specification.

Sure, sometimes you have to deal with shift-reduce and reduce-reduce conflicts
(akin to static type errors), but once you get the hang of it, parser
generators become instrumental tools in language prototyping. Moreover,
current advancements in LR parsers address a lot of these drawbacks.

------
iso8859-1
How do parser combinators like megaparsec fit into this?

[https://hackage.haskell.org/package/megaparsec](https://hackage.haskell.org/package/megaparsec)

~~~
eckzow
The first footnote of the article says:

> While that could mean just about anything, nearly everyone who writes a
> half-decent hand-written parser, whether they know it or not, is writing a
> recursive descent parser [1].

> [1] Parsing Expression Grammars (PEG)s and “parser combinators” in some
> functional languages are just recursive descent parsers in disguise.

Which, while I haven't thought about it deeply, intuitively makes some sense:
parser combinators encourage an approach mostly equivalent to little functions
that do something like "parse a $thing, or parse a $different_thing" (etc.)
which very much translates down like a recursive descent parser.

------
ducaale
There is also this interesting stackexchange answer which compares different
types of parsers
[https://softwareengineering.stackexchange.com/a/338888/37266...](https://softwareengineering.stackexchange.com/a/338888/372666)

------
vlovich123
I feel like the error handling is the trickiest part. I remember during my
compilers course I fought YACC so hard to work well and provide meaningful
error messages. Bison offered some relief at the time through custom
extensions. Maybe the state of the art here is much better now. It was
definitely fun to do as a project and it was depressing that my course didn’t
cover error reporting in any way, likely because there’s no formal theory
behind how to make it readable to humans (beyond a “wow - yours is the first
to do gold error messages” in the project feedback, it wasn’t mentioned in
class)

------
samatman
This should probably be titled "why I like LR parsers".

There was basically no discussion of PEGs or parser combinators, just a
footnote that he doesn't care for them.

Which is fine! LR parsing is a great approach, he's committed significant
amounts of his life energy to working with them.

I think PEGs are great, though, I'd recommend them to any developer as the
default tool for munging structured data. Nearly any mess of regular
expressions can be improved with PEGs.

There are unsolved problems in the PEG world, particularly with error
recovery. But not insoluble problems by any means.

I'm not quite ready to show my work on PEGs but I can say I've gotten pretty
far with the approach. It's noteworthy that he starts with recursive descent,
notes that PEGs and combinators (same same basically) generalize recursive
descent, and then just kinda walks away from that whole line of inquiry!

But hey, I'm a parsing nerd. I'm glad Laurence Tratt has pushed the state of
the art in LR, just like I'm glad Terrence Parr has done the same for LL. I'm
working in a similar space with PEG, and have a long way to go.

------
zinclozenge
I'm experimenting with a toy language and I wanted to introduce generic syntax
with angle brackets. I recall the golang team saying that it could lead to
unbounded lookahead. Is there a parsing approach that mitigates that?

~~~
earthboundkid
Yeah, not using angle brackets. ^_^

If you allow a, b = f < t1, t2 > ( c ), it can only be parsed if you know
whether f is generic (if it's not, then a and b are bools), which requires
whole program parsing because f could be defined above, below, or in another
file.

Basically, if you want to have a sane grammar, don't use angle brackets for
generics or pipes for lambdas. It looks good to humans but really screws up
the computers.

~~~
laddng
Doesn't golang not support `a < b < c` and chaining of comparisons? I don't
see why they make that argument in their specification if it's not already
legal Golang syntax

specification:
[https://go.googlesource.com/proposal/+/refs/heads/master/des...](https://go.googlesource.com/proposal/+/refs/heads/master/design/go2draft-
type-parameters.md#why-not-use-the-syntax-like-c_and-java)

~~~
yencabulator
It does not, so the argument is very much valid.

[https://play.golang.org/p/NQL1rVp-wVC](https://play.golang.org/p/NQL1rVp-wVC)
\-- it's trying to interpret that as (1<2)<3, and the types make no sense for
that.

~~~
dllthomas
Which argument is valid?

~~~
yencabulator
That using angle brackets for generics would unduly complicate Go parsing.

~~~
dllthomas
Then although I agree, I'm still not sure I follow.

> It does not [support chained comparisons], so the argument is very much
> valid.

Was your comment to the effect of "although Go does not support chained
comparisons, it is syntactically valid for a comparison operator to be passed
the result of a comparison so the issue can still arise"?

That's valid, but I would have expected "but" in place of "so".

(sorry if it seems I'm picking nits; I'm just trying to understand and
explaining my attempts)

------
UncleOxidant
How about PEG parsers - or are they just a special case of recursive descent
parsers?

------
doonesbury
He shows us Flex/Yacc and shift/reduce errors then moves on. I'd greatly
appreciate tools tips to either

\- help better diagnose where/why shift/reduce arises in the context of a
given grammar

\- argument if Prat or another approach entirely avoids shif/reduce errors.

------
formerly_proven
Ctrl+F "pratt", no matches. How is this technique still not widely known?
(Though one of their sources does mention it in passing)

~~~
neel_k
Pratt parsing (aka top-down operator precedence) is a special case of LR
parsing. Since Laurie Tratt (the author of the article) is advocating LR
parsing, he has no need to talk about Pratt parsing since it is a special case
of the approach he advocates.

The nicest explanation of this fact is in Olivier Danvy and Kevin Millikin's
paper _Refunctionalisation at Work_ \-- as an example of their program
transformation technique, they demonstrate how operator precedence turns into
shift-reduce. It's really cute.

------
amelius
Nice overview, but it misses the important subfield of incremental parsing.
Modern IDEs couldn't exist without it.

------
ianhorn
I'd like to learn more about formal grammars and parsing in depth. Can anyone
recommend a good textbook?

~~~
sparkie
Engineering a Compiler by Cooper & Torczon covers the theory behind Regular
expressions, LL and LR parsing.

~~~
ianhorn
Everyone in this thread seems to be saying Pratt and PEG matter most, which I
don't see in that book. Any recs on that front? (Or do you think they're not
important?)

~~~
sparkie
I recommend becoming familiar with the theory and making the decision for
yourself.

My personal preference is pretty much the same as Tratt's, except I never had
an aversion to LR grammars to begin with. I always found them intuitive. I had
used them before understanding the theory.

~~~
ianhorn
Thanks!

