
Challenging LR Parsing - dilap
https://rust-analyzer.github.io/blog/2020/09/16/challeging-LR-parsing.html
======
cwzwarich
> There’s a lot of literature about error recovery for LR parsers, how come
> academics haven’t figured this out already? I have a bold claim to make:
> error-recovery research in academia is focusing on a problem irrelevant for
> IDEs. Specifically, the research is focused on finding “minimal cost repair
> sequence”

The parsing literature can be a bit difficult to navigate because the field is
effectively dead for research purposes. However, other models of errors have
been considered. Tim Wagner's PhD thesis is the best discussion of the
application of LR parsing techniques in IDEs, and this paper extracted from it
discusses error recovery in particular:

[http://harmonia.cs.berkeley.edu/papers/twagner-
er.pdf](http://harmonia.cs.berkeley.edu/papers/twagner-er.pdf)

In the non-interactive context, Richter's work on error recovery by substring
parsing is also very interesting, in that it doesn't cause spurious errors
when attempting to recover from errors, and it only depends on the language
and not the grammar or parsing technique used:

[https://dl.acm.org/doi/10.1145/3916.4019](https://dl.acm.org/doi/10.1145/3916.4019)

For LR grammars (and many other grammars), the necessary substring parsing can
be done in linear time using variations of the GLR algorithm.

~~~
raphlinus
In the open source world, both Tree-sitter and Lezer are based on GLR, and
cite Tim Wagner's thesis. So, while it might be possible to do better even
staying in the GLR world, I think the comparison based on Tree-sitter is fair.

I think these projects show that GLR is viable (there's a good set of grammars
written for Tree-sitter), but I am also very willing to believe that the
parser in rust-analyzer is more finely tuned for the specific needs of a Rust
IDE.

~~~
maxbrunsfeld
While Tree-sitter is heavily influenced by Wagner's thesis, our error recovery
strategy actually uses a novel approach, so it is fair to question whether
Wagner's would have different properties.

For error recovery, Tree-sitter is currently optimized for some goals that are
a bit different from the goals of a single-language IDE like Rust-Analyzer:

    
    
      * Like RA, we want to minimize the impact of syntax errors on basic features like syntax highlighting and symbol name extraction (which are important to GitHub, and in lightweight text editors like Neovim and Atom).
      * Unlike RA, I wanted to avoid requiring grammar authors to think about error recovery at all. 
      * We needed to tightly control the performance cost of pathological error recovery cases, like parsing a large Ruby file as Go (either accidentally or adversarially).
    

My understanding is that Wagner's thesis is mostly concerned with "history-
sensitive" error recovery in an IDE: using _earlier_ syntax trees as an
additional input to the error recovery process in order to better understand
the user's intent when editing.

I think this idea makes a lot of sense, but it doesn't address cases like
Matklad has prevented here, where the IDE is presented with already-invalid
file.

~~~
ltratt
> My understanding is that Wagner's thesis is mostly concerned with "history-
> sensitive" error recovery in an IDE: using earlier syntax trees as an
> additional input to the error recovery process in order to better understand
> the user's intent when editing.

That matches my understanding (note that Wagner's error recovery algorithm is
probably best supplemented with Lukas Diekmann's thesis, which corrects and
expands it a bit
[https://diekmann.co.uk/diekmann_phd.pdf](https://diekmann.co.uk/diekmann_phd.pdf)).

Because Wagner's algorithm relies on history and context, it does sometimes do
some unexpected (and not always useful things). However, I don't see any
fundamental reason why it couldn't be paired with a "traditional" error
recovery approach (I have one to sell!) to try and get the best of both
worlds.

~~~
maxbrunsfeld
Yeah, I agree. To really demonstrate the best possible IDE error recovery that
LR/GLR can provide, pairing a robust batch error recovery strategy with
Wagner's (and your and Lukas's) history-sensitive approach is probably the way
to go.

At the time that I was implementing error recovery for Tree-sitter, that felt
beyond my "complexity budget", since batch error recovery was already not
trivial, and the batch approach was good enough for the features I was
building on GitHub/Atom. Someday, I could imagine augmenting it with history-
sensitive behavior. I am thankful that Lukas has helped to "pave that path" so
thoroughly.

------
rstuart4133
Oh ... so that is what a Pratt parser is. Thanks for the nice clear writeup.

But methinks it should be called a Wirth parser. Nicholas Wirth invented the
same thing in the 1970's. He initially used it to parse his first language,
Pascal. He drew them as Syntax diagrams:
[https://en.wikipedia.org/wiki/Syntax_diagram](https://en.wikipedia.org/wiki/Syntax_diagram)
The loops that characterise Pratt are evident in those diagrams.

You can read his book on Pascal here: [https://www.research-
collection.ethz.ch/bitstream/handle/20....](https://www.research-
collection.ethz.ch/bitstream/handle/20.500.11850/68910/eth-3059-01.pdf) At the
end you will find the syntax diagrams for the entire language. The compiler
was derived from those diagrams in the straight forward way described in the
article.

And everything the article says is true, if you are going to manually write a
parser, they are by far the easiest starting point. And I don't doubt error
messages and error recovery are much easier too. Parser generators don't give
you the same flexibility to play with a parsers guts.

But you do lose some things. A parser generator checks for ambiguities,
something you can easily miss in a hand written parser. A parser generator
also enforces a clean split between the parser and whatever you are doing with
the syntax tree, whereas hand written parsers lead you toward a ball of mud
where one lump of code has all the tasks intermixed.

And there are LR parsers out there that support the loops and priorities that
raw BNF productions make so hard. Like mine :D
[http://lrparsing.sourceforge.net/doc/html/](http://lrparsing.sourceforge.net/doc/html/)

------
sreekotay
You know --- more and more --- I kinda like the REBOL approach --- forget
precedence.

Rather, just go L to R except where superceded by parens. Explicit and easy to
understand.

(Though I say that not having lived it in production, lol - so take with a
grain of salt)

Over time, readability/grokability tends to trump other factors...

~~~
jcranmer
Because parsing "a += b * c" as "(a += b) * c" is a wonderful, intuitive
program.

(This does assume that op= is an expression in your language.)

~~~
sreekotay
Edit: Realized what I was tring to say: Intuitive might be the wrong thing to
chase.

Intuitive to who? :) Trivial programs are trivial. We might be stuck in the
BCPL derived syntax that trade a false notion of intuitive for complexity of
implentation and maintenance. (And I do mean _might_ \--- it's not hard not to
think incrementally from the languages we know) --- but "inituitive" is
EXACTLY the problem: I guess I was arguing we might consider searching for
simple and predictable (even if that comes at the cost of slightly higher
initial learning?)

~~~
sreekotay
e.g. maybe write it as "b * c += a" ? ... lol full on LR - no ambiguiity.

------
Nelkins
Somehow F# manages to do a pretty good job (although maybe I've just gotten
used to any quirks re: error recovery and messages). Anybody know if they're
doing anything differently?

[https://github.com/fsprojects/FsLexYacc](https://github.com/fsprojects/FsLexYacc)

[https://github.com/dotnet/fsharp/blob/main/src/fsharp/lex.fs...](https://github.com/dotnet/fsharp/blob/main/src/fsharp/lex.fsl)

------
matthewaveryusa
Well said, there certainly is a chasm between the theory and what practically
works

As a sidenote, if like myself you like to dabble first before jumping in to
theory, I would recommend starting with Jisson -- it's so much fun and you'll
learn a lot quickly!:

[https://zaa.ch/jison/try/](https://zaa.ch/jison/try/)

[https://nolanlawson.github.io/jison-
debugger/](https://nolanlawson.github.io/jison-debugger/)

------
thesz
The chart parsing (CYK and friends) will produce exactly what is needed - as
little contiguous completely parsed parts of program as it is possible. The
syntax tree from example above will be different, but not much - parts "fn f("
and "struct { ... }" will be adjacent, not included in partial tree.

There's a paper [1] about efficient _parallel_ implementation possible.

[1] [https://jyp.github.io/pdf/PP.pdf](https://jyp.github.io/pdf/PP.pdf)

Enjoy. ;)

------
leeoniya
relevant:

parser recovery in codemirror6 (inspired by tree-sitter but adapted for IDEs):

[https://marijnhaverbeke.nl/blog/lezer.html](https://marijnhaverbeke.nl/blog/lezer.html)

a standalone incremental JS parser with error recovery:

[https://github.com/escaya/escaya](https://github.com/escaya/escaya)

------
SloopJon
This post is a next-level look at parsing interactively in an editor or IDE.
Not only do you want it to be fast, but you want it to be resilient enough to
provide useful output given incomplete input.

Most of my poor attempts at parsing have been in the context of an offline
tool. I find error handling hard enough using Yacc or Bison.

I missed the post that this one responds to, so I'm going back and reading
that discussion too:

[https://news.ycombinator.com/item?id=24480504](https://news.ycombinator.com/item?id=24480504)

------
zamalek
He suggests pratt parsing as a production parser which, while unbelievably
elegant, is recursive. Correctly recovering from a stack overflow (without
leaking memory or resources) is, at best, a very hard problem worth
researching in academia. Recursion is generally not production code, its
usefulness does not extend much further than confusing university students.

That's the hidden benefit of parsers that use DFAs: they generally don't use
recursion. They will parse something until you run out of memory, not stack
space (which is at least 3 orders of magnitude smaller than the heap).

~~~
ghusbands
It's trivial to replace the stack usage with a data structure. I've written a
number of extended-Pratt parsers that don't use the stack.

Also, a significant number of successful parsers have used recursion. As long
as you successfully handle stack overflow and only hit it on pathological
inputs, it works reasonably well.

~~~
zamalek
> It's trivial to replace the stack usage with a data structure.

Interesting. Do you have any examples of that, reading the article actually
had me thinking how that would be achieved.

~~~
flashmozzg
You could trivially (or rather, mechanically) emulate HW bounded stack with a
heap-based stack collection (like std::stack).

