
Don't Panic! Better, Fewer, Syntax Errors for LR Parsers - matt_d
https://soft-dev.org/pubs/html/diekmann_tratt__dont_panic/
======
nickmqb
I have a somewhat controversial opinion on this.

This paper, like some others that I've seen, treats parsing as an
academic/algorithmic topic. Given some token stream we'd like to minimize some
error criteria. However, these papers ignore the fact that people don't write
token streams: they write code that is formatted using whitespace. I don't
know any programmer that writes their source code on a single line. On the
contrary, pretty much every programmer formats their source code to some
reasonable (if not complete) degree. In other words: parsing error recovery is
not primarly an algorithmic problem: it's a usability problem.

Newlines and indentation are a source of extra information that we can use to
infer what the programmer meant. Why throw all that away during tokenization?
That's crazy! We can totally use it for error recovery/error message
generation.

I decided to play around with this idea in the design my programming language
Muon [1]. The language uses redundant significant whitespace for parser error
recovery. This simple approach by itself has turned out to work surprisingly
well! In my admittedly biased opinion, it frequently surpasses the error
recovery quality in mature compilers such as the C# compiler, which has had
tons of effort poured into error recovery.

Of course, a purely whitespace based recovery scheme is not perfect: there are
rough edges, like having to deal with tabs vs spaces, and recovering inside a
line is not usually possible. But the fact that such a simple approach has led
to good results makes me think this would be a great area for future research,
that can perhaps combine the best of both worlds here.

[1] [https://github.com/nickmqb/muon](https://github.com/nickmqb/muon)

~~~
cwzwarich
An even bigger flaw with most academic research into parser error recovery is
that the vast majority of syntax errors occur from modifying a valid program
to produce an invalid program, but the recovery algorithms are oblivious to
this.

~~~
ComputerGuru
Have a look at this writeup by the author of lezer, if you haven't already:
[https://marijnhaverbeke.nl/blog/lezer.html](https://marijnhaverbeke.nl/blog/lezer.html)

~~~
ltratt
tree-sitter is excellent stuff! It's heavily inspired by Tim Wagner's PhD
thesis (original site seems to be down, but
[https://web.archive.org/web/20150919164029/https://www.cs.be...](https://web.archive.org/web/20150919164029/https://www.cs.berkeley.edu/Research/Projects/harmonia/papers/twagner-
thesis.pdf) works). IMHO more people should know about that work, and the
sequence of work from Susan Graham's lab that led up to it. We have also been
heavily inspired by Tim's work and Lukas's thesis extends and updates a number
of aspects of that seminal work including, in Chapter 3, error recovery
([https://diekmann.co.uk/diekmann_phd.pdf](https://diekmann.co.uk/diekmann_phd.pdf)).

All that said, it's surprisingly difficult to compare error recovery in an
online parser (i.e. one that's parsing as you type) to a batch parser. In the
worst case (e.g. load a file with a syntax error in), online parsers have
exactly the same problems as a batch parser; however, once they've built up
sufficient context they have different, sometimes more powerful, options
available to them (but they also need to be cautious about rewriting the tree
too much as that baffles users).

------
ltratt
Co-author here. If you want to quickly play with this on the command-line with
your favourite Yacc grammar, a simple way is to use nimbleparse
[https://crates.io/crates/nimbleparse](https://crates.io/crates/nimbleparse)
('cargo install nimbleparse' should do the trick; though note that you will
probably need to munge your lexer a bit). You can see this in use in the paper
itself: the examples at the end are direct output from nimbleparse
[https://github.com/softdevteam/error_recovery_paper/tree/mas...](https://github.com/softdevteam/error_recovery_paper/tree/master/examples).
Some example grammars are at
[https://github.com/softdevteam/error_recovery_paper/tree/mas...](https://github.com/softdevteam/error_recovery_paper/tree/master/examples)
if you want to get going quickly.

If instead you want to use the full set of Rust libraries, a good place to
start is
[https://softdevteam.github.io/grmtools/master/book/quickstar...](https://softdevteam.github.io/grmtools/master/book/quickstart.html).

------
bollu
Awesome. I love good research into parsing, since parsing is famously "the
solved problem that isn't":
[https://tratt.net/laurie/blog/entries/parsing_the_solved_pro...](https://tratt.net/laurie/blog/entries/parsing_the_solved_problem_that_isnt.html)

I hope to study this paper well (which is by the author of the above post) and
understand what their algorithm is. I feel that good error messages make or
break a language.

Here is the arxiv link of the same paper:
[https://arxiv.org/pdf/1804.07133.pdf](https://arxiv.org/pdf/1804.07133.pdf)

~~~
userbinator
In theory it's "the solved problem that isn't", in practice it is --- there's
a reason the compilers for the most popular programming languages either moved
to or have always been recursive-descent with perhaps a bit of operator-
precedence/precedence-climbing to handle the repetitive cases of binary
operators. GCC, Clang/LLVM, Java, and C# are the ones that immediately come to
mind; and no doubt there are plenty of others.

~~~
choeger
There are _way_ more parsers out there in production than the top-ten
programming language implementations. Just think about serialization formats
(json, bson, protobuff, et.al.): There are maybe 10 or so widely used. Then
think about message formats (http, xml, yaml, ...): Maybe another 10 or so?
Then think about configuration languages (.ini, .toml, ...), let's call that
5?. On top of that we have a bunch of widely used scripting languages (Make,
sh, various lisps), I would say again 5 or so.

So for any new language, say rust, you want (ideally) parsers for about 30 or
more languages in real-world use cases. And that does _not_ include language
prototyping or consider the fact that recursive-descent parsers are usually
not available as libraries (say for use in an editor).

In conclusion: Having good automatic parser generators readily available is
very relevant. And "good" can obviosly include usability, IMO.

------
cornstalks
I've thought about doing something like this when using matching/parsing with
derivatives[1]. When the graph collapses to ∅ (indicating an error), you could
back up one character, then ask the graph "what are the next valid token(s)?"
This would give you a list of possible things to insert (`,` and `=` for the
example given in the article, though naively implementing this would also
suggest stuff like `;`).

But this article takes it a step further and suggests deletions, too. That's
actually really cool. My biggest gripe with people making DSLs is the crappy
error messages you get when you have a syntax error (GCL, I'm looking at you).
Hopefully research like this will improve the lives of those of us that have
to use DSLs.

[1]: [http://matt.might.net/articles/parsing-with-
derivatives/](http://matt.might.net/articles/parsing-with-derivatives/)

------
emmanueloga_
Most grammars define a very specific starting point for languages. This is
fine of course, but what if we want to parse, say, only an expression, only a
function body, etc?

I feel most parsing APIs tend to expose a single method, something like:

    
    
        parser.parse(source) 
    

... which is kind of a limited API to do the things I ask about above.

With a more flexible API, an structure aware editor [1] could keep different
areas of the buffer separated, and know if the user is typing a function body,
a variable definition, etc, and only parse that part w/o breaking the rest of
the parser progress (is this a good idea? I think so but I don't think most
editors work like this, so maybe not :-)

\--

An area of parser APIs that I've recently seen trouble with is lack of
metadata on the productions. I'd like to be able to do something like:

    
    
        production.meta() => { :line 23, :column 10 }
    

... or something like that. This is very useful while error reporting, for
instance. I was trying to find a parser for RDF Turtle that would do this, but
couldn't find any!

1:
[https://en.wikipedia.org/wiki/Structure_editor](https://en.wikipedia.org/wiki/Structure_editor)

~~~
choeger
You can always start from a different place even with an LR(k) parser. You
just need to make sure that your starting point is sensible.

------
tzs
In the first figure, it gives "int x y;" and says that the new algorithm gives
the complete set of minimal cost repair sequences, which is in this case
"Delete y", "Insert ,", and "Insert =".

I haven't read the article, so maybe this is covered later, but why isn't
"Delete x" included?

~~~
laszlokorte
Just a guess: "int x" could already be comsumed so there is no need to change
anything about it. It's only the next token that can not be processed.

~~~
choeger
Intriguing, so if we design a language that allows for:

int single x;

and

int x y z;

The error for

int single x y z;

Would be to delete "y" and "z", but not to delete "single"?

------
kazinator
Trying to repair bad syntax by editing the tokens stream is decades obsolete.
It was an important technique when compiling had long turnaround times which
included having to stand in line to submit a deck of punched cards to a clerk
in a job submission window. Naturally, programmers wanted to catch as many
errors as possible in a single round trip.

That said, it could be useful again, for an IDE to get the most amount of
information (for completion and whatnot) even in the face of a bit of bad
syntax.

The technique opens up possibilities for the compiler to get into a bit of a
loop generating a large quantity of error messages, most of which result from
its repair attempts. If any of the recovery strategies insert tokens, there is
a potential getting into an infinite loop, if the code isn't careful.

In the 1970's, _Creative Computing_ magazine held a contest for producing the
longest ream of error messages from a single error in a program, or something
like that. That would have been fueled by error recovery.

------
choeger
That is a very well-written paper and an interesting topic! Kudos to the
authors! I'd really like this to turn into a thesis and further research.

Edit: it also follows the de-facto standard for the naming of cs-papers, which
demands a pun followed by a quick explanation. Brilliant!

I also just saw that it might be too late for the thesis part. A pity.

~~~
MaxBarraclough
> the de-facto standard for the naming of cs-papers, which demands a pun
> followed by a quick explanation

I'm reminded of the following wordplay. (I'm ashamed to say I was rather slow
to catch the pun when I first heard it.)

 _The aim of the semantic web is to save the world._

------
estebank
> A more grammar-specific variation of this idea is to skip input until a pre-
> determined synchronisation token (e.g. ‘;’ in Java) is reached [8 , p. 3],
> or to try inserting a single synchronisation token. Such strategies are
> often unsuccessful, leading to a cascade of spurious syntax errors (see
> Figure 1 for an example). Programmers quickly learn that only the location
> of the first error in a file – not the reported repair, nor the location of
> subsequent errors – can be relied upon to be accurate.
    
    
      C.java:2: error: ’;’ expected 
        int x y; 
             ^ 
      C.java:2: error: <identifier> expected 
        int x y; 
               ^
    

This is a common problem, but the solution here for the example shown is for
the parser _not_ to attempt to recover that statement and _ignore_ everything
until it reaches the "synchronization point" (I call them "landmarks"), which
sounds like what their were arguing for at first. Doing that would make the
AST be marked with a node along the lines of "some int named x with errors",
and would completely _skip_ everything between the `x` and the `;`. Because an
error has been reported, there will be no output binary, but the second error
will _not_ be emitted.

Beyond the parser, the AST needs to keep around a lot of extra information for
error recovery that would otherwise not be needed. An example from rustc[1]
(what I'm familiar with) is structs and struct variants that have had some
parse error. We recover the parse and keep a node for the variant in the AST
signaling the existence of a parse error, which means that later passes of the
compiler can type and borrow check, but if it finds a _use_ that is either
missing or _adding_ fields we _do not_ emit an error for it[2][3] under the
assumption that the parse error caused the field being used to not be captured
in the AST node for the variant. This can have quite dramatic impact on the
quantity of errors being emitted.

[1]: [https://github.com/rust-
lang/rust/blob/7e11379f3b4c376fbb9a6...](https://github.com/rust-
lang/rust/blob/7e11379f3b4c376fbb9a6c4d44f3286ccc28d149/src/librustc_middle/ty/mod.rs#L1969-L1971)

[2]: [https://github.com/rust-
lang/rust/blob/7e11379f3b4c376fbb9a6...](https://github.com/rust-
lang/rust/blob/7e11379f3b4c376fbb9a6c4d44f3286ccc28d149/src/librustc_typeck/check/expr.rs#L1292-L1295)

[3]: [https://github.com/rust-
lang/rust/blob/7e11379f3b4c376fbb9a6...](https://github.com/rust-
lang/rust/blob/7e11379f3b4c376fbb9a6c4d44f3286ccc28d149/src/librustc_typeck/check/pat.rs#L1085-L1094)

~~~
ltratt
I think we are in agreement that skipping over large chunks of input is rarely
a good idea. In Section 7 ("Using error recovery in practice") we show how you
can make fine-grained decisions about what to do when recovery has happened in
our Rust parsing system. The more fine-grained you go, the more code you have
to write, but it allows you to do exactly the sorts of things that rustc does
with structs if you want.
[https://softdevteam.github.io/grmtools/master/book/errorreco...](https://softdevteam.github.io/grmtools/master/book/errorrecovery.html)
is a more approachable version of the same stuff.

