Hacker News new | past | comments | ask | show | jobs | submit login
Challenging LR Parsing (rust-analyzer.github.io)
121 points by dilap on Sept 16, 2020 | hide | past | favorite | 42 comments



> There’s a lot of literature about error recovery for LR parsers, how come academics haven’t figured this out already? I have a bold claim to make: error-recovery research in academia is focusing on a problem irrelevant for IDEs. Specifically, the research is focused on finding “minimal cost repair sequence”

The parsing literature can be a bit difficult to navigate because the field is effectively dead for research purposes. However, other models of errors have been considered. Tim Wagner's PhD thesis is the best discussion of the application of LR parsing techniques in IDEs, and this paper extracted from it discusses error recovery in particular:

http://harmonia.cs.berkeley.edu/papers/twagner-er.pdf

In the non-interactive context, Richter's work on error recovery by substring parsing is also very interesting, in that it doesn't cause spurious errors when attempting to recover from errors, and it only depends on the language and not the grammar or parsing technique used:

https://dl.acm.org/doi/10.1145/3916.4019

For LR grammars (and many other grammars), the necessary substring parsing can be done in linear time using variations of the GLR algorithm.


In the open source world, both Tree-sitter and Lezer are based on GLR, and cite Tim Wagner's thesis. So, while it might be possible to do better even staying in the GLR world, I think the comparison based on Tree-sitter is fair.

I think these projects show that GLR is viable (there's a good set of grammars written for Tree-sitter), but I am also very willing to believe that the parser in rust-analyzer is more finely tuned for the specific needs of a Rust IDE.


While Tree-sitter is heavily influenced by Wagner's thesis, our error recovery strategy actually uses a novel approach, so it is fair to question whether Wagner's would have different properties.

For error recovery, Tree-sitter is currently optimized for some goals that are a bit different from the goals of a single-language IDE like Rust-Analyzer:

  * Like RA, we want to minimize the impact of syntax errors on basic features like syntax highlighting and symbol name extraction (which are important to GitHub, and in lightweight text editors like Neovim and Atom).
  * Unlike RA, I wanted to avoid requiring grammar authors to think about error recovery at all. 
  * We needed to tightly control the performance cost of pathological error recovery cases, like parsing a large Ruby file as Go (either accidentally or adversarially).
My understanding is that Wagner's thesis is mostly concerned with "history-sensitive" error recovery in an IDE: using earlier syntax trees as an additional input to the error recovery process in order to better understand the user's intent when editing.

I think this idea makes a lot of sense, but it doesn't address cases like Matklad has prevented here, where the IDE is presented with already-invalid file.


> My understanding is that Wagner's thesis is mostly concerned with "history-sensitive" error recovery in an IDE: using earlier syntax trees as an additional input to the error recovery process in order to better understand the user's intent when editing.

That matches my understanding (note that Wagner's error recovery algorithm is probably best supplemented with Lukas Diekmann's thesis, which corrects and expands it a bit https://diekmann.co.uk/diekmann_phd.pdf).

Because Wagner's algorithm relies on history and context, it does sometimes do some unexpected (and not always useful things). However, I don't see any fundamental reason why it couldn't be paired with a "traditional" error recovery approach (I have one to sell!) to try and get the best of both worlds.


Yeah, I agree. To really demonstrate the best possible IDE error recovery that LR/GLR can provide, pairing a robust batch error recovery strategy with Wagner's (and your and Lukas's) history-sensitive approach is probably the way to go.

At the time that I was implementing error recovery for Tree-sitter, that felt beyond my "complexity budget", since batch error recovery was already not trivial, and the batch approach was good enough for the features I was building on GitHub/Atom. Someday, I could imagine augmenting it with history-sensitive behavior. I am thankful that Lukas has helped to "pave that path" so thoroughly.


I was only commenting on his claims about the academic literature on parsing, not whether he should use the techniques verbatim.

The problem with structured approaches to incremental parsing is that you rarely want incremental parsing alone. You’re going to want to do something else with the resulting parse tree (name resolution, type checking, compilation, etc.) and no structured approach to incremental parsing will help you there. Once you’re stuck writing an incremental type checker by hand, you probably have the right framework for also writing an incremental parser.

Rust has additional complications, e.g. the macro system and the fact that the language is defined by a single implementation.


I can confirm that Rust's macro system presents massive difficulties. For example, rustfmt is currently broken for subdirectory in Druid because of the use of a `cfg_if!` macro. We filed a bug (https://github.com/rust-lang/rustfmt/issues/4266) but it got closed without (as far as I can tell) actually being fixed. Our present approach is to remove the use of the macro, but this is a bit of a bad sign: two features (macros and automatic formatting) that work fine individually, but their interactions break.


For error recovery, do any parsers have a notion of which tokens are more likely to be wrong?

For example, in an actively edited C++ file, braces are more likely correct than angle brackets which are overloaded. So it's reasonable to auto-insert closing > if you see a } with an open <, but not reasonable to insert a } if you see a > with an open <.


> For error recovery, do any parsers have a notion of which tokens are more likely to be wrong?

lrpar has a %avoid_insert declaration which allows users to say "these token types are better avoided if there are other equivalently good repair sequences". It's simple, but it works well https://softdevteam.github.io/grmtools/master/book/errorreco...


I like this solution, and have thought about adding something similar as an "advanced feature" in Tree-sitter grammars. I'm interested to see how you've used it in grammars.


To the extent that there will be further research on this problem, I think it will be driven by machine learning. We're already seeing this with autocompletion with TabNine, and no reason a similar approach couldn't work for the other issues facing draft code.

If you had a large dataset of keystroke-by-keystroke edit streams, you'd be able to match up erroneous code with the eventual fix. Google or Facebook could absolutely capture these datasets for their internal work if they wanted, and I suspect the results could be scary-good. They wouldn't be able to release the models, though, even for their open source work (Chromium and Android), because it would have such a high chance of revealing internal secrets.


I haven't used it, but Microsoft is already offering machine learning based completions with IntelliCode [1]. See also the introductory blog post. [2]

It also offers sharing refined predictions within teams.

I think we will see some interesting developments in this domain over the next few years.

[1] https://visualstudio.microsoft.com/services/intellicode/

[2] https://devblogs.microsoft.com/visualstudio/ai-assisted-inte...


Oh ... so that is what a Pratt parser is. Thanks for the nice clear writeup.

But methinks it should be called a Wirth parser. Nicholas Wirth invented the same thing in the 1970's. He initially used it to parse his first language, Pascal. He drew them as Syntax diagrams: https://en.wikipedia.org/wiki/Syntax_diagram The loops that characterise Pratt are evident in those diagrams.

You can read his book on Pascal here: https://www.research-collection.ethz.ch/bitstream/handle/20.... At the end you will find the syntax diagrams for the entire language. The compiler was derived from those diagrams in the straight forward way described in the article.

And everything the article says is true, if you are going to manually write a parser, they are by far the easiest starting point. And I don't doubt error messages and error recovery are much easier too. Parser generators don't give you the same flexibility to play with a parsers guts.

But you do lose some things. A parser generator checks for ambiguities, something you can easily miss in a hand written parser. A parser generator also enforces a clean split between the parser and whatever you are doing with the syntax tree, whereas hand written parsers lead you toward a ball of mud where one lump of code has all the tasks intermixed.

And there are LR parsers out there that support the loops and priorities that raw BNF productions make so hard. Like mine :D http://lrparsing.sourceforge.net/doc/html/


You know --- more and more --- I kinda like the REBOL approach --- forget precedence.

Rather, just go L to R except where superceded by parens. Explicit and easy to understand.

(Though I say that not having lived it in production, lol - so take with a grain of salt)

Over time, readability/grokability tends to trump other factors...


Because parsing "a += b * c" as "(a += b) * c" is a wonderful, intuitive program.

(This does assume that op= is an expression in your language.)


Edit: Realized what I was tring to say: Intuitive might be the wrong thing to chase.

Intuitive to who? :) Trivial programs are trivial. We might be stuck in the BCPL derived syntax that trade a false notion of intuitive for complexity of implentation and maintenance. (And I do mean might --- it's not hard not to think incrementally from the languages we know) --- but "inituitive" is EXACTLY the problem: I guess I was arguing we might consider searching for simple and predictable (even if that comes at the cost of slightly higher initial learning?)


e.g. maybe write it as "b * c += a" ? ... lol full on LR - no ambiguiity.


Including precedence rules will take care of that.


That… would be jcranmer's point, they're replying to a toplevel comment advocating for not having precedence.


The APL approach, which is mostly just simple RL-associative + grouping parentheses, works surprisingly well in practice - although that probably owes something to the simplicity of the syntax and the way idiomatic APL uses mainly the built-in 1- and 2-argument functions.


That was the thought: is there some non-BCPL like grammar that makes sense. Like... (not fully formed thought here) JSON is easy but verbose - YAML is nightmare by cleaner. Maybe we're looking at it wrong....


As a nearby comment observes, this leads to s-expressions.

The connection between APL syntax and s-expressions is surprisingly close. Start with prefix notation and right-associativity, add grouping parens, and you have something that looks like lisp. The Nial language is an outright fusion of the two that's fun and enlightening to play with.

https://en.wikipedia.org/wiki/Nial https://github.com/danlm/QNial7


Smalltalk too. I really liked it.

In practice, C compilers warn you if you try to take advantage of && having higher precedence than ||, so other than a few very simple rules wise programmers use extra parens anyway.


I suspect is not that bad to force to group things manually (1 + (....)) but is truly precedence the biggest issue? I think even without that hand-coded parsers are still superior for error reporting and other factors?


I just wrote a parser in TypeScript that parses itself (for fun, and not quite done yet).

Precedence and associativity is a painful aspect of parsing JavaScript… and probably C, Java and many programming languages that look like C. It took me a non negligible part of the time to think about how to handle this. With the presence of infix, suffix and prefix operators, and weird operators like new…(), ? … : …, and a lot of rules and exceptions, it's a lot of fun (for instance, yield is an operator in JavaScript. It takes an optional expression. However, 1 + yield is forbidden…). The grammar of JavaScript expressions is… not simple.

And then, you have Automatic Semicolon Insertion, which forces you to backtrack and snapshot / restore the lexer state when consuming white characters EVERYWHERE (well, not literally everywhere, but still) because things may behave differently if there is a new line. And of course, new lines cannot happen in some constructs (there are a few exceptions to the general behavior of ASI), so you need to take care of this too.

And then you have the comma operator that is allowed in some places taking an expression (especially in older constructs, and disallowed in other places (especially in newer constructs). So you need to handle this too.

Parsing the rest is easy. It's just that there are a lot of things in the language so it's a bit long to write.


Even with pratt parsing? Or using other method?


I can only expect an other method as both are trivial using a pratt parser (though right-associativity still feels a bit weird).


Indeed, I wrote it by hand, the naive way.


The standing rule has always been 5th grade math comprehension.

As with most things, switching from too much to none is generally not an improvement.

Perhaps languages should have 8 levels of precedence instead of 18.


That kind of thinking inevitably leads to S-expressions.


You might be right about that. Maybe we need more bracket flexibility or something. Or something built for TOOLING - nobody just writes freeform anymore?


Or RPN


Maybe - REBOL was interesting in the functions are all RL to precedence and operators are all LR precedence. It's clean(-ish) without being too verbose/messy. It's not perfect - just thought it was an interesting direction to ponder...


Somehow F# manages to do a pretty good job (although maybe I've just gotten used to any quirks re: error recovery and messages). Anybody know if they're doing anything differently?

https://github.com/fsprojects/FsLexYacc

https://github.com/dotnet/fsharp/blob/main/src/fsharp/lex.fs...


Well said, there certainly is a chasm between the theory and what practically works

As a sidenote, if like myself you like to dabble first before jumping in to theory, I would recommend starting with Jisson -- it's so much fun and you'll learn a lot quickly!:

https://zaa.ch/jison/try/

https://nolanlawson.github.io/jison-debugger/


The chart parsing (CYK and friends) will produce exactly what is needed - as little contiguous completely parsed parts of program as it is possible. The syntax tree from example above will be different, but not much - parts "fn f(" and "struct { ... }" will be adjacent, not included in partial tree.

There's a paper [1] about efficient parallel implementation possible.

[1] https://jyp.github.io/pdf/PP.pdf

Enjoy. ;)


relevant:

parser recovery in codemirror6 (inspired by tree-sitter but adapted for IDEs):

https://marijnhaverbeke.nl/blog/lezer.html

a standalone incremental JS parser with error recovery:

https://github.com/escaya/escaya


This post is a next-level look at parsing interactively in an editor or IDE. Not only do you want it to be fast, but you want it to be resilient enough to provide useful output given incomplete input.

Most of my poor attempts at parsing have been in the context of an offline tool. I find error handling hard enough using Yacc or Bison.

I missed the post that this one responds to, so I'm going back and reading that discussion too:

https://news.ycombinator.com/item?id=24480504


He suggests pratt parsing as a production parser which, while unbelievably elegant, is recursive. Correctly recovering from a stack overflow (without leaking memory or resources) is, at best, a very hard problem worth researching in academia. Recursion is generally not production code, its usefulness does not extend much further than confusing university students.

That's the hidden benefit of parsers that use DFAs: they generally don't use recursion. They will parse something until you run out of memory, not stack space (which is at least 3 orders of magnitude smaller than the heap).


It's trivial to replace the stack usage with a data structure. I've written a number of extended-Pratt parsers that don't use the stack.

Also, a significant number of successful parsers have used recursion. As long as you successfully handle stack overflow and only hit it on pathological inputs, it works reasonably well.


> It's trivial to replace the stack usage with a data structure.

Interesting. Do you have any examples of that, reading the article actually had me thinking how that would be achieved.


You could trivially (or rather, mechanically) emulate HW bounded stack with a heap-based stack collection (like std::stack).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: