* Incremental re-parsing. If a user in the IDE changes the document, we need to reparse the file, but we want to do this while using as little memory as possible. To this end, we re-use AST nodes from previous parses.
* Better error reporting. Parser generators are known for producing terrible errors. While you can hack around this, by using recursive-descent, you can get information from further "up" the tree to make your more relevant to the context in which the error occurred.
* Resilient parsing. This is the big one! If you give our parser a string that is illegal according to the grammar, our parser will still give you a syntax tree! (We'll also spit errors out). But getting a syntax tree regardless of the actual validity of the program being passed in means that the IDE can give autocomplete and report type-checking error messages. As an example, the code "var x = velocity." is invalid C#. However, in order to give autocomplete on "velocity", that code needs to be parsed into an AST, and then typechecked, and then we can extract the members on the type in order to provide a good user experience.
My personal opinion is that everyone should just use s-expressions. Get rid of this whole debate :P
It supports both incremental parsing and an API to inspect and recover incomplete ASTs (which powers Merlin, the IDE-like thing for OCaml). It provides stellar debugging features for ambiguous grammars and a way to have good error messages (which is used in compcert's C parser and facebook's reason).
So, it's not impossible. Most parser generators are not that good, though.
Maybe, but if you want all of the features mentioned above (incremental parsing etc.), then writing a parser becomes much more complicated.
* Repo: https://github.com/SAP/chevrotain
* Online Playground: http://sap.github.io/chevrotain/playground/
This handles the repetitive parts such as lookahead functions
creation or Parse Tree building, while also providing advanced features such as error recovery or syntactic content assist (auto complete).
With a parser generator, it quickly becomes a game of designing your language and its semantics around the limitations of the generator you are using.
Here are the first couple of paragraphs:
"In this API, control is inverted. The parser does not have access to the lexer. Instead, when the parser needs the next token, it stops and returns its current state to the user. The user is then responsible for obtaining this token (typically by invoking the lexer) and resuming the parser from that state. The directory demos/calc-incremental contains a demo that illustrates the use of the incremental API.
This API is “incremental” in the sense that the user has access to a sequence of the intermediate states of the parser. Assuming that semantic values are immutable, a parser state is a persistent data structure: it can be stored and used multiple times, if desired. This enables applications such as “live parsing”, where a buffer is continuously parsed while it is being edited. The parser can be re-started in the middle of the buffer whenever the user edits a character. Because two successive parser states share most of their data in memory, a list of n successive parser states occupies only O(n) space in memory."
There does not appear to be a specific mention of having the partial AST available.
 - linked from their front page and available at http://gallium.inria.fr/~fpottier/menhir/manual.pdf
I did a part of the incremental parsing in Menhir and the whole recovery aspect. I can try to explain a bit.
My goal was Merlin (https://github.com/ocaml/merlin/), not research so there is no paper covering the work I am about to describe. Also as of today only incrementality and error message generation are part of upstream version of Menhir, but the rest should come soon.
# Incrementality, part I
The notion of incrementality that comes builtin with Menhir is slightly weaker than what you are looking for.
With Menhir, the parser state is reified and control of the parsing is given back to the user. The important point here is the departure from a Bison-like interface. The user of the parser is handled a (pure) abstract object that represents the state of the parsing.
In regular parsing, this means we can store a snapshot of the parsing for each token, and resume from the first token that has changed (effectively sharing the prefix).
But on the side, we can also run arbitrary analysis on a parser (for error message generation, recovery, syntactic completion, or more incrementality...).
# Incrementality, part II
Sharing prefix was good enough for our use case (parsing is not a bottleneck in the pipeline). But it turns out that a trivial extension to the parser can also solve your case.
Using the token stream and the LR automaton, you can structure the tokens as a tree:
- starts with a root node annotated by the state of the automaton
- store each token consumed as a leaf in that tree
- whenever you push on the automaton's stack, enter a sub-level of the tree (annotated by the new state), whenever you pop, return to the parent node
This gives you the knowledge that "when the parser is in state $x and $y is a prefix of the token stream, it is valid to directly reduce the subtree $z".
In a later parse, whenever you identify a known (state number, prefix) pair, you can short-circuit the parser and directly reuse the subtree of the previous parse.
If you were to write the parser by hand, this is simply memoization done on the parsing function (which is defunctionalized to a state number by the parser generator) and the prefix of token stream that is consumed by a call.
In your handwritten parser, reusing the objects from the previous parsetree amounts to memoizing a single run (and forgetting older parses).
Here you are free to choose the strategy: you can memoize every run since the beginning, devise some caching policy, etc (think about a user (un)commenting blocks of code, or switching preprocessor definitions: you can get sharing of all known subtrees, if this is of any use :)).
So with part I and II, you get sharing of subtrees for free. Indeed, absolutely no work from the grammar writer has been required so far: this can all be derived by the parser generator (with the correctness guarantees that you can expect from it, as opposed to handwritten code).
A last kind of sharing you might want is sharing the spine of the tree by mutating older objects. It is surely possible but tricky and I haven't investigated that at all.
# Error messages
The error message generation is part of the released Menhir version. It is described in the manual and papers by F. Pottier (like http://dl.acm.org/citation.cfm?doid=2892208.2892224, PDF available on http://gallium.inria.fr/~fpottier/biblio/pottier_abstracts.h...).
I might be biased but contrary to popular opinions I think that LR grammars are well suited to error message generation.
The prefix propery guarantees that the token pointed out by the parser is relevant to the error. The property means that there exist valid parses beginning with the prefix before this token. This is a property that doesn't hold for most backtracking parsers afaik, e.g PEG.
Knowledge of the automaton and grammar at compile time allow a precise work on error messages and separation of concerns: the tooling ensures exhaustivity of error coverage, assists in migration of error messages when the grammar is refactored, or can give an explicit description of the information available around a given error.
This is not completely free however, sometimes the grammar needs to be reengineered to carry the relevant information. But you would have to do that anyway with a handwritten parser and here the parser generator can help you....
If you have such a parser generator of course :). Menhir is the most advanced solution I know for that, and the UX is not very polished (still better than Bison). It is however a very officient workflow once you are used to it.
So LR is not the problem, but existing tools do a rather poor job at solving that kind of (real) problems.
In Merlin's, recovery is split in two parts.
The first is completion of the parsetree: for any prefix of a parse, it is possible to fill holes in the tree (and thus get a complete AST).
This is done by a mix of static analysis on the automaton and user guidance: for major constructions of the AST, the grammar writer provide "dummy" constructors (the node to use for erroneous expressions, incomplete statement, etc).
The parser generator then checks that is has enough information to recover from any situation, or point out the cases it cannot handle.
It is not 100% free for the grammar writer, but for a grammar such as OCaml one it took less than an hour of initial work (I remember only a few minutes, but I wasn't measuring properly :)). It is also a very intuitive step as it follows the shape of the AST quite closely.
The second part is resuming the parse after an error (we don't fill the whole AST every time there is an error; before filling holes we try to look at later tokens to resume the parse).
There is no big magic for that part yet, but the heuristic of indentation works quite well: constructions that have the same indentation in source code are likely to appear at the same depth in the AST, and this is used to decide when to resume consuming tokens rather than filling the AST.
I have a few lead for more disciplined approaches, but the heuristic works so well that it isn't an urgency.
I would say that if I had to write a parser for a language for which a reasonable LR grammar exists, I will use a (good) parser generator without hesitation.
For the grammars of most programming languages, LR is the sweet spot between parsing expressivity and amenability to static analysis.
For instance, working in a OO language such as C#, I often want to turn the parse into an AST build out of idiomiatic objects.
Most compiler-compilers use the same "callback hell" interface that was used for the original yacc in the 1970's. Thus you wind up writing error-prone code.
Conceptually, the grammars used in compiler-compilers aren't that much more complex than regexes (you just have named sub-expressions, the possibility of recursion, and possibly some specification of associativity) Yet, regexes are probably used 100x more in application software development.
In many systems programming areas I think often using unergonomic tools is a badge of pride, it proves how smart you are, so I think ergonomics are often not taken seriously by those who develop the tools.
We use this in Presto for parsing SQL and generating an immutable AST:
I wasn't smart enough to figure out how to adapt these strategies to my LL(k) grammars (late 90s). Maybe someone's figured this out.
Efficient and Flexible Incremental Parsing
TIM A. WAGNER and SUSAN L. GRAHAM
The library itself is here: https://github.com/tree-sitter/tree-sitter
and here are some existing grammars:
Adding a new keyword to a parser is a huge work, even tests need to be written!
(Just kidding, we still do them internally too)
Aren't S-expressions particularly problematic for providing autocomplete hints?
It might be true that s-expressions would make people prefer certain semantic decisions that would be difficult to analyze for autocomplete.
(not that it's important of course, but your remark made me curious, and the reactions to my question even more so)
If you used an s-expression based grammar that still had fields and methods, then you'd get the exact same experience.
foo.bar. <- autocomplete here
^ autocomplete here
(. (bar foo) ...)
^ cursor is here
If you've already typed the form and are just changing the head symbol, you're golden:
If you're using Clojure's threading forms, Cursive understands that:
(-> my-object .myMeth|)
Otherwise, if you're typing out a new form you can do:
(.myMethod my-object |).
One correction - this is all done at edit time, not compile time. At compile time there's no issue since all macros are expanded to the core language.
It's just like how SQL decided to write all their queries backwards making it impossible to do good autocomplete.
There are also a ton of hobby languages that are statically typed and use s-exprs so the author doesn't have to spend as much time on syntax.
(defn len2 [^String x]
> My personal opinion is that everyone should just use s-expressions. Get rid of this whole debate :P
C# 8 maybe?
Have a compiler target dialect that expects "(int 5)" and "(int x)" have a type-inference step that target such "typed" s-expressions?