Hello, I work on the C# compiler and we use a handwritten recursive-descent parser. Here are a few of the more important reasons for doing so:
* Incremental re-parsing. If a user in the IDE changes the document, we need to reparse the file, but we want to do this while using as little memory as possible. To this end, we re-use AST nodes from previous parses.
* Better error reporting. Parser generators are known for producing terrible errors. While you can hack around this, by using recursive-descent, you can get information from further "up" the tree to make your more relevant to the context in which the error occurred.
* Resilient parsing. This is the big one! If you give our parser a string that is illegal according to the grammar, our parser will still give you a syntax tree! (We'll also spit errors out). But getting a syntax tree regardless of the actual validity of the program being passed in means that the IDE can give autocomplete and report type-checking error messages. As an example, the code "var x = velocity." is invalid C#. However, in order to give autocomplete on "velocity", that code needs to be parsed into an AST, and then typechecked, and then we can extract the members on the type in order to provide a good user experience.
My personal opinion is that everyone should just use s-expressions. Get rid of this whole debate :P
While I agree with you those 3 points are extremely important, it turns out there is at least one parser generator that can do all of it: http://gallium.inria.fr/~fpottier/menhir/
It supports both incremental parsing and an API to inspect and recover incomplete ASTs (which powers Merlin, the IDE-like thing for OCaml). It provides stellar debugging features for ambiguous grammars and a way to have good error messages (which is used in compcert's C parser and facebook's reason).
So, it's not impossible. Most parser generators are not that good, though.
Yeah, so we can spend a long time trying to find a parser generator that can solve all of our problems in the language that we use, which may or may not exist. Or you could just spend the same time to just write the parser. It turns out that writing a parser just isn't that hard, and in the time that you spend wavering on which parser generator to use, you could have most of your parser done.
Or you could take the middle ground and use a library meant
to make it easier to create hand built parsers.
as opposed to a library that generates a parser for you.
This handles the repetitive parts such as lookahead functions
creation or Parse Tree building, while also providing advanced features such as error recovery or syntactic content assist (auto complete).
* ECMAScript family of languages supported (JavaScript/TypeScript/CoffeeScript).
Not really, it is rather easy to hack features into a recursive descent parser.
That is is the main reason almost all production compilers are written that way.
With a parser generator, it quickly becomes a game of designing your language and its semantics around the limitations of the generator you are using.
Details about the "incremental" mode are listed in the documentation PDF[0] at section 9.2
Here are the first couple of paragraphs:
"In this API, control is inverted. The parser does not have access to the lexer. Instead, when the parser needs the next token, it stops and returns its current state to the user. The user is then responsible for obtaining this token (typically by invoking the lexer) and resuming the parser from that state. The directory demos/calc-incremental contains a demo that illustrates the use of the incremental API.
This API is “incremental” in the sense that the user has access to a sequence of the intermediate states of the parser. Assuming that semantic values are immutable, a parser state is a persistent data structure: it can be stored and used multiple times, if desired. This enables applications such as “live parsing”, where a buffer is continuously parsed while it is being edited. The parser can be re-started in the middle of the buffer whenever the user edits a character. Because two successive parser states share most of their data in memory, a list of n successive parser states occupies only O(n) space in memory."
There does not appear to be a specific mention of having the partial AST available.
I did a part of the incremental parsing in Menhir and the whole recovery aspect. I can try to explain a bit.
My goal was Merlin (https://github.com/ocaml/merlin/), not research so there is no paper covering the work I am about to describe. Also as of today only incrementality and error message generation are part of upstream version of Menhir, but the rest should come soon.
# Incrementality, part I
The notion of incrementality that comes builtin with Menhir is slightly weaker than what you are looking for.
With Menhir, the parser state is reified and control of the parsing is given back to the user. The important point here is the departure from a Bison-like interface. The user of the parser is handled a (pure) abstract object that represents the state of the parsing.
In regular parsing, this means we can store a snapshot of the parsing for each token, and resume from the first token that has changed (effectively sharing the prefix).
But on the side, we can also run arbitrary analysis on a parser (for error message generation, recovery, syntactic completion, or more incrementality...).
# Incrementality, part II
Sharing prefix was good enough for our use case (parsing is not a bottleneck in the pipeline). But it turns out that a trivial extension to the parser can also solve your case.
Using the token stream and the LR automaton, you can structure the tokens as a tree:
- starts with a root node annotated by the state of the automaton
- store each token consumed as a leaf in that tree
- whenever you push on the automaton's stack, enter a sub-level of the tree (annotated by the new state), whenever you pop, return to the parent node
This gives you the knowledge that "when the parser is in state $x and $y is a prefix of the token stream, it is valid to directly reduce the subtree $z".
In a later parse, whenever you identify a known (state number, prefix) pair, you can short-circuit the parser and directly reuse the subtree of the previous parse.
If you were to write the parser by hand, this is simply memoization done on the parsing function (which is defunctionalized to a state number by the parser generator) and the prefix of token stream that is consumed by a call.
In your handwritten parser, reusing the objects from the previous parsetree amounts to memoizing a single run (and forgetting older parses).
Here you are free to choose the strategy: you can memoize every run since the beginning, devise some caching policy, etc (think about a user (un)commenting blocks of code, or switching preprocessor definitions: you can get sharing of all known subtrees, if this is of any use :)).
So with part I and II, you get sharing of subtrees for free. Indeed, absolutely no work from the grammar writer has been required so far: this can all be derived by the parser generator (with the correctness guarantees that you can expect from it, as opposed to handwritten code).
A last kind of sharing you might want is sharing the spine of the tree by mutating older objects. It is surely possible but tricky and I haven't investigated that at all.
I might be biased but contrary to popular opinions I think that LR grammars are well suited to error message generation.
The prefix propery guarantees that the token pointed out by the parser is relevant to the error. The property means that there exist valid parses beginning with the prefix before this token. This is a property that doesn't hold for most backtracking parsers afaik, e.g PEG.
Knowledge of the automaton and grammar at compile time allow a precise work on error messages and separation of concerns: the tooling ensures exhaustivity of error coverage, assists in migration of error messages when the grammar is refactored, or can give an explicit description of the information available around a given error.
This is not completely free however, sometimes the grammar needs to be reengineered to carry the relevant information. But you would have to do that anyway with a handwritten parser and here the parser generator can help you....
If you have such a parser generator of course :). Menhir is the most advanced solution I know for that, and the UX is not very polished (still better than Bison). It is however a very officient workflow once you are used to it.
So LR is not the problem, but existing tools do a rather poor job at solving that kind of (real) problems.
# Recovery
In Merlin's, recovery is split in two parts.
The first is completion of the parsetree: for any prefix of a parse, it is possible to fill holes in the tree (and thus get a complete AST).
This is done by a mix of static analysis on the automaton and user guidance: for major constructions of the AST, the grammar writer provide "dummy" constructors (the node to use for erroneous expressions, incomplete statement, etc).
The parser generator then checks that is has enough information to recover from any situation, or point out the cases it cannot handle.
It is not 100% free for the grammar writer, but for a grammar such as OCaml one it took less than an hour of initial work (I remember only a few minutes, but I wasn't measuring properly :)). It is also a very intuitive step as it follows the shape of the AST quite closely.
The second part is resuming the parse after an error (we don't fill the whole AST every time there is an error; before filling holes we try to look at later tokens to resume the parse).
There is no big magic for that part yet, but the heuristic of indentation works quite well: constructions that have the same indentation in source code are likely to appear at the same depth in the AST, and this is used to decide when to resume consuming tokens rather than filling the AST.
I have a few lead for more disciplined approaches, but the heuristic works so well that it isn't an urgency.
# Conclusion
I would say that if I had to write a parser for a language for which a reasonable LR grammar exists, I will use a (good) parser generator without hesitation.
For the grammars of most programming languages, LR is the sweet spot between parsing expressivity and amenability to static analysis.
Another problem I have with parser generators is that they often have an awful API.
For instance, working in a OO language such as C#, I often want to turn the parse into an AST build out of idiomiatic objects.
Most compiler-compilers use the same "callback hell" interface that was used for the original yacc in the 1970's. Thus you wind up writing error-prone code.
Conceptually, the grammars used in compiler-compilers aren't that much more complex than regexes (you just have named sub-expressions, the possibility of recursion, and possibly some specification of associativity) Yet, regexes are probably used 100x more in application software development.
In many systems programming areas I think often using unergonomic tools is a badge of pride, it proves how smart you are, so I think ergonomics are often not taken seriously by those who develop the tools.
FWIW, I'm developing a library based on the technique outlined in this paper (and others by Time Wagner). Like the original paper, it uses LR(1) (and GLR), not LL(k).
One nice thing about using recursive descent, is that your parser "is just a normal program." It's also a potential problem, but every substantive project needs coding standards anyhow.
To clarify: do you think there are sets of semantic decisions using S-expressions that would still "get rid of the whole debate", and not make autocomplete harder in practice?
(not that it's important of course, but your remark made me curious, and the reactions to my question even more so)
I develop Cursive, an IDE for Clojure code. I handle this for Java interop in a couple of ways. Cursive does type inference (or propagation, really) in the editor, so if you have enough type hints it knows what the type of a symbol is. This allows a few tricks (| is the caret):
If you've already typed the form and are just changing the head symbol, you're golden:
(.myMeth| my-object)
If Cursive can accurately determine the type of my-object, you'll only see the relevant completions, i.e. methods from that type.
If you're using Clojure's threading forms, Cursive understands that:
(-> my-object .myMeth|)
also works.
Otherwise, if you're typing out a new form you can do:
(my-object .myMeth|)
and when you auto-complete the forms will be switched around to:
Maybe a stupid question because I don't know much clojure but I'm curious. How can you be sure that -> is threading macro and my-object is what you think it is at compile time? Can't macros change the entire meaning of an expression?
That's not a stupid question at all, and yes they can. Cursive works using static analysis rather than runtime introspection, so I need to teach it how macro forms work. This uses an extension API internally which will eventually be open so that users can add support for their own forms or forms in libs they use. Cursive uses the information about the macro forms to do symbol resolution, so that's currently a best effort in the presence of forms it doesn't understand yet. It works pretty well though since Clojure doesn't tend to use as many macros as, say, Racket. Doing full resolution means that I can distinguish clojure.core/-> from some other symbol with the same short name, it's not just using the -> text.
One correction - this is all done at edit time, not compile time. At compile time there's no issue since all macros are expanded to the core language.
Common Lisp via slime has quite good autocomplete because it extracts the information from a running system and doesn't have to infer it from the source code.
* Incremental re-parsing. If a user in the IDE changes the document, we need to reparse the file, but we want to do this while using as little memory as possible. To this end, we re-use AST nodes from previous parses.
* Better error reporting. Parser generators are known for producing terrible errors. While you can hack around this, by using recursive-descent, you can get information from further "up" the tree to make your more relevant to the context in which the error occurred.
* Resilient parsing. This is the big one! If you give our parser a string that is illegal according to the grammar, our parser will still give you a syntax tree! (We'll also spit errors out). But getting a syntax tree regardless of the actual validity of the program being passed in means that the IDE can give autocomplete and report type-checking error messages. As an example, the code "var x = velocity." is invalid C#. However, in order to give autocomplete on "velocity", that code needs to be parsed into an AST, and then typechecked, and then we can extract the members on the type in order to provide a good user experience.
My personal opinion is that everyone should just use s-expressions. Get rid of this whole debate :P