Hacker News new | comments | ask | show | jobs | submit login
Constructing human-grade parsers (duriansoftware.com)
70 points by ingve 4 months ago | hide | past | web | favorite | 13 comments

> For a user-facing tool, parse errors should be recoverable and the parser should be able to keep going to collect multiple errors.

I don't necessarily agree with that assertion. I need to fix syntax errors one at a time anyway, and very often the parser recovers incorrectly and a single missing parenthesis gives hundreds of errors. Depending on the tooling, later errors can obscure earlier ones, but later errors are more likely to just be a result of the parser recovering incorrectly.

A lot of academic work has gone into making recoverable parsers, but a lot of that work originated from when compilers were much much slower. Remember there was a time when compiling meant taking your stack of punch cards down to the computer room, and checking on the results the next morning. The largest C++ file in my corner of the project I work on is about a quarter megabyte, and builds in 2 seconds on my aging desktop. Of course most of that time is compiling and not parsing. I intentionally added a parse error on its last line, and the compiler only spent 800ms to find it. Most languages parse faster than C++, and most source files are smaller. Waiting 800ms to get to the next syntax error isn't that bad, and it sure beats a screenful of false syntax errors resulting from incorrect parser recovery.

Tooling matters though. Good tooling means that rather than having to search a scollback buffer for the errors that matter, you editor has just jumped to some highlighted lines. But good tooling also means that as soon as you fix one syntax error, the parse results are updated. So it doesn't matter if the one error you fixed is the only one, or the tip of an iceberg.

Compilers are not the only tools worth considering. Language services are also important, and language services see most use when the code is half-written and malformed.

    func foo() {

    func bar() {
`foo` is still in the process of being written and the user has invoked auto-completion after typing `b`. It's good if the parser can recover and parse the `bar` function so that the language service can suggest `bar` there.

As I sit in my fancy editor typing away code fragments I am often struck by the notion that my productivity would improve if the analyzer remembered something about the last clean parse of the file.

If you parsed the file as a diff against the last known good state you should be able reason a lot more about what’s going on. Especially if the programmer has a habit of writing code incrementally.

Instead, what my editor does is autocomplete code blocks or statements in order to keep the parse tree consistent.

But my editing style causes it to constantly jam me up when I’m trying to change string construction, Boolean logic, function calls or function chaining.

I am reminded of Knuth's famous TeX, which goes into an interactive mode when it encounters an error, asking the user what to do next; options include inserting extra tokens before the unexpected one, trying to recover and continuing to process the input, aborting, or even opening the editor so you can edit it in-place. I don't know of any other compiler which does this.

I don't think you actually disagree with the author. I think they would basically agree with everything you wrote and just add on, "therefore write your parser so it actually does recover correctly." Which is what most of the post boils down to.

I tried using parser generators; it's been quite a while, and the state of the art is almost certainly improved over lex+yacc and bison. But I got sick of diagnostics no more specific than "Error in line 15", and sticking intermediate states into yacc to improve errors broke the grammars. Probably I should have looked more closely at ANTLR.

So I wound up at state-machine driven lexers with recursive descent for parsing, and since the code is hand crafted it's relatively easy to print errors that make sense to users. The parsers might be harder to maintain. But they are easy to understand.

I am not a career compiler jockey, for the most part I've just wanted to get languages of medium complexity into tree form so I can reason about things and generate some code or do some analysis. Mostly I want to get another job done, but emitting quality error messages is a general good.

While I certainly understand the desire to hand-write parsers, I think people often underestimate how useful it can be to have an explicit grammar for your language (even if you hand-write the parser then just fuzz test it against the grammar or something like that). It can help so much with external tooling, and you can often analyse explicit grammars in ways you could never analyse a hand-written parser.

As a concrete example, I recently had to write a parser for i3 config files and trying to figure out what was and wasn't syntactically valid was so horrendous that I ended up just hacking together some half-arsed thing just about works for the one or two files I tested it but that I'm fairly sure is going to break horrendously at some point.

To be honest I wish I got more "error in line 15" type stuff. Half the software I develop just fails silently

For those who may be interested in improving the error messages generated for LR parsers, you might want to look at MERR from the Unicon project.


EDIT: correct spelling mistake


Aside from the technical aspects, it seems like constructing good error messages involves understanding human ideas of the meaning of programming constructs.

A big thing is people think of a program as a set meaningful statement with errors involving additions or subtractions from this. But a naive c parser can't give a "missing semicolon" error. All it can do is give "unexpected token X" or some such, because it goes through the code token by token it reaches a token that make sense - that "unexpected token" may be what comes right after the missing semicolon or it might be even further down.

All this is an indication how different the human understanding of a unit of meaning is from just a language spec.

That doesn't (in theory) prevent a parser from taking a guess what the user meant though, for example by incorporating formatting (that is otherwise irrelevant to the parsing). In C++ terms, a missing a semicolon after a class definition is a problem, but almost always the next token is not on the same line but multiple lines further down if this is what happened. A parser that encounters a class definition followed by one or more newlines and then an unexpected token could very reasonably output a hint that there is probably a missing semicolon here (and possibly continue under the pretense that it was there).

The thing that killed my desire to write my own programming language was the stark difference between how I as a human would deconstruct a piece of code that doesn’t compile versus how the tools we tell people to use to build a programming language “accomplish “ the same goal.

At the time I gave up for good (never say never?) SGLR was the only strategy one might possibly mistake for “sane”, and it didn’t yet exist in a programming language I wanted to work in.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact