
Constructing human-grade parsers - ingve
http://duriansoftware.com/joe/Constructing-human-grade-parsers.html
======
repiret
> For a user-facing tool, parse errors should be recoverable and the parser
> should be able to keep going to collect multiple errors.

I don't necessarily agree with that assertion. I need to fix syntax errors one
at a time anyway, and very often the parser recovers incorrectly and a single
missing parenthesis gives hundreds of errors. Depending on the tooling, later
errors can obscure earlier ones, but later errors are more likely to just be a
result of the parser recovering incorrectly.

A lot of academic work has gone into making recoverable parsers, but a lot of
that work originated from when compilers were much much slower. Remember there
was a time when compiling meant taking your stack of punch cards down to the
computer room, and checking on the results the next morning. The largest C++
file in my corner of the project I work on is about a quarter megabyte, and
builds in 2 seconds on my aging desktop. Of course most of that time is
compiling and not parsing. I intentionally added a parse error on its last
line, and the compiler only spent 800ms to find it. Most languages parse
faster than C++, and most source files are smaller. Waiting 800ms to get to
the next syntax error isn't that bad, and it sure beats a screenful of false
syntax errors resulting from incorrect parser recovery.

Tooling matters though. Good tooling means that rather than having to search a
scollback buffer for the errors that matter, you editor has just jumped to
some highlighted lines. But good tooling also means that as soon as you fix
one syntax error, the parse results are updated. So it doesn't matter if the
one error you fixed is the only one, or the tip of an iceberg.

~~~
Arnavion
Compilers are not the only tools worth considering. Language services are also
important, and language services see most use when the code is half-written
and malformed.

    
    
        func foo() {
            b
    
        func bar() {
        }
    

`foo` is still in the process of being written and the user has invoked auto-
completion after typing `b`. It's good if the parser can recover and parse the
`bar` function so that the language service can suggest `bar` there.

~~~
hinkley
As I sit in my fancy editor typing away code fragments I am often struck by
the notion that my productivity would improve if the analyzer remembered
something about the last clean parse of the file.

If you parsed the file as a diff against the last known good state you should
be able reason a lot more about what’s going on. Especially if the programmer
has a habit of writing code incrementally.

~~~
hinkley
Instead, what my editor does is autocomplete code blocks or statements in
order to keep the parse tree consistent.

But my editing style causes it to constantly jam me up when I’m trying to
change string construction, Boolean logic, function calls or function
chaining.

------
kabdib
I tried using parser generators; it's been quite a while, and the state of the
art is almost certainly improved over lex+yacc and bison. But I got sick of
diagnostics no more specific than "Error in line 15", and sticking
intermediate states into yacc to improve errors broke the grammars. Probably I
should have looked more closely at ANTLR.

So I wound up at state-machine driven lexers with recursive descent for
parsing, and since the code is hand crafted it's relatively easy to print
errors that make sense to users. The parsers might be harder to maintain. But
they are easy to understand.

I am not a career compiler jockey, for the most part I've just wanted to get
languages of medium complexity into tree form so I can reason about things and
generate some code or do some analysis. Mostly I want to get another job done,
but emitting quality error messages is a general good.

~~~
Sean1708
While I certainly understand the desire to hand-write parsers, I think people
often underestimate how useful it can be to have an explicit grammar for your
language (even if you hand-write the parser then just fuzz test it against the
grammar or something like that). It can help so much with external tooling,
and you can often analyse explicit grammars in ways you could never analyse a
hand-written parser.

As a concrete example, I recently had to write a parser for i3 config files
and trying to figure out what was and wasn't syntactically valid was so
horrendous that I ended up just hacking together some half-arsed thing just
about works for the one or two files I tested it but that I'm fairly sure is
going to break horrendously at some point.

------
oldandtired
For those who may be interested in improving the error messages generated for
LR parsers, you might want to look at MERR from the Unicon project.

[http://unicon.org/merr/](http://unicon.org/merr/)

EDIT: correct spelling mistake

------
joe_the_user
Hmm,

Aside from the technical aspects, it seems like constructing good error
messages involves understanding human ideas of the meaning of programming
constructs.

A big thing is people think of a program as a set meaningful statement with
errors involving additions or subtractions from this. But a naive c parser
can't give a "missing semicolon" error. All it can do is give "unexpected
token X" or some such, because it goes through the code token by token it
reaches a token that make sense - that "unexpected token" may be what comes
right after the missing semicolon or it might be even further down.

All this is an indication how different the human understanding of a unit of
meaning is from just a language spec.

~~~
MauranKilom
That doesn't (in theory) prevent a parser from taking a guess what the user
meant though, for example by incorporating formatting (that is otherwise
irrelevant to the parsing). In C++ terms, a missing a semicolon after a class
definition is a problem, but almost always the next token is not on the same
line but multiple lines further down if this is what happened. A parser that
encounters a class definition followed by one or more newlines and then an
unexpected token could very reasonably output a hint that there is probably a
missing semicolon here (and possibly continue under the pretense that it was
there).

