
GHC Parser Principles - jasim
https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/Parser#Principles
======
danharaj
I have a commit on GHC (a very modest one). The compiler is written in a mix
of somewhat old Haskell style and new which makes sense since it's one of the
oldest and largest codebases in the language. There is a constant drive to
modernize it and some new techniques come out of GHC architectural changes
[0].

I didn't work with the parser, mostly the constraint solver in the type
checker. Still, I browsed quite a bit of code. It's one of the most pleasant
and well documented codebases I've worked in in any language. In particular,
the [note] system is an excellent way to embed long form documentation that
touches many components without using fancy doc tools: grep suffices.

If you want to see what a large, old, industrial grade Haskell project looks
like I think GHC is a good place to look. And if you ever run into trouble
with GHC then I encourage you not to be afraid to hack on it. The code won't
bite and the maintainers have no shortage of patience for newcomers.

[0] [https://arxiv.org/abs/1610.04799](https://arxiv.org/abs/1610.04799)

~~~
endgame
Can you go into more detail about the [note] system?

~~~
ufo
Simon Marlow and Simon Peyton-Jones wrote a chapter about GHC for the
Architecture of Open Source Applications ebook and they talk about the note
system there. Skip to section 5.6.

[http://www.aosabook.org/en/ghc.html](http://www.aosabook.org/en/ghc.html)

------
harpocrates
I have very mixed feelings about using parser generators instead of hand
written parsers. I've contributed to GHC's parser grammar and spent a lot of
time reading Rust's hand-written parser.

On one hand, a grammar is an invaluable tool. It helps keep you honest about
ambiguities when you start thinking about adding language features. I think
languages should always keep some reference grammar up to date.

That said, I don't think it is a good idea for compilers to use parser
generators. A hand-written parser can have better error recovery and more
helpful error messages. GHC's parser has some pretty bad parse error message.

Some work was merged into Happy a while ago to support better messages
(enabling you to see the expected tokens along with the bad token), but using
that in GHC itself is blocked by how complex GHC's tokenizer has become.

~~~
kazinator
Grammars don't keep you entirely honest about ambiguities, because parser
generators resolve some ambiguities like shift in favor of reduce in Yacc.
It's possible to massage a Yacc grammar into working and then not entirely
understand it.

> _A hand-written parser can have better error recovery and more helpful error
> messages._

A hand-written parser lets you put a breakpoint on the function which handles
a given phrase rule, and when the execution stops there, you get a call stack
which more or less corresponds to the top-to-bottom derivation, with arguments
you can examine at every level.

~~~
ufo
The difference is that the bottom up parser is able to detect the ambiguity
and complain about it at all. And anyone who wants to keep their sanity won't
rely on the default yacc behavior and will take care to get rid of all the
shift-reduce conflicts.

------
arturogoosnargh
I'm writing a Haskell compiler in C as a hobby. I didn't take a look at GHC
code but now I find the way they deal with optional indentation interesting.
Nonetheless I implemented the same thing in a simpler way having a function in
my parser that checks for the presence or absence of curly braces and sets
some state on the parser.

My implementation: [https://github.com/angel-
manuel/schc/blob/c1fa77b332a765b476...](https://github.com/angel-
manuel/schc/blob/c1fa77b332a765b4767822bc90314bc62aebac33/src/parser.c#L216-L260)

~~~
kreetx
Out of curiosity, do you have a mathemathics background? I.e for me the
scaryest part of this is type inference and type checking.

------
FullyFunctional
That's neat; parser working hard to make the language look simple.

The success sword of GHC is dual edged; it's made Haskell effectively
synonymous with "Language accepted by a recent GHC". The upside is an amazing
language with great many advanced features. The downside is an enormous
language in a monoculture. It's nearly the antithesis of C (small language
that grew little and slowly, but for which there are 100s of implementations).

~~~
sevensor
It's funny, you could also say this of Ruby and MRI, which is in so many other
respects the polar opposite of Haskell.

Edit: Spelling.

~~~
FullyFunctional
I don't know either, but FWIW, there _are_ many (ok, several) Haskell
implementation, but they only implement the language defined in the standard,
and nobody has been able or willing to chase all the extensions in GHC.

Does Ruby and MRI have specs detailed enough that you could implement the
entired language from just those? Few languages do, but Scheme and Haskell are
among them.

~~~
lobster_johnson
Ruby is actually an ISO standard. Admittedly, it's frozen at Ruby 1.8 and it
doesn't count for much.

------
sfvisser
Similar to what I'm noticing just right now in VSCode where the IDE warns you
comments are not allowed in a JSON file. This means the parser happily parses
comments and performs a separate check afterwards. Beats a "unexpected symbol
'/' at 12:4"

Great UX!

~~~
evincarofautumn
One thing I’ve learned in my years of working on compilers & programming
language tech, which at first is counterintuitive, is that for the best user
experience in a compiler, often you want your parser to be “total”, and _not_
to reject any inputs.

Instead of bailing out at the first error, you parse invalid portions of the
input and sync up to the next point where you can recover and keep parsing,
annotating the resulting token stream or parse tree with error indicators such
as “invalid number literal token” or “missing bracket”.

This lets you report all errors in a source file at once, so they can all be
highlighted in an IDE for example, regardless of errors earlier in the file.
The more semantic information you have about the _correct_ parts of the
program, the better error messages you can produce about the _incorrect_
parts.

You can extend this to all areas of the compilation pipeline—for example,
proceeding to resolve names and typecheck the definitions in a program that
are syntactically correct, while ignoring the others and leaving them as error
markers.

~~~
tom_mellior
> Instead of bailing out at the first error, you parse invalid portions of the
> input and sync up to the next point where you can recover and keep parsing

People like saying that, but I've never seen it work well or provide any
benefits. Because these tools are necessarily imperfect, you will get false
positives, i.e., an earlier error will confuse the system sufficiently to
generate later errors that aren't really errors in the program. So you have to
fix errors in order anyway. So there is no benefit to producing further
messages. You just risk producing several pages of spurious diagnostics for a
simple typo.

Of course this is not _completely_ true, but it's true enough. Or if it isn't,
I'd like to hear about examples of systems that _never_ produce such false
positives, and where for some reason it makes sense (and is actually
beneficial) to change something later in the file first.

Until then, just give me the first error. I won't look at the rest anyway, so
no need for the compiler to be "clever" about it.

~~~
evincarofautumn
It works best in an IDE, or in my case in Emacs where I can jump to each error
in turn. On the command line or in an interactive mode (REPL, notebook, &c.) I
usually just want one error at a time. I’ve had fairly good luck with this in
C# in Visual Studio or Xamarin Studio with Roslyn, although I don’t actually
use .NET much—even though I worked on Mono for a few years, hah.

It also really depends on the grammar of the language how well you can do
recovery. Ideally a language should be designed from the beginning to support
it. Simple example: if a language allows nested function definitions with the
same notation as top-level definitions, then you can’t reliably insert a
missing closing bracket for a definition, because you can’t assume that the
next one is a new top-level; it could be nested. (Although of course you can
use heuristics like indentation to get around this.)

Anyway, the point of successful recovery is that the diagnostics later in the
file are _not_ spurious. And if the compiler supports recoverable parsing, you
can still have a mode where it just bails out after the first error if that’s
what you prefer.

~~~
tom_mellior
> Anyway, the point of successful recovery is that the diagnostics later in
> the file are not spurious.

True if you only consider superficial syntax things. False in the real world,
where compilers also check whether identifiers are declared, and with what
type. If you have an error (syntax or semantic) and decide to press on, in
general your symbol tables will be in a messed-up state, and you will generate
spurious type or "undeclared identifier" errors. Also, in the real world, the
compilers I have experience with do _not_ only continue if they can be 100%
sure that it makes sense, but they invariably continue further because they
try to be "smart".

The IDE point is a semi-valid point since it doesn't scroll to the bottom-
most, most useless errors like a terminal does. But I'm still not convinced
that it adds _value_. That it enables situations where it's _better_ to fix
the second error before the first.

~~~
evincarofautumn
Yeah, it’s definitely a bad idea to press on beyond the point where you can
reasonably recover, and actually better to err on the side of fewer
diagnostics and _less_ information per message if it means the messages aren’t
misleading.

For example, if a compiler says “Expected semicolon before open bracket”…you
should be able to insert a semicolon there! In my experience, that’s what a
beginner programmer or someone unfamiliar with the language is going to try.
If that’s not possible, the compiler is probably leaking implementation
details of the parser, and should say something else. The most important thing
is that the source location points to the actual code where something is wrong
so the programmer can examine it. The message is secondary—make it actionable
if possible, otherwise just avoid making it actively harmful.

But it also really depends on the language how well you can do recovery. Most
languages in wide use are _not_ designed to handle this, in both syntax &
semantics.

Again, even if you don’t want to go for full error-recovery, it’s still a good
thing to support in a compiler pipeline, because it makes it easier to produce
good error messages if the compiler accepts a superset of the actual
language—recognising common mistakes in order to offer fixes, for example.

