
Dreaming of a Parser Generator for Language Design - matt_d
https://blog.adamant-lang.org/2019/dreaming-of-a-parser-generator/
======
seanmcdirmid
Hand-written recursive descent isn't that bad: it isn't that hard to
implement, it gives you lots of control over error recovery, you don't have to
worry about how ambiguity is handled (well, you deal with it on your own), and
you don't have to worry about interfacing with the rest of your
compiler/interpreter pipeline. The tradeoff is that you write some more code
and parser performance is less optimized. Considering everything else that
happens in a modern compiler, parser performance is in the noise anyways.

There are good reasons why most production languages are implemented with hand
written parsers. Generators are a bit of a convenience, but come with too many
drawbacks.

~~~
userbinator
_The tradeoff is that you write some more code and parser performance is less
optimized._

Performance _improved_ when GCC switched from a generated to a handwritten
parser. Additional techniques like using an operator precedence
table/"precedence climbing" as described in the article can reduce the amount
of code, make extending it easier, and you still don't have to learn the
specific language of a parser generator nor attempt to debug what it's
generated.

IMHO parser generators and parsing algorithms in general seem to have gotten a
bit to heavy on the academic theory side of things while ignoring practical
considerations, which is why recursive descent + precedence climbing became so
popular. Big production-quality compilers use it, little toy ones use it, I
don't see any drawbacks other than maybe a misguided attempt at academic
purity or dogmatism.

~~~
joe_the_user
Also, a recursive descent parser is created by modifying the language
specification and so I think one gets a deeper understanding of what language
spec "means" than if you just feed a spec into a parser generator - a
grammar's ambiguity can become in the process, for example.

~~~
ltratt
My experience is different. I've converted two or three grammars from
recursive descent to LR recently. In each case, the recursive descent authors
have made surprising mistakes, resolving ambiguities in ways that violate the
intended spec. This isn't surprising: you can't, in general, know when you're
resolving ambiguities in a recursive descent parser, so mistakes are almost
inevitable.

------
andrewla
This is an excellent survey of many of the issues facing modern parsers.

They correctly point out that error handling is by far the biggest one, and
usually ends up being the biggest part of a parser for a language, full of
crufts and strange mixtures of semantics and syntax. Add to this that you
frequently want to try to partially recover -- make a guess as to what was
intended and continue parsing in the hopes of finding more actionable errors
later in the code.

I do disagree with a lot of the other conclusions, especially operator
precedence, which I've come to believe is generally a bad thing --
associativity can be fine (although even there I'm uncertain), but managing
operator precedence is almost always better done by using parentheses.

I like scannerless, but the author makes a lot of good points that I might
have to rethink. The analogy to how people read makes some sense, but I'm not
sure there's as crisp a line between syntax and semantics as that analogy
would imply in natural languages.

Around ambiguity and grammar constraints, I think in the end it's something
you have to live with; a complex language with ambiguities or left-recursion
is hard for both a machine to parse and for a human to parse, and should be
avoided. This also relates to error handling, and, to some degree, error
recovery, which is a very nice feature in a language grammar that allows you
to generate some AST information (for the purposes of syntax highlighting or
LSP features) even on incomplete or partially truncated code. Here I feel that
there's a lot of room for languages to include assertions that serve "no
value", like semicolon line endings -- one frustration often expressed by the
anti-semicolon people is "the compiler can tell me when a semicolon is
missing, let's just have the compiler put them in". I feel that this is
misguided and interferes with the ideas of being able to continue parsing code
with errors in it.

~~~
johnday
> I do disagree with a lot of the other conclusions, especially operator
> precedence, which I've come to believe is generally a bad thing --
> associativity can be fine (although even there I'm uncertain), but managing
> operator precedence is almost always better done by using parentheses.

This is a very interesting take. I feel like there are many, many situations
where relying on operator precedence is not only natural, but will cause the
code to become _less_ readable. Look at any language which uses many operators
(my go-to would be Haskell) and consider readability of one over the other.

Also what do you mean with associativity here? Associativity, as I know it,
just means that (a+b+c) = (a+b)+c = a+(b+c). There's clearly no downside to
that, right?

~~~
andrewla
Yes, you have associativity right. The question is what kind of price you pay.
Allowing a user to type ‘a+b+c’ seems reasonable and forcing them to write
‘a+(b+c)’ seems draconian. But what does ‘a/b/c’ mean? Does your grammar have
to have different productions for associative and non-associative binary
operators? For general operator support do you have to have productions for
every operator, or do you add a little bit of state to your parser outside the
grammar to handle the lookback to determine if the same operator is being
used? These are ugly decisions to make for the smallish benefit of being able
to omit a couple of parentheses. And how often does it really occur outside of
heavy numeric computations, which are kind of the minority.

Some APLs go overboard here and just say that everything operates right to
left so 2*3+4 is 14. This is super awkward but once you get used to it it’s
kind of a relief to not have to think about operator precedence. Similarly RPN
and Lisp dodge the issue.

In my experience operator precedence causes actual bugs, but associativity
rarely does. So despite all this I’m not firmly against associativity, but I’m
getting there.

~~~
loup-vaillant
> _These are ugly decisions to make for the smallish benefit of being able to
> omit a couple of parentheses._

Careful there. We're not talking about one compiler writer and one user. There
will be potentially _many_ users. That little benefit of using parentheses
will be multiplied across users (and across uses).

Paying a large cost upfront to enjoy small but lasting benefits could very
well be worth the cost. (Assuming the benefits are real of course. That would
be a separate issue.)

~~~
andrewla
Agreed. And the cost is higher than that even; most people who look at code
are not the authors of the code, and presenting people with a snippet that
looks like `a + (b + c)` seems like it would always be a stumbling block --
"why not `(a + b) + c`? Is there a reason, or did they just type it that way
because that's their style?"

------
bd82
I've created a Parsing Toolkit (for JavaScript) which I believe fulfills 7/10
of the requirements described.

Chevrotain -
[https://github.com/SAP/chevrotain](https://github.com/SAP/chevrotain)

The list of features can be found here: \-
[https://sap.github.io/chevrotain/docs/features/blazing_fast....](https://sap.github.io/chevrotain/docs/features/blazing_fast.html)

What is interesting imho is that Chevrotain is __not __a Parser Generator
__nor __a Parser Combinator. It is a tool that makes it easier to craft hand
built recursive decent parsers rather than a whole new level of abstraction.

------
austincheney
The paper author understates the importance of composability.

> While having a tool that supports combining grammars would be handy, I don’t
> see it as a must-have. In practice, languages are not combined very often.

That was once true, but not any more. Intermixing of grammars within a single
code instance is becoming more normal, and in many cases it is now the norm.
The inability to accept this new norm leaves you holding a bag of multiple
parsers and a large amount of processing overhead with a loss of flexibility
on how to manage the entirety of the code. It is a horrid mess to deal with a
collection of unrelated and nested ASTs to represent a single code instance.

> If the languages being combined are radically different, then for the sake
> of the programmer, there will probably need to be unambiguous delimiters at
> the transitions between the languages.

A faulty assumption that often does not reflect reality.

By focusing on composability first you remove many design barriers that limit
or prevent design decisions down the line.

\---

The fault of many parsers, from a design perspective, are a heavy concern for
goals outside of parsing, frequently either a compile step or the shape of the
AST. A compile operation is completely independent of a parse operation. Once
an AST is created a consuming compiler will execute on it. Keep these concerns
separate. Don't force your parser to act as a proxy for the limitations of
your compiler.

A flexible and well reasoned parser can return an AST is a variety of
shapes/formats. It doesn't really matter so long as the data structure is well
reasoned and in a way the consuming application understands.

Also performance is not well understood in many cases of parsing, which is
probably why parsers are often intermingled with the concerns of compilers. A
parse operation takes time. It also takes different unrelated time for a
consuming application to consume the AST. The speed with which a parse tree is
generated is unrelated to the speed with which it is read. This is why
composability is more important the paper gives credit for.

If anybody is interested I am working on addressing some of these concerns at
[https://sparser.io/](https://sparser.io/) which is a personal pet project. It
doesn't cure cancer or solve world hunger yet (I am just one guy with a job),
but the concept does solve some design problems and composability issues
associated with parsing.

~~~
tlb
I'm working on a language that embeds Javascript. Javascript syntax makes it
quite easy to find the end of a complete block by considering just parens,
brackets, braces, strings, and comments. Then I can just pass the string to a
JS interpreter. So it's easy to allow

    
    
      javascript {
        ...any valid javascript code...
      }
    

C++ syntax, on the other hand, makes it very hard to delimit a block. For
instance, it's hard to guess whether a > is an operator or the end of a
template parameter list. And preprocessor macros mean that the entire context
of all the includes files before the current point can change what any token
means.

I think a good goal for a syntax is to make it easy to delimit blocks without
a complete parser. It also makes syntax highlighting practical.

~~~
austincheney
JavaScript related languages introduce the same problem. Is the `<` a
comparison operator, the start of a type generics segment from something like
TypeScript, a markup tag from something like React, or something else?
JavaScript alone does not provide the context for that clarity. Boost
templates in C++ are similar sort of problem, but there is less variability
around boost templates, so they would be easier to discern from the
surrounding syntax than the several JavaScript variances.

> I think a good goal for a syntax is to...

I manage a parser supporting many languages, not a language. I don't have the
freedom to dictate what a language should look like. Instead my application
must have the flexibility to tolerate other peoples' language design
decisions, what ever they may be.

------
fjfaase
I developed a small interpretting parser, which you can give a grammar on the
input and a file to be parsed. It parsers the grammar (with itself) and
parsers the input file according to the grammar specification. No need to run
a code generator and a compiler before testing a change in the grammar. It
uses a recursive decent back-tracking parser which because it uses
memorization, works quite fast.

See [http://github.com/FransFaase/IParse](http://github.com/FransFaase/IParse)

------
lsh
Aren't s-expressions all one really needs?

~~~
cellularmitosis
After realizing how easy it is to parse sexps, and how complex some language
syntax rules have become [1], i’ve wondered why we tolerate such complexity.

I’ve realized it’s because we are still stuck in a text-file-oriented way of
writing programs. We are hand-editing the serialization format of our
programs. All of these languages with complex syntax is an attempt to optimize
around that use-case.

If we had better tools which allowed us to leave behind text-file-oriented
programming, there would be no need for all of this complex syntax.

There would be a few other benefits as well. No one cares whether photoshop
.psd files use tabs or spaces, or whether the curly brace should go at end-of-
line or next-line, etc.

[1]
[https://github.com/ruby/ruby/blob/trunk/parse.y](https://github.com/ruby/ruby/blob/trunk/parse.y)

~~~
munificent
_> After realizing how easy it is to parse sexps_

Why should "easy for the machine" ever be a consideration for software used by
humans? Surely "easy for the human" is the metric that matters. The entire
point of having computers is so that can do hard things to make our lives
easier.

If rich syntax is easier for us to read and feasible for computers to parse,
that's the right trade-off.

~~~
cellularmitosis
I think the primary drawback of languages with complex syntax is the
additional barrier to entry it creates for developers to write more tools for
the language. Many more people have started a Scheme interpreter than their
own C++ compiler.

If we had better tools, we wouldn’t necessarily be limited to just one user-
facing syntax for a language. The syntax could simply be a preference in the
tools. My point is that as long as we are stuck in text-file-oriented
thinking, no one is even thinking about these possibilities.

~~~
munificent
_> Many more people have started a Scheme interpreter than their own C++
compiler._

And yet many more people are writing C++ code. Toolability definitely matters,
but the empirical observation is that a relatively small number of tools for
complex languages used by a large number of people seems to provide a lot of
benefit.

 _> My point is that as long as we are stuck in text-file-oriented thinking,
no one is even thinking about these possibilities._

Thousands of people are thinking about this. I've been reading blog posts and
manifestos about the death of text-based syntax literally since the 90s.

What I think most of those people overlook is that if you think programmers
struggle to write text, you're _really_ underestimating how much they would
struggle to layout boxes and lines or do whatever other spatial/visual user
experience you have in mind instead.

------
xitrium
Curious if the author could comment on what Menhir is lacking, my estimation
is possibly just the concrete syntax trees, though you could definitely emit
an AST that has as much of that as you need to do the tasks the author
mentions.

~~~
WalkerCodeRangr
Author here.

I don't have any direct experience with Menhir because I haven't tried to
write a compiler in OCaml yet. It does seem to be one of the better parser
generators and is frequently recommended. Skimming the docs, its seems to be
lacking in these areas:

    
    
      * Concrete syntax trees (as you pointed out)
      * LR(1) will probably still be more restrictive than I want
      * Unicode support (defaults to ocamllex, requires custom code to use ulex)
      * Intransitive Operator Precedence isn't supported (though it does have regular precedence levels and associativity)
      * While it looks to have better error handling than most LR parser generators, I suspect it still isn't where I'd want it to be.
    

I suspect there are other issues, but I don't want to go any further beyond my
knowledge base than I already have. I've been thinking about writing on this
topic more, if I do, then I might address this more.

~~~
xitrium
Thanks! The new .messages error message mechanism is great, though we'll see
if it supports the full brunt of user needs soon enough.

------
scythe
I think the reason this happens is that parser generators are probably never
(ignoring AGI etc) going to produce code as fast and debugger-friendly as a
human being. A really good parser generator might be a good way to "mock up"
the design of a language, but for anything that gets used in production, the
benefits of hand-rolling your parser will probably always outweigh the costs
by a large margin: it's crucial code that affects thousands of projects, so it
really should be perfect.

~~~
cwzwarich
I don't think the performance part of your statement is true. There exist
techniques for generating extremely efficient LR automata (e.g.
[https://link.springer.com/article/10.1007/BF00267043](https://link.springer.com/article/10.1007/BF00267043))
where the equivalent handwritten code would be a goto-ridden mess.

------
jheriko
yeah, i started working on something to try and address the fact that all of
the tools in this space are steaming hot garbage... after struggling to build
a new language myself with various and finding it was possible, but the
resulting toolchain was terrible and bug-prone.

its a shame given how fantastic llvm is for the backend problem that all of
the lexer/parser tools are terrible.

its very easy to produce much better interfaces and errors than ... well...
everything i have been able to find including antlr, flex, bison, haskell
parser combinator libraries... an afternoon of lazy work without especially
strenuous thought is more than enough.

but performance is hard, especially given the appalling quality of the
academic literature and example code for the various algorithms. but for
proving the point it is not especially important...

its not a great example... its garbage code, but here it is, able to generate
itself and all the usual hoo hah... complete with a scheme shat out with it in
2 hours instead of the usual mundane 24...

[https://bitbucket.org/craniumsoftware/cp2/src/c6eb20ce4c05fc...](https://bitbucket.org/craniumsoftware/cp2/src/c6eb20ce4c05fce595337a646ff3a0e1a99167e2/?at=semi%2Fwork)

the parser is awful. i wrote my own totally generic algorithm (can't remember
if i solved the left recursion, but its totally do-able) based on conjuring
obvious solutions without thinking very much...

i keep thinking it would be nice to take it further, or even to impart the
wisdom to those continuing to develop the garbage. if only i had the resources
and time...

(generates shitty vs integrations too)

.. no reason this couldn't be extended to remove the c++ and add functionality
to describe language features and generate llvm from that.

(oh and naturally the lexer and parser come from the same data, because why
make two things where one does a considerably better job and removes a huge
source of bugs)

------
anth_anm
Just skimmed it, but using a good parser combinator library sounds like a lot
of what the author wants.

