Hacker News new | past | comments | ask | show | jobs | submit login
Dreaming of a Parser Generator for Language Design (adamant-lang.org)
117 points by matt_d 16 days ago | hide | past | web | favorite | 61 comments

Hand-written recursive descent isn't that bad: it isn't that hard to implement, it gives you lots of control over error recovery, you don't have to worry about how ambiguity is handled (well, you deal with it on your own), and you don't have to worry about interfacing with the rest of your compiler/interpreter pipeline. The tradeoff is that you write some more code and parser performance is less optimized. Considering everything else that happens in a modern compiler, parser performance is in the noise anyways.

There are good reasons why most production languages are implemented with hand written parsers. Generators are a bit of a convenience, but come with too many drawbacks.

The tradeoff is that you write some more code and parser performance is less optimized.

Performance improved when GCC switched from a generated to a handwritten parser. Additional techniques like using an operator precedence table/"precedence climbing" as described in the article can reduce the amount of code, make extending it easier, and you still don't have to learn the specific language of a parser generator nor attempt to debug what it's generated.

IMHO parser generators and parsing algorithms in general seem to have gotten a bit to heavy on the academic theory side of things while ignoring practical considerations, which is why recursive descent + precedence climbing became so popular. Big production-quality compilers use it, little toy ones use it, I don't see any drawbacks other than maybe a misguided attempt at academic purity or dogmatism.

Also, a recursive descent parser is created by modifying the language specification and so I think one gets a deeper understanding of what language spec "means" than if you just feed a spec into a parser generator - a grammar's ambiguity can become in the process, for example.

My experience is different. I've converted two or three grammars from recursive descent to LR recently. In each case, the recursive descent authors have made surprising mistakes, resolving ambiguities in ways that violate the intended spec. This isn't surprising: you can't, in general, know when you're resolving ambiguities in a recursive descent parser, so mistakes are almost inevitable.

In absence of heavy compiler optimizations, parsing consumes about 50% of the compilation time for simple C-style languages. Source: my own experiments. Seems to agree with http://doc.cat-v.org/bell_labs/pikestyle (see last section), but note that text was written in 1989.

That's maybe true for simple C-style languages but very likely not near true for modern statically typed languages that freely allow forward declarations and have type inference.

"freely allow forward declarations" - not sure what you mean. In any case forward declarations were only ever a performance hack for single-pass compilers. Nowadays most compilers are multi-pass and have a separate symbol resolution stage that operates on an in-memory representation of the code (e.g. AST). It's not a huge extra cost to search for symbols "in both directions".

Regarding type inference, it depends on the complexity of the inference algorithm. Personally I think the right way for procedural languages is to require explicit function signatures. Only for functional languages where you have many more (smaller) functions, that is a deal breaker.

> Nowadays most compilers are multi-pass and have a separate symbol resolution stage that operates on an in-memory representation of the code (e.g. AST). It's not a huge extra cost to search for symbols "in both directions".

That's true, but treating all top-level declarations as occupying the same unordered namespace means that a change in any module can require a large amount of non-local recompilation, which makes incrementality more difficult.

> Personally I think the right way for procedural languages is to require explicit function signatures.

Even then, you still spend a lot of time doing inference inside the bodies of functions. Once you mix in inferring parameter types for lambdas, inferring generic type arguments, overload resolution, and maybe implicit conversions, there is a lot going on.

Yeah, recursive descent is just the best. I've played with a lot of options, wrote my own parser combinator and packrat libraries and a Pratt parser. By now what I have settled on is recursive descent until I reach an expression, then switch to a Pratt parser. For me, that's the best overall tradeoff that I've found.

In my experience, source problems like parsing and linking are still the major costs in compiler performance. I love this article by Walter Bright on the subject: http://www.drdobbs.com/cpp/c-compilation-speed/228701711

C++ is slow for many reasons, being really hard to parse is just one of them.

Languages with recursive descent parsers are bound to have simpler syntaxes; eg many (though definitely not all) avoid ambiguity like using < and > as both operators and braces.

This is an excellent survey of many of the issues facing modern parsers.

They correctly point out that error handling is by far the biggest one, and usually ends up being the biggest part of a parser for a language, full of crufts and strange mixtures of semantics and syntax. Add to this that you frequently want to try to partially recover -- make a guess as to what was intended and continue parsing in the hopes of finding more actionable errors later in the code.

I do disagree with a lot of the other conclusions, especially operator precedence, which I've come to believe is generally a bad thing -- associativity can be fine (although even there I'm uncertain), but managing operator precedence is almost always better done by using parentheses.

I like scannerless, but the author makes a lot of good points that I might have to rethink. The analogy to how people read makes some sense, but I'm not sure there's as crisp a line between syntax and semantics as that analogy would imply in natural languages.

Around ambiguity and grammar constraints, I think in the end it's something you have to live with; a complex language with ambiguities or left-recursion is hard for both a machine to parse and for a human to parse, and should be avoided. This also relates to error handling, and, to some degree, error recovery, which is a very nice feature in a language grammar that allows you to generate some AST information (for the purposes of syntax highlighting or LSP features) even on incomplete or partially truncated code. Here I feel that there's a lot of room for languages to include assertions that serve "no value", like semicolon line endings -- one frustration often expressed by the anti-semicolon people is "the compiler can tell me when a semicolon is missing, let's just have the compiler put them in". I feel that this is misguided and interferes with the ideas of being able to continue parsing code with errors in it.

> They correctly point out that error handling is by far the biggest one, and usually ends up being the biggest part of a parser for a language, full of crufts and strange mixtures of semantics and syntax. Add to this that you frequently want to try to partially recover -- make a guess as to what was intended and continue parsing in the hopes of finding more actionable errors later in the code.

I agree. Personally, I ignored automated error handling in parsing for years because it seemed to be a lost cause. Eventually, I decided to look in more detail at it, and soon came across a rich vein of previous work that's been largely ignored / forgotten. I suspect that's because the approaches they proposed were too slow to be practical back in the day. After a bit of modernisation, it turns out that these techniques run more than fast enough on modern machines and can even be extended to do a better job than previous approaches attempted (draft paper at https://arxiv.org/abs/1804.07133 ; a more down-to-earth explanation of how to use the accompanying software at https://softdevteam.github.io/grmtools/master/book/errorreco...).

EDIT: fixed URL.

> I do disagree with a lot of the other conclusions, especially operator precedence, which I've come to believe is generally a bad thing -- associativity can be fine (although even there I'm uncertain), but managing operator precedence is almost always better done by using parentheses.

This is a very interesting take. I feel like there are many, many situations where relying on operator precedence is not only natural, but will cause the code to become less readable. Look at any language which uses many operators (my go-to would be Haskell) and consider readability of one over the other.

Also what do you mean with associativity here? Associativity, as I know it, just means that (a+b+c) = (a+b)+c = a+(b+c). There's clearly no downside to that, right?

Yes, you have associativity right. The question is what kind of price you pay. Allowing a user to type ‘a+b+c’ seems reasonable and forcing them to write ‘a+(b+c)’ seems draconian. But what does ‘a/b/c’ mean? Does your grammar have to have different productions for associative and non-associative binary operators? For general operator support do you have to have productions for every operator, or do you add a little bit of state to your parser outside the grammar to handle the lookback to determine if the same operator is being used? These are ugly decisions to make for the smallish benefit of being able to omit a couple of parentheses. And how often does it really occur outside of heavy numeric computations, which are kind of the minority.

Some APLs go overboard here and just say that everything operates right to left so 2*3+4 is 14. This is super awkward but once you get used to it it’s kind of a relief to not have to think about operator precedence. Similarly RPN and Lisp dodge the issue.

In my experience operator precedence causes actual bugs, but associativity rarely does. So despite all this I’m not firmly against associativity, but I’m getting there.

> These are ugly decisions to make for the smallish benefit of being able to omit a couple of parentheses.

Careful there. We're not talking about one compiler writer and one user. There will be potentially many users. That little benefit of using parentheses will be multiplied across users (and across uses).

Paying a large cost upfront to enjoy small but lasting benefits could very well be worth the cost. (Assuming the benefits are real of course. That would be a separate issue.)

Agreed. And the cost is higher than that even; most people who look at code are not the authors of the code, and presenting people with a snippet that looks like `a + (b + c)` seems like it would always be a stumbling block -- "why not `(a + b) + c`? Is there a reason, or did they just type it that way because that's their style?"

Usually operators are defined as being left-associative, right-associative, or non-associative.

So e.g.,

a + b + c is interpreted as (a + b) + c because + is commonly left-associative


a ^ b ^ c is a ^ (b ^ c)

because languages that support ^ for exponentiation treat it as right associative.

For non-associative operators, parentheses are required if you have two operators at the same precedence, so I believe the grandparent is arguing that e.g. a + b - c should require parens to disbiguate in a language where + and - have the same precedence.

I'm not sure if this is what the OP meant but floating point addition is not necessarily associative and this can surprise people when the same code can produce different results where evaluation order is not guaranteed. If you require parentheses to determine evaluation order it might reduce surprises.

I'm the author of the post.

With regard to operator precedence, my position may be closer to yours than you think. In the post I linked to about intransitive operator precedence (https://blog.adamant-lang.org/2019/operator-precedence/) I describe what I want. Basically, I think basic order of operations like `*` before `+` and associativity are important. But in more complicated cases you should be forced to put in parentheses. Exactly where that line is drawn can be debated. What I want is a way of handling operator precedence that is more complicated under the hood, but should be more natural for the programmer as it will respect basic precedence, but not leave you guessing about confusing expressions. i.e. `a/b/c` could be illegal and require parentheses even though `a+b/c` doesn't.

I think our positions are pretty close, I'm just coming around to the idea that any operator precedence at all is of dubious value.

It's nice to write `a + b * c` and have it parsed the "right" way, but I've become fairly comfortable with the idea of getting rid of that. To always require parentheses. I don't think writing `a + (b * c)` is that terrible. `a + b == c` is a little more annoying as `(a + b) == c`.

Where I think the worst cases are is addition/subtraction, since those are so primitive -- subtraction can be thought of as addition of a negation, and in that form it is directly associative with addition. Indexing into arrays with constructs like `x[a + b - 1]` is incredibly common, and here almost any parenthesization makes it look like it has meaning -- if you see code that says `x[(a + b) - 1]`, is that a different intent than `x[a + (b - 1)]`? It feels different to me; `x[(start + offset) - 1]` is an adjustment, but `x[start + (first_comma - 1)]` is finding the location before an offset.

On associativity I think this gets even worse; if you have to parenthesize `a + b + c` then it feels like the language is getting in your way, just by force of habit. Nobody blinks at the idea of deciding between `a b + c +` vs. `a b c + +` in a stack language, so maybe people would just get used to typing `a + (b + c)` and it would just fade into the background.

This is all kind of rambly; my general feeling about computer languages is what I said above -- the harder it is for a machine to parse, the harder it is for a person to read; but some idioms are just so ingrained that it's hard to even recognize that you're making assumptions about associativity; and I venture there's a significant class of programmers who don't know what associativity is in any formal sense but routinely use the property because it's so ingrained in how we are taught to calculate.

I've created a Parsing Toolkit (for JavaScript) which I believe fulfills 7/10 of the requirements described.

Chevrotain - https://github.com/SAP/chevrotain

The list of features can be found here: - https://sap.github.io/chevrotain/docs/features/blazing_fast....

What is interesting imho is that Chevrotain is not a Parser Generator nor a Parser Combinator. It is a tool that makes it easier to craft hand built recursive decent parsers rather than a whole new level of abstraction.

The paper author understates the importance of composability.

> While having a tool that supports combining grammars would be handy, I don’t see it as a must-have. In practice, languages are not combined very often.

That was once true, but not any more. Intermixing of grammars within a single code instance is becoming more normal, and in many cases it is now the norm. The inability to accept this new norm leaves you holding a bag of multiple parsers and a large amount of processing overhead with a loss of flexibility on how to manage the entirety of the code. It is a horrid mess to deal with a collection of unrelated and nested ASTs to represent a single code instance.

> If the languages being combined are radically different, then for the sake of the programmer, there will probably need to be unambiguous delimiters at the transitions between the languages.

A faulty assumption that often does not reflect reality.

By focusing on composability first you remove many design barriers that limit or prevent design decisions down the line.


The fault of many parsers, from a design perspective, are a heavy concern for goals outside of parsing, frequently either a compile step or the shape of the AST. A compile operation is completely independent of a parse operation. Once an AST is created a consuming compiler will execute on it. Keep these concerns separate. Don't force your parser to act as a proxy for the limitations of your compiler.

A flexible and well reasoned parser can return an AST is a variety of shapes/formats. It doesn't really matter so long as the data structure is well reasoned and in a way the consuming application understands.

Also performance is not well understood in many cases of parsing, which is probably why parsers are often intermingled with the concerns of compilers. A parse operation takes time. It also takes different unrelated time for a consuming application to consume the AST. The speed with which a parse tree is generated is unrelated to the speed with which it is read. This is why composability is more important the paper gives credit for.

If anybody is interested I am working on addressing some of these concerns at https://sparser.io/ which is a personal pet project. It doesn't cure cancer or solve world hunger yet (I am just one guy with a job), but the concept does solve some design problems and composability issues associated with parsing.

I'm working on a language that embeds Javascript. Javascript syntax makes it quite easy to find the end of a complete block by considering just parens, brackets, braces, strings, and comments. Then I can just pass the string to a JS interpreter. So it's easy to allow

  javascript {
    ...any valid javascript code...
C++ syntax, on the other hand, makes it very hard to delimit a block. For instance, it's hard to guess whether a > is an operator or the end of a template parameter list. And preprocessor macros mean that the entire context of all the includes files before the current point can change what any token means.

I think a good goal for a syntax is to make it easy to delimit blocks without a complete parser. It also makes syntax highlighting practical.

JavaScript related languages introduce the same problem. Is the `<` a comparison operator, the start of a type generics segment from something like TypeScript, a markup tag from something like React, or something else? JavaScript alone does not provide the context for that clarity. Boost templates in C++ are similar sort of problem, but there is less variability around boost templates, so they would be easier to discern from the surrounding syntax than the several JavaScript variances.

> I think a good goal for a syntax is to...

I manage a parser supporting many languages, not a language. I don't have the freedom to dictate what a language should look like. Instead my application must have the flexibility to tolerate other peoples' language design decisions, what ever they may be.

I've discovered that python-style block syntax makes it trivial to embed arbitrary other languages, eg:

  while lang == nothing:
    lang = something()
    ... whatever )) }] --] </script>
    {{({[(( /* ...
  ) # works fine

There's an easy way to handle composability: pick a quoting syntax that doesn't appear in the target language (or is unlikely to appear in it) and embed the inner language in that. Stuff like {{/}} (common in templating frameworks) or LLVM TableGen's [{/}] are decent examples.

I maintain a parser. I am not designing a language.

I developed a small interpretting parser, which you can give a grammar on the input and a file to be parsed. It parsers the grammar (with itself) and parsers the input file according to the grammar specification. No need to run a code generator and a compiler before testing a change in the grammar. It uses a recursive decent back-tracking parser which because it uses memorization, works quite fast.

See http://github.com/FransFaase/IParse

Aren't s-expressions all one really needs?

After realizing how easy it is to parse sexps, and how complex some language syntax rules have become [1], i’ve wondered why we tolerate such complexity.

I’ve realized it’s because we are still stuck in a text-file-oriented way of writing programs. We are hand-editing the serialization format of our programs. All of these languages with complex syntax is an attempt to optimize around that use-case.

If we had better tools which allowed us to leave behind text-file-oriented programming, there would be no need for all of this complex syntax.

There would be a few other benefits as well. No one cares whether photoshop .psd files use tabs or spaces, or whether the curly brace should go at end-of-line or next-line, etc.

[1] https://github.com/ruby/ruby/blob/trunk/parse.y

> After realizing how easy it is to parse sexps

Why should "easy for the machine" ever be a consideration for software used by humans? Surely "easy for the human" is the metric that matters. The entire point of having computers is so that can do hard things to make our lives easier.

If rich syntax is easier for us to read and feasible for computers to parse, that's the right trade-off.

I think the primary drawback of languages with complex syntax is the additional barrier to entry it creates for developers to write more tools for the language. Many more people have started a Scheme interpreter than their own C++ compiler.

If we had better tools, we wouldn’t necessarily be limited to just one user-facing syntax for a language. The syntax could simply be a preference in the tools. My point is that as long as we are stuck in text-file-oriented thinking, no one is even thinking about these possibilities.

> Many more people have started a Scheme interpreter than their own C++ compiler.

And yet many more people are writing C++ code. Toolability definitely matters, but the empirical observation is that a relatively small number of tools for complex languages used by a large number of people seems to provide a lot of benefit.

> My point is that as long as we are stuck in text-file-oriented thinking, no one is even thinking about these possibilities.

Thousands of people are thinking about this. I've been reading blog posts and manifestos about the death of text-based syntax literally since the 90s.

What I think most of those people overlook is that if you think programmers struggle to write text, you're really underestimating how much they would struggle to layout boxes and lines or do whatever other spatial/visual user experience you have in mind instead.

Natural language is a serialization format, written and spoken. Algebraic notation is also serial. We like it, are good at it; get used to it.

Assuming natural language is the serialization format of thought (quesionable), we could have non-serial program representation when we can interface directly with thought.

\aside Chomsky reckons natural language is not primarily used for communication, but as an internal linkage between concepts and perception/action. If so, why is it serial? Is it serial? Is a grammar needed if it is not serial? Perhaps structure is all that is needed, and that does not require grammar (though fits with it, via AST).

This is not a new insight and there have been a number of attempts to create programming systems where the programmer does not primarily interact with plain text but they have been largely unsuccessful. It would probably be worth exploring why they haven't taken off if one were trying to design another such system.

That's a great suggestion. Which attempts are you familiar with? The ones I'm familiar with are summarized in this talk: https://www.youtube.com/watch?v=K0Tsa3smr1w

Some explorations of similar ideas that I'm aware of:

- Intentional Programming: https://en.m.wikipedia.org/wiki/Intentional_programming

- Light Table: http://lighttable.com/ (inspired by some ideas from Bret Victor: http://worrydream.com/)

- Visual Programming (lots of examples)

- Literate Programming

You can view sophisticated IDEs and refactoring tools as an attempt to improve the interface to plain text representations of code. These have actually had quite a bit of success.

Some approaches that move away from editing code as plain text have had success in particular domains. Visual Programming has shader graphs, Houdini, Unreal Blueprints, Substance Designer and other niches where it is relatively successful. Literate Programming has Jupiter notebooks and Mathematica that are popular in certain domains. Most of the success seems to come in particular niches, it's hard to think of widely adopted examples for "general" programming.

And Excel of course is a very successful non text based programming model within many domains.

No. Any Lisp you can name has more to it than just s-expressions.

His point still stands. The ruby parser has 10x the number of lines of code as the clojure parser.



Because one parses a data format and the other one an actual programming language...

Sorry, I linked to the wrong edn reader https://github.com/clojure/clojure/blob/master/src/jvm/cloju...

That's still only an s-expression reader, not a programming language parser.

Can you help me understand the distinction you are making? My understanding is that this is the reader used by the repl, and at least for clojure, an edn parser is the same thing as a programming language parser, because clojure programs are written in edn.

In most programming languages, a fact like "Foreach loops have a variable binding, a collection to iterate through, and a body" is guaranteed by the parser, and the rest of the language implementation can assume that the foreach loops are well-formed and concentrate on lexical and type checks.

An S-Expression parser will accept (loop for in do), and the rest of the language implementation has to deal with detecting that there is no variable, no collection and no action.

They are written on top, but not in. It's like claiming Ruby programs are written in characters.

S-expressions know nothing and represent nothing about concepts like functions, variables, argument lists, operators, bindings, type declarations, function defintions, variable defintions, class definitions, ...

Even syntactic (3 4 +) is READ successfully, but it is not a valid Clojure form.

Basically a READER is on the level of a TOKENIZER, but not a parser.

can you elaborate? i'm not a lisper, so I don't understand what else would be present. I am guessing you mean things like switch statements?

I thought maybe he meant Lisps whose compiler isn't another Lisp, or maybe macros at a stretch ..

Virtually every lisp uses some non s-expression syntax which must be dealt with at the reader level. For example, using ' for quotation.

'(a b c) is considered an S-expression in Lisp dialects which support the 'X <-> (quote X) shorthand. It's just a shorthand, and it is read-print consistent, like other such shorthands. If '(a b c) isn't a S-exp, then we have backed into uncomfortable corners in which #(a b c) isn't an S-exp.

Every character in a S-expression must be "dealt with at the reader level" (and produced at the printer level).

The term "S-expression" was originally in contrast with "M-expression", but that's a historic usage.

(3 4 +)

(2 + 3)

(- 3 4)

All are valid s-expressions. Only one is a valid Lisp expression.

Or, in ANSI CL terms, only one is a valid form (S-expression suitable for or subject to evaluation).

A form is already 'read' - it's an object.

When we parse an expression, we usually parse a textual representation into an object. If the object is meant / subject to evaluation -> then we call this a form.

s-expressions only give you a data format - just like XMl, JSON, ...

One then maps a programming language to s-expressions.

You could easily map a complex infix/mixfix language to s-expressions.

This is a valid s-expression:

(foo = x ^ 3 * 4 / sin x)

You can imagine variants with more parenthesis, too.

Curious if the author could comment on what Menhir is lacking, my estimation is possibly just the concrete syntax trees, though you could definitely emit an AST that has as much of that as you need to do the tasks the author mentions.

Author here.

I don't have any direct experience with Menhir because I haven't tried to write a compiler in OCaml yet. It does seem to be one of the better parser generators and is frequently recommended. Skimming the docs, its seems to be lacking in these areas:

  * Concrete syntax trees (as you pointed out)
  * LR(1) will probably still be more restrictive than I want
  * Unicode support (defaults to ocamllex, requires custom code to use ulex)
  * Intransitive Operator Precedence isn't supported (though it does have regular precedence levels and associativity)
  * While it looks to have better error handling than most LR parser generators, I suspect it still isn't where I'd want it to be.
I suspect there are other issues, but I don't want to go any further beyond my knowledge base than I already have. I've been thinking about writing on this topic more, if I do, then I might address this more.

Thanks! The new .messages error message mechanism is great, though we'll see if it supports the full brunt of user needs soon enough.

I think the reason this happens is that parser generators are probably never (ignoring AGI etc) going to produce code as fast and debugger-friendly as a human being. A really good parser generator might be a good way to "mock up" the design of a language, but for anything that gets used in production, the benefits of hand-rolling your parser will probably always outweigh the costs by a large margin: it's crucial code that affects thousands of projects, so it really should be perfect.

I don't think the performance part of your statement is true. There exist techniques for generating extremely efficient LR automata (e.g. https://link.springer.com/article/10.1007/BF00267043) where the equivalent handwritten code would be a goto-ridden mess.

yeah, i started working on something to try and address the fact that all of the tools in this space are steaming hot garbage... after struggling to build a new language myself with various and finding it was possible, but the resulting toolchain was terrible and bug-prone.

its a shame given how fantastic llvm is for the backend problem that all of the lexer/parser tools are terrible.

its very easy to produce much better interfaces and errors than ... well... everything i have been able to find including antlr, flex, bison, haskell parser combinator libraries... an afternoon of lazy work without especially strenuous thought is more than enough.

but performance is hard, especially given the appalling quality of the academic literature and example code for the various algorithms. but for proving the point it is not especially important...

its not a great example... its garbage code, but here it is, able to generate itself and all the usual hoo hah... complete with a scheme shat out with it in 2 hours instead of the usual mundane 24...


the parser is awful. i wrote my own totally generic algorithm (can't remember if i solved the left recursion, but its totally do-able) based on conjuring obvious solutions without thinking very much...

i keep thinking it would be nice to take it further, or even to impart the wisdom to those continuing to develop the garbage. if only i had the resources and time...

(generates shitty vs integrations too)

.. no reason this couldn't be extended to remove the c++ and add functionality to describe language features and generate llvm from that.

(oh and naturally the lexer and parser come from the same data, because why make two things where one does a considerably better job and removes a huge source of bugs)

Just skimmed it, but using a good parser combinator library sounds like a lot of what the author wants.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact