Hacker News new | past | comments | ask | show | jobs | submit login
A meta approach to implementing programming languages (2018) (rickardlindberg.me)
134 points by rickardlindberg 27 days ago | hide | past | web | favorite | 40 comments

At this point, I think that it’s a bad idea to use things like yacc or lex for parsing and tokenization. Parser combinators are available in many languages and are much easier to use and much more expressive. Of course if you need insane performance, a hand written recursive descent parser is probably the best option.

The only reason people used parser generators in the past is because the languages they were using (mostly C I suppose) weren’t expressive enough to support higher order functions and higher kinded types.

I'm a Ph. D. student in programming languages. I've written at least 7 language frontends over the years, read many more, and have given a talk on parsing algorithms. My primary programming language is Haskell.

I really don't like parser combinators.

I've found that, with parser generators, bugs in the parser are very rare. When they exist, you can find them just by carefully inspecting your grammar.

With parser combinators, a misplaced "try" can give you a parser that looks correct and usually works, but some odd placement of tokens makes it fail to parse something it should.

I agree that yacc is not very good, but it's hardly state of the art. It's a LALR generator, and hence very restricted.

Instead, you can use:

(1) ANTLR, which has a huge ecosystem, produces very fast generators, and can take in an almost-arbitrary context-free grammar. The ALL(*) algorithm makes it far superior to Yacc.

(2) A GLR parser-generator, which can take in an actually arbitrary grammar, and is usually very fast. The one I have experience with is Semantic Designs DMS. Michael Mehlich of Semantic Designs has produced a C++ parser that can accurately parse all the major C++ dialects (MS, GCC, Borland, etc), all using general machinery. The only other group that's come close is the Edison Group, which has a huge hand-written C++ parser, and has 5 people full-time on it.

What about PEGs? You can get the benefits of writing down a grammar that you can analyze yourself, and you can call methods on the production results, etc.

PEGs have the same downsides the parser combinators have. If you put things in the wrong order you can have programs that work most of the time but do the wrong thing on a corner case or in a parsing ambiguity.

By default I’m skeptical of hand-written parsers, especially if they’re implemented in a language lacking memory safety features. I have seen too many parsers fall victim to (domain specific) fuzzing, and the consequences are serious security vulnerabilities.

Parser combinators are great, but they’re still hand-written in excessively powerful languages. That’s error-prone if the goal is to end up with a trustworthy, portable parser. The most frequently used parser combinator tools I know of are embedded DSLs, which means porting such a parser to another language requires manually converting to another parser combinator’s syntax and semantics and potentially implementing a new DSL which hasn’t yet been battle tested.

Parser generators which are in a position to get battle tested by supporting many parser implementations and integrations seem like the safest and most scalable way to go. I’m not familiar with Semantic Designs’ GLR implementation; thanks for mentioning it. Another worth knowing is tree-sitter (https://tree-sitter.github.io/tree-sitter/). As a GLR-based parser generator it supports ambiguous context-free grammars. At first ambiguity may sound like a downside, but without a computer-verified proof that your grammar doesn’t admit multiple valid parses for any inputs it’s safest to assume ambiguity and proceed accordingly. For example, if at parse-time multiple parses are found you have the option to warn the user or even reject their input with a diff of the ASTs so they can try to patch the input so it’s no longer ambiguous.

Tree-sitter is young, but it has reach. Atom editor packages for languages are encouraged to implement tree-sitter parsers. It’s particularly well-suited for editors because it supports incremental parsing. Parsing the same file from scratch after every edit isn’t necessary.

Recently I implemented a tree-sitter grammar for ABNF (https://github.com/jmitchell/tree-sitter-abnf) and an Atom package (https://atom.io/packages/language-abnf) for it. Now I’m exploring how to generate tree-sitter parsers for grammars specified using ABNF, such as Dhall’s (https://github.com/dhall-lang/dhall-lang/blob/master/standar...). My plan is to automatically generate and fuzz C-friendly parser libraries for many languages to make them more accessible. I also want to help address any tree-sitter bugs discovered along the way that could translate into security vulnerabilities in generated parsers. For now this is a labor of love, and I am not affiliated with the tree-sitter project.

Lexing and parsing are the easiest (by far) parts of writing a compiler. Writing them by hand is very easy, and the code often looks like the grammar. They also tend to be the least buggy parts, I would handwave that maybe 1 in a thousand bugs were in the lexer/parser.

Lexer/parser generators are simply not needed and not particularly helpful. Two areas where they are particularly inferior:

1. Error messages. Since the generator doesn't know what you're doing, error messages tend to be generic and very unhelpful to the user. With a hand-built one, context can be supplied to the user in the language of the language(!). Suggestions for corrections for the most common mistakes can be given.

2. Error recovery. Doing this well relies on guessing at what the most likely AST was intended by the user. But a generator can't know that.

By the time you've expended effort trying to overcome (1) and (2), you'd have been better off just writing a custom one.

Here are D's lexer and parser:



For a complex language like D, they are pretty simple. (The lexer even comes with some builtin unittests!)

Probably the most complex part is the arbitrary lookahead scheme. Of course, this wouldn't be necessary if the grammar didn't need lookahead.

P.S. The initial versions of these were written 20 years ago in C, so the coding style is a bit archaic and could use improvement.

For comparison, here's the C++ lexer for Digital Mars C++:


The parser isn't in its own file, it's intermingled with the semantic processing, a mistake I didn't want to repeat with D :-)

> They also tend to be the least buggy parts, I would handwave that maybe 1 in a thousand bugs were in the lexer/parser.

Ehh, I don't know... https://johanengelen.github.io/ldc/2018/01/14/Fuzzing-with-L...

From the article:

> My expectation was that there would be a bug deep down with some rare cornercase of a cornercase, and that the fuzzer would have to run for hours and hours covering more and more of the lexer to, perhaps, finally find a bug. So I was preparing for a long fuzzing session, wanted to go to bed and kick this one off but… The fuzzer found a first failure within a second.

> Running the fuzzer several times, sometimes it still hadn’t found a bug after 15 seconds, but usually it found the bug within a second; the process is (semi-)random so that’s expected.

It found a bug in the lexer, which is good, because then it was fixed. I never said that handbuilding a lexer would mean it doesn't have any bugs.

The advantage of parser combinators is that you can use them with a lot of different things. Parsing some UDP protocol, no problem, 20 lines of codes. Parsing some bespoke binary format, no problem. I agree that maybe if your problem space is just compilers that maybe you should consider something else, but most parsing is not about code and more about data. In these cases, parser combinators are a no-brainer.

My rule of thumb:

- If you don't need reasonable performance and reasonable error messages, (and your language is expressive enough) use parser combinators.

- If you need at least one of reasonable performance and reasonable error messages, (or your language isn't expressive enough for parser combinators) use a parser generator.

- If the code at hand is the interpreter / compiler for a widely used programming language (like Java or C++), get an expert to produce a hand-written parser.

Last I checked, Elm uses parser combinators and is hailed as having among the best error messages...

How does Elm do it?

Because Elms error messages were actually mind blowing. They fundamentally changed my view of what a language and compiler is capable of. It honestly makes every other language laughable.

I seriously want anyone who ever designs a language to spend even one weekend playing with Elm just to see what's possible, because prior to that I would have never even considered a compiler could be so helpful. At times it was so clear and accurate it seems like it could've just fixed my code for me.

Hand written parsers are actually pretty slow. And higher order parsers are still hand written parsers and usually perform almost exactly the same (except for languages where higher order programming is slow on itself).

My personal experience is very contradictionary to that statement. Handoptimizing a slow PEG parser, but still using PEG gave me 10% performance increase. Handwriting it on low level libraries, but in the same high level language, gave me a 50,000% performance increase on the average file it had to parse.

Task at hand wasn't a programming language though, but more like YAML with only one level of nesting and a lot of quoted strings (which were the bottleneck for PEG in my case).

Parser combinators are fine for languages not designed to be written by people, but the error messages tend to be awful.

What's increasingly clear to me, having designed a few languages, is that a well designed languages is easy to parse. I work on a couple of languages which require one token lookahead. This means when something goes wrong, we know exactly where the problem is, and what token we didn't expect.

Error messages are a much more important reason to use a hand written parser/lexer: Unless you basically write your own parser generator, it'll never be as direct or fluent as a hand written parser

I'm not sure what you are confusing higher order parsers with, but there is absolute no problem with providing a couple of higher order abstractions for error messages. In fact, error messages are much easier to produce that way, than manually.

Are they? I'll be upfront in saying I've never done any actual work with parser combinators (I played with Parsec for about 15 minutes on the train), but I don't see how having direct control is worse. Easier, it may be, but I can only imagine real world situations harder when you get to the nasty stuff in your implementation (e.g. C++ templates basically need to be "run" while being parsed, or lexed even)

I like parser combinators, but I think they obscure the formal definition of the language being parsed. There's something to be said for having a declarative specification of what you're parsing.

This is less important if you're just implementing a parser for an existing language or standard, but if your implementation is going to be the reference implementation, then I think it's important to generate your parser from the formal language you're working with, to give other implementers confidence that your implementation is correct.

I like parser combinators, but I think they obscure the formal definition of the language being parsed. There's something to be said for having a declarative specification of what you're parsing.

I used to feel similarly, but that was because I'd been introduced to parser combinators in languages that lacked the features to replicate what made them great in Haskell. Properly implemented, they can take on a very readable, declarative-like style.

Over the weekend for instance, I threw together a JSON parser in Haskell using Graham Hutton's monadic parser library, and I found the result refreshingly clear and readable: https://gist.github.com/jarcane/025f5475418858bba8276f1dfcf2...

With do notation and ADTs, the code comes out looking more or less pretty close to what it describes, and that reflectivity applies both to the reading and the coding. I simply describe the patterns in looking to match and which parts to pass on, and voila: a parser.

To clarify: Combinators are declarative specifications, and for parsing, they are specifications of the grammar.

I never liked them, even in the late 90's, in spite of having written a tutorial how to use bison with C++, back when the integration wasn't the best.

Category grammars, PCCTS and ANTRL always seemed better approaches to prototype languages.

I was quite happy that our compiler design professor decided to switch to the barely introduced Java, adopting JavaCC and JJTree in the process.

The demonstration of this technique by the VPRI team was very impressive. http://www.vpri.org/pdf/tr2012001_steps.pdf Shame that team seems to have gone quiet (not dead) since then.

Here’s a video that covers the concepts in a more entertaining format. https://youtube.com/watch?v=ubaX1Smg6pY

They turned into Dynamicland


If you are interested by uncommon approach to implementing programming language, toy should have a look at the K framework (http://www.kframework.org/index.php/Main_Page)

The idea is to write the spec (is grammar and semantics) only once and have for free a compiler as well as some formal tools (model checker,...)

C had been formalized with it for instance.

K is really cool, but does really give you a compiler? That's the first I heard of that. Do you have (a link to) more information about how it does compilation?

No sorry, I thought it was a compiler but it's "only" an interpreter.

RLMeta/Meta II/OMeta are a fascinating branch of programming languages. And a direction I could see us moving towards.

Concise methods of abstracting a problem domain.

Nice to see the author's implementation.

The grammar definitions in the article looked very similar to PEG at the first glance. And the article seems to confirm, “The parsing algorithm that RLMeta implements is based on parsing expression grammars (PEG), but is extended to match arbitrary objects, not just characters.”

I didn’t quite get the “match arbitrary objects” bit... Is this basically the same technique as PEG or what am I missing?

As far as I know, PEGs work only on streams of characters. RLMeta works on streams of arbitrary Python objects. The syntax currently has support for matching characters, strings, and lists. Does that clarify it?

Thanks, but it still seems to me that the first (“parser”) part is very much like a PEG.

It’s the “generator” part that matches on objects (AST), generated by the parser, but I’m not sure why that’s useful: - in the toy example the evaluation could happen right in the parser, - and if you decide to have an AST, why use a DSL that looks similar to the grammar definition to walk it?

Yes, you are right that the parser grammars are much like PEGs and it's the code generator grammars that match other objects than characters.

The reason I chose to implement it that way in the article is that I modeled RLMeta after OMeta. If I remember correctly, their reason for doing so was that they thought compilers would be easier to write if all passes (parsing, code generation, ...) could be written using the same language.

I think using the same DSL for code generators works quite well and they become quite readable. Perhaps there is a better DSL suited for code generation, and it would be interesting to experiment with that as well.

You are right that the toy examples could generate code directly in the parser. There is no need for a separate code generator step. But the RLMeta compiler would be much less readable without a separate code generator step I think. The reason I introduced an AST in the toy example was to introduce the concept before I showed it in the RLMeta implementation.

OK, thanks for the clarification!

I think you should definitely take a look to Meta Programming System (MPS) from JetBrains: https://www.jetbrains.com/mps/

could I get something like this with Racket[1]? why would I chose this over Racket?

[1] https://news.ycombinator.com/item?id=19232068

From my understanding this library takes a grammar and generates a run time interpreter for that grammar. Antlr is much more similar as it takes in a grammar, and generates code you can hook into at any part of the grammar parsing.

Racket on the other hand gives you the ability to hook into the new language, convert it to racket syntax and then compile it. As far as I know racket doesn't give you a simple parsing grammar, and you're targeting outputting racket code if you make a language. Though you could always take racket as your AST and target something like C or LLVM if you want it to live outside of racket.

Racket is not necessarily unique in that aspect, it just focuses on making it easier with tooling. Scala, Kotlin, Elixir are examples of some of the most successful languages https://www.youtube.com/watch?v=du6qWa8lWZA

Racket had parser generators available. See for instance ragg, or is fork, the brag parser used in the Beautiful Racket book: https://docs.racket-lang.org/brag/index.html

The technique used in the book is to use brag to describe the grammar and generate a parser/tokenizer to s-expressions, and then you can either interpret it compile them directly, or use the Racket macro system itself to expand them into Racket.

And of course, Racket's macro system itself can do quite much without even a parser generator. The standard library even includes an example ALGOL 60 implementation done as Racket macros, and examples of C and Java can be found on the package server.

Indeed. I don't want to sound flippant as obviously a lot of hard work has gone into this but I would have thought that a lisp and a parser should be able to achieve the same no?

Some even come with Turing Complete type systems and prolog subsystems. http://www.shenlanguage.org/

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact