The only reason people used parser generators in the past is because the languages they were using (mostly C I suppose) weren’t expressive enough to support higher order functions and higher kinded types.
I really don't like parser combinators.
I've found that, with parser generators, bugs in the parser are very rare. When they exist, you can find them just by carefully inspecting your grammar.
With parser combinators, a misplaced "try" can give you a parser that looks correct and usually works, but some odd placement of tokens makes it fail to parse something it should.
I agree that yacc is not very good, but it's hardly state of the art. It's a LALR generator, and hence very restricted.
Instead, you can use:
(1) ANTLR, which has a huge ecosystem, produces very fast generators, and can take in an almost-arbitrary context-free grammar. The ALL(*) algorithm makes it far superior to Yacc.
(2) A GLR parser-generator, which can take in an actually arbitrary grammar, and is usually very fast. The one I have experience with is Semantic Designs DMS. Michael Mehlich of Semantic Designs has produced a C++ parser that can accurately parse all the major C++ dialects (MS, GCC, Borland, etc), all using general machinery. The only other group that's come close is the Edison Group, which has a huge hand-written C++ parser, and has 5 people full-time on it.
Parser combinators are great, but they’re still hand-written in excessively powerful languages. That’s error-prone if the goal is to end up with a trustworthy, portable parser. The most frequently used parser combinator tools I know of are embedded DSLs, which means porting such a parser to another language requires manually converting to another parser combinator’s syntax and semantics and potentially implementing a new DSL which hasn’t yet been battle tested.
Parser generators which are in a position to get battle tested by supporting many parser implementations and integrations seem like the safest and most scalable way to go. I’m not familiar with Semantic Designs’ GLR implementation; thanks for mentioning it. Another worth knowing is tree-sitter (https://tree-sitter.github.io/tree-sitter/). As a GLR-based parser generator it supports ambiguous context-free grammars. At first ambiguity may sound like a downside, but without a computer-verified proof that your grammar doesn’t admit multiple valid parses for any inputs it’s safest to assume ambiguity and proceed accordingly. For example, if at parse-time multiple parses are found you have the option to warn the user or even reject their input with a diff of the ASTs so they can try to patch the input so it’s no longer ambiguous.
Tree-sitter is young, but it has reach. Atom editor packages for languages are encouraged to implement tree-sitter parsers. It’s particularly well-suited for editors because it supports incremental parsing. Parsing the same file from scratch after every edit isn’t necessary.
Recently I implemented a tree-sitter grammar for ABNF (https://github.com/jmitchell/tree-sitter-abnf) and an Atom package (https://atom.io/packages/language-abnf) for it. Now I’m exploring how to generate tree-sitter parsers for grammars specified using ABNF, such as Dhall’s (https://github.com/dhall-lang/dhall-lang/blob/master/standar...). My plan is to automatically generate and fuzz C-friendly parser libraries for many languages to make them more accessible. I also want to help address any tree-sitter bugs discovered along the way that could translate into security vulnerabilities in generated parsers. For now this is a labor of love, and I am not affiliated with the tree-sitter project.
Lexer/parser generators are simply not needed and not particularly helpful. Two areas where they are particularly inferior:
1. Error messages. Since the generator doesn't know what you're doing, error messages tend to be generic and very unhelpful to the user. With a hand-built one, context can be supplied to the user in the language of the language(!). Suggestions for corrections for the most common mistakes can be given.
2. Error recovery. Doing this well relies on guessing at what the most likely AST was intended by the user. But a generator can't know that.
By the time you've expended effort trying to overcome (1) and (2), you'd have been better off just writing a custom one.
Here are D's lexer and parser:
For a complex language like D, they are pretty simple. (The lexer even comes with some builtin unittests!)
Probably the most complex part is the arbitrary lookahead scheme. Of course, this wouldn't be necessary if the grammar didn't need lookahead.
P.S. The initial versions of these were written 20 years ago in C, so the coding style is a bit archaic and could use improvement.
The parser isn't in its own file, it's intermingled with the semantic processing, a mistake I didn't want to repeat with D :-)
Ehh, I don't know... https://johanengelen.github.io/ldc/2018/01/14/Fuzzing-with-L...
From the article:
> My expectation was that there would be a bug deep down with some rare cornercase of a cornercase, and that the fuzzer would have to run for hours and hours covering more and more of the lexer to, perhaps, finally find a bug. So I was preparing for a long fuzzing session, wanted to go to bed and kick this one off but… The fuzzer found a first failure within a second.
> Running the fuzzer several times, sometimes it still hadn’t found a bug after 15 seconds, but usually it found the bug within a second; the process is (semi-)random so that’s expected.
- If you don't need reasonable performance and reasonable error messages, (and your language is expressive enough) use parser combinators.
- If you need at least one of reasonable performance and reasonable error messages, (or your language isn't expressive enough for parser combinators) use a parser generator.
- If the code at hand is the interpreter / compiler for a widely used programming language (like Java or C++), get an expert to produce a hand-written parser.
Because Elms error messages were actually mind blowing. They fundamentally changed my view of what a language and compiler is capable of. It honestly makes every other language laughable.
I seriously want anyone who ever designs a language to spend even one weekend playing with Elm just to see what's possible, because prior to that I would have never even considered a compiler could be so helpful. At times it was so clear and accurate it seems like it could've just fixed my code for me.
Task at hand wasn't a programming language though, but more like YAML with only one level of nesting and a lot of quoted strings (which were the bottleneck for PEG in my case).
What's increasingly clear to me, having designed a few languages, is that a well designed languages is easy to parse. I work on a couple of languages which require one token lookahead. This means when something goes wrong, we know exactly where the problem is, and what token we didn't expect.
This is less important if you're just implementing a parser for an existing language or standard, but if your implementation is going to be the reference implementation, then I think it's important to generate your parser from the formal language you're working with, to give other implementers confidence that your implementation is correct.
I used to feel similarly, but that was because I'd been introduced to parser combinators in languages that lacked the features to replicate what made them great in Haskell. Properly implemented, they can take on a very readable, declarative-like style.
Over the weekend for instance, I threw together a JSON parser in Haskell using Graham Hutton's monadic parser library, and I found the result refreshingly clear and readable: https://gist.github.com/jarcane/025f5475418858bba8276f1dfcf2...
With do notation and ADTs, the code comes out looking more or less pretty close to what it describes, and that reflectivity applies both to the reading and the coding. I simply describe the patterns in looking to match and which parts to pass on, and voila: a parser.
Category grammars, PCCTS and ANTRL always seemed better approaches to prototype languages.
I was quite happy that our compiler design professor decided to switch to the barely introduced Java, adopting JavaCC and JJTree in the process.
Here’s a video that covers the concepts in a more entertaining format. https://youtube.com/watch?v=ubaX1Smg6pY
The idea is to write the spec (is grammar and semantics) only once and have for free a compiler as well as some formal tools (model checker,...)
C had been formalized with it for instance.
Concise methods of abstracting a problem domain.
Nice to see the author's implementation.
I didn’t quite get the “match arbitrary objects” bit... Is this basically the same technique as PEG or what am I missing?
It’s the “generator” part that matches on objects (AST), generated by the parser, but I’m not sure why that’s useful:
- in the toy example the evaluation could happen right in the parser,
- and if you decide to have an AST, why use a DSL that looks similar to the grammar definition to walk it?
The reason I chose to implement it that way in the article is that I modeled RLMeta after OMeta. If I remember correctly, their reason for doing so was that they thought compilers would be easier to write if all passes (parsing, code generation, ...) could be written using the same language.
I think using the same DSL for code generators works quite well and they become quite readable. Perhaps there is a better DSL suited for code generation, and it would be interesting to experiment with that as well.
You are right that the toy examples could generate code directly in the parser. There is no need for a separate code generator step. But the RLMeta compiler would be much less readable without a separate code generator step I think. The reason I introduced an AST in the toy example was to introduce the concept before I showed it in the RLMeta implementation.
Racket on the other hand gives you the ability to hook into the new language, convert it to racket syntax and then compile it. As far as I know racket doesn't give you a simple parsing grammar, and you're targeting outputting racket code if you make a language. Though you could always take racket as your AST and target something like C or LLVM if you want it to live outside of racket.
Racket is not necessarily unique in that aspect, it just focuses on making it easier with tooling. Scala, Kotlin, Elixir are examples of some of the most successful languages https://www.youtube.com/watch?v=du6qWa8lWZA
The technique used in the book is to use brag to describe the grammar and generate a parser/tokenizer to s-expressions, and then you can either interpret it compile them directly, or use the Racket macro system itself to expand them into Racket.
And of course, Racket's macro system itself can do quite much without even a parser generator. The standard library even includes an example ALGOL 60 implementation done as Racket macros, and examples of C and Java can be found on the package server.
Some even come with Turing Complete type systems and prolog subsystems. http://www.shenlanguage.org/