My end goal is a self hosted language.
My first attempt was writing a recursive descent parser by hand: https://gist.github.com/cmcarey/eee1571721141c356d4f61b453a6...
This didn't work so well (badly structured as well as finding out my grammar was ambiguous after I'd written 600 lines of parser code).
Looking into alternative ways to write a parser, I've seen the usual yacc/bison/ANTLR type tools and relevant discussions here on HN.
Owl looks fantastic and probably exactly what I'm looking for. Fantastic interactive page where you enter the grammar and code and it shows a visual representation of the parse tree: https://ianh.github.io/owl/try/#example
OTOH, the parser generators will do a good job of making sure your basic grammar hangs together. Here is what I have been doing lately:
1) Use ply (Python LALR parser generator by Dave Beazly) to create an accepter (just parse, no code gen). That will help you chase out all the grammar inconsistencies.
2) For production, write a hand-generated recursive descent parser, and use "precedence climbing" for expressions. This gives you a nice table-driven way to handle expressions with a couple of mutually recursive functions, thus eliminating the main PITA of recursive descent: those darn arithmetic expressions.
Is the problem that these specific tools (yacc/bison) don't report errors well? Or is the issue you're talking about that LR grammars will fail later on than LL grammars in general? Or the greater ambiguity (e.g. more possible lookaheads due to the greater parsing power)?
When parsing top-down as in recursive descent, you know what higher level syntactic structure you are in the middle of constructing, and anything successfully parsed so far is probably slotted into the AST in a meaningful way, so you simply have a lot more context to go on when constructing a message, and that context is an intuitive AST walk away.
slows things down about 2x, which is why it isnt on by default, on the assumption that most parses are successful
This is a garbage 'computer says no' kind of error message.
It doesn't tell you why those tokens would have been allowed to follow, why it is that those are the only tokens allowed to follow, why the token you used is not one of the allowed one, or what you could have more likely been intending to write to make it correct.
In a recursive descent parser you have the freedom to use more context and logic to generate your error messages.
Your idea of a good parser error message is being told that the end of file was unexpected and being given a list of a hundred possible tokens that could start another statement, including 'end' buried somewhere in the list, because that's what's valid there.
My idea of a good parser error message would be in a recursive descent parser looking for the 'end' token, not seeing it, seeing that it's end of file instead and saying 'missing 'end' or a new statement at end of file'.
The standard ways of getting error messages from LR parsers are pretty bad, but there is a long line of work that's tried to make it better. If you'll forgive the immodesty on my part, I've been working on updating and fixing that work. There's a draft paper at https://arxiv.org/abs/1804.07133 and a beta-ish implementation at https://github.com/softdevteam/grmtools/, so you can see what you think!
[EDIT: fixed formatting]
Basically it uses the parse tree disambiguation from the GLR parser to look for the most likely mistake the user made - it's very clever.
FWIW here is a small, self-contained recursive-descent parser I recently wrote in Python (that recognizes certain C data definitions).
I think your structure is bit odd because Nodes are "smart". That is, the style is more object oriented than functional.
If you clearly separate out the input data (lexer), code for the parser, and the resulting data it will be better. Pretty much all compilers and interpreters follow this structure, even ones in Java (e.g. in Terrence Parr's books). AST nodes should be dumb (no methods related except trivial ones for pretty printing).
I think you might be getting confused between clauses of the grammar and nodes in an AST. They are similar but they don't have a one-to-one correspondence. (Generating a heterogeneous AST is generally more useful than generating a homogeneous parse tree.)
In my case I return tuples; in bigger projects I use Zephyr ASDL. (https://news.ycombinator.com/item?id=17852049)
Parser() is a class because it holds the state of the current token. That is necessary for LL(1) parser. It feels a bit redundant sometimes, but it is a straightforward and useful structure once you get used to it.
Writing an LL(1) recursive descent parser by hand is a good exercise even if you plan to use a parser generator eventually. IME it helps to write out the grammar in comments next to the code.
On the other hand, I have found that what parsing technique works best is very closely related to the actual language. It could be that something like Owl works for you language, but it's not obvious and would require some justification IMO.
The point the others have made here about error recover being hard for LR parsers still stands, unfortunately. But that's only an issue if you are doing error recovery. Most scripting languages don't.
Ohm also has a nice interactive editor: https://ohmlang.github.io/editor/
See benchmark at: https://sap.github.io/chevrotain/performance/ that I have created.
What parsing algorithm does chevrotain use under the hood? Does it allow left-recursive and right-recursive productions/rules?
Chevrotain is an LL(K) Parser library,
or more precisely SLL(K), It looks up-to K fixed token ahead to choose the next alternative.
Because it is just a library to assist in hand crafting recursive decent parsers, the same limitations apply, left recursion would lead to an infinite loop...,however left recursion is detected during initialization and an descriptive error is thrown instead.
Right recursion is allowed.
It is possible to resolve more complex ambiguities
using back tracking: http://sap.github.io/chevrotain/docs/features/backtracking.h...
or other types of custom logic.
However in general, the library does not try to be able to parse all the possible grammars in the world, instead the focus is more on performance, features and ease of development.
like Ohm it separates the grammar from the semantic actions
Instead of using an abstraction of a declarative grammar definition and code generation to provide an alternative to writing a parser by hand, Chevrotain simplifies the process of hand crafting a parser.
Meaning no code generation and you can still directly debug the code you have written (unlike most parser combinators).
> This is what guarantees the language is visibly pushdown: all recursion is explicitly delineated by special symbols.
> Plain recursion isn't allowed. Only two restricted kinds of recursion are available: guarded recursion and expression recursion.
So the Owl grammar itself qualifies:
JSON certainly qualifies:
I do not think you can do a C-like language because of function types. But you can do a simpler, C-like language, for which they have an good example in the test directory: