Hacker News new | comments | show | ask | jobs | submit login
Owl: Parser generator for visibly pushdown languages (github.com)
89 points by ccmcarey 6 days ago | hide | past | web | favorite | 32 comments





I've been looking at parser generators recently in an effort to begin writing my own simple interpreted language.

My end goal is a self hosted language.

My first attempt was writing a recursive descent parser by hand: https://gist.github.com/cmcarey/eee1571721141c356d4f61b453a6...

This didn't work so well (badly structured as well as finding out my grammar was ambiguous after I'd written 600 lines of parser code).

Looking into alternative ways to write a parser, I've seen the usual yacc/bison/ANTLR type tools and relevant discussions here on HN.

Owl looks fantastic and probably exactly what I'm looking for. Fantastic interactive page where you enter the grammar and code and it shows a visual representation of the parse tree: https://ianh.github.io/owl/try/#example


The problem with yacc/bison-style LALR parser generators is that you end up with an LALR parser. Which works great on syntactically correct code, but trying to get a reasonable error message out of one is about as much fun as repeatedly poking yourself in the eye with a sharp stick. Also, by the time you add the empty productions that you want to help with code generation, the grammar gets delicate and brittle.

OTOH, the parser generators will do a good job of making sure your basic grammar hangs together. Here is what I have been doing lately:

1) Use ply (Python LALR parser generator by Dave Beazly) to create an accepter (just parse, no code gen). That will help you chase out all the grammar inconsistencies.

2) For production, write a hand-generated recursive descent parser, and use "precedence climbing" for expressions. This gives you a nice table-driven way to handle expressions with a couple of mutually recursive functions, thus eliminating the main PITA of recursive descent: those darn arithmetic expressions.


Could you explain in more detail what you mean about LALR error messages? I haven't used the parser generators you mention but I've implemented LALR parsing before, the error reporting situation doesn't seem drastically different from recursive descent to me. You can report the location where the parse failed, what was recognized before that point, what tokens would have been allowed to follow.

Is the problem that these specific tools (yacc/bison) don't report errors well? Or is the issue you're talking about that LR grammars will fail later on than LL grammars in general? Or the greater ambiguity (e.g. more possible lookaheads due to the greater parsing power)?


A bottom up parser such as LALR knows what it has built successfully, and knows that the next token makes no sense in the current state. So you have stuff in the parse stack that is headed somewhere but you don't know where, and you know you are at the earliest point you can detect a syntax error. Unless you rummage around in the parse stack for clues, there really isn't much you can do to make a good message, and the rummaging is often ad-hoc.

When parsing top-down as in recursive descent, you know what higher level syntactic structure you are in the middle of constructing, and anything successfully parsed so far is probably slotted into the AST in a meaningful way, so you simply have a lot more context to go on when constructing a message, and that context is an intuitive AST walk away.


I'll also add you can use two parsers: one that's super fast on code that's correctly written; if that fails, one that makes error messages easy to handle. sklogic said he used that strategy in his tools for program analysis. The extra code is negligible. I don't know how much extra time or maintenance burden but I figured marginal versus cost of handling errors in first place.

My FastParse (https://www.lihaoyi.com/fastparse/) parser combinator library does this; by default it runs without error logging, just giving you an error position, but if something fails you can ask it to re-do the parse keeping track of additional metadata to give you a nice error message with what tokens could have succeeded and a stack trace telling you why wants those tokens.

Tracing errors slows things down about 2x, which is why it isnt on by default, on the assumption that most parses are successful


> You can report the location where the parse failed, what was recognized before that point, what tokens would have been allowed to follow.

This is a garbage 'computer says no' kind of error message.

It doesn't tell you why those tokens would have been allowed to follow, why it is that those are the only tokens allowed to follow, why the token you used is not one of the allowed one, or what you could have more likely been intending to write to make it correct.

In a recursive descent parser you have the freedom to use more context and logic to generate your error messages.


What context or data would constitute a helpful error message in your opinion? Do you have an example of a recursive-descent parser that generates particularly good error messages?

A great example is if you forget to put 'end' at the end of a function at the end of a file in languages that have that.

Your idea of a good parser error message is being told that the end of file was unexpected and being given a list of a hundred possible tokens that could start another statement, including 'end' buried somewhere in the list, because that's what's valid there.

My idea of a good parser error message would be in a recursive descent parser looking for the 'end' token, not seeing it, seeing that it's end of file instead and saying 'missing 'end' or a new statement at end of file'.


See elm and rust for examples of compilers with good error messages.

Or clang.

> The problem with yacc/bison-style LALR parser generators is that you end up with an LALR parser. Which works great on syntactically correct code, but trying to get a reasonable error message out of one is about as much fun as repeatedly poking yourself in the eye with a sharp stick.

The standard ways of getting error messages from LR parsers are pretty bad, but there is a long line of work that's tried to make it better. If you'll forgive the immodesty on my part, I've been working on updating and fixing that work. There's a draft paper at https://arxiv.org/abs/1804.07133 and a beta-ish implementation at https://github.com/softdevteam/grmtools/, so you can see what you think!

[EDIT: fixed formatting]


For a really nice solution to the error message problem, see this recent strangeloop talk: https://www.youtube.com/watch?v=Jes3bD6P0To

Basically it uses the parse tree disambiguation from the GLR parser to look for the most likely mistake the user made - it's very clever.


What kind of language was it, and what was ambiguous about it? Do you control the language or is it an existing language?

FWIW here is a small, self-contained recursive-descent parser I recently wrote in Python (that recognizes certain C data definitions).

https://github.com/oilshell/oil/blob/master/build/cpython_de...

I think your structure is bit odd because Nodes are "smart". That is, the style is more object oriented than functional.

If you clearly separate out the input data (lexer), code for the parser, and the resulting data it will be better. Pretty much all compilers and interpreters follow this structure, even ones in Java (e.g. in Terrence Parr's books). AST nodes should be dumb (no methods related except trivial ones for pretty printing).

I think you might be getting confused between clauses of the grammar and nodes in an AST. They are similar but they don't have a one-to-one correspondence. (Generating a heterogeneous AST is generally more useful than generating a homogeneous parse tree.)

In my case I return tuples; in bigger projects I use Zephyr ASDL. (https://news.ycombinator.com/item?id=17852049)

Parser() is a class because it holds the state of the current token. That is necessary for LL(1) parser. It feels a bit redundant sometimes, but it is a straightforward and useful structure once you get used to it.

Writing an LL(1) recursive descent parser by hand is a good exercise even if you plan to use a parser generator eventually. IME it helps to write out the grammar in comments next to the code.

On the other hand, I have found that what parsing technique works best is very closely related to the actual language. It could be that something like Owl works for you language, but it's not obvious and would require some justification IMO.


If you happy to work with Pycon, the easiest LR(1) generator to use and that generates the fastest Python parsers by far is: http://lrparsing.sourceforge.net/ The documentation is very good and the error messages are (by the standards of for the parsing world) clear.

The point the others have made here about error recover being hard for LR parsers still stands, unfortunately. But that's only an issue if you are doing error recovery. Most scripting languages don't.



Absolutely! Furthermore, https://github.com/harc/ohm is the successor to OMeta.

Thanks, I didn't even know there was a successor. I've been using Parsley for a while and love it. It's kind of like the object-oriented answer to parser combinators I think.

You might also be interested in Perl6's grammars:

https://docs.perl6.org/language/grammars#Creating_grammars


I'm in the same boat. I've found https://github.com/harc/ohm to be a pretty pleasant grammar and parser to use.

Ohm also has a nice interactive editor: https://ohmlang.github.io/editor/


Ohm has several impressive features, such as separation of grammar and semantics and incremental parsing capabilities.

The base performance however very low, as in two orders of magnitude lower than most other parsing libraries in JavaScript.

See benchmark at: https://sap.github.io/chevrotain/performance/ that I have created.


chevrotain's playground tool - https://sap.github.io/chevrotain/playground/?example=JSON%20... - is sure nice!

What parsing algorithm does chevrotain use under the hood? Does it allow left-recursive and right-recursive productions/rules?


Thanks.

Chevrotain is an LL(K) Parser library, or more precisely SLL(K), It looks up-to K fixed token ahead to choose the next alternative.

Because it is just a library to assist in hand crafting recursive decent parsers, the same limitations apply, left recursion would lead to an infinite loop...,however left recursion is detected during initialization and an descriptive error is thrown instead.

Right recursion is allowed.

It is possible to resolve more complex ambiguities using back tracking: http://sap.github.io/chevrotain/docs/features/backtracking.h... or other types of custom logic.

However in general, the library does not try to be able to parse all the possible grammars in the world, instead the focus is more on performance, features and ease of development.



have a look at https://lark-parser.readthedocs.io/en/latest/

like Ohm it separates the grammar from the semantic actions


If you are looking for an alternative to standard parser generators you may wish to have a peek at Chevrotain. * https://github.com/SAP/chevrotain

Instead of using an abstraction of a declarative grammar definition and code generation to provide an alternative to writing a parser by hand, Chevrotain simplifies the process of hand crafting a parser.

Meaning no code generation and you can still directly debug the code you have written (unlike most parser combinators).


Just write down the grammar by hand first, then write the recursive descent parser strictly based on the grammar.

The interactive part is genius.

Is there an example of such a language?

It appears that the most important restriction is on recursion:

> This is what guarantees the language is visibly pushdown: all recursion is explicitly delineated by special symbols.

> Plain recursion isn't allowed. Only two restricted kinds of recursion are available: guarded recursion and expression recursion.

So the Owl grammar itself qualifies:

https://github.com/ianh/owl/blob/master/grammar.owl

JSON certainly qualifies:

https://github.com/ianh/owl/blob/master/test/json.owl

I think the full JavaScript language would have qualified in the past, when anonymous functions could only be the explicitly delimited "function() {}"... but now with ES6 we have fat arrow functions (for example "(x,y) => x+y") so I think modern JavaScript can no longer be parsed with Owl. That's because you can write something like "f = x => y => z => x+y+z" which has no guard nor can it be parsed with expression recursion.

I do not think you can do a C-like language because of function types. But you can do a simpler, C-like language, for which they have an good example in the test directory:

https://github.com/ianh/owl/blob/master/test/something.owl


Unfortunate naming. This is not to be confused with OWL or OWL2:

https://en.wikipedia.org/wiki/Web_Ontology_Language


Yes, I'm trying to find resources on this lib, which I guess aren't many since it's new and because all results on first page for my query are about the Ontology language, which is frustrating so I agree that the naming was a bit unfortunate.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: