
Show HN: Owl, a new kind of parser generator - panic
https://github.com/ianh/owl
======
ulrikrasmussen
This looks really great! For casual parsing and data extraction in particular,
having a parser generator which does not restrict its users to a particular
non-compositional subset of its input grammar (e.g. LL, LR, etc.) is a must,
since it does not require the user to understand the particular parsing
technique in order to use it. Visibly pushdown languages seem powerful enough
to be able to express many practical data formats.

Shameless plug: I saw that you have based the parse tree construction on Dube
and Feeley's two-pass algorithm which in its second phase builds a parse tree
from a "journal" generated by a DFA in the first recognition phase. This works
well, but as you point out, it uses memory linear in the input string for the
journal.

While the parse tree must of course still be built and kept in memory, the
size of the auxiliary journal can actually be limited to what is semantically
required by the ambiguity of the input grammar - in many cases, the bound is
O(1). We have written a couple of papers about such a streaming parsing
algorithm which you might find interesting for inspiration:

[1] Optimally Streaming Greedy Regular Expression Parsing.
[https://utr.dk/pubs/files/grathwohl2014-0-paper.pdf](https://utr.dk/pubs/files/grathwohl2014-0-paper.pdf)

[2] Kleenex: Compiling Nondeterministic Transducers to Deterministic Streaming
Transducers.
[https://utr.dk/pubs/files/grathwohl2016-0-paper.pdf](https://utr.dk/pubs/files/grathwohl2016-0-paper.pdf)

We never got around to extending the technique to more powerful formalisms
such as visibly pushdown languages, although I believe that should be
possible.

~~~
panic
Thanks! This is something I want to fix eventually. These papers will be very
helpful.

------
Vanit
"Understandable — like regular expressions", out of context, but I chuckled.

~~~
sbjs
I get why that's funny, but I actually thought that made a good point: regular
expressions in the wild can become unruly and difficult unless you break them
down, but the concepts of regular expressions are very simple and natural to
understand and easy to work with if you're writing your own very simple
regexps. I don't know much about traditional parsers, but if this tool
simplifies parsers in the same way regular expressions simplified text-based
state machines, it sounds like a major innovative step forward!

~~~
hyperpallium
And owl naturally breaks down grammars, similar to CFG.

Regular expressions can also be broken down - by their recursive definition -
but curiously they never seem to be, in pratice. e.g.

    
    
        E = ab+bc
        F = x+y
        EF = (ab+bc)(x+y)

------
blarg1
"Owl uses precomputed DFAs and action tables, which can blow up in size as
grammars get more complex. In the future, it would be nice to build the DFAs
incrementally."

Don't know if you've read this, but it describes how you might build them
incrementally:

[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

Also I wonder if instead of using DFA's would it be easier to generate the
parser using assembly like instructions, for example:

[https://swtch.com/~rsc/regexp/regexp2.html](https://swtch.com/~rsc/regexp/regexp2.html)

------
chucky_z
Is Owl similar to ANTLR? Could you explain the differences between the two
languages, and what makes a parser unique if they're similar?

I'm curious because I've quickly run across two of these types of projects
now, but admittedly know very little about them.

~~~
panic
Yeah, they both solve the same problem -- turning text into a structured
"parse tree".

At the core of each parsing tool is a particular parsing algorithm. The
details of this algorithm determine what kind of text the parser can parse,
how efficiently it can parse it, and what kind of error messages you get if
something goes wrong.

There's a huge variety of these algorithms, but nearly all of them are
variations on two basic types: "LL" and "LR" parsers. For example, ANTLR uses
an "LL(*)" parsing algorithm. This is a great blog post about the difference
between the two basic types: [http://blog.reverberate.org/2013/07/ll-and-lr-
parsing-demyst...](http://blog.reverberate.org/2013/07/ll-and-lr-parsing-
demystified.html).

Owl is unique because its algorithm isn't a variation of either of these. It
works more like a regular expression parsing algorithm. This makes it more
limited than a tool like ANTLR, but it can solve some problems that have been
difficult with the existing algorithms.

For example, most programming languages allow expressions like "1 - 2" to be
used as statements. If you don't use a semicolon or anything to separate the
statements, there's an ambiguity -- is "1 - 2" a single expression, or is it
the two expressions "1" "-2" in sequence?

ANTLR (at least with the default settings I tried) doesn't tell you about this
ambiguity at all. It just picks one of the two options ("1 - 2" in this case).
This may or may not have been what you were expecting! The problems with
ambiguity checking are quite deep (this followup to the earlier blog post I
linked goes into more detail: [http://blog.reverberate.org/2013/09/ll-and-lr-
in-context-why...](http://blog.reverberate.org/2013/09/ll-and-lr-in-context-
why-parsing-tools.html)). Owl's parsing algorithm avoids these problems; it
can show you the ambiguity in all cases.

~~~
adito
It's often I stumble upon discussion mentioning the term LL and LR and other
stuff related to parsing[0]. Without a proper CS background, it's quite hard
to follow along. That first link[1] and the wikipedia page[2] mentioned there
are really great. Many thanks for posting those. It really shed some light
about those terms.

[0]:
[https://jeffreykegler.github.io/personal/timeline_v3](https://jeffreykegler.github.io/personal/timeline_v3)

[1]: [http://blog.reverberate.org/2013/07/ll-and-lr-parsing-
demyst...](http://blog.reverberate.org/2013/07/ll-and-lr-parsing-
demystified.html)

[2]:
[http://en.wikipedia.org/wiki/Tree_traversal](http://en.wikipedia.org/wiki/Tree_traversal)

------
erezsh
Looks interesting. Can you provide some intuition on which languages can be
parsed by this algorithm, and which can't?

Btw, why did you choose the name Owl?

~~~
panic
_> Can you provide some intuition on which languages can be parsed by this
algorithm, and which can't?_

The restriction is that any recursion in the grammar has to be inside explicit
begin and end tokens. For example, you can't write a rule like

    
    
        stmt = 'if' expr 'then' stmt* | expr
    

because `stmt` refers to itself directly, but

    
    
        stmt = [ 'if' expr 'then' stmt* 'end' ] | expr
    

is OK -- the [ and ] symbols indicate explicit recursion using 'if' and 'end'
as the begin and end tokens.[1]

 _> Btw, why did you choose the name Owl?_

I wanted a bird name, and Owl was short and easy to type!

. . .

[1] Though in this case, the _language_ is still visibly pushdown, since you
can expand it manually to:

    
    
        stmt = (('if' expr 'then')+ expr*)+ | expr
    

This is because the grammar only has right recursion. Middle recursion (like
you get if you add an 'else' clause) can't be expanded like this, so explicit
recursion is necessary to parse languages which use it.

To automate this expansion process (and re-association into a sensible parse
tree), Owl also has syntax for operators with precedence. Here's an example:
[https://ianh.github.io/owl/try/#expr](https://ianh.github.io/owl/try/#expr)

~~~
hyperpallium
This seems similar to grammars described by DTTs (predecessor to XML Schema),
in that it's regular expressions, that must be deterministic, that can
recurse, but only when isolated by a tag. That is, recursion always results in
nested elements, never a sequence of iterated elements:

    
    
        yes:  S -> <a> S </a> | e
        no:   S -> <a></a> S  | e
    

I've seen these called "tree grammars". Perhaps it's identical with "visibly
pushdown grammars"? I skimmed the wiki link (from your readme.md), but it
requires further study. I think it would be great if your readme.md included a
brief idea, along the lines of what your comment here.

~~~
panic
I think they're the same thing, though IIRC tree grammars are usually defined
to operate on actual trees rather than strings of text.

~~~
hyperpallium
I played around with tree grammars. I think there's a formal language reason
for them being incapable of parsing expressions properly. I tried anyway, to
find a subset or workaround that would work well enough, but failed. (IIRC) At
best, I needed parentheses around one of the operator operations e.g. always
around +, like (a+b) _c or (a+b_ c)

I'm interested that you have this too, but it seems to have the same problems
I found:
[https://news.ycombinator.com/item?id=17650770](https://news.ycombinator.com/item?id=17650770)

It would be very exciting if you have found a way around this!

~~~
panic
Yeah, this is a fundamental problem. Luckily, arithmetic expressions are
always left- or right-recursive, never middle-recursive. That means they're
recognizable by a regular grammar, even if the parse tree doesn't match what
you'd expect:

    
    
        expr = number (('*' | '+') number)*
    

You can replace number by (number | [ '(' expr ')' ]) to add support for
parentheses as well.

My solution was to add special syntax which makes rules like this
automatically:

    
    
        expr = number : num
         .operators infix left '*' : times
         .operators infix left '+' : plus
    

While the parse tree is built, the operators and operands are rearranged into
the expected parse tree using [https://en.wikipedia.org/wiki/Shunting-
yard_algorithm](https://en.wikipedia.org/wiki/Shunting-yard_algorithm). Here's
an interactive example with some more operators:
[https://ianh.github.io/owl/try/#expr](https://ianh.github.io/owl/try/#expr)

~~~
hyperpallium
Thanks, the eg I tried before works now, I must have misinterpreted the
output.

Nice, a language can be recognized, even if a different parse tree.

I vaguely recall the shunting yard algorithm... am I right in characterising
your solution as a second level of parsing? It's not "pure", but from a
practical point of view, it's no worse than back-references in PCRE. The
practical issue is just of ui usability.

------
nmca
Hey- does anyone know how this compares to Marpa? Also, more examples of
errors would be great :)

~~~
taejo
Marpa is a general parser: it can parse any context-free grammar. Owl can only
parse the visibly-pushdown grammars, which are a much smaller class, even
smaller than the LR(k) grammars that traditional parser-generators can parse.

Besides guaranteed O(n) parsing time, it's not clear to me what the advantages
of Owl are, and Marpa runs in O(n) time on any grammar that Owl accepts, and
many more (and I think it can or could easily be modified to warn if it can't
guarantee O(n) parsing time for a particular grammar).

------
wodenokoto
If you need to parse a file that is more complex than what regex can handle,
what does one /generally/ do?

I'm doing some analysis in R, and the data is generated by a program that more
or less just spit things out.

Its values are comma separated, but if there are 2 values on one line, it is
column 1 and 10. If the first character on a line is a comma, it is not a
separator but a value. Commas inside quotations or escaped with backslashes
are also values.

The list of special cases and oddities is pretty long. Right now I have a
large switch statement that applies different search and replace on all sorts
of conditions before feeding the CSV to read_csv.

Is this a case for owl? Can I use it within R or Python?

~~~
jacoblambda
From what I can tell, your best bet would be to build the generated parser
(which will generate some C code) and then build a small interface in C. From
there I would just use R/Python's foreign function interface to call your C
wrapper.

Alternatively for python you could use a parse combinator like parsec.
Depending on how complicated your file can get, using parsec could very likely
be the better choice as it has first class support in python and seems
powerful enough to handle your use case.

There do seem to be a number of parse combinators in R but none of them seem
as well established as parsec for python (which is based on the much more well
known parsec for haskell).

A quick look around seems to show that python will be your best bet for
parsing as it has some decent tooling. Here is an article on some of the
different parse generators/parse combinators in the space.

[https://tomassetti.me/parsing-in-python/](https://tomassetti.me/parsing-in-
python/)

------
Taniwha
I don't see the trick I really like to use - which is for some conflicts
(mostly shift/reduce ones) decide to resolve them on the fly at run time (it
lets you do things like changing operator precedence on the fly if your
language requires it)

------
zaptheimpaler
Is it possible to parse left-recursive binary expressions with this?

Like:

    
    
        expr = expr + expr
        expr = expr.method(args)
    

I'm trying to write a parser using a parser combinator library, and left-
recursive things are turning out to be really tricky.

~~~
panic
Yeah, there's explicit support for expressions (left recursive and right
recursive) using '.operators', which creates a group of operators at the same
precedence level:

    
    
        expr =
           identifier : var
         .operators postfix
           '.' identifier '(' args ')' : method-call
         .operators infix left
           '+' : plus

------
bsurmanski
I already made a language called OWL ;)

[https://github.com/bsurmanski/wlc](https://github.com/bsurmanski/wlc)

------
grapehut
Can you parse the owl grammar with owl?

~~~
kitd
[https://github.com/ianh/owl/blob/master/grammar.owl](https://github.com/ianh/owl/blob/master/grammar.owl)

