
Glush: A robust parser compiler built using non-deterministic automatons - evenw
https://www.sanity.io/blog/why-we-wrote-yet-another-parser-compiler
======
chrisaycock
_> However, in practice it turns out that LALR(1) isn’t always the answer. For
instance, both GCC, Rust and Go have chosen to use handwritten parsers that
are not based on a declarative grammar file. I find this disappointing: We
have decades of experience with specifying languages in a declarative format,
and apparently it’s still easier to manually write parsers. Obviously there’s
something lacking with the algorithms we have today._

Many production compilers have hand-written parsers because of error
reporting, not because of issues with parser-generator algorithms.

Consider the example of missing parenthesis. I can specify a grammar with the
following:

    
    
      EXPR =
           ...
           | '(' EXPR ')'
    

The generated parser will not gracefully report a mismatched number of
parentheses! Instead, it will report that the user's input is not formatted as
expected, which makes it hard for the user to understand what went wrong.

It is possible to modify the grammar to look for this case specifically:

    
    
      EXPR =
           ...
           | '(' EXPR ')'
           | '(' EXPR      // report this explicitly
    

But now I've just polluted my once-pristine grammar file. And adding all of
the necessary error checks throughout the grammar will really make things
complicated.

So, many production compilers end-up with custom parsers to report unexpected
input in a sane way. It is not that the algorithms are lacking; there is
simply no mechanism for a generator to know what a mismatched parenthesis
should be reported as.

~~~
seanmcdirmid
Error recovery and reporting are for sure the biggest challenge of a
production parser. If you work out of band, you can play a lot of tricks in
your parser to make it much more effective in handling errors. For example,
braces can be matched in many languages without parsing any other constructs
in the language, meaning these errors can all be reported and recovered from
independent of the rest of the language. Then again, the overloading of < and
> as operators and braces in many languages defeats that a bit :)

Hand coding a parser is usually worth it for a moderately popular production
language. Parser generators really shine for smaller language efforts that
can’t afford the overhead and don’t really care about the niceties of decent
error reporting.

~~~
joe_the_user
I don't think it's just error handling. A nice, human readable form has
significant chance of being ambiguous - you can remove the ambiguity with a
variety of transforms, putting the code into Greibach Normal Form, and this
resolves the ambiguities. ... and translates pretty directly to a hand-written
recursive descent parser (which you can generate also). But the thing is that
here the code essentially serves as a lower level of abstraction, teasing out
what you really mean. Which is to say there's nothing wrong with the code
being the final spec, that way you don't have to keep the code in sync with
the spec.

~~~
seanmcdirmid
For a mainstream programming languages, complicated ambiguities are hard on
users anyways, by writing your parser by hand you have an extra incentive to
keep the grammar simple (eg Scala). A lot of the power a parser generator
gives you shouldn’t be used.

~~~
joe_the_user
Remember, complicated ambiguities are different from complicated structures.

The Ruby programming language is a poster-child for potential ambiguities that
seem simple - allowing conditional before and after for example. Essentially,
all the programming languages that are "human like" wind-up like this, with
the sort of ambiguities that people are comfortable with in natural language.
This has a cost in terms of exact expression but a lot of users think of this
as being "easy on them".

Oppositely, languages without grammatical ambiguities often seem irritatingly
verbose to use - new users dislike lisp's parenthesis proliferation and
personally find Scala irritatingly verbose.

The complexity of compiler-writing approaches as limiting factor to language
complexity probably depending what someone is familar with. Some people can
spit out annotated YACC grammar pretty easily whereas my head swimming looking
at the stuff. I can produce a recursive descent parser from a grammar pretty
easily however.

------
abecedarius
Sounds promising! And a great exposition of the background and development.

About the Glushkov construction for regular expressions, here's some Python
code that may _perhaps_ help to explain it, since the Wikipedia page is hard
to follow (at least for me) and the Mastodon thread linked in the post
admitted to some confusion about the "Play on Regular Expressions" paper
(which I haven't read):
[https://github.com/darius/sketchbook/blob/master/regex/nfa_p...](https://github.com/darius/sketchbook/blob/master/regex/nfa_position.py)

Sorry it's not pedagogically great either; I was rederiving the method for
myself rather than implementing from textbooks. What I called 'follows' in the
comment should correspond to 'pair set' in the post; the other fields have the
same names.

Incidentally I think the post a bit overstates the problems with PEGs (you
shouldn't need left recursion anywhere a hand-written recursive-descent parser
would also work; PEGs were meant to streamline and formalize recursive
descent, and this failure to model the practice can be fixed without big
changes) -- but CFGs are great too and I'm looking forward to studying Glush.
I wish I'd thought of exploring the generalization to context-free grammars.

------
erezsh
That sounds really cool, and I hope it will find success. The field of parsing
truly needs some refreshment, especially in usability.

I do have a few questions:

1\. You claim the algorithm used guarantees O(n^3) for all CFGs, but is
different from Earley and CYK. Are there more details for this algorithm,
perhaps with a rigorous mathematical proof for this claim?

2\. How is Glush's performance in relation to Earley, in terms of constant
time and memory requirements? (even for low complexity, a high constant can be
problematic)

3\. How do you deal with ambiguity? The blog mentions priority, but is there a
way to get something similar to a SPPF?

~~~
judofyr
Author of the article here. Thanks for your comments!

> 1\. You claim the algorithm used guarantees O(n^3) for all CFGs, but is
> different from Earley and CYK. Are there more details for this algorithm,
> perhaps with a rigorous mathematical proof for this claim?

Not quite yet. There is _a lot_ more work needed and writing this article was
one step in right direction. I was hoping I was able to get even closer to

> 2\. How is Glush's performance in relation to Earley, in terms of constant
> time and memory requirements? (even for low complexity, a high constant can
> be problematic)

This is yet to be determined. A few comments:

\- I'm not 100% concerned about performance because as mentioned in the
article, I don't plan to use this in a place where performance is critical.

\- I'm not too worried about the performance since I'm not doing too
complicated data structures (it's all hash maps), and most intermediate data
structures are small in size and are probably best implemented linearly (e.g.
plain array) for practical grammars with few ambiguities.

\- It's based on NFA-style "advance one state at a time" which actually is
_not_ good for performance. This means that parsing a keyword such as
"function" actually passed through 8 different states instead of just looking
ahead and jumping straight ahead. I think taking advantage of these look ahead
will probably help a lot

> 3\. How do you deal with ambiguity? The blog mentions priority, but is there
> a way to get something similar to a SPPF?

A single output from the parser is conceptually a flat list of _marks_ (think
like a reverse RPN):

    
    
        // this input:
        1 + 2 * 3 + 4
        // with this precedence:
        (1 + (2 * 3)) + 4
        // could generate this flat list of marks:
        add, add, 1, mul, 2, 3, 4
    

The total output from the parser is a "tree of marks".

------
anhldbk
Thanks for sharing. Currently I'm looking for an elegant language for querying
Json structures. In addition to JsonPath, JMESPath, Glush & GROQ is definitely
worth a try

~~~
sshine
Have you considered jq?

[https://stedolan.github.io/jq/](https://stedolan.github.io/jq/)

[https://github.com/stedolan/jq](https://github.com/stedolan/jq)

~~~
naniwaduni
Worth noting jq is not quite like what GP has tried, in that it seems to be
designed as a command-line utility first with only incidental concessions to
allow itself to be linked as a library. Its C API is poorly documented and
essentially defined by the single implementation. The language itself is
immensely powerful and not really sandboxable.

It's a nice language to work in but definitely not one I'd want to expose as a
user interface.

~~~
sshine
I'm actually working on a separate implementation with the intent of making jq
work as a library.

By "not really sandboxable", what powerful features of jq do you mean?

I'm not sure what's meant by "expose as a user interface". It's a pretty
abstract DSL, so I would certainly only expose it to programmers.

~~~
naniwaduni
> By "not really sandboxable", what powerful features of jq do you mean?

On close inspection I can't find things as bad as I imagined, but there are
still some unexpected barely-documented features like reading files from
anywhere on disk[1] as long as they have the right extension, with much wider
implications than one might expect at a glance. (Note the while the module
path is relative and checked for path traversal, _you can just set the search
path_. Undocumented is that the text needn't even parse as JSON because you
can just set the raw flag.) In general, many language features are implemented
by exposing a more powerful primitive than necessary to the language and
documenting a less powerful wrapper but not the primitive itself; the security
implications go more or less unexamined.

There seems to be no particular provision for executing any less than enough
to generate the next output, so if you want to set limits on execution you'll
have to do that at the process level. This is something you can probably do
better in a separate implementation, though.

> I'm not sure what's meant by "expose as a user interface". It's a pretty
> abstract DSL, so I would certainly only expose it to programmers.

Only trusted users should have access to jq. I would assume that anyone who
has access to executing jq code can execute arbitrary code in the
interpreter's context. I'm less sure that it's _true_ right now than I was
when I made the post above, but it still seems pretty likely.

[1]:
[https://stedolan.github.io/jq/manual/#importRelativePathStri...](https://stedolan.github.io/jq/manual/#importRelativePathStringas$NAME\[%3Cmetadata%3E\];)

------
sshine
Skipping to "Glush, the algorithm" can be helpful if you already know the
history of parsing.

TL;DR: Glush grammars consist of "rules" that are regexes + recursion of rule
calls. Compiles to NFAs.

I'm not sure I see the point in declaring precedence levels inside rules --
maybe it's just a preference thing, but I like having operator precedences in
a separate section of the grammar. Yacc does this. Megaparsec for Haskell does
this.

This reminds me of two things:

1\. How syntax highlighting is implemented in Vim and GEdit (which is to say
GtkSourceView). See for example how "contexts" can be nested here:
[https://github.com/rubencaro/gedit_hacks/blob/master/.local/...](https://github.com/rubencaro/gedit_hacks/blob/master/.local/share/gtksourceview-3.0/language-
specs/ruby.lang#L469)

2\. Kleenex: [https://kleenexlang.org](https://kleenexlang.org) \-
[https://github.com/diku-kmc/kleenexlang](https://github.com/diku-
kmc/kleenexlang) \- a silly example:
[https://github.com/athas/EggsML/blob/master/concieggs/compil...](https://github.com/athas/EggsML/blob/master/concieggs/compiled/makeDanish.kex)

I am tempted to classify this as "regex extended with recursion" rather than
"an alternative to CFGs", since I cannot decipher from the article the exact
expressive power.

~~~
tom_mellior
> Compiles to NFAs.

No. It cannot, since NFAs cannot parse languages that are not regular. Rather,
the algorithm _executes_ like a pushdown automaton. I don't think it's correct
to call this compilation.

> I cannot decipher from the article the exact expressive power.

To me it seems to suggest being able to parse (exactly) all context-free
languages.

------
tom_mellior
Nice article, this looks like a useful parser generator. Though, as others
have mentioned, error reporting is a very important point that should be
considered even at this early stage.

Here's something that surprised me though:

> a parser toolkit that lets you create efficient parsers in multiple
> languages (currently JavaScript, Go, and Ruby)

> [some other approach] it looked very interesting, but it’s not completely
> smooth. First of all, all of the implementations have been in highly
> expressive, garbage collected languages (Haskell, Scala, Racket).

This isn't the author's only reason to dismiss that alternative approach, but
it's still a strange point to spend almost an entire paragraph on. If all the
targets the author currently envisions are GC'd languages, it's premature
pessimization to worry about how other, hypothetical, targets might be
accommodated.

~~~
judofyr
(Author here!)

> This isn't the author's only reason to dismiss that alternative approach,
> but it's still a strange point to spend almost an entire paragraph on. If
> all the targets the author currently envisions are GC'd languages, it's
> premature pessimization to worry about how other, hypothetical, targets
> might be accommodated.

Aah. I should probably have clarified that. At Sanity.io today we only need
JavaScript and Go, and those are the most imminent languages to support (and
what I can actually spend time on during work). However, as an overall parser
algorithm I consider it crucial to be able to easily target C as this would
unlock a whole new world of possibilities.

Does that make it more clear?

> but it's still a strange point to spend almost an entire paragraph on

Yeah, this article isn't exactly short on details. The purpose of this post
was a mixture between (1) arguing for the value of declarative-based grammars,
(2) showing all of the fascinating algorithms that have been discovered, and
(3) showing how Glush works.

I've had a great time playing with Parsing with Derivatives so even if I
didn't end up directly using it, I feel like sharing cool ideas to new people.

~~~
tom_mellior
Thanks for your response. You made your point more explicit, but I still think
that being afraid of GC due to planning to support C at some point could be
premature pessimization. Your milage obviously varies!

Also, yes, thank you for mentioning Parsing with Derivatives and all the other
stuff, I really liked the overview you gave of the field.

------
IshKebab
No mention of Tree-sitter ([http://tree-sitter.github.io/tree-
sitter/](http://tree-sitter.github.io/tree-sitter/)), which this looks very
similar to.

------
vinodkd
Fascinating journey through various parsing techniques, aside from describing
glush itelf. bookmarked!

------
7532yahoogmail
Intrigued .. keep posting. I tried a bison/flex parser and was stuck in
shift/reduce hell.

~~~
naniwaduni
In my experience, shift/reduce conflicts typically reflect real ambiguities,
at least in your initial draft of a grammar. Reporting them at that stage is a
good thing; you should be addressing them instead of glossing over them.

There are legitimate reasons to need more power than LALR(1) provides, but
careless thinking is more common.

------
lopsidedBrain
I would have _really_ preferred a table of contents, either at the top or in a
sidebar. It would have made it easier to skip over all the background
materials that essentially rehashes an undergrad complier course.

~~~
naniwaduni
There is one, which appears as a sidebar once you're below the fold. Bizarre
placement imo, but it's there.

~~~
lopsidedBrain
Huh, you're right. I guess I shouldn't be reading this on mobile.

