
Ohm: Parsing Made Easy - pdubroy
https://nextjournal.com/dubroy/ohm-parsing-made-easy
======
pdubroy
Hi HN, I'm a researcher at HARC
([https://harc.ycr.org/](https://harc.ycr.org/)) and one of the authors of
Ohm. We've used it to power several of our programming language
investigations, such as Seymour (which was on HN yesterday:
[https://news.ycombinator.com/item?id=15471954](https://news.ycombinator.com/item?id=15471954))
and Chorus ([http://www.chorus-home.org/](http://www.chorus-home.org/)).

If you're interested, here's the grammar for the language used in the Seymour
demo:
[https://github.com/harc/seymour/blob/6f55361ad3410f42f67f183...](https://github.com/harc/seymour/blob/6f55361ad3410f42f67f183da7b7549418884e50/lang/grammar.js)

Happy to answer any questions that you have!

~~~
jwiley
Nice project, thanks for sharing. One interesting application that comes to
mind is creating a "safe" subset of Javascript, that could be run in an end-
users browser without requiring a sandbox. One definition of safe might be:
not allowing access to the DOM or global variables.

Is this a reasonable use case? Is Ohm's executing environment appropriate for
this usecase?

~~~
BoppreH
Javascript is too dynamic to have a safe subset. For example, using only the
six characters ()+ []! you can write arbitrary code. The main culprits are the
weak typing, permissive attribute access, and large runtime environment with
lots of surface area. This is unlikely to be fixable by changing the language
grammar.

See [http://www.jsfuck.com/](http://www.jsfuck.com/)

    
    
        JSFuck is an esoteric and educational programming style
        based on the atomic parts of JavaScript. It uses only six
        different characters to write and execute code.
        It does not depend on a browser, so you can even run it on
        Node.js.

~~~
pdubroy
Depending on your definition of "safe", it is indeed possible. See the paper
"Preventing Capability Leaks in Secure JavaScript Subsets" for a good
analysis: [http://www.adambarth.com/papers/2010/finifter-weinberger-
bar...](http://www.adambarth.com/papers/2010/finifter-weinberger-barth.pdf)

------
bd82
Ohm is very impressive.

Specifically:

    
    
      1. The separation of Grammar and Semantics.
      2. Handling left recursion in a top down (peg) parser.
      3. Incremental parsing.
    

I think that the one feature missing to make it applicable for more than rapid
prototyping and teaching purposes is _performance_.

In this benchmark I've authored:
[http://sap.github.io/chevrotain/performance/](http://sap.github.io/chevrotain/performance/)
Which uses the simple JSON grammar it is about _two orders of magnitudes_
slower than most other parsing libraries in JavaScript.

So I am sure there is a great deal of room for optimizations.

~~~
pdubroy
Thanks!

Yes, we are aware that Ohm's batch parsing performance is not great. In
practice, it has been fast enough for our uses -- especially since we
implemented incremental parsing. With incremental parsing, Ohm's ES5 parser
can be as fast as hand-optimized parsers like Acorn.

But you're right, there is definitely room for improvement. So far, we have
been much more concerned with making Ohm easy to learn and pleasant to use. I
would certainly be happy to have contributors who are interested in improving
our batch performance.

~~~
bd82
Incremental parsing is indeed amazing for IDE scenarios. But you still have to
parse the entire file at least once.

For example it takes 15 seconds to parse lodash.js with Ohm (on my machine)
using the sample EcmaScript grammar. But what happens if my IDE has 200 files
and 400KLOC of code?

From my personal experience, if you want high performance you have to treat it
as an ongoing feature, this could mean:

    
    
      * Inspect each new version for performance regressions.
    
      * Reinspect previous feature implementations for possible 
        performance optimizations.
    
      * keep track of underlying performance characteristics of your 
        runtime, for example V8 hidden class changes and other 
        de-optimization causes. These characteristics may (and do!) 
        change over time with newer releases of V8...
    

It would be interesting to try and optimize Ohm.js I even contributed some
optimizations to Nearley.js in the past. But I'm afraid I just don't know when
I will get around to trying this with too many projects and ideas competing
for my time. :(

------
iamleppert
I've really tried to get on with parser generators, but I've found they are
hard to use, hard to debug and the languages/DSLs are clunky and weird. Except
for cleanroom academic implementations, or for language designers who can
afford the time and resources to learn and get good at a parser generator,
I've found its better to simply use regular expressions to do matching and a
functional language that can build up a data structure recursively.

Another problem is a lot of them resort to clunky code generation from the
grammar file, and when something goes wrong you're not debugging the grammar
per se, you are stepping through a bunch of machine generated code that you
didn't write yourself. So your debug process looks like make change to
grammar, regenerate parser, try parsing again, loop. etc. It replaces the
entire file too, so its not like you can isolate areas of the code and work on
them like you would regular code. And the time to generate the parser is often
times slow.

Also when runtime parsing errors do happen, often the incorrect line/column
numbers are reported, and getting good descriptive parser errors is a project
in and of itself after you have your grammar written and working.

~~~
bd82
Even the creator of Antlr (Terence Parr) Said:

"In my experience, almost no one uses parser generators to build commercial
compilers."

[https://github.com/antlr/antlr4/blob/master/doc/faq/general....](https://github.com/antlr/antlr4/blob/master/doc/faq/general...).
(no anchors for direct link).

~~~
nerdponx
Does Pandoc count? IIRC it implements CommonMark in a PEG.

~~~
bd82
I don't understand the question. I am not familiar with Haskell but from what
I understand Pandoc is a group of hand crafted parsers (readers) for markup
formats.

How does this relate to a discussion on the relevance of parser generators /
libraries?

[https://github.com/jgm/pandoc/tree/master/src/Text/Pandoc](https://github.com/jgm/pandoc/tree/master/src/Text/Pandoc)

~~~
beojan
If you check the Parser.hs file there, Pandoc uses Parsec.

------
kvlr
Hey, I’m one of the founders of Nextjournal, the coding, writing and
publishing platform this article was written in.

This probably isn’t obvious: you can get a copy of the article and play with
it if you click remix and sign in/up.

There’s some more context about what we’re trying to build and why in our
launch post [https://medium.com/nextjournal/launch-nextjournal-public-
bet...](https://medium.com/nextjournal/launch-nextjournal-public-beta-for-
open-research-a55d15bfa95f)

------
chrislloyd
I’ve used Ohm for a few small parsers. What’s great about the editor is that
you can take it and share it with somebody else and they’re given insights
into _how_ the parser works.

------
dman
Any thoughts on how to handle parsing for the IDE use case where a document is
being edited that might have errors in it. I would usually expect an area
around the cursor that is an area that receives edita and hence contains
errors. I would also expect a header and footer surrounding the edited area
that would be okay structurally since its unchanged from a previously sound
definition of the file.

~~~
pdubroy
You might be interested in Ohm's incremental parsing support. I'll be
presenting a paper on this at SPLASH next week:
[https://ohmlang.github.io/pubs/sle2017/incremental-
packrat-p...](https://ohmlang.github.io/pubs/sle2017/incremental-packrat-
parsing.pdf).

In theory, what you're suggesting should be possible to implement with our
incremental packrat parsing algorithm. But we haven't tried it yet, so I can't
say for sure.

~~~
dman
Thanks!

------
simplify
I've had great experience using PEG.js, another PEG-based parser generator.
How does Ohm compare?

~~~
pdubroy
It's somewhat similar, but the main difference is that Ohm has a strict
separation between syntax and semantics. We think this has several benefits,
which we describe a bit here:
[https://github.com/harc/ohm/blob/master/doc/philosophy.md](https://github.com/harc/ohm/blob/master/doc/philosophy.md)

Another difference is that Ohm grammars can contain left recursion -- both
direct and indirect. IMHO this is a pretty big deal, but I know that some
people don't agree, and think that avoiding left recursion is not a big
problem.

~~~
yazaddaruvala
Any thoughts on
[https://github.com/nikomatsakis/lalrpop](https://github.com/nikomatsakis/lalrpop),
and it's first priority being "Nice error messages"?

Sidenote: I created this issue a while ago, but never got a response:
[https://github.com/nikomatsakis/lalrpop/issues/180](https://github.com/nikomatsakis/lalrpop/issues/180)

~~~
vidarh
'Nice error messages' is an important goal - it is one of a few things that
tends to stop people from using parser generators for 'production' compilers.
Parsing itself is the easy bit, and while things like left recursion can make
some things a bit easier, the workarounds are so well understood it is not
what stops people.

But error handling does.

------
richard_shelton
I really liked what was done in STEPS project. I learned a lot from their
repors. For example, this Ian Piumarta's paper is absolutely beautiful [1]. I
also spent a lot of time learning oMeta [3] system by Alessandro Warth.

And, honestly, now I see nothing really new in Ohm. Basically, it's just some
tweaking of the same tech. Moreover, Ohm was made for isolated parsing task.
For me it's a step back. My point is that the parsing alone is not very
interesting thing, for making DSLs you need to have other tools too. In the
Ian Piumarta's paper we had a minimalistic program transformation system [2].
Remember original META II [4]? It was a compiler-compiler (metacompiler), not
just a parser generator. I'm really curious to know why the authors decided
this time to limit themselves by only parsing.

[1]
[http://www.vpri.org/pdf/tr2010003_PEG.pdf](http://www.vpri.org/pdf/tr2010003_PEG.pdf)

[2]
[https://en.wikipedia.org/wiki/List_of_program_transformation...](https://en.wikipedia.org/wiki/List_of_program_transformation_systems)

[3]
[http://www.vpri.org/pdf/tr2008003_experimenting.pdf](http://www.vpri.org/pdf/tr2008003_experimenting.pdf)

[4] [http://www.ibm-1401.info/Meta-II-
schorre.pdf](http://www.ibm-1401.info/Meta-II-schorre.pdf)

~~~
pdubroy
Ohm is not really "just" a parser generator -- but that's the easiest way to
describe it.

The big idea in Ohm is its modular semantic actions. You can read more about
the design -- and why we think it's interesting -- in our DLS paper:
[https://ohmlang.github.io/pubs/dls2016/modular-semantic-
acti...](https://ohmlang.github.io/pubs/dls2016/modular-semantic-actions.pdf)

~~~
richard_shelton
Thank you for the answer!

I understand that separation of the grammar and semantics has its benefits.
You can use the same grammar description with different semantic rules sets
etc. It's, indeed, a clean and interesting approach.

But, as I understand, Ohm still has no support for context-sensitive grammars,
which is more important to have in many cases, than proper left recursion
handling.

And oMeta had another nice feature: meta-rules (higher-order rules) which is
absent in Ohm, if I understand correctly.

Ohm tries to be very user-friendly, but at the price of droping the
functionality. So in this case Ohm is not a modern replacement for oMeta
(which had the AST transforming features -- very important for making
compilers!).

I'm not trying to be negative and I really wish a big success to your team!

~~~
pdubroy
> But, as I understand, Ohm still has no support for context-sensitive
> grammars, which is more important to have in many cases, than proper left
> recursion handling.

Right, we don't support context-sensitive grammars yet. But we'd like to --
we're just trying to figure out how to do it in a way that fits Ohm's design
principles. I'm optimistic that we'll be able to do that.

------
derriz
Sorry to be negative and this comment probably doesn't belong in a discussion
about a specific parsing toolkit but I've become unconvinced that parser
generators are useful. My experience is limited to Yacc/lex back in the old
days (quickly jumped to Bison/flex), more recently Antlr and a couple of
functional parser combinator libraries. In nearly all case it was to deal with
"real world" (i.e. not toy) programming languages.

The last time I needed a parser (in Java), I started studying the Antlr docs
(it's changed quite a bit since I used it last) but became disillusioned
quickly with the amount of reading and studying I would have to do to get
something working.

So I quickly wrote a "hand crafted" tokenizer and recursive descent parser. I
found this so satisfying that it made me wonder why I had bothered learning
relatively complex tools in the past particularly since I had been exposed to
recursive descent parsing as an undergrad.

Advantages that pop into my head:

\- The code was clean, readable and very concise. For debugging, the
stacktraces were helpful and I could use my regular debugger/IDE to step
through the parsing process. The method names in my Parser class mostly
matched the names of corresponding grammar rules.

\- You can code around the theoretical limitations of recursive descent
parsing in a very intuitive manner (e.g. "if (tokens.peekAhead(1).getType() ==
Token.LEFT_BRACE) { parseX(); } else { parseY(); }"). In theory it might seem
this would lead to a mess but it actually allows very flexible and natural
abstractions.

\- You have complete control over the building of the AST - the parseX(...)
methods can take arguments or the calling parse method can manipulate the
returned AST - doing stuff like flattening (normalising) node trees or re-
ordering child nodes, etc. The shape of the AST can be independent of the
structure of the grammar rules.

\- It's easy to provide helpful error messages and even error recovery without
fighting with the toolkit. Better still, you can start with a fairly lazy
generic error handler and later, in a natural style, add special cases to make
the messages more and more helpful for specific common user mistakes. I
sneakily logged all parse failures by users to constantly improve error
reporting. After a while the parser seemed almost like an AI when reporting
errors.

\- For parsing expressions, there is a relatively well-known way to deal with
operators with different arities and associativity rules (by adding a numeric
"context binding strength" parameter to your parseExpr() method) - a quick
google provided the template.

\- The entire parser was self contained in a small number of reasonably
compact classes: a Lexer/Tokenizer class, a Parser class and a SymbolTable
class (and of course a TokenType enum and an ASTNode class). Other developers
could grok the code because it was compact and self contained without having
to learn a parsing toolkit.

\- You feel in control; i.e. you can add features to the language and the
parser incrementally without fearing that sinking feeling you get when you
think you're 99% of the way there only to realize that the tool you're using
makes the last 1% impossible forcing you to rethink/rewrite already "banked"
functionality.

\- Zero dependencies and trivial to integrate into the build and test process.

edit: paragraphs

~~~
rwmj
I was surprised when I revisited the GCC code after 20 years to find that GCC
switched from using flex/bison to using a hand-written recursive descent
parser (even for C++ which is reputed to be "impossible" to parse).

Here's the C parser: [https://github.com/gcc-
mirror/gcc/blob/master/gcc/c/c-parser...](https://github.com/gcc-
mirror/gcc/blob/master/gcc/c/c-parser.c)

and the C++ parser: [https://raw.githubusercontent.com/gcc-
mirror/gcc/master/gcc/...](https://raw.githubusercontent.com/gcc-
mirror/gcc/master/gcc/cp/parser.c)

------
CalChris
_In many parser generators (e.g. Yacc and ANTLR), a grammar author can specify
the language semantics by including semantic actions inside the grammar. A
semantic action is a snippet of code — typically written in a different
language —that produces a desired value or effect each time a particular rule
is matched._

Actually, _the need for that_ went away with ANTLR4. The grammar is now all
grammar (and lexer) and the semantic actions are listeners or walkers written
separately calling or overriding methods and classes generated from the
grammar.

Much cleaner that way.

~~~
bd82
It did not exactly went away.

It is still possible to embed semantics inside an Antlr4 grammar.

For example see the Antlr4 EcmaScript grammar sample:
[https://github.com/antlr/grammars-v4/tree/master/ecmascript](https://github.com/antlr/grammars-v4/tree/master/ecmascript)
which uses embedded code to solve the RegExp vs division operator ambiguity.

Another scenario when embedding code could be preferred is optimizing for
maximum performance as abstractions normally come at a performance overhead.

I do agree that the default approach should be to separate the semantics
unless there is a very good reason why not to...

~~~
CalChris
I stand corrected. I rewrote my grammar with ANTLR4 and gutted all the
embedded semantics. That was a _good_ day.

I'm going to hit reply now and then I'm going to take out my personal
neuralyzer and forget that I ever found out that you can still embed.

------
feelin_googley
IMHO, nothing makes parsing as easy as snobol/spitbol. It is almost as old as
lisp, and older than C.

The question I have as a mere mortal user, who is not interested very much in
theory and debates thereon, is _what has the fastest performance_?

If the proponents of post-snobol PEG/packrat were to publish a "parsing
challenge" and let us replicate/create benchmarks of different parsers,
including some written in snobol, I would find that very useful in determining
whether these other parsers are worth a more serious look.

------
kasbah
I have been using Nearley.js [1] and have had a lot of fun using it. I
actually quite liked being able to mix in the JS post-processing with the
grammar definition in Nearley but could be convinced of the advantages of
keeping the separate (checking out your paper on DSLs now).

How would you compare it to Nearley? Can Ohm handle ambiguous grammars?

[1]: [http://nearley.js.org](http://nearley.js.org)

------
jsierles
Pretty cool for sharing as it can run in the browser.

So click on the 'Remix' button and you can play around with and run the
article's contents.

Is there a way to play with this using Node.js as well?

------
disconnected
"Further reading" links at the bottom just link back to the same page.

~~~
pdubroy
Thanks. Fixed!

------
tomp
> The Ohm language is based on parsing expression grammars (PEGs), which are a
> formal way of describing syntax, similar to regular expressions and context-
> free grammars

Uh-oh. I've voiced my concerns about PEGs (and LL parsers) before, but IMO any
grammar "interpreter" that doesn't point out the ambiguities in grammar and
instead relies on some vague, and ultimately arbitrary, notion of "precedence"
(e.g. that rules declared first in the grammar file have priority), isn't a
good foundation for a serious language (good for throwaway parsers and
language experiments, though).

~~~
shalabhc
Is this a limitation of the PEG syntax itself? IOW, is it possible to identify
ambiguities in grammars defined as PEGs?

~~~
tomp
Possible? Maybe.

But the problem is more that, (according to Wikipedia), the _choice_ operator
in PEGs (i.e. _e1 | e2_ ) is in fact _defined_ as _ordered_ choice, i.e. it
prefers the first alternative.

They try to sell this as a "solution" to ambiguous grammars, as an advantage,
but they're just ... _wrong_. It's as if Java, when resolving method
overloading, arbitrarily prefered the method that's declared first in the
source file, instead of refusing to compile ambiguous code, as it does now.

~~~
chrisseaton
> They try to sell this as a "solution" to ambiguous grammars

But it is a solution... the grammar is no longer ambiguous if you define
choice as giving priority to one side or the other. There's no need for scare
quotes! It is a solution that removes ambiguity. There is no longer any
ambiguity, and there's nothing 'vague' at all about a rule as simple and clear
as this.

> but they're just ... wrong

You'll have to give a more convincing argument than that if you want to
persuade people!

> It's as if Java, when resolving method overloading, arbitrarily prefered the
> method that's declared first in the source file, instead of refusing to
> compile ambiguous code, as it does now

But if you had this rule, then the code isn't ambiguous any more is it? It has
a well-defined way to decide which method to use, which everyone can
understand and implement.

~~~
haberman
> But it is a solution... the grammar is no longer ambiguous if you define
> choice as giving priority to one side or the other.

Sure it's no longer ambiguous to the computer. But the important question is:
is it ambiguous to a human?

Take the "dangling else" problem. What does this mean in C?

    
    
       if (a)
       if (b) f();
       else g();
    

If you defined your grammar with a PEG, the answer is: whichever alternative
you put first (if-with-else or if-without-else). But that answer doesn't help
someone actually trying to _use_ your language unless they go and read your
PEG. What user wants to do that?

Worse, it keeps language designers from being aware when they accidentally put
gotchas like this into their languages. The PEG tools can't warn you, because
to a PEG tool there is no problem. As a real-world example of this, it was not
discovered that ALGOL 60 had a "dangling else" ambiguity until the language
had already been published in a technical report. A CFG-based tool could have
warned the designers about the ambiguity, but with PEG-based tools you are
designing blind.

~~~
victorNicollet
Ambiguity of a grammar is rather unrelated to how surprising it can be to a
human. Something like TypeScript:

    
    
        var a = { label: f() };
        () => { label : f() };
    

These constructs look similar, but one is an object literal and the other is a
block with a useless label. All of this can be implemented as an unambiguous
context-free grammar.

Relying on grammar ambiguity detection to find constructs surprising to humans
is not very effective, if only because of the difference between how a human
understands the grammar (pattern-based) and how EBNF expresses it (prefix-
based).

~~~
haberman
This is a red herring. Context-free grammar tools don't solve the problem of
keeping a language from ever being confusing. However they do solve the
problem of allowing literal ambiguity into your language.

Ambiguity is strictly worse than confusion. Ambiguity means you have to
communicate more information to your users: when two parses are both
syntactically valid, which one does the language actually choose?

------
wybiral
Those popup chats on articles like this gross me out...

I'm just trying to read something, stop phishing for my email address.

~~~
kvlr
Sorry about that. We didn't mean to show it to visitors and a bug prevented us
from quickly disabling it. We've now removed it completely.

------
ohm
Nice

