Hacker News new | past | comments | ask | show | jobs | submit login
Ohm: Parsing Made Easy (nextjournal.com)
248 points by pdubroy on Oct 17, 2017 | hide | past | favorite | 97 comments

Hi HN, I'm a researcher at HARC (https://harc.ycr.org/) and one of the authors of Ohm. We've used it to power several of our programming language investigations, such as Seymour (which was on HN yesterday: https://news.ycombinator.com/item?id=15471954) and Chorus (http://www.chorus-home.org/).

If you're interested, here's the grammar for the language used in the Seymour demo: https://github.com/harc/seymour/blob/6f55361ad3410f42f67f183...

Happy to answer any questions that you have!

Many PEG-based parser generators do not support left recursion — requiring grammar authors to use repetition or right recursion instead. But left recursion is the most straightforward way to express left associative operators, which is why left recursion is supported by Ohm.

Does Ohm have any limitation with regard to left recursion? The last time I checked there was a paper by Warth et al. [1] extending PEG Packrat Parsers to handle left recursion but later a paper by Tratt [2] pointed out a flaw in that algorithm and only managed to fix the problem for a limited subset of grammars.

[1] http://www.vpri.org/pdf/tr2007002_packrat.pdf

[2] http://tratt.net/laurie/research/pubs/papers/tratt__direct_l...

Ohm implements the same algorithm described in [1] -- Alex Warth (the author of that paper) created Ohm.

We're understand the complaint in [2], but we strongly disagree with Laurie's claim that these are "incorrect" parses. He proposes a different way of handling left recursion, which is just that -- different.

You can find a more thorough discussion here: https://github.com/harc/ohm/issues/55

I believe left recursion is a frill, because you don't really miss it when you write a recursive-descent parser. You just write the actions a little bit differently for a left-associative operator than a right-associative one, with a loop instead of recursion. Since PEGs are supposed to be the theory abstracting recursive-descent parsers, I would want semantic actions to be added to the theory in a way that can handle both kinds of operators, instead of complicating the parsing algorithm with left recursion. Here's an example of how that can work: https://github.com/darius/parson/blob/master/eg_calc.py -- the '-' operator associates left, while the '^' op associates to the right.

(Like Ohm, my system separated the semantic actions from the grammar: .bind(operator) on line 24. It's older than Ohm, actually, but unfinished and undocumented -- just something I've been using in my own projects.)

Of course, another reason to support left recursion is to let you use a grammar someone else has already written with left-recursion baked in, without having to mess with it.


As the grammar and the semantic actions are separate, is there any support for languages in which the actual parsing rules depend on earlier "semantics" of the language? This kind of thing is needed IIRC to decide whether the C expression

Is a function call (e.g. following "int hello(int x)") or a conversion, (e.g. following "typedef float hello"); Many languages have these constructs that make a parse ambiguous unless you can look at previous semantics.

Also, if I remember correctly, Ometa had a problem with the proof and implementation of left recursive PEG grammars - was this fixed in Ohm? (Or am I misremembering?)

@pdubroy , wow! This is some really exciting work. I had just looked at JISON but this walk through article really takes the cake at making Ohm look like the better approach. Thank you.

Just want to say, this article is probably in the Top 5 coolest things I've seen on HackerNews in the last 7ish years. Can't wait to see/read more about Ohm!

Thank you! Be sure to check out the Ohm Editor (https://ohmlang.github.io/editor/), and if you have any trouble getting started with Ohm, you can ask for help on the mailing list: https://groups.google.com/a/ycr.org/forum/#!forum/ohm.

Nice project, thanks for sharing. One interesting application that comes to mind is creating a "safe" subset of Javascript, that could be run in an end-users browser without requiring a sandbox. One definition of safe might be: not allowing access to the DOM or global variables.

Is this a reasonable use case? Is Ohm's executing environment appropriate for this usecase?

Javascript is too dynamic to have a safe subset. For example, using only the six characters ()+ []! you can write arbitrary code. The main culprits are the weak typing, permissive attribute access, and large runtime environment with lots of surface area. This is unlikely to be fixable by changing the language grammar.

See http://www.jsfuck.com/

    JSFuck is an esoteric and educational programming style
    based on the atomic parts of JavaScript. It uses only six
    different characters to write and execute code.
    It does not depend on a browser, so you can even run it on

Depending on your definition of "safe", it is indeed possible. See the paper "Preventing Capability Leaks in Secure JavaScript Subsets" for a good analysis: http://www.adambarth.com/papers/2010/finifter-weinberger-bar...

If you simply hand your subset off to the standard runtime yes, but you needn't do that. Once you've parsed it, you could certainly close those doors with a few simple rewrite rules.

Yes, if I understand correctly, you're describing something like ADsafe (http://www.adsafe.org/). I'm pretty sure you could implement that in Ohm, by extending our ES5 grammar (https://github.com/harc/ohm/blob/master/examples/ecmascript/...).

However, Ohm doesn't really have an "executing environment". It's up to Ohm users to define how their language is interpreted or compiled, by writing semantic actions for the grammar rules. See the second half of the article for an example.

asm.js/WebAssembly fit that bill.

Can the generated parser be saved somehow externally so that I don't have to ship the grammar with a project?

Are you guys working on (or have you considered) WASM text format as a target language?

Not yet, but I would love to see someone do that.

Ohm is very impressive.


  1. The separation of Grammar and Semantics.
  2. Handling left recursion in a top down (peg) parser.
  3. Incremental parsing.
I think that the one feature missing to make it applicable for more than rapid prototyping and teaching purposes is performance.

In this benchmark I've authored: http://sap.github.io/chevrotain/performance/ Which uses the simple JSON grammar it is about two orders of magnitudes slower than most other parsing libraries in JavaScript.

So I am sure there is a great deal of room for optimizations.


Yes, we are aware that Ohm's batch parsing performance is not great. In practice, it has been fast enough for our uses -- especially since we implemented incremental parsing. With incremental parsing, Ohm's ES5 parser can be as fast as hand-optimized parsers like Acorn.

But you're right, there is definitely room for improvement. So far, we have been much more concerned with making Ohm easy to learn and pleasant to use. I would certainly be happy to have contributors who are interested in improving our batch performance.

Incremental parsing is indeed amazing for IDE scenarios. But you still have to parse the entire file at least once.

For example it takes 15 seconds to parse lodash.js with Ohm (on my machine) using the sample EcmaScript grammar. But what happens if my IDE has 200 files and 400KLOC of code?

From my personal experience, if you want high performance you have to treat it as an ongoing feature, this could mean:

  * Inspect each new version for performance regressions.

  * Reinspect previous feature implementations for possible 
    performance optimizations.

  * keep track of underlying performance characteristics of your 
    runtime, for example V8 hidden class changes and other 
    de-optimization causes. These characteristics may (and do!) 
    change over time with newer releases of V8...
It would be interesting to try and optimize Ohm.js I even contributed some optimizations to Nearley.js in the past. But I'm afraid I just don't know when I will get around to trying this with too many projects and ideas competing for my time. :(

I've really tried to get on with parser generators, but I've found they are hard to use, hard to debug and the languages/DSLs are clunky and weird. Except for cleanroom academic implementations, or for language designers who can afford the time and resources to learn and get good at a parser generator, I've found its better to simply use regular expressions to do matching and a functional language that can build up a data structure recursively.

Another problem is a lot of them resort to clunky code generation from the grammar file, and when something goes wrong you're not debugging the grammar per se, you are stepping through a bunch of machine generated code that you didn't write yourself. So your debug process looks like make change to grammar, regenerate parser, try parsing again, loop. etc. It replaces the entire file too, so its not like you can isolate areas of the code and work on them like you would regular code. And the time to generate the parser is often times slow.

Also when runtime parsing errors do happen, often the incorrect line/column numbers are reported, and getting good descriptive parser errors is a project in and of itself after you have your grammar written and working.

I agree with most of your points. Re: debugging, we tried to address this with the visualizer that we built into the Ohm Editor: https://ohmlang.github.io/editor/

If you enable "Explain parse" in the bottom right, you'll see that it shows every single step that the parser takes in interpreting your grammar. It's not perfect, but we think it goes a long way to addressing the debugging issues.

Even the creator of Antlr (Terence Parr) Said:

"In my experience, almost no one uses parser generators to build commercial compilers."

https://github.com/antlr/antlr4/blob/master/doc/faq/general..... (no anchors for direct link).

The unwritten corollary of course is that "almost nobody writes commercial compilers." :) Almost all of us do, however, write parsers for data, config files, languages etc... all the time. I'd personally used ANTLR of course for all my parsing needs beyond the trivial.

Does Pandoc count? IIRC it implements CommonMark in a PEG.

I don't understand the question. I am not familiar with Haskell but from what I understand Pandoc is a group of hand crafted parsers (readers) for markup formats.

How does this relate to a discussion on the relevance of parser generators / libraries?


If you check the Parser.hs file there, Pandoc uses Parsec.

Yacc/bison seems pretty common for stuff implemented in C.

I find them pretty useful, the DSLs are generally just slightly modified version of BNF which is pretty straightforward to pick up. Bison + flex is my tool of choice for extensive CLI interfaces in C projects. Regex is useful (hence flex) but for higher level constructs without reinventing the wheel I think it's hard to do much better than a parser generator.


They still do have there place though. I, for instance, was able to import the MySQL grammar for Antlr4 from the github Antlr4 grammar repository and got it working. I use it to mangle table names of all mysql requests during tests to be able to easily isolate tests within a single database that would otherwise stomp on each other.

But I don't think I'd use one for any custom parsing, like my own compiler or any grammar I get to define.

Hey, I’m one of the founders of Nextjournal, the coding, writing and publishing platform this article was written in.

This probably isn’t obvious: you can get a copy of the article and play with it if you click remix and sign in/up.

There’s some more context about what we’re trying to build and why in our launch post https://medium.com/nextjournal/launch-nextjournal-public-bet...

I’ve used Ohm for a few small parsers. What’s great about the editor is that you can take it and share it with somebody else and they’re given insights into _how_ the parser works.

Any thoughts on how to handle parsing for the IDE use case where a document is being edited that might have errors in it. I would usually expect an area around the cursor that is an area that receives edita and hence contains errors. I would also expect a header and footer surrounding the edited area that would be okay structurally since its unchanged from a previously sound definition of the file.

One option would be to use a Parser that supports fault tolerance and error recovery.

In hand crafted parsers this error recovery may be added manually (but resulting in a-lot of work...) For example: search for the word "recovery" in the TypeScript parser which is used to provide Language Services for the VSCode IDE. https://github.com/Microsoft/TypeScript/blob/master/src/comp...

Automatic Error Recovery using heuristics can also be implemented by parsing Libraries. In the world of JavaScript I'm familiar with two libraries that support Error Recovery:

Antlr4 - https://github.com/antlr/antlr4

Chevrotain - https://github.com/SAP/chevrotain (disclaimer - I'm the author of Chevrotain)

You might be interested in Ohm's incremental parsing support. I'll be presenting a paper on this at SPLASH next week: https://ohmlang.github.io/pubs/sle2017/incremental-packrat-p....

In theory, what you're suggesting should be possible to implement with our incremental packrat parsing algorithm. But we haven't tried it yet, so I can't say for sure.


I've had great experience using PEG.js, another PEG-based parser generator. How does Ohm compare?

It's somewhat similar, but the main difference is that Ohm has a strict separation between syntax and semantics. We think this has several benefits, which we describe a bit here: https://github.com/harc/ohm/blob/master/doc/philosophy.md

Another difference is that Ohm grammars can contain left recursion -- both direct and indirect. IMHO this is a pretty big deal, but I know that some people don't agree, and think that avoiding left recursion is not a big problem.

Any thoughts on https://github.com/nikomatsakis/lalrpop, and it's first priority being "Nice error messages"?

Sidenote: I created this issue a while ago, but never got a response: https://github.com/nikomatsakis/lalrpop/issues/180

'Nice error messages' is an important goal - it is one of a few things that tends to stop people from using parser generators for 'production' compilers. Parsing itself is the easy bit, and while things like left recursion can make some things a bit easier, the workarounds are so well understood it is not what stops people.

But error handling does.

I really liked what was done in STEPS project. I learned a lot from their repors. For example, this Ian Piumarta's paper is absolutely beautiful [1]. I also spent a lot of time learning oMeta [3] system by Alessandro Warth.

And, honestly, now I see nothing really new in Ohm. Basically, it's just some tweaking of the same tech. Moreover, Ohm was made for isolated parsing task. For me it's a step back. My point is that the parsing alone is not very interesting thing, for making DSLs you need to have other tools too. In the Ian Piumarta's paper we had a minimalistic program transformation system [2]. Remember original META II [4]? It was a compiler-compiler (metacompiler), not just a parser generator. I'm really curious to know why the authors decided this time to limit themselves by only parsing.

[1] http://www.vpri.org/pdf/tr2010003_PEG.pdf

[2] https://en.wikipedia.org/wiki/List_of_program_transformation...

[3] http://www.vpri.org/pdf/tr2008003_experimenting.pdf

[4] http://www.ibm-1401.info/Meta-II-schorre.pdf

Ohm is not really "just" a parser generator -- but that's the easiest way to describe it.

The big idea in Ohm is its modular semantic actions. You can read more about the design -- and why we think it's interesting -- in our DLS paper: https://ohmlang.github.io/pubs/dls2016/modular-semantic-acti...

Thank you for the answer!

I understand that separation of the grammar and semantics has its benefits. You can use the same grammar description with different semantic rules sets etc. It's, indeed, a clean and interesting approach.

But, as I understand, Ohm still has no support for context-sensitive grammars, which is more important to have in many cases, than proper left recursion handling.

And oMeta had another nice feature: meta-rules (higher-order rules) which is absent in Ohm, if I understand correctly.

Ohm tries to be very user-friendly, but at the price of droping the functionality. So in this case Ohm is not a modern replacement for oMeta (which had the AST transforming features -- very important for making compilers!).

I'm not trying to be negative and I really wish a big success to your team!

> But, as I understand, Ohm still has no support for context-sensitive grammars, which is more important to have in many cases, than proper left recursion handling.

Right, we don't support context-sensitive grammars yet. But we'd like to -- we're just trying to figure out how to do it in a way that fits Ohm's design principles. I'm optimistic that we'll be able to do that.

It struck me as mostly the point of Ohm. Bring the ideas behind the various OMetas into an environment we can actually use it.

Sorry to be negative and this comment probably doesn't belong in a discussion about a specific parsing toolkit but I've become unconvinced that parser generators are useful. My experience is limited to Yacc/lex back in the old days (quickly jumped to Bison/flex), more recently Antlr and a couple of functional parser combinator libraries. In nearly all case it was to deal with "real world" (i.e. not toy) programming languages.

The last time I needed a parser (in Java), I started studying the Antlr docs (it's changed quite a bit since I used it last) but became disillusioned quickly with the amount of reading and studying I would have to do to get something working.

So I quickly wrote a "hand crafted" tokenizer and recursive descent parser. I found this so satisfying that it made me wonder why I had bothered learning relatively complex tools in the past particularly since I had been exposed to recursive descent parsing as an undergrad.

Advantages that pop into my head:

- The code was clean, readable and very concise. For debugging, the stacktraces were helpful and I could use my regular debugger/IDE to step through the parsing process. The method names in my Parser class mostly matched the names of corresponding grammar rules.

- You can code around the theoretical limitations of recursive descent parsing in a very intuitive manner (e.g. "if (tokens.peekAhead(1).getType() == Token.LEFT_BRACE) { parseX(); } else { parseY(); }"). In theory it might seem this would lead to a mess but it actually allows very flexible and natural abstractions.

- You have complete control over the building of the AST - the parseX(...) methods can take arguments or the calling parse method can manipulate the returned AST - doing stuff like flattening (normalising) node trees or re-ordering child nodes, etc. The shape of the AST can be independent of the structure of the grammar rules.

- It's easy to provide helpful error messages and even error recovery without fighting with the toolkit. Better still, you can start with a fairly lazy generic error handler and later, in a natural style, add special cases to make the messages more and more helpful for specific common user mistakes. I sneakily logged all parse failures by users to constantly improve error reporting. After a while the parser seemed almost like an AI when reporting errors.

- For parsing expressions, there is a relatively well-known way to deal with operators with different arities and associativity rules (by adding a numeric "context binding strength" parameter to your parseExpr() method) - a quick google provided the template.

- The entire parser was self contained in a small number of reasonably compact classes: a Lexer/Tokenizer class, a Parser class and a SymbolTable class (and of course a TokenType enum and an ASTNode class). Other developers could grok the code because it was compact and self contained without having to learn a parsing toolkit.

- You feel in control; i.e. you can add features to the language and the parser incrementally without fearing that sinking feeling you get when you think you're 99% of the way there only to realize that the tool you're using makes the last 1% impossible forcing you to rethink/rewrite already "banked" functionality.

- Zero dependencies and trivial to integrate into the build and test process.

edit: paragraphs

I was surprised when I revisited the GCC code after 20 years to find that GCC switched from using flex/bison to using a hand-written recursive descent parser (even for C++ which is reputed to be "impossible" to parse).

Here's the C parser: https://github.com/gcc-mirror/gcc/blob/master/gcc/c/c-parser...

and the C++ parser: https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/...

For many of the same reasons H2 (h2database.com) use a recursive descent parser to parse SQL

In many parser generators (e.g. Yacc and ANTLR), a grammar author can specify the language semantics by including semantic actions inside the grammar. A semantic action is a snippet of code — typically written in a different language —that produces a desired value or effect each time a particular rule is matched.

Actually, the need for that went away with ANTLR4. The grammar is now all grammar (and lexer) and the semantic actions are listeners or walkers written separately calling or overriding methods and classes generated from the grammar.

Much cleaner that way.

It did not exactly went away.

It is still possible to embed semantics inside an Antlr4 grammar.

For example see the Antlr4 EcmaScript grammar sample: https://github.com/antlr/grammars-v4/tree/master/ecmascript which uses embedded code to solve the RegExp vs division operator ambiguity.

Another scenario when embedding code could be preferred is optimizing for maximum performance as abstractions normally come at a performance overhead.

I do agree that the default approach should be to separate the semantics unless there is a very good reason why not to...

I stand corrected. I rewrote my grammar with ANTLR4 and gutted all the embedded semantics. That was a good day.

I'm going to hit reply now and then I'm going to take out my personal neuralyzer and forget that I ever found out that you can still embed.

IMHO, nothing makes parsing as easy as snobol/spitbol. It is almost as old as lisp, and older than C.

The question I have as a mere mortal user, who is not interested very much in theory and debates thereon, is what has the fastest performance?

If the proponents of post-snobol PEG/packrat were to publish a "parsing challenge" and let us replicate/create benchmarks of different parsers, including some written in snobol, I would find that very useful in determining whether these other parsers are worth a more serious look.

I have been using Nearley.js [1] and have had a lot of fun using it. I actually quite liked being able to mix in the JS post-processing with the grammar definition in Nearley but could be convinced of the advantages of keeping the separate (checking out your paper on DSLs now).

How would you compare it to Nearley? Can Ohm handle ambiguous grammars?

[1]: http://nearley.js.org

Pretty cool for sharing as it can run in the browser.

So click on the 'Remix' button and you can play around with and run the article's contents.

Is there a way to play with this using Node.js as well?

"Further reading" links at the bottom just link back to the same page.

Thanks. Fixed!

> The Ohm language is based on parsing expression grammars (PEGs), which are a formal way of describing syntax, similar to regular expressions and context-free grammars

Uh-oh. I've voiced my concerns about PEGs (and LL parsers) before, but IMO any grammar "interpreter" that doesn't point out the ambiguities in grammar and instead relies on some vague, and ultimately arbitrary, notion of "precedence" (e.g. that rules declared first in the grammar file have priority), isn't a good foundation for a serious language (good for throwaway parsers and language experiments, though).

I'm not sure how this can be a critism - you inherently need to decide between left and right recursion - and you can, and it is explicit in the ohm grammar:

> 4.1. Left Recursion

> Another interesting thing to note about the new definition of "Exp" above is that it is recursive — i.e., its body contains an application of "Exp" itself. More specifically, it is left recursive, meaning that the recursive application is the first expression in the branch.

> Many PEG-based parser generators do not support left recursion — requiring grammar authors to use repetition or right recursion instead. But left recursion is the most straightforward way to express left associative operators, which is why left recursion is supported by Ohm.

Ed: -- Or are you making some other, more subtle point? If so, maybe you can give an example?

Ed2: @yxhuvud's reply links to an article with some examples like:

  L1 = ("a"|"aa")"a"

  L2 = ("aa"|"a")"a"

A good example of the kind of ambiguitiy I mean is the "dangling else" problem.

Sure, for this example, the solution is easy - just declare whichever you want first! But in more complex grammars, conflicting rules can be declared very far away, and it can be highly non-obvious which should be prioritised, or that there even is a conflict! LR parser generators point out these conflicts (the infamous shift/reduce and reduce/reduce conflicts), which I find very useful.

> you inherently need to decide between left and right recursion

I assume that you mean that you have to do that to get a parser that can parse in linear time? Because it is certainly not true in the general case.

But even if you mean that, it is wrong, as there are optimizations to the Earley parser that allows it to be linear on both grammars that contain both left- and right-recursive rules at the same time.

(It can still be quadratic or worse on internal recursion, and on many ambiguous grammars).

I meant that if you want your language to be unambiguous in what tree it produces, you have to choose (somehow not be ambiguous). You can be explicit in your grammar and semantics of your grammar, or you could throw away some set of "wrong" trees - or you can be implicit - allow your grammar (under certain semantic assumptions be ambiguous) - and assume left /right.

But if the grammar and semantics allow (force) you to be explicit - I'm not sure I see it as a problem. I certainly see how allowing ambiguity with a warning/error would also be good.

I'm on a cellphone now, and can't see if I can make useful/surprising grammar for the a/aa|a/aaa languages in ohm.

Maybe it's just too minimal an example for me (why not: a|aaa or aa+ - or whatever one is trying to express? Is it, say about the language of pairs or triplets of "a"s?).

Well, most parsers have made a global choice of left and right. If the choice is on rule or symbol level, then that would allow a lot more flexibility - probably to the point that it would be powerful enough to parse most things without tedious rewriting of the grammar.

Is this a limitation of the PEG syntax itself? IOW, is it possible to identify ambiguities in grammars defined as PEGs?

Ambiguity means different things to CF and PE grammars. For CF it is obvious: two derivations are possible for a given string (or prefix, depending on your flavor).

But for a PE grammar, it's harder to define. It's not "an ambiguity in the equivalent CF grammar" because what does equivalent mean ? Same derivations ? Then it's not ambiguous, because PEG only allows single derivations. Same rules ? It's obvious that you can't treat a PE grammar as if it were CF and still expect it to work.

I believe the actual problem isn't ambiguity, but rather brittleness. What transformations can be applied to a grammar while keeping the same meaning ? In other words, how easily can it be refactored ?

Consider this small grammar:

    DEF    -> ('int' | 'bool') ID ('=' EXPR)?
    VARDEF -> 'var' ID '=' EXPR
In a context-free grammar, you have mathematical proof that you can convert the above to this cleaner version without altering the meaning:

    DECL -> ('int' | 'bool') ID
    DEF  -> ('int' | 'bool' | 'var') ID '=' EXPR
In a parsing expression grammar, this conversion might change the meaning, depending on what EXPR can recognize.

Personally, I'd rather deal with shift/reduce conflicts than try to determine, by hand, whether a minor refactoring of my grammar will cause some obscure edge case to be parsed differently.

Possible? Maybe.

But the problem is more that, (according to Wikipedia), the choice operator in PEGs (i.e. e1 | e2) is in fact defined as ordered choice, i.e. it prefers the first alternative.

They try to sell this as a "solution" to ambiguous grammars, as an advantage, but they're just ... wrong. It's as if Java, when resolving method overloading, arbitrarily prefered the method that's declared first in the source file, instead of refusing to compile ambiguous code, as it does now.

> They try to sell this as a "solution" to ambiguous grammars

But it is a solution... the grammar is no longer ambiguous if you define choice as giving priority to one side or the other. There's no need for scare quotes! It is a solution that removes ambiguity. There is no longer any ambiguity, and there's nothing 'vague' at all about a rule as simple and clear as this.

> but they're just ... wrong

You'll have to give a more convincing argument than that if you want to persuade people!

> It's as if Java, when resolving method overloading, arbitrarily prefered the method that's declared first in the source file, instead of refusing to compile ambiguous code, as it does now

But if you had this rule, then the code isn't ambiguous any more is it? It has a well-defined way to decide which method to use, which everyone can understand and implement.

> But it is a solution... the grammar is no longer ambiguous if you define choice as giving priority to one side or the other.

Sure it's no longer ambiguous to the computer. But the important question is: is it ambiguous to a human?

Take the "dangling else" problem. What does this mean in C?

   if (a)
   if (b) f();
   else g();
If you defined your grammar with a PEG, the answer is: whichever alternative you put first (if-with-else or if-without-else). But that answer doesn't help someone actually trying to use your language unless they go and read your PEG. What user wants to do that?

Worse, it keeps language designers from being aware when they accidentally put gotchas like this into their languages. The PEG tools can't warn you, because to a PEG tool there is no problem. As a real-world example of this, it was not discovered that ALGOL 60 had a "dangling else" ambiguity until the language had already been published in a technical report. A CFG-based tool could have warned the designers about the ambiguity, but with PEG-based tools you are designing blind.

Ambiguity of a grammar is rather unrelated to how surprising it can be to a human. Something like TypeScript:

    var a = { label: f() };
    () => { label : f() };
These constructs look similar, but one is an object literal and the other is a block with a useless label. All of this can be implemented as an unambiguous context-free grammar.

Relying on grammar ambiguity detection to find constructs surprising to humans is not very effective, if only because of the difference between how a human understands the grammar (pattern-based) and how EBNF expresses it (prefix-based).

This is a red herring. Context-free grammar tools don't solve the problem of keeping a language from ever being confusing. However they do solve the problem of allowing literal ambiguity into your language.

Ambiguity is strictly worse than confusion. Ambiguity means you have to communicate more information to your users: when two parses are both syntactically valid, which one does the language actually choose?

> Relying on grammar ambiguity detection to find constructs surprising to humans is not very effective

Well yeah, obviously, that's like relying on unittests and type system to keep your program bug free... I mean, neither can guarantee your program is going to work, but either failing means that it almost certainly won't work (well).

Your example seems to support my argument, not yours!

Dangling else's are solved by formally expressing a fixed priority in the grammar.

> unless they go and read your PEG

Well yes they need to read the documentation for the language... how else were they managing to write a program in it? How is it any difference from having to read the documentation to understand the precedence of different operators?

Dangling else is solved by changing the definition of the language. Newer languages don't have the dangling else problem, because we learned the hard way in the 60s how to avoid it.

PEG-based tools invite more mistakes like this, because they can prevent the discovering of ambiguities until it is too late to fix them.

Yes, operator precedence is another example of ambiguity: we live with it because infix math is useful enough that we live with the fact that you have to learn the precedence rules. But with PEG-based tools, you can accidentally introduce ambiguities that might be easy to avoid if you just knew about them at the design stage.

I would never design a language with PEG-based tools, because I would have no idea where I might be introducing pointless ambiguity.

> prevent the discovering of ambiguities

But there are no ambiguities to discover if you use a PEG! A language grammar defined using PEG cannot be ambiguous!

It's very strange that you keep returning to this. "Dangling else" is a problem regardless of how you write down your grammar or implement your parser.

It's like sweeping a mound of dirt under the rug and then saying "by definition, there is no dirt on the rug!"

I guess I just can't understand where you are coming from then.

Someone said that PEGs don't solve ambiguous grammars. I said that's wrong - they do - they are no longer ambiguous. Now people are arguing about having to understand the grammar and how languages should be designed and things like that? Seems irrelevant to me.

I thought there was one precise technical question - do PEGs make solve the problem of ambiguous grammars. Yes they do - they are no longer ambiguous. If you define your language using a PEG you will never have any ambiguity in your grammar. Seems solved to me!

> do PEGs make solve the problem of ambiguous grammars. Yes they do

That's a problem noone ever had.

By this definition, even grammars generated by yacc and Bison (LR parser generators) aren't ambiguous - even if there are conflicts a working parser will be produced. They even resolve conflicts in a "sensible" way - shift/reduce conflicts are resolve in favour of shift (i.e. longest match - like regexes) whereas reduce/reduce conflicts are resolved in favour of the earlier rule (like PEGs). Hell, I even left one shift/reduce conflict unresolved in a parser I wrote a few days ago (amply documented, of course), because I know precisely where the ambiguity, no, confusion stems from, and I decided that rewriting the grammar to resolve the conflict explicitly would be too much work and would make the grammar much less readable.

So I really see no point in PEGs (except possibly ease of use). LR parser generators apply more-or-less the same conflict resolution, except that they will proactively warn you about where your grammar is, or could be, confusing.

> I thought there was one precise technical question - do PEGs make solve the problem of ambiguous grammars. Yes they do - they are no longer ambiguous.

Everyone understands that. We all agree that this is formally true: PEG grammars have no ambiguity. No one is disputing that.

The question is whether the convention of ordered choice is useful. Does it solve conceptual ambiguities?

Take tomp's example: imagine that we changed Java to resolve overloaded functions differently. Right now Java will throw an error if two overloaded functions both match a call site. Imagine that Java was changed to successfully compile the file, and resolved the ambiguity by arbitrarily picking the first matching function.

Now there is no ambiguity! Just like PEGs! But is this behavior useful? I would rather know about the ambiguity, and have the compile fail.

It is a "solution" because it becomes unobvious what language it will parse, not because it doesn't resolve the ambiguity.

I can't see why that would be the case. When I read the grammar I can see the choice operators and I know they have priority. What is unobvious about that?

I'm aware of ordered choice in PEGs. I'm not entirely convinced that's a real problem in practice (or am missing some compelling example). E.g. regular expressions can be ambiguous too, but are still used very extensively and people tend to not complain about them. Is longest match - a common resolution to multiple matches in regexps - also problematic?

I'm also wondering if this specific complaint wrt ambiguity can be easily addressed by just implementing an analyzer that identifies ambiguities in a PEG grammar by evaluating `|` as alternative choice.

Regarding your regexp question, it depends on what implementation of regexp you use. If you go by the formal definition and the state machine parsers optimized for that, then multiple matches is not problematic for finding the longest match. It can be if using some common extensions though, that change the running time to exponential. See https://blog.codinghorror.com/regex-performance/ for example.

I'm not talking about performance though. I'm talking about the fact that when given an ambiguous regexp, most engines match it happily and return the longest match, instead of erroring out saying 'ambiguous input provided'. This is not considered problematic in practice (or I don't hear criticisms about this at least).

Contrast with the the criticism of PEGs - that they don't report 'ambiguous input' but just go ahead and match using ordered choice.

Ah, that is what you mean.

Then the answer is that they are used for different things. PEGs are typically used to parse a string, while regexps are typically used to recognize a pattern.

Where regexps are used for parsing, then ambiguity is definitely a problem that very often will have to be resolved by rewriting the regexp to be less accepting.

In practice, most Regexes fit on a line, and while not trivial to read, they don't involve a deep tree-like structure. Grammars, on the other hand, do, and ambiguity can be very sublte.

Is ordered a choice a necessary property for PEG parsing? Maybe PEG parser generators should have an "unordered mode".

In practice, they should distinguish between / (I meant this alternative to be ordered, because I care about something) versus | (no ordering required, because I know only one choice will ever be picked), and come up with a way to determine that a | generates an ambiguity.

Yes. https://jeffreykegler.github.io/Ocean-of-Awareness-blog/indi... have some discussion on it, and references for further reading (and a whole lot of making his own parser look powerful in comparison. It certainly is. I just wish the code of his parser was closer to the C I've seen before).

While PEGs necessarily impose a notion of "precedence" due to their construction, LL parsers don't necessarily impose any such notion. Any precedence that an LL parser might have is a consequence of its implementation. It's also possible to check an LL(1) grammar for ambiguities by computing an LL(1) parsing table and checking that any given (state, input) pair has at most one entry in the table. The process for other sorts of LL parsers is similar.

I understand your concern, but can you tell me a bit more about how you think "serious" languages should be developed?

That is a very good question because afaik many(most?) commercial (meaning "serious"...) programing languages are developed using hand crafted recursive decent parsers which usually mean that the first alternative takes precedence and that there are no grammar validations to detect ambiguities at all.

I do agree that grammar validations and ambiguity detection is very important, but even if the semantics of a PEG grammar define that the first matching alternative should be taken, should it be possible to detect some cases such as unreachable alternatives as long as the grammar is structure is known in advance?

It's not hard to point out ambiguities, it's hard to fix them. That's why PEGs are nice, they don't make humans work as much, since fixing ambiguities is just a matter of precedence.

But I don't think the world of context-free grammars even matters that much.

Pointing out ambiguities in a context-free grammar is an undecidable problem (http://www.cis.upenn.edu/~jean/gbooks/PCPh04.pdf).

No, that's not what your link proves and not what we are talking about.

Theorem 6.8.2 in the PDF: It is undecidable whether a context-free grammar is ambiguous.

PEGs are unambiguous by construction, so I assumed we were talking about grammars which would be ambiguous as CFGs (replacing every ordered choice with a nondeterministic choice). What kind of ambiguity were you thinking about?

Any ambiguity that manifests itself as unreachable alternatives in PEGs can be detected for example. Doesn't matter if it's undecidable whether the whole grammar is ambiguous.

Those popup chats on articles like this gross me out...

I'm just trying to read something, stop phishing for my email address.

Sorry about that. We didn't mean to show it to visitors and a bug prevented us from quickly disabling it. We've now removed it completely.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact