Hacker News new | past | comments | ask | show | jobs | submit login

Can someone summarize what PEG is? I read through the Wikipedia page but can't make heads or tails of it. What is this used for?



PEG is a different way of specifying grammars and an different process for parsing them. If you're used to Bison/Yacc or LALR parsers modeled on Yacc (like Racc), the big things you notice with any PEG parser are:

* You automatically get "EBNF"-style operators like "zero-or-more" or "one-or-more" instead of having to specify recursive nonterminals and epsilons.

* You typically don't need a separate lexer; PEG parser productions resemble regular expressions.

* PEGs use prioritized choice to deal with ambiguities such as dangling-else. PEG grammars are unambiguous.

PEG parsers are much easier to build and work with than Yacc-style parsers. The learning curve on PEG is also way, way shorter than for shift-reduce parsers. You should probably be using PEG parsers whenever possible now.


I've long been a PEG skeptic. All three of these benefits have issues:

(1) LL parser generators have had zero-or-more and one-or-more operators for a long time (see ANTLR for example [1], as well as the venerable Parse::RecDescent [2]). They're very easy to implement in a recursive-descent parser.

(2) In practice, lexer-free grammars are very difficult to maintain in the presence of things like comments and whitespace rules. You sacrifice the ability to reason about layers of the grammar in isolation by binding the two layers together. Try to write a PEG grammar for, say, Java and you'll see what I mean; you end up wishing you had a sort of preprocessor that could get rid of comments and such in advance, which is precisely what a lexer is.

(3) PEGs are unambiguous in the sense that the system specifies how to resolve ambiguities (by taking the first option). But backtracking LL grammars and LR grammars provide this too. LL parser generators typically accept the first rule that matches: what could be simpler? LR parser generators are a little more clever and resolve shift/reduce conflicts in favor of the shift, which is a bit more magical but basically means that productions act "greedy". The key is that grammars always specify how to resolve ambiguities in some way, and this is really no different from what PEGs provide.

[1]: http://www.antlr.org/wiki/display/ANTLR3/ANTLR+Cheat+Sheet [2]: http://search.cpan.org/~dconway/Parse-RecDescent-1.965001/li...


I don't care enough about this stuff to have strong opinions on (1) and (3). Until this year, I wrote my parsers (I end up doing 2-3 a year) in Lex/Yacc.

But (2) is a red herring. Nothing about PEG tools force you to do all your tokenizing in the grammar. If you want to strip /++/ comments out, you can use exactly the same code you'd have used in your handrolled recursive descent parser to do that, and then parse the real stuff in PEG.

But that aside, we use a PEG grammar for Javascript, which has (I assume) all the annoying comment syntax you're referring to, and it doesn't seem like a big deal.


> we use a PEG grammar for Javascript [...]

Anything open-source, or that you're in the mood to share?



For a counterpoint:

* you can get "EBNF"-style operators using traditional LL parsers like ANTLR.

* Likewise, top-down parsers can have integrated lexing also.

* The downside of prioritized choice is that you don't get any insight into what inherent ambiguities your grammar has. The dangling-else ambiguity in C (for example) is a real ambiguity that you have to tell your users about. Prioritized choice hides ambiguities and gives you no hint when you develop your grammars where the ambiguities are.

* PEG-based parsers are less efficient than LL parsers. LL parsers are O(n) in time and O(d) in space (where n is the length of the input and d is the depth of the nesting). To parse a PEG in O(n) time the space complexity becomes O(n) (this is a packrat parser). To parse with O(d) space, time complexity becomes O(n^d) (as with lpeg). And the constant factor for LL parsing is much lower than with packrat parsing.

I think the future is LL. When the tooling gets good enough with LL, I think that the primary motivation for PEGs (that they are easy to use) will evaporate.


I knew someone would call me on this and chose my words carefully --- "LALR parsers modeled on Yacc" --- but am glad for the response. ANTLR is a pain in the ass to integrate; you can get a totally respectable PEG parser in 2k lines of C. What would you use for LL parsing?


I'm working on making my ideal tool: http://www.reverberate.org/gazelle/

It's been a long time coming, but I believe very strongly in it.

"pain in the ass to integrate" -- I can sympathize. I'm working tirelessly to make this as easy to integrate as regexes are in Perl, Python, Ruby, etc.


The name of the pdf linked to on the language.js page is "Packrat Parsers Can Handle Practical Grammars in Mostly Constant Space" (www.ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf).

They achieve this by adding a Prolog-style Cut operator to the PEG language.


PEG parsers get rid of the dangling else problem with recursive descent parsers where it's unclear in the following statement whether

   if e then s1 else if e then s2 else s3
   
the else s3 should be with the first if or the second if - usually you have to code special rules into your parser. It also allows for a worst-case linear time parse, and there are some language rules you can express with a PEG parser that you can't with a recursive descent parser.


Wouldn't you just have an end symbol for if statements, "end" or "}" or something? That's what I did in my little recursive parser.

Maybe what I should ask is, how does a PEG parser resolve it?


In a PEG you operate under the expectation that all rules will be tried exhaustively, so if one fails you just move on to the next, and if they all fail you drop back a level of recursion.

Thus you can have rules like:

"if <if-expr>" where if-expr can contain "<statements> else" or "<statements> if" or "then <statements>", where statements is a set of predefined keywords, another set of PEG rules, or a regular expression matcher.

And then you can just expect that they'll all be tried at some point.

The main issue with this, of course, is the one you raised, with endings and creating rules for delimiting new blocks. Although it's possible to define a good set of rules, it's still hard to distinguish between a continuation of a statement and the start of a new one, and in most cases you won't get a useful syntax error message using a naive PEG system like the one in PEG.js, because it'll just drop all the way to the top level and then fail at the first character without telling you how far/deep it got before the syntax broke.

So in practice having end symbols is still useful to make the grammar operate sanely. PEG can also be used to execute a relatively flat pass that mostly produces tokens and simple expressions, while other methods are applied to the resulting tree to get the complete structure. This is an approach which I find a little less mind-bending than trying to cram everything into PEG rules, since you can use varying methods of parsing as the situation dictates.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: