Hacker News new | past | comments | ask | show | jobs | submit login

PEGs are one class of cleanly composable grammars. OMeta uses that attribute of theirs to great advantage. And PEGs are, formally speaking, parsable in worst-case linear time, although whether that linear time is fast enough to be practical is still unknown, and it also potentially uses large linear space. I hope it's not vulgar to post yet another link to my minimal PEG parser generator in one page of code: https://github.com/kragen/peg-bootstrap/blob/master/peg.md

In particular, the example here of embedding SQL in C would be pretty trivial to do as a PEG.

PEGs can parse some languages that are not context-free — the example from Bryan Ford's thesis is aₓbₓcₓ, where xᵢ means "i repetitions of x". I think there are also context-free languages they can't parse, but I'm not sure.

(Operator precedence parsing is particularly ugly in PEGs, since by default, they don't support left recursion. The OMeta folks figured out how to cleanly support left recursion in PEGs, but you lose the linear-time guarantee.)

One very appealing attribute of PEGs is that you can conveniently extend PEGs to support parameterized productions, which could be crucial in building up a library of general-purpose parsing productions that could reasonably be used to shorten the time to define new languages.

k4st claims in http://news.ycombinator.com/item?id=2330672 that Pratt top-down operator-precedence parsers are also cleanly composable. I didn't know that, and it seems like a surprising claim, but I don't really understand Pratt parsers yet. Crock has written an excellent article on Pratt parsers as his Beautiful Code chapter, and his JSLint is written with one. I think this is the same text as his chapter: http://javascript.crockford.com/tdop/tdop.html

Earley parsers can parse any context-free sentence in worst-case O(N³) time. Context-free grammars can, of course, be easily composed. Quoting Earley's abstract, "It has a time bound proportional to n³ (where n is the length of the string being parsed) in general; it has an n² bound for unambiguous grammars; and it runs in linear time on a large class of grammars, which seems to include most practical context-free programming language grammars."

John Aycock has written an Earley parser generator called SPARK which is included with Python, and is reportedly pretty efficient. He's advocated using it specifically because of the composability concern. http://pages.cpsc.ucalgary.ca/~aycock/spark/

Laurie Tratt just posted this other thing about parsing that's currently on the front page of HN; it's really excellent, and duplicates much of what I've written above: http://tratt.net/laurie/tech_articles/articles/parsing_the_s...




A PEG parser generator will produce a recursive descent parser, which should be exactly what the author was looking for. That pesky operator precedence thing is solved (as it is in most BNF-like grammars) with rules nested in precedence order. Rats! is a PEG parser generator for Java. http://www.cs.nyu.edu/~rgrimm/xtc/rats.html If the author would like more links, he should google "PEG parser generator".


asking as a total noob, how does a PEG grammar compare to ANTLR's grammar? seems very similar


ANTLR is a parser generator for CFGs and allows for flexibility in regards to ambiguity and context-dependent grammar by means of predicates. But the generated parsers are LL(k) (with enhancements, LL(star) they call it) and this brings certain problems with it, like the rules cannot be left-recursive.

As Terrence Parr said "LL recognizers restrict the class of acceptable grammars somewhat".

PEG rules cannot be ambiguous, as rules are tried out in order until one matches. Also recognizing tokens is part of processing rules, rather than having a separate lexer (which ANTLR does - lexer rules are separate from parser rules, and the code generated reflects that) - which means it is easier to deal with certain kinds of ambiguities.

This can work to your disadvantaged however. For example languages where whitespace is significant, like in Python where blocks are delimited by indentation level, the lexer needs to be hacked by hand (i.e. you have to cheat somewhere, because that's context-dependent and a bitch to deal with).

In general, ANTLR generated parsers are more efficient and allow for certain hacks to be implemented easily, but PEG grammars are easier to write and allow for certain context-dependent rules that are very hard to implement in ANTLR. And PEG grammars need backtracking as an implementation detail and although memoization reduces the impact of that, the result will have less performance.

Basically for quick prototypes I would choose PEG grammars and for industrial strength something like ANTLR.

Btw, here's a cool Java/Scala library for creating PEG parsers, and it's much lighter and easier to setup than ANTLR: https://github.com/sirthias/parboiled/wiki

And the syntax is basically Java/Scala: https://github.com/sirthias/parboiled/wiki/Simple-Java-Examp...

AND the Scala way of defining grammars is so cool I can't even describe it - basically you can create new grammars by inheriting existing grammars - say you have a general SQL syntax that you want to specialize for MySQL / PostgreSQL. Works like a charm ;)


PEGs backtrack to handle multiple alternatives, and can bound their parsing time to linear by memoizing; ANTLR generates LL(*) parsers (despite its name!) which use a token-lookahead mechanism to bound their parsing time to linear (and, incidentally, run much faster).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: