* You automatically get "EBNF"-style operators like "zero-or-more" or "one-or-more" instead of having to specify recursive nonterminals and epsilons.
* You typically don't need a separate lexer; PEG parser productions resemble regular expressions.
* PEGs use prioritized choice to deal with ambiguities such as dangling-else. PEG grammars are unambiguous.
PEG parsers are much easier to build and work with than Yacc-style parsers. The learning curve on PEG is also way, way shorter than for shift-reduce parsers. You should probably be using PEG parsers whenever possible now.
(1) LL parser generators have had zero-or-more and one-or-more operators for a long time (see ANTLR for example , as well as the venerable Parse::RecDescent ). They're very easy to implement in a recursive-descent parser.
(2) In practice, lexer-free grammars are very difficult to maintain in the presence of things like comments and whitespace rules. You sacrifice the ability to reason about layers of the grammar in isolation by binding the two layers together. Try to write a PEG grammar for, say, Java and you'll see what I mean; you end up wishing you had a sort of preprocessor that could get rid of comments and such in advance, which is precisely what a lexer is.
(3) PEGs are unambiguous in the sense that the system specifies how to resolve ambiguities (by taking the first option). But backtracking LL grammars and LR grammars provide this too. LL parser generators typically accept the first rule that matches: what could be simpler? LR parser generators are a little more clever and resolve shift/reduce conflicts in favor of the shift, which is a bit more magical but basically means that productions act "greedy". The key is that grammars always specify how to resolve ambiguities in some way, and this is really no different from what PEGs provide.
But (2) is a red herring. Nothing about PEG tools force you to do all your tokenizing in the grammar. If you want to strip /++/ comments out, you can use exactly the same code you'd have used in your handrolled recursive descent parser to do that, and then parse the real stuff in PEG.
Anything open-source, or that you're in the mood to share?
* you can get "EBNF"-style operators using traditional LL parsers like ANTLR.
* Likewise, top-down parsers can have integrated lexing also.
* The downside of prioritized choice is that you don't get any insight into what inherent ambiguities your grammar has. The dangling-else ambiguity in C (for example) is a real ambiguity that you have to tell your users about. Prioritized choice hides ambiguities and gives you no hint when you develop your grammars where the ambiguities are.
* PEG-based parsers are less efficient than LL parsers. LL parsers are O(n) in time and O(d) in space (where n is the length of the input and d is the depth of the nesting). To parse a PEG in O(n) time the space complexity becomes O(n) (this is a packrat parser). To parse with O(d) space, time complexity becomes O(n^d) (as with lpeg). And the constant factor for LL parsing is much lower than with packrat parsing.
I think the future is LL. When the tooling gets good enough with LL, I think that the primary motivation for PEGs (that they are easy to use) will evaporate.
It's been a long time coming, but I believe very strongly in it.
"pain in the ass to integrate" -- I can sympathize. I'm working tirelessly to make this as easy to integrate as regexes are in Perl, Python, Ruby, etc.
They achieve this by adding a Prolog-style Cut operator to the PEG language.
if e then s1 else if e then s2 else s3
Maybe what I should ask is, how does a PEG parser resolve it?
Thus you can have rules like:
"if <if-expr>" where if-expr can contain "<statements> else" or "<statements> if" or "then <statements>", where statements is a set of predefined keywords, another set of PEG rules, or a regular expression matcher.
And then you can just expect that they'll all be tried at some point.
The main issue with this, of course, is the one you raised, with endings and creating rules for delimiting new blocks. Although it's possible to define a good set of rules, it's still hard to distinguish between a continuation of a statement and the start of a new one, and in most cases you won't get a useful syntax error message using a naive PEG system like the one in PEG.js, because it'll just drop all the way to the top level and then fail at the first character without telling you how far/deep it got before the syntax broke.
So in practice having end symbols is still useful to make the grammar operate sanely. PEG can also be used to execute a relatively flat pass that mostly produces tokens and simple expressions, while other methods are applied to the resulting tree to get the complete structure. This is an approach which I find a little less mind-bending than trying to cram everything into PEG rules, since you can use varying methods of parsing as the situation dictates.