My approach has been to introduce a type of cut operator, such that if you backtrack past a cut operator then it's information is stored, then if the parsee can't proceed it can examine the stored errors and produce hopefully sensible error messages.
For example, if I have a simple rule for an expression
exp = atom (op exp)*
exp = atom (op <cut> exp)*
With this grammar:
expr: expr '+' term | term
But with this grammar:
expr: expr '+' <cut> term | term
expr: expr '+' API_call_here? term | term
expr : atom (op <cut> expr)*
expr : atom (op (expr / cut))*
cut : "XXX"?
expr : atom (op expr?)*
For instance, you are responsible for declaring any possible branches with choice operators.
The approach becomes quite similar to manual lookahead in a recursive descent parser, but more declarative.
I presume that this makes the runtime characteristics of the parsing a lot more predictable and bounded, and I'm surprised I hadn't thought about the solution before. Seems like a great idea, and worked well for my usecase.
Packrat caches intermediate results so that you don't have to factor your rules out to avoid trying to parse the same thing repeatedly. PEGs can already do as much lookahead as you care fore and Packrat doesn't optimise that, except between multiple choices. So that allows you to write more natural grammars.
So how does a PEG parser solve these annoyances? By using an infinite lookahead buffer! The typical implementation of a PEG parser uses something called “packrat parsing”, which not only loads the entire program in memory before parsing it, but also allows the parser to backtrack arbitrarily. While the term PEG primarily refers to the grammar notation, the parsers generated from PEG grammars are typically recursive-descent parsers with unlimited backtracking, "
I read that as implying generating a typical recursive descent, with the classical degenerate branching cases that TDOP partially solves.
Arbitrary backtracking is a property of PEGs though - you don't need Packrat for that. The original 1970s Ullman paper that this all goes back to is 'parsing algorithms with [arbitrary] backtrack'. Packrat made it linear.
These differences probably aren't worth worrying about too much though.
The way I've seen it presented, PEG is primarily a style of grammar (basically, prioritized choice instead of normal alternation) and Packrat is a parsing algorithm.
A PEG is imperative. It's already an 'algorithm' and already describes how to parse it. That's different to traditional declarative grammars that people are used to, which need an algorithm to go with the grammar.
Or alternatively you could claim no CFG needs an algorithm either, since you could just use recursive descent on everything...
You don't need an extra algorithm to parse. You'd just like one to make sure it runs reasonably fast.
You can't use recursive descent on everything - how do you build a reclusive descent parser for an ambiguous grammar? You'd have to introduce some extra rules. That's the parsing algorithm, distinct from the grammar.
I think you can probably also use GLR (or whatever) to parse a language described by a PEG too. All you should need to do is to just prune the extra branches. If I'm not mistaken it should get you a better worst-case complexity than Packrat too. So I really don't see how you can say PEG is an algorithm... you have plenty of options for what algorithm to use, it's just that people prefer a particular one.
Here's a PEG for tweets:
It would not have been practical to write in the absence of the huge existing test suite.
The operator overloads are a lot to reckon with, beyond just the concepts, but it seems like there's something there worthwhile of study and learning. Hope to give it another try.
Be interested in any introductory material, if anyone has some good links.
This is probably the best reason I've seen why you should learn parsing.
They've become somewhat unfashionable over the last decade - not really sure why.
If anything I feel like PEGs are getting more popular lately, likely because the memory usage is no longer terrible relative to the amount of memory even on "memory constrained" systems.
LR and LL have the reverse problem. It's a challenge to get it to recognise a complex target language, but once you get the thing to work at all it's probably right. Also LL & LR are faster (very close to O(size of input)) and use less memory (almost none). PEG's have get their speed by remembering how they reduced what they've seen so far.
There are lots of trade-offs in other words. PEGS win anywhere you would use once of the current crop of NFA based regular expressions recognisers (eg, PCRE), need something better.
Much, much larger memory consumption than LR/LALR or even GLR. They're called packrat parsers for a reason.
Although the time complexity is asymptotically the same, LR-family parsers have such a tiny memory footprint that they usually run entirely in cache. Packrat tables never fit for nontrivial grammars+inputs. In practice this makes a big difference in performance.
Also, PEG makes having a context sensitive lexer not simply trivial but no effort at all because lexing and parsing are all the same process. It you want to have two different token-types for an INT-looking-thing at two different places in your grammar for, "reasons", you are free to do so. Try that in a discrete lex/parse style grammar.
(You might not think you'd need a full blown parser library for something with a grammar as simple as Lisp but I found that there are enough grammar rules and the lookahead is deep enough that using a PEG parser was a win.)
Too bad they are so slow.
> Unlike CFGs, PEGs cannot be ambiguous; if a string parses, it has exactly one valid parse tree. It is conjectured that there exist context-free languages that cannot be recognized by a PEG, but this is not yet proven.
Does this mean that PEG grammars are a form of regular grammar?
It will come back to you.