Hacker News new | past | comments | ask | show | jobs | submit login
PEG Parsers (medium.com)
151 points by X-Istence 34 days ago | hide | past | web | favorite | 49 comments

One thing that I like about PEG (among many) is that it's fairly easy to create useful error messages from them.

My approach has been to introduce a type of cut operator, such that if you backtrack past a cut operator then it's information is stored, then if the parsee can't proceed it can examine the stored errors and produce hopefully sensible error messages.

For example, if I have a simple rule for an expression

  exp = atom (op exp)*
And I insert a cut between the op and the atom...

  exp = atom (op <cut> exp)*
Then that cut (which can be appropriately annotated with human readable error information) will be returned if there is a binary exp with no rhs

Any suggestions for discriminating between a partial parse and an error parse? For example a REPL in which on newline if the parse can never be correct versus a parse that could be made correct with more input.

This has always struck me as significant effort invested in making a bad user experience only slight less bad: the REPL overloads the meaning of "enter" to mean both "insert a new line" and "execute command", and then attempts to guess what the user meant by tweaking the grammar of the language. Why not distinguish between "insert a new line" and "execute command" as two distinct inputs, and eliminate all ambiguity ?

I haven't used it for that, but I have used it to absorb errors and continue. The error just gets placed into the AST and then a later phase can report on (potentially) many errors, rather than just stopping at the first. YMMV

I used this exact same technique for the most recent PEG-based parser I wrote too. Works really well, and integrates well with Prolog (which I was using).

Wait... what is a "cut"? A minimal sample that show how it help with the error messages?

Cut means "if you matched this far, don't consider other options even if the current option fails to parse".

With this grammar:

    expr: expr '+' term | term
And this input:

    5 +
It will first consider the first alternative (expr '+' term) but will fail to match that branch because it's missing the "term" after the "+". That's fine because there's another alternative to try (term). That alternative doesn't match either, and you'll get the error message from the last alternative that it tried (probably something like "unexpected +").

But with this grammar:

    expr: expr '+' <cut> term | term
Then with the same input ("5 +") it will consider the first alternative, fail to match, and then fail immediately with the error message from that first alternative (something like "missing term after +").

yeah, I get the purposed point of a "cut" but how I put one? Exist a basic sample? (ie: Using a peg parser how define what is a "cut"???)

    expr: expr '+' API_call_here? term | term

Hmm, if you're using a peg parser that doesn't directly support this you can emulate it by choosing between the rhs of the cut and a production that will always match


  expr : atom (op <cut> expr)*
Would be replaced with

  expr : atom (op (expr / cut))*
  cut  : "XXX"?
or perhaps simpler

  expr : atom (op expr?)*
Either way when you process the expr rule there will be an empty match where the rhs should be, and an appropriate error can be generated

Drothlis comment describes it well, but to perhaps explain where the term comes from, Prolog has a related concept and calls it the cut operator

I recently used the Rust "Pest" PEG parser, and it took an interesting approach to lookahead. It delegates the problem to the grammar writer.

For instance, you are responsible for declaring any possible branches with choice operators.

The approach becomes quite similar to manual lookahead in a recursive descent parser, but more declarative.

I presume that this makes the runtime characteristics of the parsing a lot more predictable and bounded, and I'm surprised I hadn't thought about the solution before. Seems like a great idea, and worked well for my usecase.


That's just a PEG isn't it - that's what this article is about. It's not a new idea in Pest.

My understanding is Pest doesn't use packrat, and thus doesn't have the unbounded lookahead mentioned in the article. I figure that unbounded lookahead comes with a lot of downsides, and my opinion is that Pest solves lookahead in an interesting way. It felt relevant given the focus on lookahead in the article.

Packrat isn't really a parsing algorithm - the PEG is the parsing algorithm. Packrat is instead an optimisation that can be transparently applied to a PEG. It's a common cause of confusion. And to make it more complicated some PEGs aren't really tractable without Packrat, so then that's sort of not an optimisation any more.

Ok, thank you for clarifying. If you don't mind me asking, is the manual approach to lookahead common to most PEG parsers? And if so, why would packrat be associated with infinite lookahead?

The manual approach to lookahead is standard for PEG. You mean things like the negative predicate when you talk about lookahead, right? That's in Ford's 2002 thesis. And the operator precedence? I was doing that in my masters thesis in 2006.

Packrat caches intermediate results so that you don't have to factor your rules out to avoid trying to parse the same thing repeatedly. PEGs can already do as much lookahead as you care fore and Packrat doesn't optimise that, except between multiple choices. So that allows you to write more natural grammars.

I might have misread what Guido is writing in the article then;

" So how does a PEG parser solve these annoyances? By using an infinite lookahead buffer! The typical implementation of a PEG parser uses something called “packrat parsing”, which not only loads the entire program in memory before parsing it, but also allows the parser to backtrack arbitrarily. While the term PEG primarily refers to the grammar notation, the parsers generated from PEG grammars are typically recursive-descent parsers with unlimited backtracking, "

I read that as implying generating a typical recursive descent, with the classical degenerate branching cases that TDOP partially solves.

I suppose you could call the packrat cache an 'infinite lookahead buffer'.

Arbitrary backtracking is a property of PEGs though - you don't need Packrat for that. The original 1970s Ullman paper that this all goes back to is 'parsing algorithms with [arbitrary] backtrack'. Packrat made it linear.

These differences probably aren't worth worrying about too much though.

The manual approach to lookahead is primarily what makes PEG, PEG and resolves the ambiguity in a CFG.

> Packrat isn't really a parsing algorithm - the PEG is the parsing algorithm.

The way I've seen it presented, PEG is primarily a style of grammar (basically, prioritized choice instead of normal alternation) and Packrat is a parsing algorithm.


I'm listed in that bibliography.

A PEG is imperative. It's already an 'algorithm' and already describes how to parse it. That's different to traditional declarative grammars that people are used to, which need an algorithm to go with the grammar.

In your own thesis you point out that the naive recursive-descent algorithm has exponential runtime and that's why we use Packrat parsing instead? Meaning you have a choice of at least 2 algorithms that you need to pick between when parsing with your PEG, not just one?

Or alternatively you could claim no CFG needs an algorithm either, since you could just use recursive descent on everything...

'Risks exponential runtime', is what it literally says. Ford also said 'risk of exponential parse time'. PEG without Packrat can work for many grammars just fine. I was also talking in the context of PEG under composition, where the problem is at its worst.

You don't need an extra algorithm to parse. You'd just like one to make sure it runs reasonably fast.

You can't use recursive descent on everything - how do you build a reclusive descent parser for an ambiguous grammar? You'd have to introduce some extra rules. That's the parsing algorithm, distinct from the grammar.

Ambiguity shouldn't be relevant? If it's ambiguous then you can just keep going after the first match. The only thing you should need is to eliminate left recursion, which is a grammar transformation you can do before you run your recursive descent... I think left recursion elimination + recursive descent should cover recognition for all CFGs?

I think you can probably also use GLR (or whatever) to parse a language described by a PEG too. All you should need to do is to just prune the extra branches. If I'm not mistaken it should get you a better worst-case complexity than Packrat too. So I really don't see how you can say PEG is an algorithm... you have plenty of options for what algorithm to use, it's just that people prefer a particular one.

Yeah, that's why they're tricky. They will often produce /something/, but it might not be what the author expects.

Here's a PEG for tweets: https://github.com/sayrer/twitter-text/blob/master/parser/sr...

It would not have been practical to write in the absence of the huge existing test suite.

I've made a few half-hearted attempts to use lpeg:


The operator overloads are a lot to reckon with, beyond just the concepts, but it seems like there's something there worthwhile of study and learning. Hope to give it another try.

Be interested in any introductory material, if anyone has some good links.

> Remove all the Javascript from an html page.

This is probably the best reason I've seen why you should learn parsing.

They're... beautiful.

I keep feeling like I'm following in the steps of giants with lpeg. I loved my introduction to it when I was big on Lua. Some of those operators take functions with such strange signatures - until you try to do something clever later, and your realize you're being passed exactly the arguments you need to make it work.

Bryan Ford has the most comprehensive information about PEG and how they're parsed http://bford.info/packrat/.

They've become somewhat unfashionable over the last decade - not really sure why.

They're one of Perl 6's killer features (much like how regexes were one of pre-6 Perl's killer features); first-class support for PEGs in the language itself is a godsend for parse-heavy tasks like deserialization and DSLs.

If anything I feel like PEGs are getting more popular lately, likely because the memory usage is no longer terrible relative to the amount of memory even on "memory constrained" systems.

It's very easy to write a PEG that will recognise the a target language, but it will probably recognise a few more as well. They are hard to reason about.

LR and LL have the reverse problem. It's a challenge to get it to recognise a complex target language, but once you get the thing to work at all it's probably right. Also LL & LR are faster (very close to O(size of input)) and use less memory (almost none). PEG's have get their speed by remembering how they reduced what they've seen so far.

There are lots of trade-offs in other words. PEGS win anywhere you would use once of the current crop of NFA based regular expressions recognisers (eg, PCRE), need something better.

They're alive and well in lots of DSLs implemented with LPeg. MoonScript (which I use almost daily) parses using LPeg and the general notation inspired some dynamic type validators (e.g. tableshape).

> They've become somewhat unfashionable over the last decade not really sure why.

Much, much larger memory consumption than LR/LALR or even GLR. They're called packrat parsers for a reason.

Although the time complexity is asymptotically the same, LR-family parsers have such a tiny memory footprint that they usually run entirely in cache. Packrat tables never fit for nontrivial grammars+inputs. In practice this makes a big difference in performance.

Packrat parsing is not an intrinsic aspect of PEG, nor does it increase memory usage. Further, the memory issue is more academic than real. If you are having memory issues, you will add intentional pruning to your parser to purge anything that comes before a point behind which you can not backtrack. Such points tend to exist in most computer languages. Yes, this is not a strict part of PEG but being academically pure when constructing a parser of any complexity leads to a short trip to madness. Perl 6's grammar construct is not pure to the PEG definition but it is a recursive descent grammar engine with all the bells and whistle you could want including functional components such as argument pattern matching which makes generic and reusable grammar expression a cinch.

Also, PEG makes having a context sensitive lexer not simply trivial but no effort at all because lexing and parsing are all the same process. It you want to have two different token-types for an INT-looking-thing at two different places in your grammar for, "reasons", you are free to do so. Try that in a discrete lex/parse style grammar.

If anybody is thinking about writing a parser in Ruby I can recommend the Parslet (https://github.com/kschiess/parslet) PEG parser library. I used it to construct a parser for a Lisp-like grammar and I found it robust and easy to use.

(You might not think you'd need a full blown parser library for something with a grammar as simple as Lisp but I found that there are enough grammar rules and the lookahead is deep enough that using a PEG parser was a win.)

Didn't notice that Guido van Rossum was the author until I got to the end. As someone who has done work with the Python grammar, and watched it grow into the complex mess that the Python 3.8 grammar is, I'm glad to hear that Guido is exploring PEG parsers as an alternative to LL(1).

I'm surprised you missed the fact after reading this in the first paragraph - `...I now think it’s an interesting alternative to the home-grown parser generator that I developed 30 years ago when I started working on Python.`

For anyone interested in following this work, these Discourse threads are interesting:

- https://discuss.python.org/t/preparing-for-new-python-parsin...

- https://discuss.python.org/t/switch-pythons-parsing-tech-to-...

A few years ago, I played with Pegged. Pegged is a PEG parser fro DLang. Beyond about why PEG is good/bad, Pegged shows how powerful is the metaprogramming of DLang. It simply generates the parser from the PEG grammar at compile time, and could execute and parse stuff at runtime and compile time ! https://tour.dlang.org/tour/en/dub/pegged

For flexibility and ease of understanding, I still prefer nondeterministic monadic parser combinators. Not sure why, but its the only way I can write a decent parser.

Too bad they are so slow.

From Wikipedia...

> Unlike CFGs, PEGs cannot be ambiguous; if a string parses, it has exactly one valid parse tree. It is conjectured that there exist context-free languages that cannot be recognized by a PEG, but this is not yet proven.

Does this mean that PEG grammars are a form of regular grammar?

No. One can come up with a PEG for a language consisting only of strings with balanced parentheses. That language cannot be described by a regular grammar.

I don't like PEGs: the choice operator makes it easy to accidentally create ambiguous grammars. The explicit conflict messages in conventional LR-ish parser generators tell you up front when your language is wonky.

with an infinite lookahead it seems that the main advantage of LL over LR is gone, and the one might as well go all in for the LR for its more expressive power.

Non-paywalled version: http://archive.is/E5QtQ


It will come back to you.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact