Hacker News new | past | comments | ask | show | jobs | submit login
Use context-free grammars instead of parser combinators and PEG (safinaskar.writeas.com)
100 points by mmastrac 13 days ago | hide | past | favorite | 82 comments





Parsing is an application domain where I'd expect more hard logic and less soft feelings than in this article.

"I feel that CFGs more naturally represent how we think" but

"All this doesn't apply to situations where you trying to parse some language, whose specification explicitly says that the language is based on PEG formalism"

A terrible context free grammar (with left and right recursion, not a small feat) is presented.

"I blindly converted grammar above to Haskell code"

"I accidentally discovered my grammar was ambiguous!"

"All this gave me psychological trauma (figuratively speaking). I will never use Parsec (and PEG parsers) anymore."

Denial that an ambiguous grammar is a defect, denial that the ambiguity has to be resolved, refusal to use parser combinators to resolve the ambiguity in the correct way.

It's also strange to see someone react to a modest, non-shameful failure (guess what, parsing is difficult is language design is even more difficult) that should be a nice learning opportunity (elegant context free grammars are usually insufficient in practical usage) with anger and radicalization far beyond an acceptable level of blaming the tools.

The other post in the blog, however, follows a similar general pattern: never use a popular technique (de Bruijn indices rather than parser combinators), because they advocate another technique (expressing lambda calculus terms with randomly generated globally unique identifiers for all bound variables rather than pure context free grammars) because that's what they like.


So, I agree with your take as a reaction to this article specifically.

However, I also think that there is a legitimate point here. Which is that parser combinators, even in languages like Haskell, are shockingly tricky to use correctly. In particular, they violate the usual Haskell principle that "if it compiles, then it will probably run correctly". Having written multiple parsers in multiple languages with multiple parsing frameworks (including direct recursive descent), using Parsec in Haskell was the only time I had to write a test suite for the parser specifically. And that's because there were so many footguns that I was repeatedly making mistakes despite the fact that this was a port of a parser I'd already written (and therefore, ought to know backward and forward). (For the record, this was not my first Haskell project either, which is why I was so shocked it went so poorly.)

I'm sure it's possible to make mistakes in CFGs too, but practically speaking I can't remember running into that situation myself. Whereas with Parsec I ran into it repeatedly.


PEG parsing is likely to squash some grammar ambiguities in unexpected ways, making a test suite a very good idea; but we shouldn't forget that this kind of failure can only happen starting from an ambiguous grammar specification, so the work of taming PEG precedence or combinator behaviour is not an additional burden of the technology but an additional tool to correct the grammar to become less ambiguous, an alternative to rewriting it in unnatural, verbose and complex ways.

> but we shouldn't forget that this kind of failure can only happen starting from an ambiguous grammar specification

I don't think that's true. The following CFG is unambiguous:

  S ::= A S | A
  A ::= 'a' | 'a' 'b'
Yet if we translate this 1-to-1 to a PEG (without changing the order of the alternatives), it's not going to match any input with 'b's in it, so it doesn't match the same language.

This grammar is difficult to adapt to ordered choice, having 4 possible variants that mostly define different languages from each other and from the "same" grammar with traditional parsing.

It's a formally very different problem from spontaneously and arbitrarily disambiguating an ambiguous grammar, but in practice it causes the same kind of suffering and it's likely to be a more common issue.


I have written a simple compiler in Haskell using parser combinators for parsing. And I couldn't disagree more with you. It was super straightforward writing the parser and it worked immediately. Why do say they are tricky?

Unfortunately this was years ago so I'm not sure I can dig that far into specifics.

Overall, combinators (almost by definition) have the property that you can plug anything into anything else. Haskell encourages a "zero-point" style where you don't even name the parameters (this is sort of the point of parser combinator libraries) and provides a bunch of fancy operators to glue them together in various ways.

I think the mistakes I was making were mainly in just gluing the combinators together in the wrong way. It's not that the grammar was ambiguous, I just messed it up.

You could say some of the same things about CFGs, and I guess that's true at some level. But in my opinion the syntax is more obvious and you don't usually spend time squinting at the syntax to remember how $ works or what order things execute in. And like I said, with parser combinators, there is no feedback (because everything compiles, more or less), so the only way to tell if it's working is to run it, which is a very un-Haskell-like experience.


What language did you compile? Some languages can be much more manageable than others.

Did you adapt an existing CFG, or another parser combinator implementation, or did you start from scratch? The first case is more treacherous.


Start from scratch. It was a basic c like syntax with sane typing syntax.

> I feel that CFGs more naturally represent how we think

Author could have replaced the “we” with “I” and made it into a mildly interesting diary entry.


> Parsing is an application domain where I'd expect more hard logic and less soft feelings than in this article.

Ah, an S-expr enjoyer I see


it's important to read your comment in the context that we have established in https://news.ycombinator.com/item?id=40582885 that you wrote it before finding out what pegs were

i'm curious if you change your mind once you understand peg's limited backtracking semantics


Well said.

> And if you agree with me that grammar above in intuitive sense is ambiguous, i. e. if you agree that A :: B ==> C allows two parse trees, then your brain thinks in terms of CFG, too! CFGs are simply more natural.

Count me as my brain being on team PEG then, so also a data point against CFGs being more natural. In fact, I think this example proves that author of TFA does not think in CFG, otherwise they wouldn't have accidentally written an ambiguous grammar!

In my experience, it's an extraordinarily tiny fraction of people who actually want parse forests instead of parse trees in the languages they are constructing (TFA specifically excludes pre-existing languages). A PEG will be unambiguous. People want unambiguous. There's a chance it disambiguates against the intent of the author, but a CFG won't disambiguate at all, which will nearly always be against the intent of the author.


The question here is whether it is better to silently guess at the author's intent or to loudly say "your intent isn't clear; try again".

I strongly believe that the second is the better approach for nearly all use cases. When I'm writing a grammar, I'm focused on the happy case and not on how the rules interact with each other. My intent was to write a grammar with only a single result, indeed, but if that's not what I've done, I'd want to know!


I disagree on both counts.

I think the fundamental question here is "how easily can you construct (write or generate) a parser that matches your intent?". And to answer that question, I think you have to look at the entire process.

What good does it do you to hear "your intent isn't clear; try again" if you don't understand where the ambiguity is? Presumably, if you knew where the ambiguity was, you wouldn't have written it -- since you know it can't create a grammar. So really what that message is telling you is, "you don't understand your own grammar" or "you don't understand how to write rules for your PG". And in my experience with several different PGs, the failure messages returned in those cases are... less than helpful. So you end up spending hours banging your head against your desk trying to figure out where the ambiguity is, even with a grammar you know back-to-front.

Furthermore, even if you use a CFG, you still haven't gathered evidence that the resulting parse matches your intent. For that you need test cases. And as soon as you write a test case, you get immediate feedback that your intent isn't clear, regardless of PEG, CFG, or anything else: you see that your resulting parse doesn't match with the expected one.

Given both of those things, it seems to me that issue at hand really does boil down to "which feels more natural, PEG, CFG, etc", which is always going to be a personal preference.


> Presumably, if you knew where the ambiguity was, you wouldn't have written it

Well, no because people make mistakes.

> And in my experience with several different PGs, the failure messages returned in those cases are... less than helpful. So you end up spending hours banging your head against your desk trying to figure out where the ambiguity is, even with a grammar you know back-to-front.

Yeah that's my experience too but is "something is wrong" better than nothing? Maybe.

I really like Nom & Chumksy though and haven't accidentally made ambiguous grammars (as far as I know) so it feels like a small risk to base your entire parser decision on.


Great! Someone who's brain is on team PEG. You are just the person I need. Can you explain in plain English what strings match the following PEG grammar?

    Str = "a" Str "a" / "a"

Interesting. I've last had to deal with formal grammars in the previous century, and your post had me go scratch the surface of what PEGs are.

If I understand correctly, your example is not a well-formed grammar, because of the left recursion.

So what you're saying is that PEGs align badly with our expectations, tempting us to create non-well-formed grammars, and making it hard to recognize them as such, because our brains want to parse them differently. Right?


it doesn't have left recursion, but grammars in general can have left recursion; it doesn't make them ill-formed. warth even hacked in left recursion as an extension to pegs

My grammar has no left recursion, there is a terminal string before the recursive nonterminal.

Any odd number 2n+1 of "a" repetitions, longest match. The inflating first rule is preferred about 2n times until it fails at input end, then the parser backtracks until the middle "a" is matched by the second rule and the others by n expansions of the first rule.

I'm sorry that is not correct at all. It does not match the string "aaaaa" for example.

Why not? Preferring the first rule, after seeing a you expect Str a; after seeing another a you expect Str a a; after seeing the third a you expect Str a a a (and you can already realize it cannot match) so after some backtracking you match the third a with the second rule instead, expecting a a, which are the fourth and fifth a in the input.

Do you have a different parsing algorithm in mind?


> Do you have a different parsing algorithm in mind?

Yes, since I specified a PEG grammar I assumed a PEG parsing algorithm.

You can try the grammar yourself in a PEG parser, e.g. https://shamansir.github.io/pegjs-fn/.


This tool seems to dislike recursion, it manages to parse aaaaa with

   Str =   "a"  Str2 "a" /"a"
   Str2 =  "a"  Str3 "a" /"a"
   Str3 =  "a"
or also

   Str =   "a"  Str2 "a" 
   Str2 =  "a"  Str3 "a" 
   Str3 =  "a"
I'd call it a bug.

Note that

   Str =  "aa"*"a"
parses any odd number of "a" repetitions correctly.

> I'd call it a bug.

It's not a bug in the tool, it's inherent to PEG. All PEG parsers will behave the same on this grammar.

My point is that PEG is highly unintuitive in how it works to most human brains, which is why I specifically asked for a plain explanation why this happens to someone who claims their brains align with how PEG works.


The ordered choice operator should try all choices, not stop earlier.

There's probably some cache confusing, for instance, not matching the first Str rule at the third a with not matching a Str at all (neither rule) at the third a.

The original example (here with parentheses, just in case)

   Str= ("a" Str "a") / "a"
matches strings of 2N-1 "a" repetitions if N is a power of 2 (i.e. length 1,3,7,15,31,63...) which is an obvious symptom of improperly constraining rule expansion. So I insist, you are using an incorrect parsing tool.

Interestingly,

   Str= ("a" Str "a") / "b"
matches strings of N "a", one "b" in the middle and N "a" for any N, without degenerate behaviour.

> The ordered choice operator should try all choices, not stop earlier.

PEGs made this choice to ensure linear-time parsing. Another example of what I would consider a failure is the PEG grammar:

    Keyword = "else" / "elseif"
This will only match "else", it will never match "elseif" as the ordered choice will refuse to turn the successful match of "else" into a failure to try the "elseif".

> So I insist, you are using an incorrect parsing tool.

I agree that PEG is an incorrect formalization for parsing, as it has highly unintuitive and surprising behaviors like this one.

However, I have to insist that this is part of Parsing Expression Grammars, and not an issue with some particular tool implementing PEGs. It's part of the formalization, and if you wish to avoid this you need a different formalization, like Context-Free Grammars.

My personal favorite is LR(1) CFG grammars: you are guaranteed to get unambiguous and linear-time parsing grammars, or else the grammar will fail to compile.

What your intuition probably is is that PEGs are Ordered Context-Free Grammars, e.g. such as in https://arxiv.org/pdf/2309.08717, where first a full ambiguous parse forest is constructed and then using ordered rules a single canonical parse is selected. That is a lot more sensible, but also raises the time complexity of parsing from O(n) to O(n^4) for strings of length n. However, that's not what PEGs are.


> My personal favorite is LR(1) CFG grammars: you are guaranteed to get unambiguous and linear-time parsing grammars, or else the grammar will fail to compile.

We can both agree that LR(1) is great; it parses a large and useful subset of CFLs; had the minimal LR(1) parser been discovered before YACC was written, I probably would never have moved to PEGs; hopefully we can all agree that LALR parsers are terrible?

As a side note, I don't consider LR(1) parsers to be CFGs since they can only parse a subset of CFLs (actually only a subset of DCFLs if I remember my rusty CS correctly).

[edit]

Also TFA argues for Early parsers, which seems to me to be a poor choice for parsing programming languages.


> Hopefully we can all agree that LALR parsers are terrible?

Yes, I think most of the bad reputation LR(1) gets is from stuff that is actually LALR(1) with its nonsense reduce-reduce conflicts.

Another great thing is that LR(1) parser generators are being written now that instead of just spitting out a 'oops there was a conflict on these two things in this state', they can actually give you back two conflicting prefixes giving you much more insight in what causes the defect in your grammar and how to fix it.


> Keyword = "else" / "elseif"

Are there any languages in which this rule makes sense? I can't imagine wanting to write

  ifFOOelseifBARelseBAZ
And ending your token rules with WS|EOF removes any issues, so

  if FOO elseif BAR else BAZ
works just fine.

I worked this out by hand and I'd recommend you do the same. The paper defining the behavior of PEGs is here: https://bford.info/pub/lang/peg.pdf . Section 3.3 is the relevant one, defining all the parsing rules.

As for the simple explanation of what's going wrong:

1. We try to parse `aaaaa` as if it opened with the pattern `a STR a`. [And, with foresight, let's observe that this is true of the way we want the parse to go.]

2. We match the `a` at the beginning of the input and try to parse `aaaa` as if it opened with the pattern `STR a`.

3. This repeats; we make the same guess that we're dealing with a recursive STR, we match an `a`, and then we try to parse `aaa` as if it opened with `STR a`. I didn't mention this before, but the first step of doing this is to match the first element of the concatenation, `STR`, against the input.

4. Here's where things go wrong. We still guess that we're dealing with a recursive STR, because that rule has priority over the terminal rule `STR = a`. With foresight, let's observe that in the parse we want, this guess will fail, because we need the third STR to be a literal `a`. In that world, we'd then match the two following `a`s and our parse would succeed.

5. However, when we try to match our recursive rule `a STR a` against our input `aaa`, this succeeds, because that's a correct match. We have an outer recursive STR and an inner terminal STR, yielding `aaa`.

6. Since our first guess that the suffix `__aaa`` opens with a recursive STR succeeded, we will never experiment with what would happen if it opened with a terminal STR, the way we wish it would.

7. That was our only chance; the whole parse will now fail to match in predictable ways.

-----

> The ordered choice operator should try all choices, not stop earlier.

You can't defend PEGs by saying you wish they were defined differently. That's not a defense!

Look at the paper:

> Alternation (case 1): If (e₁, xy) => (n₁, x) then (e₁ / e₂, xy) => (n₁ + 1, x). Alternative e₁ is first tested, and if it succeeds, the expression e₁ / e₂ succeeds without testing e₂.

It's hard to be clearer than that.


that's not how pegs work, and if you add that kind of unrestricted backtracking to pegs, you get a grammar class that nobody knows how to parse in linear time

pegs, by contrast, like ll(1) and lr grammars, have a linear-time parsing algorithm available (though its constant factor is significantly higher). that's one of the reasons people use them

you should read an introduction to pegs; you might like them once you get past 'i insist peg parsers are incorrect parsing tools' ;)


No, orlp is right, as far as I understand. It's not a bug in the tool, but inherent to PEGs in general. Wikipedia has a section on this problem: https://en.wikipedia.org/wiki/Parsing_expression_grammar#The...

I’ve never touched a PEG parser in my life and parsing isn’t super my thing in general, but: a recursive descent parser could parse this just fine (not left recursive), correct? But it would have to backtrack after trying rule 1 five times, and switch to rule 2 when it backtracks to the third layer of recursion?

PEGs can’t do this kind of backtracking, is that the reason?


yes, exactly correct

it's common to write recursive-descent parsers without this kind of backtracking support, though, because expressing that kind of backtracking in a natural way requires coroutines


Wow, this is surprising to me. Why doesn’t this work?

See the nearby comment by adwn: aggressive "packrat" caching fixates on long matches starting at intermediate positions, without considering any shortest matches at the same position.

Correct parsing would require discarding these long matches because they cause higher level rule expansions to fail, but smarter algorithms with an increase of runtime due to complete backtracking and/or an increase of memory usage to distinguish and remember different matches for the same rule at the same location would be required.

EDIT: two nearby comments by adwn.


this is not specific to the packrat parsing algorithm; any algorithm for parsing pegs has the same behavior as packrat. your understanding of peg semantics is incorrect

(I'll use position 1...5 to refer to the characters in "aaaaa", and first, second, third level to refer to the initial and then the recursive calls to the Str rule)

1. When the PEG parser enters the third level of Str, its first choice will successfully match on positions 3...5, and return to the second level of Str.

2. The first choice on the second level will fail, however, because it expects an "a" at position 6, which doesn't exist. Therefore, the first choice is discarded.

3. The second choice on the second level will succeed by matching "a" on position 2, successfully returning to the first level.

4. The first level finishes matching the first choice by consuming the "a" on position 3, successfully returning the parse tree Str = ("a" ("a") "a")

5. The remaining "aa" on positions 4...5 weren't consumed.


Beautiful example! `2^n -1` times a?

I love PEG parsers though


Yes, that is the correct answer, although you're missing the 'explanation' part for why this happens :)

His problem isn't primarily the choice of formalism, but that his grammar is wildly overcomplicated and makes it really hard to see what language it specifies irrespective of formalism.

This is a common problem.

In this case the amount of recursion makes it hard to even guess at what it is trying to express.

If my grammar looked like that, I'd throw it away and start over, because the problem to me appears to be not quite having a clear idea of what the grammar is expressing.

EDIT: Trying to clean up his grammar into something readable

    t0 = t1 | a
    t1 = t3 ("==>" t1)?
    t3 = (id | d)+ c? | b

    a  = "!!"  id c "." t0
    b  = "%" id c "." t3
    c  = "::" t0
    d = "(" t0 ")"
t999 in the original was left-recursive, but in this case that's a matter that can be left to the parse tree construction if he insists on keeping it that way. For your own sanity don't mix left and right-recursion in a parser, and decide how you structure the parse-tree in terms of left/right binding separately, and only express that in the grammar if your tool doesn't let you express it when building the tree, or if it actually changes the language recognised.

One thing to note, is that my first instinct when writing parsers is to look for opportunities to factor out rules that start with a terminal, because it's usually a good way of avoiding or reducing the need for lookahead. Hence the extra rules, that to me at least makes it a lot clearer. The part of the grammar to look for places where your formalism causes ambiguities is now much smaller.

This grammar still gives me hives, but without knowing his constraints and intent, it's tricky to see if it can be cleaned up further.


> If my grammar looked like that, I'd throw it away and start over, because the problem to me appears to be not quite having a clear idea of what the grammar is expressing.

That's fine if you are working in a purely academic domain, or you're just having fun writing something on your free time.

In production settings, you need to keep extending whatever language you have. The alternative is having to start to maintain two separate parsers and a migration tool, and be stuck with that for years.

The point is that being able to throw away code is a luxury that virtually no one is able to afford.


No, I'm not talking about an academic setting.

This is not about the language but about his expression of it in this specific instance of the grammar.

As the article makes clear, he didn't understand the language, and my claim is that is to a large extent down to how it is expressed.

Not throwing away this specific grammar instance is something you can't afford.


sometimes having an lalr parser generator tell you where the ambiguities are in your grammar can be a useful step to understanding the language you're trying to describe. you're probably right that once you do understand it, you should throw away the grammar given as an example and write a new one ;)

Sure, especially for anyone not used to writing parsers using a parser generator is a good way of validating assertions about it. But for that parser, just starting by cleaning it up would be a big improvement irrespective of formalism. Even if you get an unambiguous parser, if you can't read the grammar you'll end up with nasty surprises anyway...

agreed!

> In production settings, you need to keep extending whatever language you have.

Then the hard work of leaving behind simplistic, ambiguous, imprecise, inefficient grammar specifications has already been done (mostly) and you are ready to add small new features without serious rewrites.


Can't you prove that your new grammar describes the same language as your old one, or mechanically concert both to some normal form and check that they're identical or something like that?

What do you mean? Stuff gets refactored all the time.

I think it all depends on the language, but I found it (usually) easier to implement the recursive-descending parsers just manually. This may include combination approach too, but using my own combinatorial and domain-specific functions.

In a nutshell such a parser is PEG in a sense that the choice is ordered, but with little to no test expressions. In other words, I try to stick to lookahead-1 most of the time. Left recursion quite often could be resolved by lifting the branches to their siblings rather than re-parsing the same things many times using the cache (packrat). In fact both approaches, the packrat and the nodes lifting, are the same in a nutshell, but nodes lifting is usually easier to maintain (again, it all depends).

Writing the parser manually, or in other words a set of relatively simple mutually recursive functions, in my opinion, is not the hardest problem in practical engineering. At least it's not so hard when you have a mental model of the language's syntax upfront.

If the language fits specific class of formal grammars very well, perhaps it would be not so hard to implement it using the parsers generator tool. But it is likely there will be edge cases that are hard to express properly using the formal grammar, and fighting with the tool specifics will waste a lot of your time rather than benefit you.


> I think it all depends on the language, but I found it (usually) easier to implement the recursive-descending parsers just manually.

I think that packrat parsers are even easier to implement by hand, and even troubleshoot.

It's all procedural code, and you just call the next routine to figure out if you have a token match or not.

More importantly, PEGs promise to be easier to maintain in smaller custom parsers. To support an enum, all you need to do is add a function call, and that's it. One line of code. Can't beat that.


> I think it all depends on the language, but I found it (usually) easier to implement the recursive-descending parsers just manually. This may include combination approach too, but using my own combinatorial and domain-specific functions.

Working in what languages did you make this observation?

In something like C I would probably agree with you. But in eg Haskell or OCaml, I would probably prefer parser combinators.


In other functional languages I tend to reach for combinators, but in OCaml I either use Menhir or regret not using it.

I work with Rust most of the time. As of the pure functional languages, yes, perhaps using parser combinators is better.

The reality however is that many applications find a recursive descent parser easier to handle once the grammar has been fixed (and in spite of its many caveats!), so PEG is a natural formalism for them.

For example I once tried to write a full parser for Lua in LALRPOP and gave up after a week because it was too slow to compile and too hard to avoid conflicts; I ended up writing a recursive descent parser in a single day. I already knew that LALRPOP doesn't resolve shift-reduce conflicts and also how to transform the grammar to avoid them, but that wasn't enough for me.

I think there is a niche for CFG-like formalisms with bounded PEG-like constructs so that it can be quickly adjusted without giving up all the benefits of CFG. I have been already in situations where I really want to have such formalisms for multiple times, including the aforementioned one.


> Python spec says Python is PEG, so you should use PEG parser for parsing Python.

Python switched to PEG in IIRC 3.9, the language standard, parser and the grammer were previously CFG, specifically LL(1).

That switch implies that maybe the author is only correct until your language is widely (if not wildly) successful?

https://peps.python.org/pep-0617/


I think the main issue here is that you in general just don't "get a CFG" parser which magically works correct with nice ergonomics and grate performance.

You get a LL(1), LR(1), LALR(1), etc. parser and the more of CFG they can parse the more drawbacks there often are.

And making a programming language fit into more limited parsers like LL(1) is a total pain.

So you need something which is more powerful, but still not to complicated to have a fast efficient implementation for python and python tooling.

At which point it can be simpler to just go with PEG especially if you have the amount of people and brain power to make sure you don't slip up with resolving ambiguity in a arbitrary "accidental" way.


Core.py episode on the transition to the parser mentioned limitations of the LL(1) parser wrt to new syntax.[1][2]

There were some positive side effects to the move and it enabled better exceptions, with more detailed information.

[1] https://podcasts.apple.com/us/podcast/episode-8-the-new-pars...

[2]https://podcasts.apple.com/us/podcast/episode-7-the-old-pars...


I both agree and disagree.

I would avoid PEGs they are a huge food gun.

But a lot of things you need to parse are file formats designed to be reasonable straight forward parseable and in many such situations using a parsing combinatorics is all of: the fastest write, the easiest to write, the easiest to debug, the easiest to maintain, the smallest code size (potentially by a lot), the fastest build times and often even the fastest to execute .

So a very clear straightforward win for parsing combinatorics.

Through the moment you parse (programming) languages and similar instead of data formats things often change leading a massive increase of complexity and ways you can subtle get it wrong (e.g. not explicitly handling operator precedence). And here proper grammar parsing frameworks can shine, sometimes.

But even there they don't shine always as they are huge complicates pieces of code which often require running some complicated code gen to have any chance to get acceptable perf/code size which is often hardly debugable in case you might have run into a bug. Furthermore parsing arbitrary CFG is complicated and potentially slow so most frameworks prefer you to use a specific subset like LR(1) grammars. And understanding what exactly that means for your grammar and if that works for you use case is neither obvious nor intuitive and can make you wast a lot of time. Sure you probably should have learned all about it (if you have a CS Bachelor or similar degree) but I mean when was the last time you used that knowledge?

So in the end even for such cases using a PEG or parser generator, while being _very_ _very_ careful about grammar design is sometimes still the better choice (even through it shouldn't, it's a better choice born due to limitations of existing tooling).

As a side note a good hint to weather using a CFG or parser combinator is the better choice is often to ask yourself if a tokenization pass would make parsing much more simple. For thinks where using a CFG is a terrible idea (e.g. parsing msgpack) the answer is often that it doesn't make sense at all. For things where tokenization makes sense, but is literally most of the parsing (e.g. JSON) it's often just easier/better to not use a CFG. But the moment it seems to make a lot of sense (e.g. a programming language) it also tends to be the best choice to use a CFG (or similar) grammar parser.


This also applies to LLM constrained generation. Instead of outputting something that promised to be a JSON (but maybe it's not, or maybe it's an incomplete JSON, or maybe it's a JSON but doesn't conform to your JSON schema), just use CFGs (e.g., GBNF) to guide the model to generate the exact JSON you want. It's guaranteed, and you don't have to send lengthy JSON schemas to the LLM.

With the caveat that you significantly change the distribution of the output. Picking on JSON with current-gen LLMs, consider

(1) Retry till the output passes the grammar

(2) Only generate tokens passing the grammar

Assuming (2) actually runs (not guaranteed for all grammars and distributions theoretically, basically never a problem in practice), one of many common failure modes is forcing the generation of grammar-valid tokens after an error has been made.

E.g., one sort of "truncation" error happens when the LLM is producing an object with metadata around a transcription (e.g., an author, timestamp, and blob of text). A common incorrect result looks like

{"author": "John Doe", "timestamp": "20:23:58", "transcript": "My name is John Doe, and ...

Those ellipses weren't just cutting off my description of the invalid data. The model will produce ellipses after the string carries on for too long. If you greedily produce tokens adhering to a grammar, your only options at that point are continuing the string and ending it, and both options are very wrong.

Contrast that with option (1). Empirically, if the model is willing to actually produce those tokens rather than cut off its answer, it's much more likely to be correct (i.e., the presence of that ellipses bug will tend to cut off the entire response, not just truncating the string and still generating a valid object). If you just retry till the model succeeds then you have a substantially higher accuracy in that task.

That particular example may or may not apply to your favorite problem and favorite model, but it highlights the problem that constraining to a grammar forces a large skew in your result distribution. A large part of the power of an LLM is its autoregressiveness, and other sampling algorithms only tend to perform well when there's some outside knowledge relevant to the real-world problem being solved (e.g., if the LLM has >50% accuracy on a task with a small, finite number of outputs, then you might want to drop the temperature to 0 and just sample repeatedly, selecting the most likely option), but messing with the result distribution without a rationale for why you're still likely to get good results is risky.


Let's say the grammar forces a JSON like this:

    {"transaction_number": <number>}
Is this also problematic? My understanding is that the model is first forced to generate "transaction_number", which then hints to the model that it must generate the number of the transaction, not something else. I understand that maybe the model didn't want to start writing "transaction_number" without enforcing the grammar, but after being forced to, it has to only continue the generation.

In general, all bets are off. The model has a certain auto-regressive property (a factoring of conditional probabilities), and every other sampling should be examined with extreme suspicion. At a bare minimum, a lot of real-world problems in your desired domain should be spot-checked.

For more constrained examples like that, you're probably a lot safer than with completely unconstrained strings.

Even for very simple constraints like that, it's not terribly hard to imagine exactly the same ellipses problem (especially if your chosen JSON grammar isn't very careful with unbounded vs 53-bit integers). You really want a non-negative integer, you're using JSON and don't want to roll your own grammer so you use an off-the-shelf JSON LLM CFG toolkit of some flavor, and you allow _numbers_ in that field. I haven't personally seen ellipses bugs after only a few dozen characters in current-gen LLMs, but to have something concrete talk about let's say your model "tries" to write a transaction_number of 3141592635.... They want to truncate the actual result with an ellipsis. On the first period, the only allowable results include more digits (the grammar prevents the actual ellipsis from happening). You successfully adhere to the grammar and have garbage in that field, where a simple retry loop would have likely completed correctly.

Assuming the grammar actually correctly restricts to just a non-empty contiguous sequence of digits in that spot, that's _harder_ to force a bug into, but it's not impossible. For starters, the whole thing is probabilistic. Suppose the model uses "0" as the first digit. The only remaining valid (integer) JSON is finishing the field. Multiple leading zeros's aren't allowed. If for _any_ reason the model is more likely to produce invalid JSON after incorrectly using a "0" for the first digit, the grammar-constrained implementation will have more errors than a basic retry loop.

You can probably do something clever with changing the sampling temperature based on which part of the grammar you're in to side-step some of those issues. You might also find that for a particular problem my theoretical complaints don't seem to apply in practice. Maybe, for business reasons, past a certain accuracy rate you value being able to write simpler code (i.e., assume valid JSON) even if you dip down from 99% to 98% accuracy on a task. Nothing I've outlined says "don't guide an LLM with grammars." Do be careful with it though and consciously consider the task at hand and your proposed solution. It's not a free lunch.


> when I wrote that grammar, I meant it is CFG. Simply because my brain thinks in terms of CFG.

> I blindly converted grammar above to Haskell code.

> But then I accidentally discovered my grammar was ambiguous!

You wrote an ambiguous CFG, implemented its parser, and then discovered that it was ambiguous.

Would the CFG have become non-ambiguous if you took your advice and avoided parser combinators?

This makes me really want to avoid the CFG step altogether.


The advice to "use CFGs" is incomplete. What you really want is to use one of the deterministic subsets of CFGs, like LR(k), and importantly, with a parser-generator which will reject any non-LR productions in the grammar before even generating the parser. This way, you can get a guarantee that there are no ambiguities in the language you've defined ahead-of-time.

You might consider this to be analogous to the difference between static and dynamic typing. A statically typed language will reject an erroneously typed program before attempting to run it, but the dynamically typed language may accept it and attempt to run it, only to fail at runtime if there's a type mismatch.

Parser combinators are more like the dynamic language. You find out at runtime (when parsing), whether the text you have parsed has ambiguities because more than one parse result is returned if that is the case. Normal operation is that the combinator returns a single result, in which case there are no ambiguities in the input, but the parser combinator doesn't provide any guarantees about absence of ambiguity in the language itself - it only proves that there's no ambiguity for the given input. And there's no practical way of testing for all inputs that the defined language is unambiguous, because there are infinite possible inputs.

LL/LR parser generators are like a type system for grammar. They'll take your input grammar, and check that all rules adhere to the constraints of LL/LR, and iff that's the case, then they'll produce a valid parser for a provably unambiguous language.

PEG are also provably unambiguous, but for a given input, the "correct parse" is not necessarily the "desired parse". Essentially, when you're using a PEG, the language is defined by its grammar. When you use an LL/LR parser-generator, you're attempting to use some constraints to formalize a grammar for your language, which you define yourself, and the tool will correct you if you've made a mistake and introduced some ambiguity by accident, whereas a PEG will silently mask that issue by just picking the first option. Rather than prompting you to fix the language to match the intuition you had about its syntax, the "accidental ambiguity" becomes a non-ambiguity and part of the language definition. Also, when extending an existing language with new syntax, PEGs can have unexpected results which may break existing code written in the language, or interpret it differently to what was intended by whoever wrote it.

However, when you're parsing a language that somebody else has already defined, it doesn't really matter which method you use to write the parser, as an incorrect parse is a bug in your code, and not the language itself.

For greenfield languages, I'm largely with the article's author that LL/LR are the correct constructions to use - they assist you in designing a language that is free of ambiguity from the start because incorrect assumptions about your syntax are flagged by the parser-generator. A greenfield language designed with a PEG becomes a language that is defined by the PEG, and any potential mistakes in your intuition become a set-in-stone part of the language.

As for the choice between LL and LR, it's pretty much a non-brainer to use LR (or LALR) to give yourself greater flexibility. LL had their place on resource constrained systems of the past, but on today's systems, LR parsing uses a trivial amount of resources and is only going to be a fraction the compiler or interpreter's memory and CPU time. If you can fit your language into an LL grammar, you certainly should, as it will be blazingly fast to parse, but I don't think you should constrain yourself just because of this.

Recommendation: Use Menhir. It supports writing production rules which are parametrized by other production rules. This can "feel" a bit like parser combinators to write, but it's all ultimately lowered to LR. The parsers it produces are fast, and you can use the tool for incremental parsing.


> The advice to "use CFGs" is incomplete. What you really want is to use one of the deterministic subsets of CFGs, like LR(k), and importantly, with a parser-generator which will reject any non-LR productions in the grammar before even generating the parser.

Which is also why there is no "just use CFG". CFG is an abstraction from language generation. BNF as a "language" was invented for language generation. It's suitability to parsing was mostly an accident, and we've been dealing with the pragmatic repercussions in the field of parsing ever sense. To make parsing work we have bounded subsets of the abstraction. To make the illusion of the abstraction work we've bent BNF to nearly its breaking point, but the illusion fails as soon as you realize the BNF for LL and LR (and LALR and so on and so forth) is subtly incompatible. There's no universal BNF, there never was, it's a broken illusion.

PEG is just as useful as a bounded subset of the abstraction as LL/LR. The break was PEG decided to stop pretending BNF was a useful lingua franca between these subsets and then parser combinators contributed further to the demise of "universal BNF", which was never universal.


There's no requirement for parser combinators to allow ambiguity. It's perfectly fine, and possible to write parser combinators that will only parse an unambiguous language, and only ever return a single unambiguous parse or throw an error.

You're right it's often easy to use them in ways that introduce ambiguity, and some explicitly support ambiguity, though.


Yes, but I've never actually seen this done in practice. For starters, it kind of requires a multi-stage compilation process - because you must verify and prove non-ambiguity before or during compilation of the application using the parser. In practice people just use their parser-generators in their application's code without any prior stage verifying it. To do this without a multi-stage approach would requires somehow embedding the semantics of LL/LR into the host language's type system. I suppose alternatively you could verify the parser on application startup and terminate with an error if it's not valid LL/LR.

PEGs are a better fit for parser combinators because the host language typically has a pattern matching construct which checks patterns from top to bottom, and this matches the semantics of PEGs ordered alternations.


I don't get why you assume this.

If I write a parser combinator based parser for a grammar that is not ambiguous, I will write it to fail if it comes across an error, just as I would if I were to e.g. hand-write a recursive descent parser. In fact, most of the places where I use parser-combinators, I use them to build parsers structured exactly the way I would structure a recursive descent parser, just with reusable combinators instead of manually written functions, and I will - in the same way as I would wherever possible in a recursive descent parser - aim to bail at the first symbol that causes an error, or optionally jump an error recovery function.

I've never in my life written a parser combinator based parser that required a multi-stage compilation process. Verifying that the parser recognises the language it is meant to, is part of the test suite whichever parsing method you use.

> I suppose alternatively you could verify the parser on application startup and terminate with an error if it's not valid LL/LR.

Why would I run my test-suite on application startup? The grammar is not magically changing between runs.


> If I write a parser combinator based parser for a grammar that is not ambiguous

How do you even know it is not ambiguous to begin with?

I already said it doesn't matter what tool you use to parse a language designed by someone else. The language you design yourself is not automatically free of ambiguity, and intuition only gets you so far. If you stick to only using productions which are valid LL/LR, then why not have that enforced by the tool, so it can warn you when you go astray?


> How do you even know it is not ambiguous to begin with?

By knowing what makes a grammar ambiguous, and not doing it, and having a test suite that I need anyway.

> If you stick to only using productions which are valid LL/LR, then why not have that enforced by the tool, so it can warn you when you go astray?

A large part of the reason why there are so many parser generators is that we to date do not have a parser generator design that generates parsers that most people like to use for major projects. I've written about 5 generations of parser generators myself over the last 30 years, and I hate every one of them, yet every one was written in response to not finding a parser generator I consider tolerable to work with.

Most of the time it's easier to work with a hand-rolled parser. Parser combinators makes that easier, and more reusable.


IMO, people get way too caught up in the formal specification of grammars. Writing an unambiguous parser by hand is not that difficult. It can be done without a lexer using recursive descent and a switch statement at each byte in the input stream where a decision needs to be made.

> Parser and lexer for reversible parsing with guaranteed reversibility. Seems to be totally unique across all programming languages. Of course, CFG-based.

There's also the parser combinator approach of parser-printers ;)


To the author: How "you" think is not how "we" think.

The tendency to assume that your preferences or tastes are globally shared is really puzzling to me. That has never been true and (most likely) never will be true.


PEGs are more expressive than CFGs, iirc. So use the right tool for the job. I have come to quite like PEGs and they become quite natural, once one understands things like no left recursion. I also often prefer them over complex regexes.

It's difficult to conclude meaningfully that any one is more expressive than the other. They parse different languages, though there is a large subset of PEGs which are also CFGs, they're not a proper superset - there are things you can parse with a CFG which cannot be parsed with PEG, and vice-versa.

OTOH, when comparing LL and LR, we can conclusively say that LR are more expressive than LL because they're a proper superset - any language that can be parsed by an LL(k) grammar can be parsed by an LR(k) grammar (for same k).


> there are things you can parse with a CFG which cannot be parsed with PEG

This is not known. It is conjectured to be true because if all context-free languages could be parsed with a PEG grammar it would imply you'd have a linear-time boolean matrix multiplication algorithm, but no one has ever constructed a concrete example of a language which is context-free but not parsable with a PEG.


the other one is known, though, isn't it? there's an example of a peg parsing a non-context-free language aⁿbⁿcⁿ in ford's dissertation?

Yes.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: