Hacker News new | comments | show | ask | jobs | submit login
LALRPOP, an LR(1) parser generator for Rust (smallcultfollowing.com)
130 points by brson on Sept 29, 2015 | hide | past | web | favorite | 26 comments



Is there any particular reason these days to favour a parser generator over a straight-up backtracking parser combinator library like attoparsec these days?

Seems like the whole reason for LALR(1) parsers was performance, and parsing is no longer as significant an amount of time compared to the old days, while helpful feedback, which any parser generator approach tends to be bad at, is at a premium.


Parsing speed is no longer important inside of compilers, but it still matters inside of text editors and IDEs. In those contexts, you have to choose between the extra complexity of managing parser generation, or the extra complexity of avoiding too many reparses.


Unfortunately, you can't really use a parser generator for editors and IDEs, unless they're structured editors. If they're not structured editors, then the parser has to be able to gracefully handle broken parses.


Incremental parsing is designed for just this use case. The last major work in the area that I know of is Tim Wagner's PhD thesis http://www.cs.berkeley.edu/Research/Projects/harmonia/papers...


> Parsing speed is no longer important inside of compilers, but it still matters inside of text editors and IDEs.

Maybe! I agree with the second half there but I think the first half is not totally spot on. For _small_ projects modern machines will parse through faster than you can perceive but as source code grows a pokey parser will out. Becomes increasingly frustrating as the parser sub-system is tied up with early analysis, making it difficult to modify without getting deeply involved in the rest of the compilation steps.


If your grammar is very simple and performance is important, but not important enough to hand-write a parser. This is common when, for example, ingesting data.


The point of LALR, as opposed to LR is performance but one of the advantages of bottom-up combinators over backtracking combinators is that ambiguities in the syntax can be detected at compile time.


It might be nice for us less formally-trained persons to include a small description of what an LR(1) parser is. I tried looking it up wikipedia, but the article isn't that clear either IMHO.


The main contrast is with LL. LL(1) is trivially parsed with recursive descent; you start with the root production in the grammar, and you never need more than one token of lookahead to figure out which alternate to parse next. This maps naturally to turning productions into functions and alternation choices into switches on the next token. The parse tree is built from the top down, like a pre-order traversal, because the current production is chosen before the constituent productions and terminals are parsed.

LR parsers build the tree from the bottom up. Productions / terminals get pushed onto a stack (shift) until there's enough information to decide what parent production to use, and children are popped off and replaced with the parent (reduce). The tree gets built post-order; the parents are built after the children are already built. The hard bit is figuring out whether the productions on the stack are enough to compose a parent rule or not.

LR parser generators figure out a state machine for making shift / reduce decisions quickly.



Thank you. I found that to be a perfect refresher before reading the article in this story.

Does anyone know if there's a plan to move rusts own parser to a rust native tool? Now it apparently uses antlr4 along with some custom rust code:

https://github.com/rust-lang/rust/tree/master/src/grammar


That's just an alternative implementation of rust's grammar for verification purposes. Rust's parser is entirely hand coded inside rustc (in Rust).


Ah, I thought that was something new. I remember reading something about a manual parser a while back. Thank you for clarifying.

Would there be a benefit to moving from hand coded to something like parent project for rustc? IIRC the rationale for hand-coding the parser was mostly speed (and a wish to avoid external (to rust) dependencies)?


Seeing this bug filed today (rather interesting/weird breaking change needed to fix it), it seems plain sanity should prefer a generated parser.

https://github.com/rust-lang/rust/issues/28777

I'm not a compiler hacker, so I don't know how to weigh it really though.


I literally took a compiler class just to learn what LR parsing is, and I had to spend an additional ~6 months of my free time just to write an LR parser generator that I found intellectually satisfying and didn't leave question marks in my brain.

In other words, it's not something I expect you can read a couple sentences about and hope to understand well...


I find LR parsers to be conceptually simple--they're the same idea as recurisve descent parsers with the optimization that the many stack frames that result from "recursively descending" are bundled up into a single stack frame, corresponding to an 'item set' that contains every parse item that you might recursively descend into. You do not actually add a stack frame until you consume a token of input and "call" the next item set based on what token of input it was.

So, if you have an item X that might call rules Y and Z, instead you have an item set {X, Y, Z}. This has the additional advantage that arbitrary left-recursion is handled naturally, since you're discarding the 'call graph' between rules when you build these sets.

To deal with the compressed stack frames that you've now built, you construct a goto table for each item set that tells you what to do based on what item has been returned to you, from the observation that only certain items could have 'called' certain other rules in the first place.

Finally, you add lookahead to your goto tables so that, in case of ambiguity, you can peek ahead to the next token and make your decision that way. This will not solve all ambiguity but it solves most cases in practice.

This is a very hand-wavy description, but I hope it is useful if you decide to look further into the Wikipedia article. Unfortunately I have never seen a complete description of an LR(1) parser that takes the conceptual approach, and I don't have time to write one here.


Sorry, but I find this more confusing than anything. I would explain it more in terms of nondeterministic parsing (i.e. delaying the choice of the production rather than picking the production immediately) or in terms of "bottom-up" parsing (going from the string to the start symbol rather than the other way around). That said, it's way easier to understand what it should be doing than it is to understand it well enough to write one yourself.


How do LR(1) parsers compare to PEG?


LR(1)-parseable context free grammars are more convenient to write than parsing expression grammars in my experience, partially because PEGs are completely unambiguous. PEG parsers often exist just because it's easier to implement them. Also, LR(1) parsers and LALRPOP generally operate on token streams and not plain strings, whereas PEG parsers need to encode lexing within the grammar, which is a pain. So if you were writing a compiler, I would recommend using an LR(1) parser over a PEG parser.


PEGs can sometimes be pretty unintuitive. For example, consider the grammar

    ("a" / "aa") "a"
in a traditional BNF setting, this grammar could match the string aa or the string aaa. However, this PEG will only accept the string aa, and cannot match the string aaa.

(This is because of how backtracking works in PEGs—or, more precisely, how it doesn't work. With the input string aaa, we'll first try to match ("a" / "aa") against the input in a left-biased way, which succeeds because "a" matches. Then we'll try to match the second expression—"a"—which also matches. Now we are done with the grammar but still have the last a in the input, so the grammar fails. There were untried alternatives, but they only appeared in earlier parts of the grammar which had already 'succeeded', so we never go back and try them.)

Additionally, both LR and PEG parsers take linear time to parse a given input, but PEG/packrat parsing has much larger constant factors. On small examples, this might not be a big deal (what's 0.2 seconds compared to 0.02 seconds when loading a file?) but when doing large amounts of parsing it can definitely be a minus on the PEG/Packrat side.


Your points aren't incorrect, but I'd consider many of your negative points to be positives.

I'd say PEGs are easier to write as they're unambiguous. It's easy to understand what they do. I work on languages for a living, and I still have trouble wrapping my head around what an LR parser is doing.

LR parsers often exist just because that's how we've done things for decades and it's what everyone knows from their university language courses.

LR parser operate on a stream of tokens, so you have to force a distinction between syntax and parsing, which isn't always natural and just seems like unnecessary complication.

If you were writing a compiler, I'd recommend a PEG.


> I'd say PEGs are easier to write as they're unambiguous.

While that is technically true, it doesn't solve the actual issue of ambiguity, it just defines it away.

If you have a case in your language where two syntax rules both match, it's confusing to users because you have to arbitrarily decide that one of them is correct. The best-known example of this is the dangling else ambiguity: https://en.wikipedia.org/wiki/Dangling_else

Sure, PEGs by definition are unambiguous, but only because they arbitrarily decide that the first option always "wins." The language itself might still be ambiguous, but you aren't aware of it because the parser always chooses the first option.

Put another way, with PEGs you never know if:

    a -> b / c;
is equivalent to:

    a -> c / b;
You also don't know if there are rules that are entirely unreachable.

Besides this, packrat parsing (the PEG parsing algorithm) is significantly more expensive than LR/LR parsing. Packrat parsing takes O(input length) memory -- significantly more than the O(tree depth) space of LL/LR.

I talk more about these issues in my blog article: http://blog.reverberate.org/2013/09/ll-and-lr-in-context-why...


Hey, thanks for the great blog posts but your blogging platform is frustrating to use on mobile because you have the swipe left/right to change articles in addition to code blocks that don't wrap. I'm having about a 10% success rate on horizontal scrolling in the code blocks on chrome on android, and the other 90% I'm unintentionally loading a different article. Just something to consider. Personally I'd drop the swipe navigation since I'm far more likely to use a search bar or index to find posts rather than flip through like a magazine. Thanks again, you did a better job explaining ll and lr than any of my professors did.


Thanks for the feedback and I'm so sorry it's so frustrating. I just tweaked my Blogger theme to just use the desktop theme on mobile -- hopefully this will be an improved experience!


thanks!


Sounds fun.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: