Hacker News new | past | comments | ask | show | jobs | submit login
Which Parsing Approach? (2020) (tratt.net)
52 points by tomas789 on May 5, 2023 | hide | past | favorite | 18 comments



Generated lexers are still worth it but handwritten RD won on the parsing front. No other historical debate gets this Weekend at Bernie's treatment the way parser generators do. Is it from dated undergrad + research material? There must be a reason all the references and approaches in this article are decades old and parser combinators get a single dismissive footnote. Pretty sure that Guy Steele quote even predates PEGs.

Bison and YACC are a mess. Sure, go and complicate your build by introducing code gen steps and ancient spaghetti DSLs with terrible tooling and an extra learning curve. Anything to avoid writing a readable parser in the same language as the rest of your compiler. The amount of ink spilled over parsing and parser generators is phenomenal when you consider how trivial parsing is compared to the other compilation phases. Look at source code for large compilers and you'll find the handwritten parser is some of the easiest code to understand in the project. Lack of left recursion is a trivial limitation in practice.


I think the parsing debate is just one example of a spectrum that I've noticed a lot of developers (or indeed people in general) lie along; at one end are those who love complexity and abstraction and generality as well as indulge heavily in theory, while at the other end are those who want simple practical solutions even if they might be considered "hacks" by those at the opposite end. The former are obviously those advocating for parser generators and all of the theory behind it, while the latter are going straight for handwritten RD.

Is it from dated undergrad + research material?

If one thinks that creating research material is the goal, then that's what happens in abundance.

Lack of left recursion is a trivial limitation in practice.

The same goes for context-sensitivity.


The front end of compilers gets a lot of attention because the problem looks complex, but is easy in practice, which is why there are a large number of materials trying to teach the topic.

On that note, what resources would you suggest to become familiar with the backends of a compiler? The CS classes I had mostly had a frontend focus with the backend only being covered at a high level and without delving too deep into the implementation details.


I have a similar struggle and find you are largely limited to learning piecemeal. I have found the best learning resources are the technical docs, specifications, proposals and source code of major language projects and VMs. Research languages and associated research papers. Conference talks by compiler authors (Rust has some good ones).


> The amount of ink spilled over parsing and parser generators is phenomenal when you consider how trivial parsing is compared to the other compilation phases.

Trying to find good resources for implementing type checking was a pain when I was doing my MSc project. Maybe it's partly due to how few projects get to serious stages like type checking or optimization passes.


> Generated lexers are still worth it but handwritten RD won on the parsing front.

Packrat parsers make a very compelling case. I'm not sure the debate is over.


An interesting article about parsing techniques that were developed half a century ago when there were still severe memory limitations. It would be nice if it at least would mention that these techniques were developed because back-tracking was not considered as a valid option for parsing, because it was simply not feasible to store a whole file in memory. Nowadays the whole source code of the linux kernel easily fits in RAM.

For many applications, using a back-tracking parser with some memorization/caching, is feasible. IParse Studio [1] is an example of an interpretting parser that parses a grammar and an input at each change on the input and demonstrates that it works for rather complex grammars [2] on realistic input [3].

[1] https://fransfaase.github.io/MCH2022ParserWorkshop/IParseStu... [2] https://fransfaase.github.io/MCH2022ParserWorkshop/C_grammar... [3] https://fransfaase.github.io/MCH2022ParserWorkshop/scan_pc.t...


Here is another example of an LR parser that reads the grammar on the fly and immediately parses the input (which is, among other things, also defining its own grammar) with it: https://marketplace.visualstudio.com/items?itemName=Practal....

But it seems the LR approach is just not flexible enough for my needs, so I am experimenting with replacing it with parser combinators + Earley parsing. The LR approach has also the disadvantage of having to precompute tables, which is fine for a single file, but if you want to deal with modules, then I am not sure how to compute these tables incrementally, which is kind of a prerequisite for a good user experience.


"I suspect that parsing performance worries date back to the period when parsing techniques were under heavy development. LR parsing was invented in 1965, a time when computers were painfully slow [19] and resource poor. "

"When combined with a couple of other techniques to squeeze the statetable’s memory footprint [22], even the most puny modern machine can run an arbitrary LR parser at impressive speeds."


> works (...) on realistic input

However, sometimes the input is generated by code. Think of e.g. huge switch statements. It's certainly not nice if a parser breaks down on that.


The source file isn't the main memory concern, it's the memoization.


There is no need for full memoization. The above JavaScript implmentation caches for each non-terminal the last location a parsing attempt was made and the out come of that: either false, or true with an abstract syntax tree and the next location after the part of the input that was parsed.


Personally, I swear by recursive descent parsing mixed with TDOP for expressions. Since I've tried that combination for the first time I haven't looked back.


And I. OP with partial-order precedence, mixfix ops, op predicates for white-space sensitivity, and op-fragment chains for n-ary mixfix. RD with backtracking, computed predicate rules, and backtracked "parse scoped" data. The thought of exploring a performance-oriented one in Julia (monomorphized multiple-dispatch jit) occasionally comes to mind - if I ever see a "get the rep right and the problem gets much easier" story for graph rewrite on an LLVM scale, I might try for an fully-extensible compiler bootstrap.


>>> Over time, I’ve come to view LR parsing as equivalent to static typing: occasionally annoyingly restrictive, but providing enough static guarantees to be worth the annoyance for important software.

That's also one of the points of using parser tools instead of hand coded descendant parsers. The rule validation is also done by the tool. A lot of pitfalls can be avoided.


I think this may be one of the best overviews of parsing algorithms I’ve come across! It is concise, it points out the various approaches and their advantages/pitfalls in practice.

In general, it seems like most articles on parsing are overly pushing one approach or another… either because people have cargo culted themselves into believing whatever they learned first is best, or because they only understand one approach.

Parsers are just plain tough imo! It is a simple problem with a tricky solution space!

Complete overviews with citations like this are hard to come by in my opinion.


Thank you! A very nice overview and plausible reasoning.


(2020)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: