I think this is a good illustration of why parser-generator middleware like yacc...

jasone · 2024-09-02T05:20:57.000000Z

Hard disagree. Yacc has unnecessary footguns, in particular the fallout from using LALR(1), but more modern parser generators like bison provide LR(1) and IELR(1). Hand-rolled recursive descent parsers as well as parser combinators can easily obscure implicit resolution of grammar ambiguities. A good LR(1) parser generator enables a level of grammar consistency that is very difficult to achieve otherwise.

thomasmg · 2024-09-02T07:53:36.000000Z

> Hand-rolled recursive descent parsers as well as parser combinators can easily obscure implicit resolution of grammar ambiguities.

Could you give a concrete, real-life example of this? I have written many recursive-descent parsers and never ran into this problem (Apache Jackrabbit Oak SQL and XPath parser, H2 database engine, PointBase Micro database engine, HypersonicSQL, NewSQL, Regex parsers, GraphQL parsers, and currently the Bau programming language).

I have often heard that Bison / Yacc / ANTLR etc are "superior", but mostly from people that didn't actually have to write and maintain production-quality parsers. I do have experience with the above parser generators, eg. for university projects, and Apache Jackrabbit (2.x). I remember that in each case, the parser generators had some "limitations" that caused problems down the line. Then I had to spend more time trying to work around the parser generator limitations than actually doing productive work.

This may sound harsh, but well that's my experience... I would love to hear from people that had a different experience for non-trivial projects...

masfuerte · 2024-09-02T13:31:36.000000Z

If you start with an unambiguous grammar then you aren't going to introduce ambiguities by implementing it with a recursive descent parser.

If you are developing a new grammar it is quite easy to accidentally create ambiguities and a recursive descent parser won't highlight them. This becomes painful when you try to evolve the grammar.

tgv · 2024-09-02T12:25:54.000000Z

The original comment says that using yacc/bison is "fundamentally misguided." But parser generators make it easy to add a correct parser to your project. It's obviously not the only way. Hand-rolling has a bunch of pitfalls, and easily leads to apparently correct behavior that does weird things on untested input. Your comment then is a bit like: I've never had memory corruption in C, so Rust/Java/etc. is for toy projects only.

thomasmg · 2024-09-03T09:21:40.000000Z

> Hand-rolling has a bunch of pitfalls

I'm arguing that this is not the case in reality, and asked for concrete examples... So again I ask for a concrete example... For memory corruption, there are plenty of examples.

For parsing, I know one example that lead to problems. Interestingly, it was about using a state machine that was then modified (manually) and the result was broken. Here I argue that using a handwritten parser, instead of a state machine that is then manually modified, would not have resulted in this problem. Also, there was no randomized testing / fuzz testing, which is also a problem. This issue is still open: https://issues.apache.org/jira/browse/OAK-5367

tgv · 2024-09-03T17:33:26.000000Z

There's no reason for concrete examples, because the point was about the fundamental misguidedness of parser generators, not about problems with individual parser generators or the nice things you can do in a hand-rolled one, but to accommodate you, ANTLR gives one on its home page: "... At Twitter, we use it exclusively for query parsing in Twitter search... Samuel Luckenbill, Senior Manager of Search Infrastructure, Twitter, inc."

Also, regexps are used very often in production, and that's definitely a parser-generator of sorts.

The memory corruption example was an analog, but to spell it out: it's easier and faster to write a correct parser using flex/bison than by hand, especially for more complex languages. Parser-generators have their use, and are not fundamentally misguided. That you might want to write your own parser in some cases does not diminish that (nor vice versa).

tgv · 2024-09-02T07:33:14.000000Z

Same. LR(k) and LL(k) are readable and completely unambiguous, in contrast to PEG, where ambiguity is resolved ad hoc: PEG doesn't have a single definition, so implementations may differ, and the original PEG uses the order of the rules and backtracking to resolve ambiguity, which may lead to different resolutions in different contexts. Ambiguity does not leap out to the programmer.

OTOH, an LL(1) grammar can be used to generate a top-down/recursive descent parser, and will always be correct.

HelloNurse · 2024-09-02T13:14:28.000000Z

A large portion of this consistency is not making executive decisions about parsing ambiguities. The difference between "the language is implicitly defined by what the parser does" and "the grammar for the language has been refined one failed test at a time" is large and practically important.

tannhaeuser · 2024-09-02T08:49:36.000000Z

I think it would be interesting and adequate to hear about and link to the reflections of the original awk authors (Aho, Kernighan, Weinberg et al) considering they were also experts for yacc and other compiler-compiler tools from the 1977–1985 era and authors of the dragon book. After all, awk syntax was the starting point for JavaScript including warts such as regexp literals, optional semicolons, for (e in a), delete a[e], introducing the function keyword to a C-like language, etc. I recall at least Kernighan talked about optional semicolons as something he‘d reconsider given the chance.

Levitating · 2024-09-02T06:00:10.000000Z

And GNU is notorious for their use of yacc. Even gnulib functions like parse_datetime (primarily used to power the date command) rely on a yacc generated parser.

bonzini · 2024-09-02T08:12:22.000000Z

That's mostly for historical reasons. Nobody felt the need to switch and do all the work needed to avoid breaking edge cases.

GCC used to have Bison grammars but it switched to recursive descent about 20 years ago. The C++ grammar was especially horrible.