Hacker News new | past | comments | ask | show | jobs | submit login

Parser combinators are less powerful than what parser generators can do, in terms of expressiveness and efficiency. And we've known parser generators since the 70s.

So I'm not sure what the point is of the title of the article.




(one of the authors here): parser generators are generally good for one thing: parsing programming languages. For more complex formats, where you have to carry state around, or binary formats, they're extremely cumbersome to use. I often meet people that tell me they want the parser generator to end all parsers. But for most real world formats, you'll have to hack around the generator's limitations. Theoretically, parser combinators are less expressive, but they're just functions: you can write whatever you want inside a subparser, and integrate it with the rest. That approach works well, since we wrote a lot of complex parsers with nom.


I thought most real world compilers tend to be hand written rather than using a generator? By no means an expert but that's something I've heard and know to be true for many real world compilers.



They probably should have used this instead: http://scottmcpeak.com/elkhound/


For C and especially C++, this makes a lot of sense. There's so much context dependence in the latter when you get to templates.


Ya, heck, not just C++, but scala and C#. You can tune the heck out of error recovery when you write a parser by hand, as well as support IDE services.

Using a combinator or generator makes sense when you really care just about getting trees out as fast as possible; e.g. what a command line batch compiler or interpreter does.


I suppose that's because they take the standpoint that they have so much testing code available, that any conflicts with the grammar are easily found.

However, in general, you can easily shoot yourself in the foot with a handwritten parser, because you can't see the conflicts by looking at the code locally.


It's usually because getting proper (as in user-friendly) error reporting is a massively weak spot with ever parser generator I've seen to date.

Conflicts is generally one of the least interesting problems in writing a parser unless you're designing a new language, and especially when hand writing recursive descent parsers, potential conflicts/ambiguities tend to become obvious quickly.

For prototyping a new language, sure, use a parser generator to test the grammar.

I too fall in the category of loving the idea of parser generator (and having written a few) but always falling back on hand-written parsers because the generators I've seen have all been inadequate. I hope that chances some day.


> For more complex formats, where you have to carry state around, or binary formats, they're extremely cumbersome to use.

That's not a theoretical issue, but more a practical issue.

> you'll have to hack around the generator's limitations

You're probably thinking of LALR(1) class generators. But we have GLR parser generators for a long time now, and they are very flexible and have little limitations. See e.g. [1].

[1] http://scottmcpeak.com/elkhound/


How do you parse ASN.1 with this generator? Or JPEG?


Or especially Semantic Designs' tool that gets a ton of mileage out of GLR:

http://semanticdesigns.com/Products/DMS/DMSToolkit.html


It may be true that parser generators are superior from a technical point of view (I can't comment), but parser combinators sure can be nice to use. I very recently used nom to implement a parser for a structured text format, it feels wonderfully expressive and you can knock together a reasonably sophisticated parser in a matter of hours. Performance-wise, it is zero-copy and chomps through a 100MB file in less than half a second on my crappy laptop, so plenty fast enough for most situations.


The nicest thing about parser generators, IMO, is that they can warn you if your grammar is ambiguous.


> So I'm not sure what the point is of the title of the article.

The rest of the paper elaborates on the title.


And if you really care about error recovery, error messages, incremental processing, and tooling integration, writing a parser by hand via recursive descent is always a sure bet. There are good reasons why many production PLs don't even use generators.


Parser combinators let you put a breakpoint on a rule and get a call trace. (Same with plain old recursive descent or anything just involving function calls.)

That alone pretty much wipes out any remaining advantages of parser generators, in which the "cow" of your rules has been turned into a "hamburger" state machine that is very difficult to follow, usually having very poor debug support compared to the maturity of the rest of the tooling. ("How come if I add a variant to ths rule, I get a reduce/reduce conflict in these five other rules elsewhere? Waaaah ...")

Lest there be any doubt: GCC uses a hand-written recursive descent parser for C++. (Meg and a half of code and increasing.)

That's virtually a proof that just parsing with functional decomposition is good enough for anything.

Another thing is that with functional parsing, you can use the exception handling (or other non-local, dynamic control transfers) of the programming language for recovery and speculative parsing. This parse didn't work? Chuck the whole damn branch of the parse with a dynamic return, and try going down another rabbit hole.


That's more a proof that the grammar of C++ is absolutely terrible. More modern languages such as rust have been carefully designed to be parsable by non-necronomicon-level code.


> That's more a proof that the grammar of C++ is absolutely terrible. More modern languages such as rust have been carefully designed to be parsable by non-necronomicon-level code.

I don't believe your comment is fair or correct, and even very naive. Considering GCC's case, what makes a parser complex is not the language grammar itself, but all the requirements set onto the compiler to enable it to churn out intelligible warnings and error messages, which means supporting typical errors as extensions of the language grammar.

Furthermore, GCC's C++ parser support half a dozen different versions of C++ and all the error and warning messages that come with targetting a language standard but using a feature not supported by it. Rust has no such requirement, nor it will have any time soon.

Your comment strikes me as the old and faded tendency to throw baseless complains about tried and true technology by pointing out how green and untested tech is somehow better because it's yet to satisfy real-world requirements.


You are incorrect: the grammar of both C and C++ is fundamentally hard to parse, let alone the extra complexity of giving nice error message and doing good recovery. E.g. they require context to parse https://stackoverflow.com/a/41331994/1256624 , https://en.m.wikipedia.org/wiki/Most_vexing_parse , and C++ is undecidable to parse correctly http://blog.reverberate.org/2013/08/parsing-c-is-literally-u... .


> That's more a proof that the grammar of C++ is absolutely terrible.

Oh, I took that for granted as the basis of my remarks: far from a random choice on my part.


File that next to the 'proof' that the grammar of English is absolutely terrible. More modern languages such as Esperanto have been carefully designed to be learnable by regular human beings.

I know which language I'd rather be fluent in.


What is this trying to say? If a crucial portion of your job depends on the ability to parse a language (and it does, if you're a programmer who uses an IDE), then that's a point in favor of a language that's LL(1) rather than context-dependent. Making an analogy to natural languages here isn't relevant.


In the context of the discussion, it has been pointed out that parser generators are not used in practice in multiple very successful compilers for very successful real world languages. The grandparent to my comment claimed that the fact that GCC uses a hand written recursive descent parser is proof that that approach to parsing is good enough for anything. The parent claimed that no, it's a proof that the grammar of C++ is 'terrible'.

My point was that grammatical purity doesn't appear to correlate particularly in the real world with the success/popularity of a language. My analogy to natural languages was relevant in that context. If parser generators are not well suited to some very successful existing languages, it's not particular useful to blame that on the languages and point to currently niche (and relatively young) languages that don't have that 'problem'. In the real world, the most used languages have and will continue to have for the foreseeable future 'terrible' grammars so there will continue to be a need for parsing techniques that can handle them.

I'd actually speculate further that the analogy to natural languages is relevant in that it may not be a coincidence that the most used languages have some of the most complex grammars. Why that might be the case is an interesting question to think about.


> that's a point in favor of a language that's LL(1) rather than context-dependent

and 5 points in favor of a language that's LL(0) rather than LL(1). The logical conclusion of your argument is to use Lisp everywhere.


Lisp is LL(1). If we have #, we don't know what we're looking at until we read the next character, like #s structure, #( vector, #= circle notation, etc.

Actually there can be an integer between the two; but that doesn't change what kind of syntax is being read so it arguably doesn't push things to LL(2).

Other examples: seeing (a . we don't know whether this is the consing dot notation, or the start of a floating-point token.

Speaking of tokens, the Lisp token conversion rules effectively add up to LL(k). 12345 could be an integer or symbol. If the next character is, say, "a" and the token ends, we get a symbol. Basically if we see a token constituent character then k more characters have to be scanned before we can decide what kind of token and consequently what object to reduce to.


That variety of Lisp with its particular brand of syntactic sugar is LL(1), and its lexing LL(k). Because of its circumfix notation, a Lisp language can be LL(0) when the prefix symbols and tokens are defined appropriately. That was the point of my original comment.


Are they? I thought parser combinators made parsing context sensitive grammars easier than when using parser generators.


I think it's more a response to the wave of attacks against unsafe parsers, such as those used to read subtitles in media players.


Well, we gladly trade power and efficiency all the time for more intuitiveness and ease of use in computing...


On top of that, we even have some that are formally verified or validating that could be tied into Rust.

https://arxiv.org/pdf/1105.2576.pdf

http://gallium.inria.fr/~xleroy/publi/validated-parser.pdf

http://users.cecs.anu.edu.au/~aditi/esop09_submission_16.pdf

One of those handled most of C99 standard.




Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: