Hacker News new | comments | show | ask | jobs | submit login

I don't think it's ever been the case that C front ends used lex/yacc. Whether it's "best practice" I suppose depends on who you ask! (You'll certainly learn it with parser generators in university because writing a hand-written parser takes too long for an undergrad class, and they will only cover a tiny subset of C if anything.)

You can see here that the original C compilers use a hand-written expression parser using the Shunting Yard algorithm, and I'm sure the rest of the front end is hand-written too:

http://www.oilshell.org/blog/2017/04/22.html

C predates lex/yacc so the syntax wasn't really designed with them in mind.

Some other potential problems:

(1) Distinguishing types and variables in C:

https://eli.thegreenplace.net/2011/05/02/the-context-sensiti...

https://en.wikipedia.org/wiki/The_lexer_hack

(2) The preprocessor. While it doesn't prevent you from using lex/yacc for the C parser, this is an entirely separate lexer and parser that is always written by hand AFAICT.

Awk, on the other hand, is almost always parsed with lex/yacc. The language was almost designed around those tools.

I think if you're writing say your first compiler you should try to use lex and yacc. I think you can solve the context-sensitivity problem but I haven't done it myself. Operator precedence probably requires some extensions or maybe you can encode the many levels of C precedence in the grammar.

One thing I found from implementing a shell parser [1] is that the POSIX spec uses yacc grammars, even if the languages aren't best parsed with bottom-up parsing. For example, all shells except bash use top-down parsing.

Analogously, the C grammar may "manually" encode the operator precedence, even though the original C parsers and current ones like GCC/Clang actually use an operator precedence algorithm like the Shunting Yard algorithm, not yacc's shift-reduce algorithm.

e.g. this grammar manually encodes it:

https://www.lysator.liu.se/c/ANSI-C-grammar-y.html

Not that yacc can't parse expressions -- it's just something interesting I found.

Also, for shell, there are a bunch of parsing rules that are necessarily listed separately from the grammar, and I'm sure the same is true for C.

[1] http://www.oilshell.org/blog/tags.html?tag=parsing-shell#par...




Thank you very much for all that information - much appreciated!

The Portable C Compiler was written using Yacc. [1]

>I think if you're writing say your first compiler you should try to use lex and yacc.

I've actually started with something higher level still: BNFC [2], which has taken me quite a long way quite quickly. It generates lexer, parser, operator precedence and AST along the lines of the Appel books for a variety of backends: C, Java, OCAML, Haskell etc.

But now I'm running into bugs and limitations in BNFC. I may fix some of those if I can, but at some point I will probably dump BNFC and then hand-edit the generated Flex/Bison.

Maybe that's what people do eventually - start with a high-level tool and then move lower?

[1] https://en.wikipedia.org/wiki/Portable_C_Compiler

[2] https://github.com/BNFC/bnfc


OK BNFC is interesting! I think that kind of tool is the holy grail, but they all have limitations. You kind of have to design your language AROUND it, and C obviously doesn't fit in that category.

For example, I mentioned that Awk was designed with yacc.

It depends how ambitious you want to be. You can tweak C to be more easily parsed, but then you will parse fewer real programs.

It's not surprising to me at all that BNFC will have problems with C. So I guess I would say you should use high level tools like BNFC / lex / yacc until you can't.

FWIW here is another little problem you will run into with C:

https://en.wikipedia.org/wiki/Dangling_else

I guess the point is that parser generators and even grammars themselves are leaky abstractions! To effectively use parsing tools you have to be familiar with their internals and the underlying algorithm. If there are two levels of code generation like BNFC, then that makes this task harder!

I think it makes sense to start high and go low. It also makes sense to start low-level and go high. Once you are sick of writing parsers by hand, you will understand how to write a better parser generator than the ones that currently exist :)




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: