
The context sensitivity of C’s grammar - yan
http://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-c%e2%80%99s-grammar-revisited/
======
eliben
Cool to see this posted here. Just for completeness: this article is part 2 of
the one here: [http://eli.thegreenplace.net/2007/11/24/the-context-
sensitiv...](http://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-
of-cs-grammar/), so if the issue interests you, start with that one.

~~~
yan
Thanks for the great blog; always make sure to get to it first in my reader.

------
acqq
An advice to those who like OP first encounter these topics: try to gain the
historical perspective, you'll understand everything much better. Find the
sources of the original C compilers, marvel that they are probably smaller
than the Yacc and Lex (at least that's how I remember them) and then
understand that both Yacc and Lex were never needed for these C compilers. You
learn about Yacc and Lex in the school more because of their "educational
value" than because they are easiest tools to make a C compiler or parser.

As the original authors wrote the compiler the result of not having the
context free grammar was just having a few lines of the code more. Their goal
was certainly not an academic "parser" purity. Parsing is one of quite
uninteresting parts when you're making UNIX and C some 40 yeas ago.

~~~
sedachv
Amateurs use parser-generators.

Professionals write recursive-descent parsers.

Real men code finite state machines by hand:
[http://galileo.phys.virginia.edu/classes/551.jvn.fall01/fsm....](http://galileo.phys.virginia.edu/classes/551.jvn.fall01/fsm.html)

But seriously, the mystery and ignorance that continues to surround Yacc to
this day needs to be fought. If you're using Yacc, you're doing it wrong. As
far as parser-generators go, it is totally obsolete, and even when it wasn't,
it was the wrong choice for almost all parsing problems. I learned this one
day when someone recoded a Yacc-based parser I wrote in recursive-descent
style, and it was less lines of code than my specification.

If you need to write a really simple, or a really complicated, parser,
recursive-descent is usually the way to go. You might try Henry Baker's META
(<http://home.pipeline.com/~hbaker1/Prag-Parse.html>) as a first alternative.
If you really, really feel the need for a parser-generator, take a look at
parsing expressions grammars/packrat parsers:
<http://pdos.csail.mit.edu/~baford/packrat/>

But the real lesson here is to design your syntax to be easy to parse and
manipulate in the first place. C's grammar is the clearest testament to how
little thought Dennis Ritchie put into the design of C.

~~~
ZoFreX
I have an interest in making parsers, but all mine so far are either hacks or
made with ANTLR. Does this fall under "doing it wrong"? (I presume the answer
is yes) How would you recommend I go about learning how to do it right?

~~~
acqq
Just don't "make parsers." Make something that does something, better than
some other software. Parser is just a piece of that what you make. Once you
have the goal, to make the parser piece, use whatever you like, but produce
the results you wanted. That's how you really learn something.

~~~
ZoFreX
Thanks for the condescending advice... one of the last parsers I wrote was for
a website with 125,000 active users, they are far from the be-all and end-all
for me.

------
wglb
So using yacc and lex to do C compilation is a newish idea. Using yacc makes
the hard part harder and the easy part easier (dgc).

What is missing from the discussion in a generally good article here is that
it isn't so much a "hack to the lexer" but you put information in the symbol
table (scope, type) and push out to yacc a symbol with token type attached
(variable, typename, etc).

As another commenter here says, C compilers weren't originally done with yacc
or other parser generator, they were recursive descent.

------
jrockway
Does the parser really need to know this information? XX yy; only means one
thing: declare( type: XX, variable: yy). Making that work is the job of
something else down the pipeline that knows about the concept of types and
symbol tables.

As an example, Emacs syntax-highlights both "YY xx" and "int xx" the same way.

~~~
judofyr
Yes, the parser needs to know about this information. Example:

    
    
        (A) * B
    

This is either A multiplied by B, or type A casting the dereferenced value of
B: <http://en.wikipedia.org/wiki/The_lexer_hack>

~~~
jrockway
Fair enough. Would the problem be fixed if the C standard said that _x
(without spaces) had to be dereferencing and that x_ y (with spaces) had to be
multiplication?

~~~
demallien
It was as if a million C programmers cried out and were suddenly silenced.

Significant whitespace is evil, at least in the context of C where it is not
significant anywhere else.

~~~
jrockway
Whitespace is significant in C. Consider the difference between "f o o" and
"foo" :)

~~~
demallien
Idon'tknowwhatyou'retalkingabout! :)

~~~
eru
You like your Commodore 64 BASIC (or Ye Olde Fortran)?

------
wbhart
I've been using an LALR(1) parser generator by Paul Mann
<http://highperware.com/> which gets around this context sensitivity problem.
I've had to hack it a bit to handle scopes, but this wasn't a major problem.

It generates an amazing lexer/parser, which can do something like a million
lines of code a second. But he hasn't open sourced the parser generator nor
ported it from Windows. I've been trying to convince him to do so, and he came
oh so close recently.

He's been trying to figure out how to make money from it if he Open Sources
it. As it is, everyone seems to be totally ignoring it despite its
sophistication compared with flex/bison.

~~~
wbhart
Ah, the binaries are unavailable at present. He's working on a new version and
says it'll be up in a week or so.

~~~
wbhart
Apparently they're back now!

------
paulbmann
Consider this context sensitive C code:

typedef unsigned int uint, uint * uintptr;

Here is how to specify a sample grammar to handle the above statement with a
new product called LRSTAR. Note, this grammar has no conflicts and is LALR(1).
Here is the grammar:

<identifier> => lookup ()

Declaration -> VarDecl1 /','... ';' -> typedef VarDecl2 /','... ';'

VarDecl1 -> Type... <identifier>

VarDecl2 -> Type... <identifier> => defterm(2,{typedef})

Type -> SimpleType... -> Type '*'

SimpleType -> char -> int -> short -> unsigned -> {typedef}

You can download LRSTAR at <http://HighperWare.com>.

------
paulbmann
The LRSTAR binaries are up now with sample projects at
<http://HighperWare.com>

