
Scandalous Weird Old Things About The C Preprocessor (2015) - AndreyKarpov
http://blog.robertelder.org/7-weird-old-things-about-the-c-preprocessor/
======
klodolph
A nit: the comment about how the C preprocessor can't be described by a
context-free grammar is followed by a comment about how # directives are
sensitive to where they are in the line. That's actually a non-sequitur! It's
totally possible for context-free grammars to be line-oriented, the "context"
in this sense has a very technical meaning--it means that the left side of a
production rule must have only one symbol (i.e. no context).

~~~
robertelder
Can you give me an example grammar to help visualize your statement? I can
probably make a note or clarification. I think one other point I forgot to
mention was that the production rule for the # directives would have to look
something like

    
    
        include-directive:
            beginning-of-file # INCLUDE ...
            newline-from-previous-line # INCLUDE ...
    

Since 'beginning-of-file' isn't really something you can reduce to a token I
justified that it would be context sensitive.

~~~
chubot
FWIW, I haven't thought too much about the case of the preprocessor, but this
series of blog posts might help:

[http://trevorjim.com/how-to-prove-that-a-programming-
languag...](http://trevorjim.com/how-to-prove-that-a-programming-language-is-
context-free/)

My initial thoughts are that the property of being context-free only refers to
a parser, not to the lexer. A lexer can be line-oriented, as mentioned.

The blog post mentions the lexer issue. The definition of a context-free
language is fuzzy because you could have an infinitely powerful lexer, but he
gives a precise definition.

~~~
klodolph
In the strictest sense, "context-free" applies to either a grammar or a
language. A language is context-free if there exists a context-free grammar
for it, but there might be several grammars for the same language.

The lexical syntax is usually a "regular language" or quite close to one, this
includes C. All regular languages are also context-free, the set of regular
languages is a strict subset of the context-free languages. Regular languages
can be parsed using regular expressions.

And again, there's nothing wrong with being line-oriented. That's not "more
powerful" in any sense of the term. Recognizing newlines, SOF, EOF, etc. are
really simple things (in the language hierarchy sense) that you can do in
regular expressions without using any extensions that make them non-regular
(like backreferences), therefore they can be converted to a context-free
grammar.

Commonly, a programming language will have regular expressions for its lexer
and a context-free grammar for its parser. Since the lexer is technically also
context-free, it's possible to combine the lexer and parser into a single
stage, but this is a bad idea for various software engineering reasons.

Some older languages have context-free parsers, like C, whose grammar is not
context-free (that's not the same as "context-sensitive" which is a particular
technical term). One major reason C's grammar is not context-free because it
has to distinguish between typedefs and other identifiers.

Of course, a language as a whole is almost never context-free, but we usually
use "context-free" to mean that the parser and lexer are context-free.

~~~
chubot
I don't think most languages in widespread use have lexers that are regular.
Here the same author gives the example of Python's indentation requiring a
stack: [http://trevorjim.com/python-is-not-context-
free/](http://trevorjim.com/python-is-not-context-free/)

Also Haskell: [http://trevorjim.com/haskell-is-not-context-
free/](http://trevorjim.com/haskell-is-not-context-free/)

I'm pretty sure in JavaScript that the /s+/ regex syntax can't be recognized
with a regular language. Perl is of course insane -- it's not even statically
parseable [1]. Shell isn't context-free.

The theme of this set of posts is that most languages used in practice aren't
expressible with CFGs, and I agree. I didn't prove it but he shows how you
would prove it.

[1]
[http://www.oilshell.org/blog/2016/10/20.html](http://www.oilshell.org/blog/2016/10/20.html)
(see the end of this post for links about C++, Perl, Make)

------
jandrese
> The C99 standard is about 500 pages, but only 19 of them are dedicated to
> describing how the C preprocessor should work

IMHO This is not a mistake. This is a subtle indication about how much you
should be doing with the C preprocessor.

People who try to do crazy clever tricks with the preprocessor are usually
just creating a huge problem for themselves down the road, especially when
porting over to a different OS or compiler.

If you're considering doing complex metaprogramming with the preprocessor you
need to step back and reconsider your approach and ask yourself who is going
to maintain it once you've moved on.

~~~
microcolonel
You don't need to get into "complex metaprogramming" to hit preprocessor bugs
in many implementations. I've been finding c preprocessor bugs to iron out in
glslang over the course of more than a year, and many of them are triggered
perfectly reasonable typos.

The C preprocessor has just enough power for most of what it exists for
(smoothing over build-time oddities, code which should/must be manually
inlined with a macro to perform as desired).

Yes, you can abuse it and produce Bournegol, but there are many decent reasons
to use most features of the C preprocessor, and the way it is defined is
largely ergonomic, aside from the whitespace gotchas around function macros.

Robert is fully entitled to lament the process of implementing the
preprocessor. Far be it from me to disagree. However, none of these examples
seem weird if you just follow the intended implementation pattern: execute
each step separately (at least as a model of behaviour) and you will have the
desired result.

Even the "let me destroy your world with this example" example is
straightforward as long as you follow the process.

-2: Digraphs and trigraphs are expanded

-1: Line continuations are collapsed

0: Comments are removed or ignored

1\. A macro is defined, `foo` is `<stdio.h>`

2\. The first non-whitespace character of a line is #

3\. The first non-whitespace character, and the contiguous non-whitespace
characters form "include", followed by a macro expansion token, `foo` which we
expand to `<stdio.h>`

4\. Splice in the included file, and continue preprocessing

5\. We're done, really, it was quite straightforward.

I think Robert's pain and confusion rests solely on his desire to define C as
a context free grammar, when C parsers are really character-oriented state
machines.

------
FrozenVoid
The most interesting things about C preprocessor is compile-time zero-overhead
tuples and variadic macro overloading. e.g you can pass (a,b,c) as single
argument to function(a) macros and selectively alter the tuple members and
perform clever tricks which look like lisp code blocks unrolled at compile
time(visible with gcc -E). Boost preprocessor and Order use it to create very
powerful abstractions. see
[http://www.boost.org/doc/libs/1_64_0/libs/preprocessor/doc/i...](http://www.boost.org/doc/libs/1_64_0/libs/preprocessor/doc/index.html)
[https://github.com/rofl0r/order-pp](https://github.com/rofl0r/order-pp)

Variadic overloading of function macros is technique from
[http://stackoverflow.com/questions/11761703/overloading-
macr...](http://stackoverflow.com/questions/11761703/overloading-macro-on-
number-of-arguments) combined with _Generic which overloads by type. It allows
things like print("abc",1.2,3) to transform into
printf("%s","abc"),printf("%f",1.2),printf("%d",3) without any overhead in the
generated code.

see
[https://www.reddit.com/r/frozenvoid/wiki/voidh](https://www.reddit.com/r/frozenvoid/wiki/voidh)
Edit: and the tuple members are just symbolic tokens, they can contain
anything from code to nested tuple trees. (code can be encapsulated in
[https://gcc.gnu.org/onlinedocs/gcc/Statement-
Exprs.html](https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html))

------
komerdoor
I once started experimenting with (GNU) C macros:
[https://gist.github.com/machuidel/d7cc099ddc4970c6ddf4](https://gist.github.com/machuidel/d7cc099ddc4970c6ddf4)

It became an abomination consisting out of many C-preprocessor hacks and
impossible to debug. I never put it online (and I never will).

In the end if you want to use C at a more abstract level you may as well use
Nim (not C, but it compiles to C89), C++ (of course) or write your own code
generator (using libclang with annotations for ex.).

~~~
FrozenVoid
D mixin templates/mixins are another alternative: You can include arbitrary
strings from functions running at compile time. Theoretically they're more
powerful than C macros, but type safety(template parameters) makes some
constructs awkward - unless you create and parse strings as tokens
manually(basically reading/writing many strings to achieve what C preprocessor
does with symbolic token composition).
[https://dlang.org/mixin.html](https://dlang.org/mixin.html)
[https://dlang.org/spec/template-mixin.html](https://dlang.org/spec/template-
mixin.html) Templates can also include mixins: template liter(String
s){mixin(s);}

------
rwmj
[The scandal isn't that we're still using it?]

I actually used the fact that:

    
    
        #define return(r) ...
    

only expands for:

    
    
        return (something);
    

and not for:

    
    
        return something;
    

in a (non-serious non-production) program I wrote. I felt slightly dirty, but
it made sense in the context -- tracking stack frames in order to do precise
garbage collection of C code.

~~~
naasking
That sounds interesting. Any links or a further description?

~~~
rwmj
Not anything released yet, but the idea is not new. I stole it from OCaml
(possibly it came from elsewhere before that):
[https://caml.inria.fr/pub/docs/manual-
ocaml/intfc.html#sec42...](https://caml.inria.fr/pub/docs/manual-
ocaml/intfc.html#sec423)

~~~
naasking
Ah, I was thinking it might be a more complex series of tricks to integrate GC
more easily with C. For instance, redefining some C keywords, like 'return',
to inject a prologue for freeing tagged locals or something.

------
cryptonector
[https://news.ycombinator.com/item?id=10945552](https://news.ycombinator.com/item?id=10945552)

------
raarts
Well I like it because of all the things you can DO with it, not for all the
things you can do wrong with it.

------
PeCaN
For better or worse, I wrote
[https://github.com/alpha123/cpplib](https://github.com/alpha123/cpplib) which
is actually used in a commercial product.

The C preprocessor can be remarkably useful for metaprogramming.

