
The Little C Function From Hell - gus_massa
http://blog.regehr.org/archives/482
======
scott_s
If this sort of thing is interesting to you, the author and his colleagues
wrote an excellent PLDI 2011 paper about their research which systematically
checks various compilers for exactly these kinds of bugs: "Finding and
Understanding Bugs in C Compilers":
<http://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf>

Perhaps my favorite part of that paper is the fact that there is no ground
truth (section 2.6). They discover bugs by testing random programs against
multiple compilers. If the result from any of the compilers disagree, then
there must be a bug. (They guarantee that the inputs are legal.) In theory,
it's possible that all of the compilers could be wrong in the same way, which
means they wouldn't discover a bug. In practice, this is extremely unlikely.
But you can't know for sure. (In practice, they never saw an instance where
there were three different results from three compilers; at least two of the
compilers always agreed.)

~~~
Deestan
> They discover bugs by testing random programs against multiple compilers. If
> the result from any of the compilers disagree, then there must be a bug.

How does this make sense?

If the result differs from the specification, it is a bug.

If the result is unspecified in the specification, the different compilers can
differ as much as they want without any of them being considered buggy.

~~~
silentbicycle
C compilers can be buggy, particularly when you start working with vendor-
supplied compilers for embedded platforms. A colleague was furious when he
realized that his board's compiler _didn't support function pointers_.

~~~
joshAg
that does seem like a rather large omission. how did he work around that
issue?

~~~
pmjordan
I could imagine a DSP architecture that doesn't intrinsically support indirect
jumps. (especially as DSPs frequently use the Harvard memory model) That would
make implementing function pointers tricky. I'd probably work around this by
making a set of dispatch macros that expand into a giant switch block where
each case is a (static) function call. The other option would be self-
modifying code, which is annoying to do, to say the least, particularly for
Harvard systems.

~~~
kragen
If your CPU supports keeping function return addresses on a stack that you can
push other things onto, you can do an indirect jump by pushing the address you
want to jump to and then "returning" to it. That's a lot easier than self-
modifying code or massive switch statements, and just as easy on Harvard as on
von Neumann architectures.

------
raverbashing
A little C tip I learned from hard experience

NEVER, EVER, NOT IN A MILLION YEARS use a signed int/char etc, unless you are
_200% certain_ you're doing the right thing (that is, it's for something
specific you need it)

You WILL have problems, period.

"Oh it's just a matter of knowing the C spec" then please go ahead as I grab
the popcorn.

~~~
barrkel
I've seen more problems from unsigned ints than signed ints (in particular,
people doing things like comparing with n-1 in loop boundary conditions).
There's a reason Java, C# etc. default to signed integers. Unsigned chars, I
have no quibble (and Java, C#, use an unsigned byte here).

~~~
rcfox
Unsigned integer overflow has defined behaviour in C, while signed overflow
doesn't. Is it really better to protect people from a simple logical error by
exposing them to possible undefined behaviour?

With signed integers, you'll run into the same problem with comparing to n+1
at INT_MAX or n-1 at INT_MIN.

~~~
barrkel
0 is a really common value for an integer variable in programs. INT_MAX and
INT_MIN are not.

It's just my experience. Don't get too wound up about it ;)

------
acqq
Before starting reading everything note that author assumed when writing the
article that (provided sizeof( int ) > sizeof( char )):

    
    
        char c = (char)128;
        printf( "%d\n", c );
    

should always be -128, whereas his commenter Mans (March 4, 2011 at 4:15 am)
points out that this conversion is _implementation-deﬁned_ according to the
standard, that is, compiler authors are free to decide what the result of such
conversion should be.

~~~
dchest
Should the implementation-defined behavior be always the same? That is,

    
    
        char a = (char)128;
        char b = (char)128;
    

Will a == b for every implementation?

~~~
lmm
Yes; that's the distinction between "implementation-defined" and "undefined"
behaviour.

~~~
beagle3
While that's how it is generally understood, there is nothing stopping the
implementation from defining the behaviour as:

    
    
        conversion is rounded up to the nearest multiple of 20 on odd lines. 
        conversion is rounded down to the nearest multiple of 17 on even lines.
    

Older versions of gcc would have such a (fully standard compliant) behavior
when you used #pragma, that included running rogue or nethack and other stuff
-- but later versions actually succumbed to implementing useful pragmas.

------
Dove
You know, not long ago I took somebody's "how well do you know C?" quizzes,
and it was fully of this sort of question -- what C does with overflows and
underflows in various circumstances. And I must admit, I felt like I had been
asked what happens, exactly and specifically, to a particular piece of memory
after dividing by zero. "I don't know, I try to avoid doing that!"

I don't know. I can admire the analysis, but I don't understand the motive. Do
people really write code that relies on this sort of behavior? Or is it just
trivia for trivia's sake?

~~~
rcfox
If it was a week or two ago, the quiz was probably from the same person as
this article.

He's not just doing it for fun; he's a professor at the University of Utah,
and he's researching this area, looking for bugs in compilers. In fact, he's
developed a tool for this: <http://embed.cs.utah.edu/csmith/>

These tiny bits of strange code are condensed versions of what you might see
in the wild, especially after preprocessing.

Nobody's doing ++x > y, but they do something that looks reasonable like
foo(x) > bar(x), where foo() and bar() return chars.

~~~
nitrogen
_Nobody's doing ++x > y, but they do something that looks reasonable like
foo(x) > bar(x), where foo() and bar() return chars._

I might write something like "++x > y"; preincrement followed by comparison is
a common operation.

------
sltkr
To be absolutely accurate, the program still invokes implementation-defined
behaviour. From the C standard on casting integers: “if the new type is signed
and the value cannot be represented in it; either the result is
implementation-defined or an implementation-defined signal is raised”.

Therefore, the author's conclusion that “the behavior is well-defined and
every correct compiler must emit this output” is plain wrong. A correct
compiler might emit a signal instead of outputting anything.

(However, printing 1 for the last case is still wrong, because there is no
possible way for ++x to yield a value greater than INT_MAX, so this cannot be
consistent with any implementation-defined behaviour.)

~~~
sltkr
^ In the above, I meant CHAR_MAX instead of INT_MAX. Oops!

------
dchest
Just tried with Plan 9-derived compilers shipped with Go, and there's no bug
(provided that my re-implementation of the test case is correct).

------
Arcticus
Dan Saks did a great set of presentations called "C and C++ Gotchas" on these
types of things at the 2006 Embedded Systems Conference in San Jose.

Sorry couldn't find a link that wasn't behind a paywall but here is one for
reference.

[http://eetimes.com/electrical-engineers/education-
training/t...](http://eetimes.com/electrical-engineers/education-
training/tech-papers/4125901/C-and-C--Gotchas)

------
WalterBright
The Digital Mars C compiler returns the correct answer with or without
optimization turned on.

------
krollew
"That’s a lot of trouble being caused by a two-line function. C may be a small
language, but it’s not a simple one." I disagree. C is very simple. You need
to know C is not suppoused to work the same on platform and I guess behaviour
you have tested is not defined by standard. I think every good C programmer
know when he have to be causious because behaviour may be platform dependent.
In matter of fact you did as well. I guess case you've studdied is not common
to be used in real code. If so, there is very easy solution:

    
    
      #if ((char)CHAR_MAX) + 1 > ((char)CHAR_MAX)
       /* some code here */
      #else
       /* some code here */
      #endif
    

There is no problem to me.

~~~
cperciva
Every problem has a solution which is simple, elegant, and doesn't work. This
is it.

Section 6.10.1, paragraph 4: _"... For the purposes of this token conversion
and evaluation, all signed integer types and all unsigned integer types act as
if they have the same represenation as, respectively, the types intmax_t and
uintmax_t..."_

Inside preprocessor directives, your chars aren't chars any more.

~~~
krollew
So what they are? Why it doesn't work? Your quote don't clarify.

~~~
cperciva
Your "char" in a preprocessor directive is either a uintmax_t or an intmax_t.
Either way, it's going to end up as #if 128 > 127 or #if 256 > 255 -- so the
first case will always end up being included.

~~~
krollew
It's standard behaviour or one compiler does so? Outside preprocessor char has
another meaning? Thanks anyway.

~~~
cperciva
Outside of the preprocessor, the char type is a one-byte integer (whether it's
signed or not is implementation-defined).

~~~
krollew
So the good solution is make a test that finds it out, for example in
configuration script and set proper preprocessor constant and test that
constant instead.

------
DanielBMarkham
I confess to just scanning the code and being cold on C right now, but isn't
++x a _post_ increment? That is, it occurs after the rest of the expression
has been evaluated? (as opposed to x++, which is a pre-increment) Just
guessing that the perhaps the issue is when the actual overflow is occurring.

~~~
vhf
It is the opposite, like in many other languages. e.g.

    
    
      int i = 0;
      printf("%i %i", i++, ++i); // prints "0 2"
    

Same goes in C, C++, Java, PHP, ...

[EDIT] Turns out this is a bad example, as "the order in which function
arguments are evaluated is undefined" (cf. below) Correct is :

    
    
      int i = 0;
      printf("%i", i++); // prints 0
      printf("%i", ++i); // prints 2

~~~
decklin
Actually, the order in which function arguments are evaluated is undefined.
<http://c-faq.com/expr/comma.html>

You can use a comma in other expressions to introduce a sequence point:
<http://c-faq.com/~scs/cgi-bin/faqcat.cgi?sec=expr#seqpoints>

~~~
vhf
Thanks for your clever comment and for those emitting similar concerns.

You surely are right. I wanted to give a quick example, turns out it was a bad
one. Next time I'll write :

    
    
      int i = 0;
      printf("%i", i++); // prints 0
      printf("%i", ++i); // prints 2

~~~
hythloday
_turns out it was a bad one_

Given that you were making a point on an article about the complexity of C,
I'd say it was an unintentionally excellent example.

~~~
sirclueless
If you really want to make your brain hurt, there was an article on HN a while
ago about the following statement:

    
    
        i = i++;
    

Evaluating that expression takes a _real_ dive into the guts of the C spec.

------
TwoBit
Summary: ++x for signed char is supposed to convert x to int before the ++,
but some compilers get it wrong.

------
robot
The user must know about overflows and act accordingly. Compiler behavior may
naturally change based on optimisations since it is undefined territory.

------
malkia
That's undefined behavior, the compiler can do whatever it pleases to. It can
even print 666, and be done.

~~~
AceJohnny2
No it's not. I had the same thought, but the author carefully points out from
the standard: "The usual arithmetic conversions ensure that two operands to a
“+” operator of type signed char are both promoted to signed int before the
addition is performed."

He does assume sizeof(int) > sizeof(char), which is true on all platforms he
has tried. It would be undefined on an AVR or other microcontroller where
sizeof(int) == sizeof(char) though.

~~~
nova
_undefined on an AVR or other microcontroller where sizeof(int) ==
sizeof(char) though._

Just noting that ints are 16 bits wide in AVR-GCC, unless you use the -mint8
option which violates C standards.

~~~
JoachimSchipper
Absolutely true. Note, though, that sizeof(int) == sizeof(char) _is_ allowed
if you have 16-bit chars.

~~~
cperciva
16-bit chars aren't allowed in modern versions of C -- CHAR_BIT is required to
be 8.

~~~
JoachimSchipper
Thanks for the update! (For the curious: C99 or POSIX both require CHAR_BIT ==
8.)

~~~
pascal_cuoq
C99 only requires CHAR_BIT to be at least 8 (5.2.4.2.1:1). POSIX requires it
to be exactly 8.

~~~
cperciva
Oops, quite right. I missed the "equal _or greater_ in magnitude" line when I
was reading that section.

