
When you see a heisenbug in C, suspect your compiler’s optimizer - rglovejoy
http://esr.ibiblio.org/?p=1705
======
antirez
Switching off the compiler optimizer may just "hide" the bug because memory
accesses are reordered in a way that makes the bug (for instance a memory
corruption or some allocation problem) less likely to trigger a segmentation
fault.

To think that every time you have a bug that is suppressed switching off the
optimizer you found a compiler bug is not a good idea. 99.99999% of the times
the problem is in your code.

Experienced programmers usually will continue to think the bug is in their own
code unless they can _prove_ otherwise.

~~~
dkersten
Exactly. Select is probably not broken.

If the program works in some compiler optimization levels and not others, then
think about what the optimizer is doing and how this may change the
circumstances that the bug may appear. I agree that it is probably a memory
corruption issue and that by turning off optimizations, you are hiding the
sympton and not fixing the bug.

I think there should be a law that states: _if you use a language like C or
C++, you must ensure it compiles cleanly with all warnings turned on AND runs
without error under a tool like Valgrind_.

There are simply too many places where bugs may creep in to leave it to
chance. The tools exist - use them!

~~~
imurray
Very good advice. I assume my C code does have memory blunders until I have
run extensively through valgrind, after which I might begin to believe any
other analysis I have done that suggests the code is correct.

I also tend to test a build linked with gcc’s mudflap:

    
    
       gcc -g -fmudflap -lmudflap
    

Your program will run much faster than under valgrind. I have had bugs that
have been missed by valgrind but caught with mudflap and vice-versa. Don’t try
to link with mudflap _and_ run under valgrind at the same time though,
valgrind won’t work.

~~~
bd_at_rivenhill
Many thanks, I've used valgrind a lot, but never heard of mudflap, I'll give
it a try.

~~~
imurray
Depending on your distribution it may not be installed automatically with gcc.
The package you need is called libmudflap0-4.4-dev in Ubuntu 9.10.

------
nkurz
My experience goes in a different direction: when you see different behavior
when compiled with and without optimization, suspect a memory allocation error
or overrun. While compiler optimization bugs do exist, I've more often found
that the problem is real.

If you are working with your own code and care if it works:

    
    
      1) Turn on all compiler warnings
      2) Change your code so it compiles clean
      3) Run under Valgrind (or equivalent).
      4) Address all reported errors, specifically 
         whitelisting them if necessary.
      5) If you find a bug, don't stop until you've found
         the cause.  You're done when you understand what
         caused the bug to appear, not when the symptoms go away. 
      6) Use open source tools, since otherwise you'll be
         tempted to blame some unspecified 'bug in the compiler'.
         (not that ESR would be using any other)
      7) If it is a compiler bug, report it, along with 
         the smallest test case you can generate.

~~~
wheels
This is definitely along the right path. One proviso that I'd note: Most of
the time in my own debugging when I've run into something that goes away at a
different level of optimization it's uninitialized variables / memory.
Valgrind is the shizzle.

------
jgrahamc
It's seems strange me to instantly suspect that your compiler's optimizer is
at fault. Which is more likely: you've found a bug in a compiler used the
world over, or you've screwed up memory access?

Since C makes it easy to overrun memory, it's pretty easy to make horrible
mistakes and have those have seemingly random consequences.

The fact that the bug changes when you change optimizer settings, add trace
statements, or add debug code would make me suspect memory corruption first.

In fact, I think it's a good assumption to always begin suspecting your own
code.

------
sreque
I have an example of a bug I found that looked like an optimization bug but
was really a user bug. Someone had the bright idea of sprintf-ing to a string
and using the string itself as one of the format arguments to the sprintf
call. When compiled with gcc the code would still run fine, but when compiling
with -O1 the string would end up garbled. The problem (and advantage,
performance-wise) of C is that most of the time you do something wrong, the
behavior is undefined, whereas most higher-level languages will spend the CPU
cycles to protect you from yourself.

Another example I can't remember the details of, but it was related to the
fact that gcc adds code to zero-initialize your stack on first access unless
optimizations are turned on. Code that checked for null pointers worked fine
until optimizations were turned on, at which point it was discovered that a
variable was being used uninitialized.

------
liuliu
Except multi-thread programming, I didn't find any compiler's optimizer
related bugs. The truth is, most heisenbugs I found are somewhat related to
memory access. It is just too risky to assume it is a compiler bug. I always
think in the other way: unless you can prove (by generating assembly code and
a possible scenario), the bug is in your code.

------
tptacek
-O3 isn't "riskier". It's just harder to debug under.

~~~
regehr
-O3 is indeed riskier. All experienced embedded systems developers know this: embedded compilers tend to be much buggier than compilers for desktop platforms. But desktop compilers are buggy too. Over the last couple years my group has reported 190 bugs to compiler development teams. A lot of these bugs turn up only at higher optimization levels. If you search on my email address "regehr@cs.utah.edu" as bug reporter in either GCC or LLVM's bugzilla, you can see plenty of examples.

~~~
tptacek
I defer to the gentleman with the University research project dedicated to
finding compiler bugs. :)

~~~
nitrogen
If he's still teaching it, I recommend his advanced embedded systems class to
anyone at utah.edu who wants to gain practical experience with such compiler
errors.

~~~
regehr
Thanks :). I'll be teaching it in the Fall.

It was actually this class that motivated the whole compiler bug-finding
project. The quality of the average embedded compiler is appalling, students
trip on codegen bugs all the time.

Of course as many people are pointing out in this thread, most of the time the
compiler is not to blame when changing optimization options changes program
behavior.

------
ja27
It's not quite the same as the 'heisenbug' but I've seen a couple of cases
over the years where a mysterious problem was resolved by moving to the latest
update of the C/C++ runtime. It's weird how many enterprises are fine with
running years behind on maintenance on that.

------
Vivtek
I found an optimizer bug once. I can't remember the exact circumstances, but
it had to do with some fancy inline incrementing I was doing. The very concept
that I'd found a bug in _somebody else's code_ floored me.

I was young.

------
jmaynard
I didn't expect a kind of Spanish inquisition...

The terminally curious may download a file containing the assembler output,
and the C source, of the offending file from
<http://www.hercules-390.org/esamebug.zip> . This corresponds to revision 5627
of the Hercules emulator as found in the Subversion repository at
svn://svn.hercules-390.org/hercules/trunk . The emulator itself is at
<http://www.hercules-390.org> .

The routine is in the generated assembler as z900_load_multiple_long.

~~~
jgrahamc
Can you also give us a test case?

~~~
jmaynard
I might be able to build one to be executed under Hercules...let me work on
that. My current test case is under NDA.

~~~
jmaynard
Done. You can get it at <http://www.hercules-390.org/lmg-test.zip> .

~~~
jgrahamc
Cool. Any particular things I need to know? e.g. which platform should I try
this on?

Also, have you run valgrind against the test?

~~~
jmaynard
The test has only been shown to fail on Mac OS X Snow Leopard with gcc 4.2.1.
I haven't been able to make it fail on any other platform. Because of the code
involved, I suspect it won't fail at all on anything but 32-bit Intel.

I've never run valgrind...it'll be interesting to see just what it does to
Hercules execution speed. mudflap, too. Getting that built into the code might
get even more interesting.

------
WalterBright
Optimizer and code gen problems are pretty hard for users to diagnose
correctly, so most compiler vendors put a high priority on fixing them.

------
ivan_w
I found the bug (I think).. It's not "really" an optimizer bug per-se
(although it is CLEARLY triggered by the optimizer).

Always treat __asm with caution !

\--Ivan

