Hacker News new | past | comments | ask | show | jobs | submit login
When you see a heisenbug in C, suspect your compiler’s optimizer (ibiblio.org)
40 points by rglovejoy on Feb 13, 2010 | hide | past | favorite | 47 comments

Switching off the compiler optimizer may just "hide" the bug because memory accesses are reordered in a way that makes the bug (for instance a memory corruption or some allocation problem) less likely to trigger a segmentation fault.

To think that every time you have a bug that is suppressed switching off the optimizer you found a compiler bug is not a good idea. 99.99999% of the times the problem is in your code.

Experienced programmers usually will continue to think the bug is in their own code unless they can prove otherwise.

Exactly. Select is probably not broken.

If the program works in some compiler optimization levels and not others, then think about what the optimizer is doing and how this may change the circumstances that the bug may appear. I agree that it is probably a memory corruption issue and that by turning off optimizations, you are hiding the sympton and not fixing the bug.

I think there should be a law that states: if you use a language like C or C++, you must ensure it compiles cleanly with all warnings turned on AND runs without error under a tool like Valgrind.

There are simply too many places where bugs may creep in to leave it to chance. The tools exist - use them!

Very good advice. I assume my C code does have memory blunders until I have run extensively through valgrind, after which I might begin to believe any other analysis I have done that suggests the code is correct.

I also tend to test a build linked with gcc’s mudflap:

   gcc -g -fmudflap -lmudflap
Your program will run much faster than under valgrind. I have had bugs that have been missed by valgrind but caught with mudflap and vice-versa. Don’t try to link with mudflap and run under valgrind at the same time though, valgrind won’t work.

Many thanks, I've used valgrind a lot, but never heard of mudflap, I'll give it a try.

Depending on your distribution it may not be installed automatically with gcc. The package you need is called libmudflap0-4.4-dev in Ubuntu 9.10.

I worked at a shop where I wrote C for almost 2 years and we ran into this case twice. It was a compiler bug in the optimizer, and only happened when it tried to also optimize the way specific structures were laid out in memory. Using the zero-index array at the end of a struct to get a pointer to the following buffer in this case caused the offset to be wrong and we were over running our buffer.

This was xlC though.

  Experienced programmers usually will continue to think the
  bug is in their own code unless they can prove otherwise.
s/Experienced/Mature/ Because ESR has plenty of experience. The kind of reasoning in the OP comes from arrogance and the narcissism of valuing one's self-concept as a competent individual over getting the job done efficiently.

If someone other than ESR had written this article, would you have reacted the same way?

Actually, that's pretty much accurate. I initially didn't even realise this was written by ESR and just assumed the author was very inexperienced. ESR however should know better, though to be honest I don't even know if the guy still writes code. I hate to stoop to ad hominem attacks, but re-reading it knowing that it's him I get the impression this was a petty reaction to some feud with a GCC dev. In any case I'd love to know who the hell is upvoting this "story".

Yes, when I started reading, my first thought was, "wow, sounds like someone pretty inexperienced." After I noticed it was written by ESR, that changed to, "wow, sounds like someone pretty arrogant."

The fact of the matter is that a compiler like gcc is used by thousands (tens of thousands? more?) of people almost daily. Usually you have to be doing some pretty crazy stuff to find a bug in it. Bugs that go away when you turn off optimizations are usually either race-condition or memory-access related.

Or uninitialised local variables, which are affected by the difference in register allocation, but really you should be enabling the corresponding warning.

No, unless the author had proven themselves in other contexts to be narcissistic and immature, I would have given them the benefit of the doubt.


its a simple fact that esr is the author of the article in-question.

Especially if it involves multithreading. Altering the timing of things means everything.

Turning off the optimizer can also hide memory barriers you forgot you'd need, by doing extra loads or stores rather than keeping an outdated value in a register.

That said, I have seen exactly one optimizer bug that I know of. Back in 1993, Borland C++ completely omitted one of my inline destructors from the binary. I had to review the assembly to convince myself I wasn't imagining things.

I've seen my fair share of compiler (and even assembler) bugs, but they were almost all in customised, proprietary compilers for game consoles. Those get nowhere near the number of eyeballs that GCC does. 99 times out of a 100, weird bugs are my own stupid fault. The ratio is even higher for multithreaded code.

Very true. It doesn't prove anything that -O0 works while -O3 doesn't. It's still almost certainly a bug in your code.

I don't think the percentage is 99.99999% though. Compiler bugs do exist. I have seen several more than four in 12 years of C programming.

Well, the percentage is a way of saying: You do not encounter a compiler bug in day-to-day-programming. 'four compiler bugs in 12 years' is just another way of saying: You do not encounter a compiler bug in day-to-day-programming.

(Note the 'day-to-day'. Day-to-day-programming and 'all programming you do in x*10 years of C-programming' are very different sets of code written)

As an example...

I once wrote some MPI code for class. It ran properly at -O0, though the compiler warned of a variable that was declared but never used. Compiling at a level other than -O0 or removing that variable declaration from the source code caused the program to segfault immediately. It turned out to be a memory error somewhere else in the program (I forget exactly what, but it was over my misunderstanding of some part of the message passing calls).

It depends on the context as well. It is much more likely to find a bug in gfortran for solaris sparc than gcc on x86 linux, for example. The only time I keep the compiler bug in mind is when working with gcc 4.x compiler on windows.

I don't have the experience of ESR, but I find the advice a bit dangerous if taken as a general one. Especially, the idea that a heisenbug is often caused by a compiler: most likely, the heisenbug is not a heisenbug at all, but just less visible depending on the compiler flags. That was the case for the vast majority of "heisenbugs" I have encountered in C.

FWIW, I do tend to think the problem's in my code until proven otherwise. That's why I spent a week tearing my hair out trying to find the problem.

And, for the naysayers, Hercules is built with -W -Wall.

While it's not an everyday occurrence, it happens a lot more often than you would believe. In Linux kernel, there is actually official list of compiler versions that produce faulty code.

My experience goes in a different direction: when you see different behavior when compiled with and without optimization, suspect a memory allocation error or overrun. While compiler optimization bugs do exist, I've more often found that the problem is real.

If you are working with your own code and care if it works:

  1) Turn on all compiler warnings
  2) Change your code so it compiles clean
  3) Run under Valgrind (or equivalent).
  4) Address all reported errors, specifically 
     whitelisting them if necessary.
  5) If you find a bug, don't stop until you've found
     the cause.  You're done when you understand what
     caused the bug to appear, not when the symptoms go away. 
  6) Use open source tools, since otherwise you'll be
     tempted to blame some unspecified 'bug in the compiler'.
     (not that ESR would be using any other)
  7) If it is a compiler bug, report it, along with 
     the smallest test case you can generate.

This is definitely along the right path. One proviso that I'd note: Most of the time in my own debugging when I've run into something that goes away at a different level of optimization it's uninitialized variables / memory. Valgrind is the shizzle.

It's seems strange me to instantly suspect that your compiler's optimizer is at fault. Which is more likely: you've found a bug in a compiler used the world over, or you've screwed up memory access?

Since C makes it easy to overrun memory, it's pretty easy to make horrible mistakes and have those have seemingly random consequences.

The fact that the bug changes when you change optimizer settings, add trace statements, or add debug code would make me suspect memory corruption first.

In fact, I think it's a good assumption to always begin suspecting your own code.

I have an example of a bug I found that looked like an optimization bug but was really a user bug. Someone had the bright idea of sprintf-ing to a string and using the string itself as one of the format arguments to the sprintf call. When compiled with gcc the code would still run fine, but when compiling with -O1 the string would end up garbled. The problem (and advantage, performance-wise) of C is that most of the time you do something wrong, the behavior is undefined, whereas most higher-level languages will spend the CPU cycles to protect you from yourself.

Another example I can't remember the details of, but it was related to the fact that gcc adds code to zero-initialize your stack on first access unless optimizations are turned on. Code that checked for null pointers worked fine until optimizations were turned on, at which point it was discovered that a variable was being used uninitialized.

Except multi-thread programming, I didn't find any compiler's optimizer related bugs. The truth is, most heisenbugs I found are somewhat related to memory access. It is just too risky to assume it is a compiler bug. I always think in the other way: unless you can prove (by generating assembly code and a possible scenario), the bug is in your code.

-O3 isn't "riskier". It's just harder to debug under.

-O3 is indeed riskier. All experienced embedded systems developers know this: embedded compilers tend to be much buggier than compilers for desktop platforms. But desktop compilers are buggy too. Over the last couple years my group has reported 190 bugs to compiler development teams. A lot of these bugs turn up only at higher optimization levels. If you search on my email address "regehr@cs.utah.edu" as bug reporter in either GCC or LLVM's bugzilla, you can see plenty of examples.

I defer to the gentleman with the University research project dedicated to finding compiler bugs. :)

If he's still teaching it, I recommend his advanced embedded systems class to anyone at utah.edu who wants to gain practical experience with such compiler errors.

Thanks :). I'll be teaching it in the Fall.

It was actually this class that motivated the whole compiler bug-finding project. The quality of the average embedded compiler is appalling, students trip on codegen bugs all the time.

Of course as many people are pointing out in this thread, most of the time the compiler is not to blame when changing optimization options changes program behavior.

Embedded compilers are much worse. You have to pick and choose which stable version of gcc-4.x you can safely use. My AVR projects have been broken by compiler changes.

But it's much more than the optimizer. Even code generation at -O0 can be broken by assumptions about alignment, insn size, etc. This usually happens when you're using a very new or very old part and the gcc developers make assumptions based on their limited dev board setups.

All appreciation should be paid to those gcc developers as it is a very difficult job they do for free. Thanks!

I know little about C compilers, but if -O3 resulted in segfaults while other settings did not, how is it not riskier? Are the different optimization levels independent such that this bug could have appeared anywhere, but just happened to be in one of the strategies used by -O3?

Because, like a number of other people have pointed out upthread, it can make the impact of memory corruption problems more immediate.

There is also the possibility that if your program has an aliasing bug, the bug may only cause failure at -O2 but not at -O0 or -O1.

See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35653, for example.

It's not quite the same as the 'heisenbug' but I've seen a couple of cases over the years where a mysterious problem was resolved by moving to the latest update of the C/C++ runtime. It's weird how many enterprises are fine with running years behind on maintenance on that.

I found an optimizer bug once. I can't remember the exact circumstances, but it had to do with some fancy inline incrementing I was doing. The very concept that I'd found a bug in somebody else's code floored me.

I was young.

I didn't expect a kind of Spanish inquisition...

The terminally curious may download a file containing the assembler output, and the C source, of the offending file from http://www.hercules-390.org/esamebug.zip . This corresponds to revision 5627 of the Hercules emulator as found in the Subversion repository at svn://svn.hercules-390.org/hercules/trunk . The emulator itself is at http://www.hercules-390.org .

The routine is in the generated assembler as z900_load_multiple_long.

Can you also give us a test case?

I might be able to build one to be executed under Hercules...let me work on that. My current test case is under NDA.

Done. You can get it at http://www.hercules-390.org/lmg-test.zip .

Cool. Any particular things I need to know? e.g. which platform should I try this on?

Also, have you run valgrind against the test?

The test has only been shown to fail on Mac OS X Snow Leopard with gcc 4.2.1. I haven't been able to make it fail on any other platform. Because of the code involved, I suspect it won't fail at all on anything but 32-bit Intel.

I've never run valgrind...it'll be interesting to see just what it does to Hercules execution speed. mudflap, too. Getting that built into the code might get even more interesting.

Optimizer and code gen problems are pretty hard for users to diagnose correctly, so most compiler vendors put a high priority on fixing them.

I found the bug (I think).. It's not "really" an optimizer bug per-se (although it is CLEARLY triggered by the optimizer).

Always treat __asm with caution !


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact