
When you see a Heisenbug in C, change your compiler's optimization level (2010) - adamnemecek
http://esr.ibiblio.org/?p=1705#
======
ohazi
This is a little misleading.

Yes, when you see something like this, changing the optimization level may
indeed influence whether or not the bug is _visible_ , and is a useful tool
for figuring out what went wrong. But these days it's pretty rare that the
optimizer is _at fault_. Usually what you'll find is that the optimizer is
indeed making a valid optimization, but your code is relying on unspecified C
or C++ behavior that just _happens to work_ when you're at -O0.

Your code is still the thing that's technically incorrect.

~~~
evanphx
Ordering is basically an unspecified behavior in C. An optimizers best friend
is the ability to reorder, perform CSE, etc.

C and C++ are specified as the behavior of a straight-line sequence of code
run in isolation. Basically, any concurrent C/C++ program treads into the
unspecified behavior area. Thusly it's not the optimizers fault, but C/C++
certainly aren't helping out at all.

Even the mechanisms that are common place to make a concurrent C/C++ program
run correctly basically wade into unspecified behavior, for instance using
asm("lock esp"), asm("mfence"), etc.

TL;DR: concurrency is unspecified in C.

~~~
detrino
C11/C++11 define a memory model and provide assorted primitives.

~~~
letzjuc
Indeed. It should be:

>TL;DR: concurrency is unspecified in _C99_.

Since it is specified in C11's memory model.

------
CJefferson
> “OK, set your optimizer to -O0,”, I told Jay, “and test. If it fails to
> segfault, you have an optimizer bug. Walk the optimization level upwards
> until the bug reproduces, then back off one.”

That is 99.9% rubbish. By which I mean in 99.9% of cases, when code works at
-O0 and not at higher optimization levels your code has a bug, almost
certainly involving invoking undefined behaviour.

Certainly it is possible to find optimiser bugs, I have done it myself, but it
is many times more common for your code to have bugs. Now, you could argue
that optimizers shouldn't make such extensive use of undefined behaviour,
which leads to these kinds of bugs, but it is perfectly valid for a C compiler
to do so.

~~~
cpncrunch
Agreed. Many times I've seen a bug disappear when I've added printfs or turned
on -g, but I've never, ever seen an optimization bug in a C/C++ compiler.

The only time I've seen optimization bugs is in the emscripten compiler, but
I'm sure those will be ironed out eventually (and the current version seems
pretty stable).

~~~
infogulch
That has happened to me as well. There's a bug until I add a printf or enable
debugging, or (paradoxically) go from -O0 to -O2. What's going on here? Is
there some specific class of UB that I'm invoking to get this infuriating bug?

~~~
cpncrunch
I think it's usually just accessing out of bounds memory / uninitialized read,
and -g/-O just happens to change the memory layout enough to cause/prevent the
crash.

------
greenyoda
I've had a long career of writing C code that runs on many different operating
systems and CPU architectures, and I don't think I've ever personally stumbled
across a problem that was caused by a compiler bug. In my experience,
heisenbugs (bugs that seem to disappear when you try to debug them)[1] are
usually caused by bad memory references that corrupt different areas of memory
- sometimes fatally, sometimes without visible effect - when the program is
compiled in debug vs. release mode. For example, optimized code might put more
variables in registers, thus changing the set of variables that are vulnerable
to being clobbered by a buffer overflow.

Rushing to blame a bug in the compiler is, much more often than not, a way to
waste a lot of time while debugging. Most commonly, we are the cause of our
own bugs.

[1]
[https://en.wikipedia.org/wiki/Heisenbug](https://en.wikipedia.org/wiki/Heisenbug)

~~~
dalke
While I'm fuzzy on the details now, one problem we had, back in the 1990s, was
in C++ code which used comments with "\"s to make it look pretty.

    
    
       /// I am a comment! \\\
       a = 5;
    

The SGI OCC compiler accepted it just fine, as did most of the other
compilers. I wasn't until we compiled on HP, or perhaps AIX, that we had a
compiler which elided the "\" \+ newline to give

    
    
       /// I am a comment! \\a = 5;
    

and caused the downstream code to fail.

As for heisenbugs specifically, I agree with you. I just wanted to tell a
story. :)

~~~
cnvogel
I had a similar story I'd like to share. A colleague had written out the
initialization data for some embedded component, and helpfully included the
ASCII representation for some of the (mixed) binary/readable stuff we sent. So
he wrote something along the lines of...

    
    
                unsigned char initdata[] = {
                   0x01,  // some value, e.g. 513
                   0x02,  // as 16bit, little endian
                   0x40,  // ASCII @
                   0x5c,  // ASCII \
                   0x42,  // ASCII *
                   0x09,  // number 9
                  (...)
                };

------
spc476
And the punchline: the heisenbug in question was in the code being compiled,
not the compiler
([http://esr.ibiblio.org/?p=1705&cpage=1#comment-248245](http://esr.ibiblio.org/?p=1705&cpage=1#comment-248245)).

~~~
to3m
I think if the code changes due to adding or removing -g, that's definitely a
bug in the compiler!

If the generated code does things that can't be represented in the debug info,
there are plenty of better solutions, with generating useless debug info being
an obvious one. Useless debug info is annoying, but far from the end of the
world, not least because just about any experienced programmer will be used to
it by now...

~~~
GFK_of_xmaspast
The only "compiler bugs" I have ever seen in 20+ years of coding have all,
with the exception of the times the compiler itself crashed, all been user
error, and most of those errors are things like using uninitialized values
(-g, at least on some compilers, sets memory to 0, instead of 'whatever's on
the stack at the moment') or stack corruption that corrupted different parts
of the stack at different optimization levels.

~~~
to3m
-g should really be controlling the emission of debug info, and nothing else, I would have thought? This is independent of the amount of optimization being performed.

In fact the code part of the result should be identical in both cases, and
ideally at runtime it should be impossible to tell the difference between -g
and not (since presumably the loader won't load/map the debug info even if
it's embedded).

------
svec
When you see a Heisenbug in C, change your mind: it's NOT the compiler. It's
you.

At least that's been my experience in almost 20 years of programming a lot of
C and C++.

I've found my share of compiler bugs, particularly in embedded compilers, but
99%+ of the time the Heisenbugs are mine: stray pointers, stack overflows,
buggy concurrency, unexpected interrupt timing, etc.

When you see a Heisenbug in C, it's probably NOT:

1\. The hardware. 2\. The compiler. 3\. The OS.

It's not you (compiler), it's me.

~~~
masklinn
> When you see a Heisenbug in C, change your mind: it's NOT the compiler. It's
> you.

Also known as "select isn't broken".

------
mannykannot
In the article's comments, the author of the code in question says "I 'solved'
the problem by sticking in one trace call that made the problem go away. That
kind of fix makes me vaguely nauseous, but ugly working code beats pretty
broken code every time." He does not appear to have an explanation of the
error that the optimizer is allegedly making, or for how this change fixes it.

If you have made a 'fix' but you don't know what problem it solves and how it
does so, you probably haven't fixed anything, and you can expect further
trouble from the root cause.

~~~
stusmall
I remember an almost impossible to diagnose heisenbug I had on an embedded
device years ago. It was a small embedded system with no MMU and very poor
tooling. It took 3 of us stuck in a room working long hours together for 2
weeks to trace it down. I ended up having to port portions of the code base to
an easier to use system and luckily the bug ended up being in that subset of
the code. The real kicker on this bug is we could never reproduce it with
optimizations turned off so it was a constant battle between adding in
diagnostics and trying to keep the bad behavior intact. Every little change we
would make would shift were the memory corruption was and the reordering from
the optimizations made the changes unpredictable at times. Ahhh, shitty
memories.

I can't express the joy when the source of the bug ended up not being my
fault.

That bug chase in particular made me extremely weary of people who alter
systems to fix bugs without understanding exactly why it made the issue go
away. It might not have fixed it but just moved it.

~~~
detrino
So what caused the bug ?

~~~
stusmall
Writing to an uninitialized pointer.

------
cjensen
No no no no! I've had dozens of heisenbugs in my career. I've only seen
compiler bugs once or twice.

------
cwzwarich
In my experience (which comes from developing compilers and bringing up
operating systems on new architectures with untested compilers), compiler bugs
are rarely 'heisenbugs' in the traditional sense, meaning bugs that go away
when you attempt to debug them.

Compiler bugs tend to be more deterministic, where some code is being
generated incorrectly and it will do the wrong thing every time you execute
that code. Changing the optimization settings may cause the bug to go away,
but it's still deterministic given the same compiled binary. One exception
would be a bug where a compiler is reordering instructions across a memory
barrier, but those are even rarer than normal compiler bugs, since compilers
are generally paranoid about reordering anything with barrier intrinsics or
inline assembly.

------
noselasd
My version of this would be "When you see a Heisenbug in C, run your program
through valgrind".

Which is a good idea even if you don't observe any bugs.

~~~
svec
Amen to that!

------
malkia
This is what helps me finding bugs, and debugging release code (MSVC specific)
- put this around function to disable temporary the optimizations:

    
    
      #pragma optimize(push)
      #pragma optimize("",off)
    
      void the_function()
      {
      }
    
      #pragma optimize(pop)
    

Usually some kind of bisection, or lucky guess is done to figure out that it's
really compiler bug.

------
huhtenberg
And when you see a heisenbug in MSVC++, just rebuild your project.

I don't know about newer versions, but VS 2010 and priors have a nasty trait
of spitting out wrong binary code now and then even when all optimizations
off, edit-and-continue and incremental linking disabled. You would just make a
little change, build, launch and get a crash in the middle of nowhere with the
most bizarre stack trace. Then rebuild it and it will magically go back to
normal. For a project of 100k lines this is a weekly, sometimes daily,
routine. Very trust inspiring.

~~~
grkvlt
probably just stale object files, hardly unique to MSVC

------
ambrop7
Actually, -g should not change the output of the compiler in any other way
than the presence of debug symbols. I can confirm that right here, where I'm
building code for an ARM microcontroller, the resulting .bin files produced by
objdump are identical (but the .elf are not).

Granted, when the debug symbols end up in the final ELF binary which is
executed by an OS, it can still affect execution in subtle ways such as load
times.

------
chewychewymango
It's probably worth pointing out that the person with the bug in the story,
Jay Maynard, is also Tron Guy.

------
fromdoon
I had once had the opportunity to investigate a heisenbug. We could see that
the bug disappeared after turning the optimization off, but our clients were
not satisfied and wanted a more detailed explanation and exact root cause
analysis.

We were using a proprietary C compiler tool chain provided by our vendor and
did not get much help from them either.

Finally, We had to sit down, get the assembly from disassembler and went
through whole 800 lines of it. And we found the bug, sitting quietly in one of
the pipelines.

I am not a compiler guy, but that day I understood the beauty of compiler
optimization.

------
kgabis
I once had a bug in my program where it would work correctly only on -O3 (not
quite correctly, but good enough for me to not notice). After a lot of head
scratching I ran clang's static analyzer [0] and it found it to be an
uninitialized variable. On -O3 compiler was reusing other variable, which
happened to be very close to what I needed.

[0] [http://i.imgur.com/lqs4FNn.png](http://i.imgur.com/lqs4FNn.png)

------
nitrogen
A better approach: when you see a Heisenbug in C, run your program through
Valgrind's memory and threading checking tools.

------
walshemj
I have had one large piece of PL1/G code - a whole mess of mappers for a map
reduce based billing system where rebuilding the entire code base made one bug
go away.

~~~
foxhill
by go away, you of course mean hidden to be found and solved later? bugs do
not typically just fix themselves.

~~~
dalke
Can't say anything about that project, but there could be a build dependency
error where a change didn't get propagated all the way through. The "bug"
isn't so much a coding bug but a synchronization bug that can be fixed by
doing a full rebuild.

For example, on one system I used there was a ~2 second difference between my
desktop machine, the build machine, and the NFS host for directory I was
working in. This would sometimes lead to:

    
    
        make: warning: Clock skew detected. Your build may be incomplete.
    

If I did a save to a .C file just after the .o file was written, then it might
still have a timestamp which was older than the .o, so not included in a
rebuild.

This required occasional manual deletes of the .o file to make sure that it
was building correctly, and we probably did a full rebuild of the entire code
every once in a while to reduce these sorts of problems. (This was back in the
c-front days, with 15 minute compilation times.)

------
GFK_of_xmaspast
It is not:

1\. lupus 2\. a tumor 3\. a compiler bug

I'm not at all surprised to see ESR being enough of a scrub to go to 'compiler
bug!' as his first guess.

