Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
When you see a Heisenbug in C, change your compiler's optimization level (2010) (ibiblio.org)
52 points by adamnemecek on March 1, 2014 | hide | past | favorite | 82 comments


This is a little misleading.

Yes, when you see something like this, changing the optimization level may indeed influence whether or not the bug is visible, and is a useful tool for figuring out what went wrong. But these days it's pretty rare that the optimizer is at fault. Usually what you'll find is that the optimizer is indeed making a valid optimization, but your code is relying on unspecified C or C++ behavior that just happens to work when you're at -O0.

Your code is still the thing that's technically incorrect.


Ordering is basically an unspecified behavior in C. An optimizers best friend is the ability to reorder, perform CSE, etc.

C and C++ are specified as the behavior of a straight-line sequence of code run in isolation. Basically, any concurrent C/C++ program treads into the unspecified behavior area. Thusly it's not the optimizers fault, but C/C++ certainly aren't helping out at all.

Even the mechanisms that are common place to make a concurrent C/C++ program run correctly basically wade into unspecified behavior, for instance using asm("lock esp"), asm("mfence"), etc.

TL;DR: concurrency is unspecified in C.


C11/C++11 define a memory model and provide assorted primitives.


Indeed. It should be:

>TL;DR: concurrency is unspecified in _C99_.

Since it is specified in C11's memory model.


> TL;DR: concurrency is unspecified in C.

One of the many reasons I'm eagerly the 1.0 release of Rust.


I think you accidentally a word.


Just a Heisenbug, he was using a browser written in C or C++.


Looks like we've found a compiler bug. Who has Stallman's email?


sigh Yeah, my edit window expired. Very eagerly awaiting. :-P


>TL;DR: concurrency is unspecified in C.

Does the same argument apply to binary code ran on a preemptive operating systems? The actual execution order of your code is still unspecified, as the OS an interrupt it whenever it wants.


Mostly. The OS and hardware will make certain guarantees about ordering and concurrency, and if you want more you need to specifically ask them for more. C makes almost no guarantees, but the basic concept is the same. Use the provided extensions to get safe order and concurrency.


This is absolutely correct, and the article worries me for that reason. C programmers shouldn't learn "Heisenbug == compiler error" -- that's rarely the case. Compilers aren't generally formally verified, but they're also very heavily tested, and it will rarely happen that you expose a compiler bug. It's far more likely you're relying on UB in some way.


The author states that this has only happened 3-4 times in 30 years. I don't see anything that would make a reasonable programmer suspect that this is always the case or even usually the case.


Indeed, if you re-interpret the article's usage of the phrase "optimizer bug", then the advice becomes useful: if changing the optimization level fixes your broken code, you have an undefined-behavior bug.


Either that, or a race condition.


I'm not sure that's necessarily the case. See my comment above about sequential consistency.


Aren't those undefined behavior bugs?


Imagine two threads that safely lock a resource, increment it, copy the value, and then unlock it. This is entirely defined, but also a race condition.

Race conditions are a lot like unsanitized input. They don't cause problems by themselves, but if you make incorrect assumptions it's easy to write incorrect code.


It's data races specifically that are undefined behaviour.


Technically yes, but the C++ spec gives the optimizer a lot of leeway, in ways that aren't necessarily safe.

Few programmers have a deep enough understanding of the C++ spec to realize things like how the optimizer is free to ignore NULL checks after a memory location has been accessed. They just see a `pointer == NULL` passing, then code failing on a null pointer.

And the problem is that they shouldn't need that level of understanding to write C++. The optimizer should be doing things like warning about unnecessary null checks that could degrade performance, it would notify the developer that there's a problem.

But that would cause issues for people working with large amounts of legacy code, so it's not done.


"Ignoring NULL checks after a memory location has been accessed" -- what do you mean by that?


Standard example:

    void f(struct some_struct* p) {
        int x = p->some_field;
        
        /* ... */
        
        if (p != NULL) {
            /* this block might be executed even if p = NULL */
        }
    }
Because reading `p->some_field` is already undefined behavior unless `p != NULL`, the compiler is free to assume that `p != NULL` is always true, and might avoid the check.

If the memory access doesn't crash the program for whatever reason (maybe it got reordered somewhere else or eliminated as dead code or whatever, I dunno), then if you call that function with a NULL pointer, you fall into undefined behavior that might manifest as that check that you put right there being skipped.


> If the memory access doesn't crash the program for whatever reason (maybe it got reordered somewhere else or eliminated as dead code or whatever, I dunno), then if you call that function with a NULL pointer, you fall into undefined behavior that might manifest as that check that you put right there being skipped.

No; if `p` is NULL, this function has undefined behavior. Full-stop. It is a 100% meaningless function as soon as `p` is NULL, because the "NULL check" happens after the pointer is dereferenced.

So the issue isn't that the compiler can make incorrect optimizations -- the compiler makes optimizations that are entirely correct, assuming that the code that you wrote isn't meaningless.


The code isn't meaningless, it just exhibits undefined behaviour. This doesn't make the program wrong, it just means the standard has nothing to say about what precisely might happen. If the compiler chooses to infer the presence of an equivalent to VC++'s __assume (http://msdn.microsoft.com/en-us/library/1b3fsfxw%28v=vs.90%2...) from the presence of undefined behaviour, it's within its rights to do so, but this particular approach is by no means mandatory.

In fact, most of the compilers I've used actually don't do this, and are (in my view) all the better for it.

See also this rant on gcc's strict aliasing, borne of the same philosophy: http://robertoconcerto.blogspot.co.uk/2010/10/strict-aliasin...


> The code isn't meaningless, it just exhibits undefined behaviour.

That's what undefined behavior means. Semantically speaking, the C language assigns no meaning to that function if the input pointer is NULL, and is therefore "wrong" by any reasonable definition of the word if it is NULL -- so the compiler is free to make an array of optimizations based on the fact that the input pointer is not NULL.


I think that interpretation is too strict. The standard is fairly clear on what the result of undefined behaviour might be, defining it as:

``behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements.

``NOTE Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).'' (italics mine)

This sounds like a long way from "meaningless" in my book. To my reading, the purpose of undefined behaviour appears to be to avoid unduly constraining implementations by not mandating behaviour that could be inefficient, costly or impossible to provide.

You (or anybody else!) may disagree on how far this inch given could or should be taken. But I think the fact the standard explicitly suggests that undefined behaviour could do something reasonable is evidence that programs producing undefined behaviour do not necessarily have to be considered meaningless.

(As a concrete example I have worked on one system where NULL was a pointer to address 0, and where address 0 was readable. Not only that, but in fact address 0 actually contained useful information, and some system macros used it. It was some kind of process information block and so there was a whole family of macros that looked like "#define getpid() (((uint32_t * )0)[0])", "#define getppid() (((uint32_t * )0)[1])", that sort of thing. I'd say this is rather odd, but the standard would appear to allow it. (However, perhaps needless to say, gcc was not the system compiler.))

(See also, the approved manner for using objc_msgSend, since time immemorial.)


On the contrary. The standard is perfectly clear that anything can happen when you write code that employs UB.

> behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements.

NO REQUIREMENTS -- so, semantically, programs that employ undefined behavior are completely meaningless.

With regard to the second quotation, the part you italicized is nice, but the part you didn't is just as important:

> Possible undefined behavior ranges from ignoring the situation completely with unpredictable results

The entire quotation basically says "when you write code with undefined behavior, anything can happen; the results can be unpredictable, or they can appear to be sensible. An error could also be triggered." But that's the point: no behavior is specified. There's no restriction to what might happen.

Take the function above that has a NULL dereference when the input pointer is NULL. That function could be compiled in such a way that it writes an ASCII penguin to stdout if the input pointer is NULL; it's totally within its rights to do that. Your mental model of how C programs work is entirely inaccurate if you expect undefined behavior in C to do something that you deem sensible.


With your summing up of the quotation, I think you're again being too strict. If the standard doesn't define undefined behaviour, which it doesn't - well, what then? You claim this renders any program that invokes undefined behaviour meaningless; I claim (as I think the standard wording implies) that this simply means the standard doesn't define the results, which then necessarily depend on the implementation in question.

(It may be OK for anything to then happen, but as a simple question of quality - and common decency ;) - an implementation should strive to ensure that the result is not terribly surprising to anybody familiar with the system in question. And I'm not really sure that what gcc does in the face of undefined behaviour, conformant though it may be, passes that test.)


In C99 there are three levels of not fully specified behaviour:

1. Undefined: anything is permitted, the standard imposes no requirement whatsoever. A typical example is what happens when a null pointer is dereferenced.

2. Unspecified: anything from a constrained set is permitted. Examples include the order of evaluation of function arguments (all must be evaluated once though any order is allowed).

3. Implementation-defined: the implementation is free to choose the behaviour (possibly from a given set), but must document its choice. An example is the representation of signed integers.


That would've already crashed at the p->some_field if p == NULL.

> the compiler is free to assume that `p != NULL` is always true

Only until the first point in the "..." part where it cannot prove that p has not been modified.


If x is unused, the compiler is allowed to remove the assignment. If it does that optimization after it removes the "redundant" null pointer test, the optimizer has legally altered your program to remove the null pointer check you thought you had.

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

Please read that article, and the rest in the series. Undefined behavior is far more pervasive than you think.


Why would you expect the NULL pointer check to do anything? The function, as given, is meaningless if `p` is NULL. The problem isn't that compilers are too aggressive in their optimization -- the problem is that people who are learning C don't usually learn what it means for the behavior of a program to be undefined.


> If x is unused, the compiler is allowed to remove the assignment

This and the Apple SSL bug makes me think that optimising compilers should be far more explicit (i.e. emitting messages or even warnings) about what they're doing than they are now, because it seems far too much about the optimisation process is being hidden and not transparent enough. Not only unreachable code, messages like "result of computation x is never used", "if-condition assumed to be {false, true}", "while-loop condition always false - body removed", etc. would be extremely useful for detecting and fixing these problems.


GCC can produce all of those warnings.


There have been Linux kernel security bugs where a NULL dereference might not result in a crash, because a malicious user-space program had mapped some accessible memory at address zero. The kernel then went on to skip the optimized-out null check, use vtable pointers in this null page and eventually execute arbitrary code in ring 0.


> That would've already crashed at the p->some_field if p == NULL.

1. Typically yes, and that is exactly why the compiler can remove the check after it, it's already undefined behavior so it can assume the optimistic case (and remove the check).

2. But it is not always true that it would crash. p->some_field might happen to be in a valid memory location to be accessed. This doesn't happen normally because low memory addresses (0-1024, say) tend to not be accessible by userspace programs. But I am not aware of a spec ensuring that. An OS could in theory let your program map memory at address 4, in which case p->some_field would succeed if p is NULL and the offset if 4.


See a certain Linux kernel exploit: http://lwn.net/Articles/342330/


Isn't it often the case that the optimizations throw out sequential consistency guarantees?

That would introduce non-determinism in concurrent executions that might have nothing to do with the semantics of the program.

For example, I'm under the impression that the most recent C++ standard basically says "if you have no data races then all will be well" even though they may be benign data races like dual assignments of the same value.


C doesn't have any sequential consistency. Even without any compiler optimizations, the hardware will reorder things across cores. Subtly on x86, massively on Alpha. You can't use multiprocessing without some kind of external guarantees.


In the embedded industry, the most common reason for things working when disabling optimizations is violating the ANSI aliasing rules. Embedded programmers type-pun all the time and don't realize it's undefined behavior in C. Most compilers have an option to disable the ANSI aliasing rules, but it really hurts optimization opportunities.

I saw a large codebase completely break when turning on intermodule inlining. The cause: more opportunities for the compiler to find pointers that aren't allowed to alias.


Yes, and most often it's clearly warned about, up to the point where lines are completely burried in type-casts, just to shut up the (well meaning) compiler. Because, you know, we thrive for 0-warning compilations!


In this case it might be useful if compilers (or a test compiling tool) had a "go crazy for undefined behaviour option". This might be difficult to implement but could do things like randomise sort order where it's undefined. Similar to how fuzz testing an application's input, this would fuzz test the compilation phase.


> “OK, set your optimizer to -O0,”, I told Jay, “and test. If it fails to segfault, you have an optimizer bug. Walk the optimization level upwards until the bug reproduces, then back off one.”

That is 99.9% rubbish. By which I mean in 99.9% of cases, when code works at -O0 and not at higher optimization levels your code has a bug, almost certainly involving invoking undefined behaviour.

Certainly it is possible to find optimiser bugs, I have done it myself, but it is many times more common for your code to have bugs. Now, you could argue that optimizers shouldn't make such extensive use of undefined behaviour, which leads to these kinds of bugs, but it is perfectly valid for a C compiler to do so.


Agreed. Many times I've seen a bug disappear when I've added printfs or turned on -g, but I've never, ever seen an optimization bug in a C/C++ compiler.

The only time I've seen optimization bugs is in the emscripten compiler, but I'm sure those will be ironed out eventually (and the current version seems pretty stable).


In nearly 2 decades I have seen only one compiler bug, in MSVC6, which was related to large switch statements.

The most successful way of finding what's causing these "heisenbugs" is to NOT change the binary, but just load it in the debugger as-is, make a guess at where it's failing, and set a few breakpoints/watchpoints. Either the compiler is not generating the code that you expect, or your expectations of the language were wrong, and both cases are fairly clear from looking at the generated code.


That has happened to me as well. There's a bug until I add a printf or enable debugging, or (paradoxically) go from -O0 to -O2. What's going on here? Is there some specific class of UB that I'm invoking to get this infuriating bug?


I think it's usually just accessing out of bounds memory / uninitialized read, and -g/-O just happens to change the memory layout enough to cause/prevent the crash.


At some point many hundreds of years ago, the law became too complicated for people to navigate on their own, so lawyers (and Law French) were created. How long will it be before there are paid language lawyers, arguing for compilers vs. programs in an ANSI/ISO "court"? There's probably money to be made in "optimizer trolling," come to think of it.


I've had a long career of writing C code that runs on many different operating systems and CPU architectures, and I don't think I've ever personally stumbled across a problem that was caused by a compiler bug. In my experience, heisenbugs (bugs that seem to disappear when you try to debug them)[1] are usually caused by bad memory references that corrupt different areas of memory - sometimes fatally, sometimes without visible effect - when the program is compiled in debug vs. release mode. For example, optimized code might put more variables in registers, thus changing the set of variables that are vulnerable to being clobbered by a buffer overflow.

Rushing to blame a bug in the compiler is, much more often than not, a way to waste a lot of time while debugging. Most commonly, we are the cause of our own bugs.

[1] https://en.wikipedia.org/wiki/Heisenbug


While I'm fuzzy on the details now, one problem we had, back in the 1990s, was in C++ code which used comments with "\"s to make it look pretty.

   /// I am a comment! \\\
   a = 5;
The SGI OCC compiler accepted it just fine, as did most of the other compilers. I wasn't until we compiled on HP, or perhaps AIX, that we had a compiler which elided the "\" + newline to give

   /// I am a comment! \\a = 5;
and caused the downstream code to fail.

As for heisenbugs specifically, I agree with you. I just wanted to tell a story. :)


I had a similar story I'd like to share. A colleague had written out the initialization data for some embedded component, and helpfully included the ASCII representation for some of the (mixed) binary/readable stuff we sent. So he wrote something along the lines of...

            unsigned char initdata[] = {
               0x01,  // some value, e.g. 513
               0x02,  // as 16bit, little endian
               0x40,  // ASCII @
               0x5c,  // ASCII \
               0x42,  // ASCII *
               0x09,  // number 9
              (...)
            };


C99 explicitly does \-splicing of lines before comment processing:

  //\
  this is a comment

  /\
  / this is also a comment
GCC has an option (-Wcomment, enabled by -Wall) to warn about this and other likely mistakes involving comments (e.g. nested /* */ comments).


It's been a few years since I've really had one, but corrupting or using up the stack space was the cause of a couple of cases of seemingly weird behavior in C.


And the punchline: the heisenbug in question was in the code being compiled, not the compiler (http://esr.ibiblio.org/?p=1705&cpage=1#comment-248245).


Indeed, one wonders whether that revelation might have been the entire reason for this belated post to HN. I don't particularly feel sorry for him, but ESR does get a hard time around here.


I think if the code changes due to adding or removing -g, that's definitely a bug in the compiler!

If the generated code does things that can't be represented in the debug info, there are plenty of better solutions, with generating useless debug info being an obvious one. Useless debug info is annoying, but far from the end of the world, not least because just about any experienced programmer will be used to it by now...


The only "compiler bugs" I have ever seen in 20+ years of coding have all, with the exception of the times the compiler itself crashed, all been user error, and most of those errors are things like using uninitialized values (-g, at least on some compilers, sets memory to 0, instead of 'whatever's on the stack at the moment') or stack corruption that corrupted different parts of the stack at different optimization levels.


-g should really be controlling the emission of debug info, and nothing else, I would have thought? This is independent of the amount of optimization being performed.

In fact the code part of the result should be identical in both cases, and ideally at runtime it should be impossible to tell the difference between -g and not (since presumably the loader won't load/map the debug info even if it's embedded).


When you see a Heisenbug in C, change your mind: it's NOT the compiler. It's you.

At least that's been my experience in almost 20 years of programming a lot of C and C++.

I've found my share of compiler bugs, particularly in embedded compilers, but 99%+ of the time the Heisenbugs are mine: stray pointers, stack overflows, buggy concurrency, unexpected interrupt timing, etc.

When you see a Heisenbug in C, it's probably NOT:

1. The hardware. 2. The compiler. 3. The OS.

It's not you (compiler), it's me.


> When you see a Heisenbug in C, change your mind: it's NOT the compiler. It's you.

Also known as "select isn't broken".


In the article's comments, the author of the code in question says "I 'solved' the problem by sticking in one trace call that made the problem go away. That kind of fix makes me vaguely nauseous, but ugly working code beats pretty broken code every time." He does not appear to have an explanation of the error that the optimizer is allegedly making, or for how this change fixes it.

If you have made a 'fix' but you don't know what problem it solves and how it does so, you probably haven't fixed anything, and you can expect further trouble from the root cause.


I remember an almost impossible to diagnose heisenbug I had on an embedded device years ago. It was a small embedded system with no MMU and very poor tooling. It took 3 of us stuck in a room working long hours together for 2 weeks to trace it down. I ended up having to port portions of the code base to an easier to use system and luckily the bug ended up being in that subset of the code. The real kicker on this bug is we could never reproduce it with optimizations turned off so it was a constant battle between adding in diagnostics and trying to keep the bad behavior intact. Every little change we would make would shift were the memory corruption was and the reordering from the optimizations made the changes unpredictable at times. Ahhh, shitty memories.

I can't express the joy when the source of the bug ended up not being my fault.

That bug chase in particular made me extremely weary of people who alter systems to fix bugs without understanding exactly why it made the issue go away. It might not have fixed it but just moved it.


So what caused the bug ?


Writing to an uninitialized pointer.


No no no no! I've had dozens of heisenbugs in my career. I've only seen compiler bugs once or twice.


In my experience (which comes from developing compilers and bringing up operating systems on new architectures with untested compilers), compiler bugs are rarely 'heisenbugs' in the traditional sense, meaning bugs that go away when you attempt to debug them.

Compiler bugs tend to be more deterministic, where some code is being generated incorrectly and it will do the wrong thing every time you execute that code. Changing the optimization settings may cause the bug to go away, but it's still deterministic given the same compiled binary. One exception would be a bug where a compiler is reordering instructions across a memory barrier, but those are even rarer than normal compiler bugs, since compilers are generally paranoid about reordering anything with barrier intrinsics or inline assembly.


My version of this would be "When you see a Heisenbug in C, run your program through valgrind".

Which is a good idea even if you don't observe any bugs.


Amen to that!


This is what helps me finding bugs, and debugging release code (MSVC specific) - put this around function to disable temporary the optimizations:

  #pragma optimize(push)
  #pragma optimize("",off)

  void the_function()
  {
  }

  #pragma optimize(pop)
Usually some kind of bisection, or lucky guess is done to figure out that it's really compiler bug.


And when you see a heisenbug in MSVC++, just rebuild your project.

I don't know about newer versions, but VS 2010 and priors have a nasty trait of spitting out wrong binary code now and then even when all optimizations off, edit-and-continue and incremental linking disabled. You would just make a little change, build, launch and get a crash in the middle of nowhere with the most bizarre stack trace. Then rebuild it and it will magically go back to normal. For a project of 100k lines this is a weekly, sometimes daily, routine. Very trust inspiring.


probably just stale object files, hardly unique to MSVC


Actually, -g should not change the output of the compiler in any other way than the presence of debug symbols. I can confirm that right here, where I'm building code for an ARM microcontroller, the resulting .bin files produced by objdump are identical (but the .elf are not).

Granted, when the debug symbols end up in the final ELF binary which is executed by an OS, it can still affect execution in subtle ways such as load times.


It's probably worth pointing out that the person with the bug in the story, Jay Maynard, is also Tron Guy.


I had once had the opportunity to investigate a heisenbug. We could see that the bug disappeared after turning the optimization off, but our clients were not satisfied and wanted a more detailed explanation and exact root cause analysis.

We were using a proprietary C compiler tool chain provided by our vendor and did not get much help from them either.

Finally, We had to sit down, get the assembly from disassembler and went through whole 800 lines of it. And we found the bug, sitting quietly in one of the pipelines.

I am not a compiler guy, but that day I understood the beauty of compiler optimization.


I once had a bug in my program where it would work correctly only on -O3 (not quite correctly, but good enough for me to not notice). After a lot of head scratching I ran clang's static analyzer [0] and it found it to be an uninitialized variable. On -O3 compiler was reusing other variable, which happened to be very close to what I needed.

[0] http://i.imgur.com/lqs4FNn.png


A better approach: when you see a Heisenbug in C, run your program through Valgrind's memory and threading checking tools.


I have had one large piece of PL1/G code - a whole mess of mappers for a map reduce based billing system where rebuilding the entire code base made one bug go away.


by go away, you of course mean hidden to be found and solved later? bugs do not typically just fix themselves.


Can't say anything about that project, but there could be a build dependency error where a change didn't get propagated all the way through. The "bug" isn't so much a coding bug but a synchronization bug that can be fixed by doing a full rebuild.

For example, on one system I used there was a ~2 second difference between my desktop machine, the build machine, and the NFS host for directory I was working in. This would sometimes lead to:

    make: warning: Clock skew detected. Your build may be incomplete.
If I did a save to a .C file just after the .o file was written, then it might still have a timestamp which was older than the .o, so not included in a rebuild.

This required occasional manual deletes of the .o file to make sure that it was building correctly, and we probably did a full rebuild of the entire code every once in a while to reduce these sorts of problems. (This was back in the c-front days, with 15 minute compilation times.)


Not necessarily.

A decent percentage of the time someone is experiencing a "heisenbug" it's because there are some stale object files sitting around, and the build state has gotten inconsistent (makefile bugs, bad tracking of dependencies, or what have you).

You go to build debug, everything works, you shrug, make clean, remake, and then you're golden again.

Obviously that's really not the way to go, though, because you risk mistaking an uncommon crash for "just had to rebuild, I guess" - but if you're comfortable reading assembly or what-have-you, you can usually confirm the "stale object files" thing easily -

and then fix your build ;)


Ironically no as dalke comments below I think it mist have been some odd build efect over several hundred modules of code.


It is not:

1. lupus 2. a tumor 3. a compiler bug

I'm not at all surprised to see ESR being enough of a scrub to go to 'compiler bug!' as his first guess.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: