
GCC null pointer check removed - Tomte
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90949
======
ChuckMcM
Wow, I am in awe of Jeff Law at the moment :-). I am glad they found the issue
and are fixing it.

I was helping a high school student learn how compilers work and we started by
opening up GCC and I was astonished at just how mindbogglingly complex it had
become. We closed that down and opened up the delightful Bellard "Tiny C"[1]
which was comprehensible at the level where this student was working. We
created a new 'bool' type for that compiler.

[1] [https://bellard.org/tcc/](https://bellard.org/tcc/)

~~~
andrepd
>we started by opening up GCC and I was astonished at just how mindbogglingly
complex it had become

It's the price of being good ;)

~~~
monocasa
Well, and just being ancient. LLVM is good too, but a lot cleaner. For
instance (from GCC's docs):

> Reload is the GCC equivalent of Satan. See [gccsource:reload.c],
> [gccsource:reload1.c], and [gccsource:reload.h] if you have a brave soul.

[https://gcc.gnu.org/wiki/reload](https://gcc.gnu.org/wiki/reload)

~~~
andrepd
AFAIK, gcc still circles around clang.

~~~
musicale
g++ drove me insane with its baroque error messages, but clang++ did not.
However, I hear that g++ has improved.

~~~
LukeShu
I avoid C++ and stay in straight C as much as I can. GCC's C error messages
have definitely improved in response to clang (especially around deep
preprocessor stuff). I had assumed that the C++ messages had likewise
improved, and thought that I had heard that as well.

Then I actually had to test that working on Envoy proxy. I was disappointed to
get utterly incomprehensible error messages. I'm "GCC-by-default", so a
coworker suggested I try switching to clang++... and sure enough, clang++ gave
me decent error messages.

------
saagarjha
As the other commenters have pointed out, this is a bug in GCC and not an
aggressive optimization typical of compilers exploiting undefined behavior.
Perhaps the title can reflect this somehow? "Bug in GCC removes null pointer
check", etc.?

~~~
a1369209993
> aggressive ["]optimization[" ...] exploiting undefined behavior

That's still a compiler bug.

But yes, this is a bug in GCC and the title should show that.

~~~
saagarjha
Exploiting undefined behavior is something compilers can and will do; that's
not a bug. (Specifically, if a compiler can choose to omit code where it can
prove that going down that branch would lead to undefined behavior. That's not
what happened here.)

------
asveikau
I think I am already not alone in saying this on this thread, but that was a
_very_ fast investigation and bug fix. I am impressed.

Not at all the expectation I would have for facing a compiler bug. I guess it
also helps that the report seemed clear and high quality. (Even had a revision
where it was introduced.)

~~~
rkangel
"Silently incorrectly compiles code in a very bad way" is a fairly unusual
bug, particularly for a well known platform (also a credit to gcc).

~~~
asveikau
Yes, but humans have natural limits in time and attention. This is fast
turnaround.

------
klodolph
This one is interesting... it looks like sibling call optimization was a bit
too aggressive, or non-nullness is propagating incorrectly.

For those who understand UB, the problem is not that 'node' is NULL. The
compiler can "prove" that 'node' is non-NULL, but the results are propagated
to 'module' from main, which is not correct because 'main' is not the only
call site of 'walk'.

~~~
tedunangst
But the proof that it's non null is the deref in main?

I'm not sure I understand why people say this has nothing to do with the UB
already derefed therefore not null optimization. How else is the compiler
inferring the pointer could be not null?

~~~
klodolph
> But the proof that it's non null is the deref in main?

This is an incorrect interpretation, it only proves that 'node' is non-null.
Recall that 'walk' is called twice—once for 'node', and once for
'node->child'. The compiler bug (and this IS a compiler bug, make no mistake)
is that the the proof that 'node != NULL' is incorrectly treated as a proof
that 'module != NULL', which is not sound, and the bug manifests as incorrect
program behavior during the recursive call 'walk(module->child, ...)'.

~~~
tedunangst
Yeah. The bug is that the compiler propagated not null from main.node to
walk.module.

But how did the compiler determine main.node is not null?

~~~
klodolph
> But how did the compiler determine main.node is not null?

The test case in the bug happens to use UB to generate the constraint, but
this is just an example of how to trigger the bug. The bug itself is not
related to UB, and there are other ways to generate the same constraint.

Here is an example on godbolt showing what I mean:
[https://godbolt.org/z/pRW97L](https://godbolt.org/z/pRW97L)

You'll notice that I removed the UB, but the bug is still there. You can see
that the 'return;' statement is not emitted because it is colored with a white
background. So the bug is not related to UB at all.

~~~
saagarjha
> You can see that the 'return;' statement is not emitted

Uh, what? The code looks correct to me:

    
    
      func:
        test rdi, rdi
        je .L7
        mov esi, 1
        jmp walk
      .L7:
        ret
    

There's a conditional branch on $rdi that jumps to .L7 (which returns) when
node is NULL.

~~~
klodolph
You're looking at 'func', the error is in 'walk'. Here is the top of 'walk':

    
    
      walk:
        push rbx
        mov rbx, rdi
        test esi, esi
        je .L3
        mov rdi, QWORD PTR [rdi]
        call walk
    

Note that esi = cleanup = 1 so the branch to .L3 is not taken, and the
recursive call to 'walk' is taken. The crash will occur on the instruction
before the call,

    
    
        mov rdi, QWORD PTR [rdi]
    

Because rdi = 0 on the second call to 'walk'.

~~~
saagarjha
Oh, yes, walk is still broken. I was focusing on func because you had added
that and it coincidentally also had a white return statement.

------
chowells
Ouch, this is annoying. But it's a bug. Sometimes those happen, and they get
fixed.

This is totally unlike the user hostility that results from playing gotcha
with UB which has also resulted in removing null pointer checks. Those aren't
bugs and don't get fixed.

~~~
a1369209993
Those _are_ bugs and the compiler developers' refusal to fix them is precisely
the user hostility you are complaining about.

~~~
ogoffart
No they are not. When your program invoke UB, it is a bug in your program, and
not in the compiler.

The compiler author are helping you with things like -fsanitize=undefined. Or
even lets you define some behavior which is normaly undefined with things like
-fwrapv, -fno-strict-aliasing, -fno-delete-null-pointer-checks

But you are asking the compiler to optimize your program, if you give an
invalid program to optimize, you get an invalid result. There is no bug there.

------
singron
I couldn't reproduce with 9.1.0 on Arch Linux since it's patched:
[https://git.archlinux.org/svntogit/packages.git/commit/trunk...](https://git.archlinux.org/svntogit/packages.git/commit/trunk/bz90949.patch?h=packages/gcc&id=7c2f84ac385521ffeb9ce7f34cc919031e717392)

However the gcc 9.1.0 in nix/nixos is still affected (but not the default gcc,
which is still gcc7).

EDIT: nix just updated to gcc 9.2.0. I was using a slightly outdated release:
[https://github.com/NixOS/nixpkgs/pull/66836](https://github.com/NixOS/nixpkgs/pull/66836)

~~~
saagarjha
If anyone wants to try this themselves:
[https://godbolt.org/z/j4YJ0c](https://godbolt.org/z/j4YJ0c)

------
mjcohen
Reading that thread, I'm really impressed by the knowledge shown by the people
that worked on it.

~~~
saagarjha
Compiler engineers are a smart bunch :)

~~~
bagol
I wish I am a compiler engineer...

~~~
jacquesm
Stop wishing, go building. It is hard but doable.

If you want an easy start re-implement an interpreted language first, then
write a compiler for it.

------
monocasa
Stuff like this is why I really like sel4's approach to verification. They
take the compiler out of the TCB by parsing and verifying the underlying
binary.

~~~
topspin
I believe NaCl and PNaCl analyzes machine instructions (NaCl) or 'bitcode'
(PNaCl) to sandbox binaries.

~~~
monocasa
They do but it's a little different. (P)NaCL still allows memory safety and
data verification bugs that don't compromise the integrity of the sandbox.
sel4 goes a lot further by verifying that the total behavior of the
application matches the machine readable formal specification. So (P)NaCL
probably would have allowed the bug in question to get into production,
whereas it would have been caught as a build error by sel4.

------
tinus_hn
Scary, does this have a chance of introducing security issues in (mis)compiled
code?

------
slovenlyrobot
Obligatory Dan Bernstein presentation:
[https://cr.yp.to/talks/2015.04.16/slides-
djb-20150416-a4.pdf](https://cr.yp.to/talks/2015.04.16/slides-
djb-20150416-a4.pdf)

(note: portrait slides!)

~~~
MaxBarraclough
> Wikipedia: “An optimizing compiler is a compiler that tries to minimize or
> maximize some attributes of an executable computer program.”

> So the algorithm designer (viewed as a machine) is an optimizing compiler?

> Nonsense. Compiler designers have narrower focus. Example: “A compiler will
> not change an implementation of bubble sort to use mergesort.” — Why not?

(No answer is given.)

An optimising compiler is _permitted_ to replace your bubble-sort by a merge-
sort, provided apparent program-behaviour is unchanged.

The only reason real compilers aren't likely to do this, is that the
transformation is too sophisticated/too specific, for current compiler
technology.

> compiler designers take responsibility only for “machine-specific
> optimization”. Outside this bailiwick they freely blame algorithm designers

This is wrong.

It's wrong in the shallow sense: in a typical modern compiler, most
optimisations are performed at the level of an IR, not at the level of the
particular target machine language. Common subexpression elimination, for
instance.

It's also wrong in a deeper sense: compilers may apply optimisations which
improve time complexity. That is to say, they may transform the algorithm
itself. The target machine has no bearing there.

Consider the following unoptimised loop:

    
    
        int found = 0;
        for (size_t i = 0; i != n; ++i) {
            if (arr[i]) {
              found = 1;
              // break;
            }
        }
    

An optimising compiler may be capable of essentially uncommenting the
commented-out 'break'. This changes the time-complexity of the scan in an
obvious way.

This optimisation could even be applied by a source-to-source optimiser, with
no awareness of the the target machine.

(We can nitpick about possible tradeoffs regarding branch-prediction when
adding the 'break', but I think my point is clear.)

------
nullc
I wonder if they'll get around to fixing this miscompliation bug:
[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90348](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90348)

~~~
nullc
It isn't particularly helpful to downvote a comment to -3 without actually
expressing what you found to be incorrect about it.

I linked to an issue where all current GCC versions emits incorrect memory
corrupting output from a pretty boring correct input which has been sitting
idle for almost 4 months. The issue-- a morally similar issue of applying a
constraint to the wrong scope-- was found in production code, in the wild--
it's not like its some machine generated test case. Currently the finders are
working around it with -fno-stack-reuse but it's unclear how much other
software is impacted... Should I be lobbying Linux distros to recompile
everything with -fno-stack-reuse ? ... hard to say: there isn't an easy way to
instrument the compiler to detect when the bug is being triggered.

A rapid and highly knowledgeable response to miscompilation is what I've
historically experienced from GCC and thats what's exhibited on the headline
issue, but seemingly not in the case I linked to.

~~~
caf
That's definitely an interesting issue - I don't think it's too similar to the
one in the original post, though. It's not about applying a constraint to the
wrong scope, it's a case where the liveness analysis has correctly deduced
that a block-scope variable in a loop is not live from one iteration of the
loop to the next but then the address of that block-scope variable has been
incorrectly kept live across the loop iterations.

------
rurban
There is a reason people are still staying with gcc-4.8. This pass in question
(ssa-tree-copy) is just horrible, esp. in the latest versions. I had to
blacklist gcc-9 because of similar issues with that. I didn't know that it did
start with gcc-7 already.

------
mjevans
Again: There really needs to be a warning (elevate-able to an error like
usual) for all removed / "optimized out" code. A programmer added it for a
reason, and either that reason is invalid and it can be removed OR the
compiler's doing something the programmer did not intend and is thus
exhibiting undesired behavior.

~~~
ogoffart
This is not possible.

    
    
       void foo(int x) {
           if (x)  do_something();
           else do_something_else();
       }
    
       void fn1() { foo(0); }
       void fn2() { foo(1); }
    
    

When the code is inlined, the compiler can propagate constant and optimize the
`if`. Do you really expect two warnings there?

These warning would happen all the time.

And anyway, this does not apply to this particular bug which is just a
compiler bug (which has been fixed). The warning was there all along:

    
    
        This is free software; see the source for copying conditions.  There is NO
        warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

