
What Every C Programmer Should Know About Undefined Behavior #1/3 - zeugma
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
======
aidenn0
I still remember one of my favorite linux bugs was due to the NULL behavior.

It was roughly this

read-from-p;

if(p != NULL) write-to-p;

since read-from-p is undefined if p is null, gcc could (and did) legally
optimize out the NULL check, so you could end up writing to NULL.

[edit] I noticed that this case of bug is actually mentioned in the regehr
article that is linked.

~~~
ramidarigaz
If the compiler can optimize that away, what's the proper method for checking
p?

~~~
panic
Check p _before_ using it. The issue is when you have code like the following,
where the behavior of the first statement is only defined when the condition
in the second is false:

    
    
        int x = *p;
        if (p == NULL) {
          // p can't be NULL without having triggered
          // undefined behavior in the first line, so
          // this code is removed by the compiler.
          return;
        }
        // ...
    

The fix is to stop dereferencing p before you know whether it's NULL or not:

    
    
        if (p == NULL) {
          // Moving the check before the dereference
          // avoids undefined behavior and resolves
          // the issue.
          return;
        }
        int x = *p;
        // ...

------
jergosh
"You don't know. The computer may levitate." is what they used to say in my C
course at the uni.

~~~
1amzave
I've heard "nasal demons" in some C circles (as in, "the compiler may cause
demons to fly out your nose").

------
caf
_It is worth noting that unsigned overflow is guaranteed to be defined as 2's
complement (wrapping) overflow, at least by Clang and other commonly available
C compilers._

This is actually guaranteed by the standard (although the language used is not
"2s complement", but "repeatedly adding or subtracting one more than the
maximum value that can be stored in the type until the value is in range").

 _C requires that these sorts of type conversions happen through unions: using
pointer casts is not correct and undefined behavior results._

Actually, loading a member of a union other than the one most recently stored
to is undefined behaviour too - it's just that using a union in this way falls
into the _"...both Clang and GCC nail down a few behaviors that the C standard
leaves undefined."_ category. (And most other compilers besides - there is
much historical usage of this).

------
brimpa
It would be nice to have a code snippet for each of the examples. I'm a fairly
experienced C++ developer but my knowledge of compilers is, admittedly,
lacking and I just want to make sure I'm on the same page.

Otherwise, a great read.

PS– Is it just me or is LLVM coming up more and more these days?

~~~
pixdamix
There's this patch from the linux-kernel:

\- <http://lkml.org/lkml/2009/7/17/187>

\-
[http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6...](http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a3ca86aea507904148870946d599e07a340b39bf)

I've put a gist file demonstrating the problem:
<https://gist.github.com/968723>

    
    
      #include <stdio.h>
      
      struct agnx_priv {
          char demo;
      };
      
      struct ieee80211_hw {
          struct agnx_priv *priv;
      };
      
      struct pci_dev {
          struct ieee80211_hw* dev;
      };
      
      struct ieee80211_hw* pci_get_drvdata(struct pci_dev *pdev) {
      
          if(!pdev)
              return NULL;
      
          return pdev->dev;
      }
      
      static void agnx_pci_remove (struct pci_dev *pdev)
      {
          struct ieee80211_hw *dev = pci_get_drvdata(pdev);
          struct agnx_priv *priv = dev->priv;
      
          if (!dev) return;
      
          // This will segfault if the previous if is optimized
          printf("%c\n", priv->demo);
      }
      
      int main(int argc, char **argv)
      {
          // (struct pci_dev *)argv[1] is just to avoid compiler optimisations
          // =~ NULL if no args are passed.
          struct pci_dev * pdev = (struct pci_dev *)argv[1];
      
          agnx_pci_remove(pdev);
          return 0;
      }
    

Compile with:

    
    
      gcc -O2 -ggdb -Wall -Wundef -fno-strict-aliasing \
        -fno-common -fdelete-null-pointer-checks test-deref.c \
        -o test-deref
    

and

    
    
      gcc -O2 -ggdb -Wall -Wundef -fno-strict-aliasing \
        -fno-common -fno-delete-null-pointer-checks test-deref.c  \
        -o test-deref

------
olov
Great article, shame it got optimized away before I had the chance to re-read
it.

edit: seems Blogger turned their -O knob to eleven
<http://status.blogger.com/>

------
KonradKlause
Why does LLVM generate "ud2" instructions? WTF?

~~~
pascal_cuoq
Because it can. This is undefined behavior we are talking about.

~~~
KonradKlause
Sorry, I don't get it. The Intel manual says on UD2:

"Raises an invalid opcode exception in all operating modes."

What is here undefined? LLVM must not generate such instructions except it it
really wants such a exception. (Like Linux's panic() does on x86)

~~~
schrototo
The _original piece of C code_ has undefined behaviour, meaning LLVM can
generate anything it wants. It happens to generate ud2 instructions (because
it's better to crash hard and fast) but it could just as well print "puppies
puppies puppies" a million times.

~~~
repsilat
> it's better to crash hard and fast

I'm not so sure about that. If you need to squeeze out a little more
performance, code that is "technically undefined" can be more portable than
dropping to ASM.

I think LLVM should emit a warning on code with undefined semantics and
generate DWIM instructions instead of UD2s.

~~~
ToastOpt
It really depends on your motives. If you will be needing to port to new
platforms in the future, it's better to have a hard-and-fast crash now, so you
can learn and avoid the undefined behavior now, rather than face a large bug
backlog years down the line.

But you're correct that sometimes it can be expedient to exploit such
technically undefined behavior. (I've committed this sin myself, most commonly
in serializers/deserializers)

