
What Every C Programmer Should Know About Undefined Behavior (2011) - BuuQu9hu
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html
======
mpweiher
"You're holding it wrong", compiler-writer's version.

IIRC, the Linux kernel has a rule that as a kernel developer, you are not
allowed to break userspace, _even if you think userspace is wrong_. This is an
apologia for why compiler writers think they should not have to be held to
similar standards.

The argument is always something along the lines that UB in the C standard
means they get to do whatever they want, including such nonsense as
eliminating entire loops because the last access is out of bounds or going
down an if-else-branch other than what the if says.

That this argumentation is specious in the extreme is shown by the fact that C
compilers worked fine (and with less code breakages) _before there even was a
C standard_ , so _all_ behavior was "UB".

The unstated assumption is that the optimizations enabled by this are
absolutely necessary. That overstates the importance of compiler optimizations
in general, and certainly the importance of these particular optimizations.
(see Proebsting's Law)[2].

There may be code-bases where these extra optimizations are beneficial or even
essential, fine, let's add a -Ogoforit or -Oletsgocrazy flag.

Rebuttal here: _What every compiler writer should know about undefined
behavior or "Optimization" based on undefined behavior hurts performance_[1]

Short summary is that this trend towards exploiting UB breaks a _lot_ of code
that has been working for decades, and the fixes required to bring them in-
line with the new regime imposed in order to enable these new optimizations
makes it slower than the old code without those optimization. Yay.

[1]
[http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_201...](http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf)

[2] [http://proebsting.cs.arizona.edu](http://proebsting.cs.arizona.edu) 'I
may be best known for "Proebsting's Law", which asserts that compiler
optimizations have yielded annual performance gains an order of magnitude
worse than hardware performance gains. The law probably would have gone
unnoticed had it not been for the protests by those receiving funds to do
compiler optimization research.'

~~~
haberman
Others have gone down this path and realized it's not as easy as it sounds:
[http://blog.regehr.org/archives/1287](http://blog.regehr.org/archives/1287)

~~~
mpweiher
Yes, but it depends a lot what you consider "down this path". John's approach
seemed to be to define the UB, and to me it looks like he found out why it was
left as UB in the spec.

Not exploiting the UB for optimizations in ways that break lots of code is
actually trivially easy. We have been doing it for 30+ years and we just have
to stop adding new optimizations that break, and possibly revert some of the
ones that are already in there.

~~~
gpderetta
> Not exploiting the UB for optimizations in ways that break lots of code is
> actually trivially easy

Here lies madness. What optimizations are not allowed exactly? Strict
aliasing? Assuming dereferenced pointers are not null? Assuming all array
accesses are always in bound? Assuming that integers don't wrap? Assuming that
allocated memory is not accessed after a free? Reusing function frame slots of
variables with non overlapping lifetimes? Not laying down local variables in
the most obvious way? Caching and reordering non-volatile memory accesses?
Accessing local variables of functions that have returned?

I'm sure you will say obviously not allowed to the first few and obviously
allowed to the last few, but at some point all of these optimizations (and
more) did break some code as some point.

~~~
mpweiher
> Here lies madness.

Nah. You're looking at it the wrong way. You are saying madness lies in going
there. We don't have to "go there". We can simply not _leave_ there. Not doing
something is easy. In fact, it's my secret super-power. If you want, I'd be
happy to teach you.

~~~
haberman
Compilers already have these optimizations. Doing nothing would be leaving
those optimizations in.

If what you want is a straightforward translation of C into machine code,
there is always -O0.

~~~
mpweiher
> Compilers already have these optimizations.

git co <previous-hash>

You're welcome!

~~~
haberman
You're the person who wants this. Maybe you should do that? Also make sure
you're willing to give up support for newer language standards, CPU
architectures, and countless bugfixes (even C++98 support was getting pretty
noticeable bugfixes in GCC 10 years ago).

------
Unklejoe
Two things:

1\. What's the proper way to do a NULL check? In the first example, they show
how a NULL check could be optimized away, but I didn't see where they
explained how to do it the right way. I might have just missed it though.

2\. It would be nice if compilers offered a special optimization flag that
performs most optimizations, but also “fills in the gaps” for non standard
behavior by behaving how people expect. I realize that this would essentially
be like each compiler implementing its own version of the standard in a way,
but the reality is that if 99% of the people are making an assumption about
how the compiler is going to do something, maybe that’s how the compiler
should do it? Perhaps like a “speed+security” optimization flag. GCC already
has some flags to handle certain questionable things such as strict aliasing,
etc… Maybe something already exists, IDK. I guess the real solution would be
to change the actual standard.

EDIT: A comment elsewhere provides a good response to my second point. It was
along the lines of "which optimizations are okay then? Most optimizations
broke code at some point".

I guess my response would be to only disable the ones that are the most common
trouble makers. Another factor could be the performance gain vs amount of bugs
it causes. I am reconsidering my second point though...

~~~
lmm
> 1\. What's the proper way to do a NULL check? In the first example, they
> show how a NULL check could be optimized away, but I didn't see where they
> explained how to do it the right way. I might have just missed it though.

You need to not do * p at all if p is NULL (i.e. move the "int dead = * p"
after the early return).

------
prodigal_erik
Are these all _undefined_ rather than merely unspecified or implementation-
defined? I'm a little suspicious that they don't explain the difference, but
I'm having trouble finding a definitive list of the places where the standard
allows nasal demons.

~~~
lomnakkus
They're undefined. (From a quick skim.)

 _However_ , an implementation _could_ define various UBs however it wants.
The difference between "undefined" and "implement-defined" is really that an
implementation doesn't have to anything even remotely sensible for "undefined"
behavior, but it has do define _some ( "sensible") behavior_ for
"implementation-defined" behavior. Usually, "implementation-defined" behavior
also tends to remain consistent across versions whereas "undefined" behavior
usually has no such constraints. For example, for "undefined" behavior,
version 5.1.2.3 of the compiler may differ drastically from 5.1.2.4 in terms
of optimizations which rely on UB.

~~~
Sharlin
Basically:

 _Undefined_ \- the compiler may assume this can never happen, and if it does
happen, it can "poison" the whole execution including whatever logically
happens _before_ the undefined part.

 _Unspecified_ \- the behavior must be deterministic and cannot affect the
semantics of other well-defined parts of the program, but the implementation
need not guarantee any single behavior.

 _Implementation-specified_ \- like above, but the implementation must specify
the exact behavior.

~~~
to3m
The compiler may also assume undefined behaviour can happen, and do something
documented, that programmers would probably expect.

Even though the standard imposes no specific requirements, _it explicitly
calls that out at something you could do_ :
[http://port70.net/~nsz/c/c11/n1570.html#3.4.3](http://port70.net/~nsz/c/c11/n1570.html#3.4.3)

But for some reason nobody seems to ever think this is an option.

~~~
lmm
It wouldn't improve benchmark performance, which is what C compiler writers
care about. And at this point in time any programmer who wants that kind of
behaviour has probably moved on from C.

~~~
lomnakkus
Well, that and undefined behavior sometimes would require _absurd_ amounts of
runtime checking (eg. on signed bit shifts, etc.).

------
new299
Nice summary.

If stability/security is important in your codebase. It's of course worth
avoiding undefined behavior, and compiling with warnings as errors (-Werror in
gcc).

However it's also really worth running your test suite through Valgrind [1]
too sometimes. This instruments your code to check for use for use of
initialized variables/other non-deterministic behavior at runtime.

I used to run this under continuous integration, and flag builds with issues.

[1] [http://valgrind.org/](http://valgrind.org/)

~~~
DSMan195276
I agree that all warnings should be generally be turned on and snuffed out
(Though there are a few innocuous warnings that I turn off sometimes), however
I wouldn't recommend the usage of -Werror to force that unless it's only
enabled for dev builds. Warnings are generally impossible to fully predict:
different compilers can give different warnings - from warnings enabled by
default, warnings changed between versions, and warnings that differ between
compilers. And of course, there's no wrong way for the compiler to do it
because what to warn about is subjective in the first place.

That can become really annoying for an end user looking to just compile and
run your software, not debug why it fails to build because their setup outputs
a warning somewhere that may not even be an error. And turning off -Werror
tends to then require delving into Makefiles (or whatever other build system
they chose to use), and at that point most people are probably just not going
to bother and call it broken.

Again, I'm _not_ saying you shouldn't attempt to snuff out and find every
warning that you can, but you shouldn't have your default build rely on
compiling with no warnings. There's just too many variables uncounted for to
expect your software to compile without warnings in every possible
configuration.

I also agree 100% with running test suites through valgrind - I do that as
well and it's a very good way to guard against memory leaks being introduced
(Of course, good structure and design techniques are always the best defense,
but there's nothing wrong with a good check that things are doing good).

~~~
ryandrake
Another reason to avoid -Werror is if you pull in third party code that you
share your build options with. Even large stable open source projects
regularly spew hundreds of warnings when you aggressively turn them on.

I'll generally build my own code with all warnings, but use as close as I can
to the package maintainer's build settings for their code. Even then, you can
get warnings because the third party developer doesn't use your exact
compiler.

------
throwaw190ay
Let's tell it like it is, the only reason C was more popular than anything
else in the 70/80's is because unlike its competition, implementing a C
compiler didn't require to pay a big license fee to a vendor. C is a horrible
language and now most developers are stuck with it, because the Linux kernel
like most system is built with C. Rust might save us from C ultimately, but
that's clearly not for tomorrow.

~~~
zzzcpan
Let's try this again. As a language C is not that horrible and it doesn't have
to be as sloppy and as insecure as it is in popular compilers. In fact, I
believe 99% of the C code is not performance critical and can benefit from
simple runtime checks to ensure memory safety. Not that much needed to "save
us from C". But ultimately no one cares, inventing and playing with new
languages is way more fun.

------
charles-salvia
Not to mention that arithmetic operations (including division) that overflow
signed integers result in undefined behavior.

~~~
mnarayan01
That's mentioned in Part 1: [http://blog.llvm.org/2011/05/what-every-c-
programmer-should-...](http://blog.llvm.org/2011/05/what-every-c-programmer-
should-know.html).

------
oldgun
Really nice summary of undefined behavior. Just realized this is posted by
Chris Lattner himself.

------
Kenji
//int dead = *P; // deleted by the optimizer.

If a compiler does that, it is broken. Dereferencing P clearly has side
effects (i.e. segfault if P does not point into valid memory). A compiler must
prove that its optimisation does not affect the program execution. This is
clearly a change that does. It cannot optimize away operations with side
effects. And if this is a valid optimisation according to the standard, then
the standard is broken.

EDIT: If segfaulting is not a part of C, but a part of the hardware
architecture, then the compiler shouldn't assume that a dereferenced pointer
is not NULL. Either way, the compiler should behave in a way to avoid the
least expected outcomes.

~~~
Unklejoe
Does this actually count as a side effect?

I didn't think that dereferencing a variable could ever be the sole cause of a
side effect. I thought that side effects were only relevant in the context of
a store (except for maybe volatile variables in which there are some
ordering/barrier constraints applied as well). I realize that in practice,
deferencing NULL can definitely cause a crash, but I don't know if it counts
as a "side effect" in the typical "C" sense of the phrase.

I know that modifying the value of a variable which is accessed by means of a
reference passed as an argument could definitely have side effects (since
another unrelated function might read that data) and thus a store can’t be
completely optimized away (but still can be reordered), but I don’t see why
“dead” can’t be optimized away in this case.

In this case “dead” is a local variable, thus it can never be accessed by
anyone outside of the function. Therefore, if there are no reads of the
variable after that single store, I don’t see why the store can’t be
completely eliminated.

In fact, this is a common issue when dealing with microcontrollers where
configuration registers are sometimes written by assigning a value to a
dereferenced address constant. That's why they get casted as volatile.

~~~
jschwartzi
If your dereference compiles into a read from a register, then in the case
that the register is clear-on-read like some interrupt status registers I've
seen, the dereference will have unintended side effects.

~~~
Unklejoe
Agreed. However, I believe that in C, standard variable types are assumed to
not have such side effects on a read, which is why such variables must be
explicitly declared as "volatile".

