
A one word change to the C standard to make undefined behavior sane again - signa11
http://blog.metaobject.com/2018/07/a-one-word-change-to-c-standard-to-make.html
======
MaxBarraclough
The author concedes that he isn't a compiler engineer, but doesn't seem to
think that perhaps compiler engineers have good reason for having their
compilers generate code that behaves bizarrely if there's undefined behavior.

There is, of course: performance.

If you want a compiler that generates code which behaves intuitively even
under UB, that requirement is at odds with compiler optimisation. Compiler
optimisation generally means doing strange and unexpected things to your code,
and the reason C has undefined behaviour in the first place (unlike, say,
Java) is to maximally enable compiler optimisation.

In other words, C values performance over safety, and that's part of the
point.

If you want to catch undefined behaviour at runtime and treat it as an error,
the tooling for doing that is better than ever: Clang's 'UBSan', for instance.

 _Edit_

Aside: I have a personal favourite 'incident' pertaining to exactly this
tension (between performance and intuitive behavior). Torvalds vents his
frustrations on the GCC mailing list, regarding legal-but-counterintuitive
behaviour of code generated for the Alpha architecture:
[https://gcc.gnu.org/ml/gcc/2012-02/msg00038.html](https://gcc.gnu.org/ml/gcc/2012-02/msg00038.html)

~~~
jstimpfle
> C has undefined behaviour in the first place (unlike, say, Java) is to
> maximally enable compiler optimisation.

My understanding is that C has undefined behaviour where there is no
universally sensible implementation across all architectures.

> Compiler optimisation generally means doing strange and unexpected things

In the first place it should make code faster. I'm no one to judge, but the
general reception seems to be that exploiting undefined behaviour (by assuming
it will never happen) rarely brings obvious speed advantages. "Strange and
unexpected things" can be a bad tradeoff for a tiny speed advantage.

For example, I'll boldly state that optimizing out a "if (p == NULL)" only
because p was dereferenced earlier never makes sense. Code containing such a
check is never performance sensitive enough to justify the tradeoff (or it's
really bad code).

And without being a compiler engineer I assume that it should be easy to
change compilers to NOT do this optimization (I assume the not-NULL inference
is a special case in the code that can simply be removed).

> In other words, C values performance over safety, and that's part of the
> point.

Many people value C because of _control_ and allowing one to not paint oneself
in a corner. Performance is only a side effect.

~~~
iainmerrick
_I 'm no one to judge, but my impression is that the general reception is that
exploiting undefined behaviour (by assuming it will never happen) rarely
brings speed advantages._

I’ll play devil’s advocate here, even though I strongly dislike the current
direction of the major C compilers.

This aggressive interpretation of undefined behavior apparently can enable
some important optimizations. Here are a couple of examples:

First, say you have an index variable where you increment it (i += n) and
check it’s within bounds (i >= start && I < end). It could be useful to assume
that i won’t overflow and wrap, so some of the bounds checks can be optimized
out. (The downside is that if you want to explicitly check for overflow, it’s
tricky to do so correctly.)

Second, say you have a function that takes several pointers and copies data
between them. It can be useful to assume those pointers aren’t aliased, so we
can assume that memory isn’t changing in unexpected ways, and therefore it’s
safe to optimize out some repeated reads.

I do feel like those cases could be handled in different ways, though. In the
second case, perhaps the compiler could insert a runtime check for aliasing,
and branch to either a fast path or a slow path as appropriate.

I’m also not sure that C really needs to try to be the fastest and most highly
optimized language, even though it’s widely regarded as being very fast. As
far as I know C isn’t the most popular language for scientific supercomputing,
for example, where the users really do care about performance.

~~~
MaxBarraclough
> not sure that C really needs to try to be the fastest and most highly
> optimized language

Sure it does. That's the whole point: performance and low-level access over
safety and convenient features. These are the defining attributes of the C
language (along with its minimalism).

If you want a safe and convenient language, why are you using C?

If you don't think C should be as fast as possible, which language _should_?
How would that language differ from C?

It's not C's fault that people use it when they shouldn't.

> In the second case, perhaps the compiler could insert a runtime check for
> aliasing, and branch to either a fast path or a slow path as appropriate.

I don't get you. Are you referring to the dangers of the 'restrict' keyword?
Otherwise, it's quite legal to pass aliasing pointers, and the compiler is
free to introduce a check for aliasing.

If you're thinking of the strict aliasing rule, I think you're misunderstand
it. It has to do with misusing types. To quote Wikipedia:

> pointer arguments in a function are assumed not to alias if they point to
> fundamentally different types, except for char* and void*, which may alias
> to any other type

~~~
iainmerrick
_If you want a safe and convenient language, why are you using C?_

Plenty of reasons, besides raw speed! If I just want speed and don’t care
about safety at all, I’d be using assembly, right?

C has (somewhat) strong typing which catches some errors at compile time. It
has higher-level constructs than assembly, like function calls and loops. It’s
exceptionally portable (although you have to be wary of undefined behavior and
platform-specific weirdness).

Speed is definitely an important part of the story but it’s by no means the
only part.

 _If you don 't think C should be as fast as possible, which language should?
How would that language differ from C?_

Something like FORTRAN maybe? No pointers, as I understand it, hence no
aliasing problems.

Unrestricted pointers aren’t necessarily great for speed, but they’re great
for flexibility, which makes C a great systems language. FORTRAN might do
matrix operations faster, but C is better for wrangling a fiddly file system
data structure. Again, it’s not all about speed.

 _it’s quite legal to pass aliasing pointers, and the compiler is free to
introduce a check for aliasing_

Right, what I’m getting at is, the compiler is also free _not_ to add a check,
and most compilers do not in fact add a check, because they care more about
speed.

In fact most C compilers seem to go further; aliased pointers of different
types can lead to undefined behavior, so the compiler assumes that this
behavior _cannot happen_ , and therefore that the pointers _cannot be
aliased_. That seems to me a logical leap too far.

~~~
MaxBarraclough
> If I just want speed and don’t care about safety at all, I’d be using
> assembly, right?

Not really. C is a reasonable choice of language for high-performance code.
There may be instances where your hand-written assembly can outperform
compiled C, but it's far from guaranteed, and of course even if you did have
such a guarantee, it might not be worth the productivity impact.

Even performance-critical codebases like the Unreal Engine don't use much
assembly, as far as I know.

> Speed is definitely an important part of the story but it’s by no means the
> only part.

Agree.

> Something like FORTRAN maybe? No pointers, as I understand it, hence no
> aliasing problems.

Indeed, but my understanding is that the performance wins there aren't exactly
night-and-day. C is pretty close to as fast as it gets.

> Right, what I’m getting at is, the compiler is also free not to add a check,
> and most compilers do not in fact add a check, because they care more about
> speed.

I was unclear - I meant the compiler is permitted to add a check and then use
a special optimised path in the case that the pointers do not alias.

It’s quite legal to pass aliasing pointers. When compiling a function which
has two pointer-to-int parameters, the compiler _is not_ permitted to assume
that they do not alias, unless you use the 'restrict' keyword (which is rarely
seen as it's so risky).

> aliased pointers of different types can lead to undefined behavior

Correct (with the special exceptions of void* and char*). Because you
shouldn't be doing that.

------
oddity
I propose that, instead of it being a wording change that sparked the current
“craziness” of optimizing C compilers it was the combination of two things:

1) Faster CPUs and more memory allowing compilers to step beyond mostly just
peephole-optimization to do optimizations that were once prohibitively
expensive.

2) The general convergence of mainstream processors on more or less identical
characteristics (16/32/64 bit registers, MMUs giving the appearance of a flat
memory space, code exists in memory, 2s complement, etc...) causing
programmers to forget how truly diverse computers can be.

In effect, as a collective, we’ve forgotten _why_ these parts of C were never
defined and now that compilers are smart enough to optimize code in magical
ways but not smart enough to perfectly verify that the programmer didn’t make
a mistake, we have this situation of terrible UX and programming pitfalls.

I’m conflicted, and it feels wrong to say this, but maybe it is time to accept
that all the world’s more or less an x86 and spin off a more well defined C
that’s perf-optimal under some compilation settings for x86-likes. Everyone
else not in the sanctioned-land would write code in their own dialect of C
with well defined characteristics or limitations for their now-obscure
architecture. There are a couple other issues that would need to be resolved,
but in some sense, this is already the world we live in.

Except, what about this new web architecture? Do we consider the problem of
compiling C to it impossible until we’ve extended it to the point that the
semantics are similar to x86?

~~~
iainmerrick
Arguably also 3) the desire to look good on benchmarks, and to take the
performance crown from FORTRAN.

~~~
pjmlp
It surely takes the security exploit crown.

------
CJefferson
The biggest problem is, the most classic C bug is writing an array out of
bounds. It would be almost impossible to efficiently add bounds checking to
all C arrays, so let's assume we are going to leave that as undefined
behaviour.

That out of bounds write could go anywhere. Into any other variable. Changing
the code of the program (if it isn't write protected). So the most common C
bug can lead to absolutely any change (hypothetically) to the running program.

That's one of the reasons writing text to "explain" what undefined behaviour
can do is so hard.

~~~
mpweiher
What you describe is pretty much exactly the range of "undefined" behavior one
would expect, and also in the range of what used to be "permissible":

you just write to that address

What happens then is hard to predict, and therefore in a sense the behavior
after that becomes "undefined". This is fine. What's not fine is the compiler
taking some action other than

1\. just writing to that address (ignoring), or

2\. emitting a compiler error and terminating, or

3\. doing something else that is "characteristic" of the environment, and
_documenting it_

Current optimizing compilers don't do any of these things, and also not
something that in any way resembles any of these options.

------
praptak
This could prevent some useful optimizations. The following can be a single
rotate instruction but only when relying on undefined behavior of shifts
(otherwise the code has to produce zero for i outside the sensible range)

    
    
        (n << i) | (n >> (sizeof(n) - i))
    

So, does it fall under "demons out of your nose"?

~~~
pg314
Of course that code is broken (relies on undefined behavior) when i is zero,
which is a totally reasonable input for a rotate function. A nice illustration
of the pitfalls of undefined behavior.

You can find a correct implementation in a blog post by John Regehr [1].

[1]
[https://blog.regehr.org/archives/1063](https://blog.regehr.org/archives/1063)

------
kazinator
This article is a pretty ignorant diatribe.

Undefined behavior simply means that the standard doesn't have requirements
for what is being expressed in the program.

The possible range of actual behaviors doesn't define what UB is; that wording
just clarifies that UB doesn't necessarily mean that there is a software
defect: it is possible to behave in a documented manner characteristic of an
implementation. This refers to documented extensions.

The way to fix C isn't to monkey with the definition of UB, but to target
specific cases where there aren't requirements and, like, put requirements
there.

~~~
mpweiher
> The possible range of actual behaviors doesn't define what UB

Hmm...you might want to actually _read_ an article before declaring it an
"ignorant diatribe".

The current standards do say "possible". The _first_ ANSI/ISO standard said
"permissible". It also wasn't a "NOTE" back then.

If the standard currently said what I would like it to say, why on earth would
I advocate changing it?!?

And once again, the wording I want is the way it was in ANSIC-C '89\. So I am
advocating changing the wording back to what it was...and lo-and-behold, C
compilers were more sensible back then.

So please inform yourself.

~~~
kazinator
A lot of things were better in the ANSI C/C90 era; but the definition of
undefined behavior is essentially the same.

"Permissible" is a poor choice of word; "possible" is a legitimate
clarification.

The reason "permissible" is poor because it wrongly hints at the possibility
that some behaviors in response to UB may be "impermissible". Which, in turn,
wrongly suggests that UB is something other than the absence of requirements.

Compilers were more sensible, but not because of squabbling over some minor
wording over the definition of undefined behavior.

It was never the case that undefined behavior was intended to be constrained
in any way.

------
jmmcd
"Possible" v "Permissible" is not the problem. What seems to be needed (based
on reading the post but no other knowledge) is the phrase "including but not
limited to".

~~~
KayEss
I don't think that is what the article is arguing for. There it would be more
akin to "absolutely limited to" as they don't want anything outside of the
examples. Unfortunately for them the examples are still open enough to allow
for what compilers do (things like remove code that the author intended be
left in).

Personally I'm OK with the more aggressive optimisations, but they are still
very controversial -- I think more so in the C than the C++ community (I'm in
the C++ one).

~~~
mpweiher
> Unfortunately for them the examples are still open enough to allow for what
> compilers do (things like remove code that the author intended be left in).

Though one could argue that, I don't really think this is the case.

Certainly "ignoring" it isn't the case when you record the fact that there is
UB and then optimize based on it.

They also aren't issuing a diagnostic and terminating, either at compile or
runtime.

So the remaining option is "behave in a manner characteristic of the
environment". Which is fairly loose/rubbery, and I'd be happy to tighten it
further. I still don't think it allows the current behavior even as written
though, because I don't see how that is "characteristic of the environment".
But the stinger is that the behavior has to be _documented_ , which very
little of the UB exploitation currently is.

------
zhivago
The argument hinges instead on manging drop the word 'unpredictable' from what
he derives from the standard.

If he kept that word intact his argument would collapse.

~~~
mpweiher
Not at all. UB is unpredictable because you don't know what will happen just
from the source text, for example when you write beyond array bounds.

However, it can be perfectly predictable what code will be emitted: a store.

------
pif
Oh no! One more post from someone who can't code as well as he thinks and
looks for a scapegoat! Please, just stop!

