
“Fiercely resist any further broadening of the scope of the C UB problem” - Sindisil
https://lists.debian.org/debian-devel/2016/04/msg00269.html
======
ComputerGuru
I don't know if it's because I convinced myself a long time ago that I'll
never be smart enough to write correct C/C++ code (though that hasn't (yet)
stopped it from being my primary development language) or if it's because I
just because I'm a secret masochist, but I disagree with this
proposal/suggestion.

Regardless of who is right and who is wrong in this matter, I think if
everyone took a step back we could at least agree that it makes absolutely no
sense to fix this on a (single Linux) distribution level. For Debian to
configure/patch compilers on their platform to "narrow" undefined behavior is
insane and ineffectual. Software isn't "validated" on/against a particular OS,
it's validated on a compiler basis.

Breaking this assumption introduces a massive schism. While Debian is an
amazing distro with plenty of clout (I'm a FreeBSD guy, but Debian comes
second), it's terrifying to imagine a new generation of "cross-platform" C/C++
software that can only be verified working on Debian (or with Debian's
fork/re-configured compiler). We've come so close to making truly cross-
platfrom C++ code a reality (even bringing Windows, I repeat WINDOWS, into the
fold) with C++11 (and the subsequent releases) and it's, in my humble opinion,
utter folly to try and change the way code will fundamentally compile
_depending on the distribution you run_.

If Debian cares, make a proposal to the C++ committee, bribe^H convince
members to see their way (or threaten^H blackmail^H show them the dangers of
continuing down the road they're on). Heck, fork C/C++ and call it E or C+++
or c-safe or something - or more reasonably - write a tool to convert C to
rust or D-without-the-standard-library and announce only tools in that/those
languages will be allowed in the standard distribution. But for Heaven's sake,
please don't try to redefine C.

~~~
linuxlizard
"even bringing Windows, I repeat WINDOWS, into the fold"

I am not a C++ expert but have tinkered with C++11. How will portable C++ work
with the Windows UTF-16 (wchar_t) and other systems using UTF-8 (char_t)?

Is there a way to have standard C++ portable across Windows' UTF-16 and
!Windows UTF-8 without #ifdef'ing a char wrapper?

edit: UTF-16 not UCS-16

~~~
ComputerGuru
Modern versions of Visual Studio allow you to configure your project as one of
ASCII/UNICODE/MBCS. MBCS (multi-byte character set) is UTF8 compatible. TCHAR
becomes char (instead of the usual wchar_t) and _most_ APIs you call via their
once-ANSI names (e.g. CreateFile being #ifdef'd to CreateFileA) will actually
accept UTF8 instead. Unfortunately, APIs that were created when ASCII was
declared dead and without an xxxxA counterpart still need manual mapping to
the xxxxW functions instead (which is just plain stupid on MS' behalf).

------
jabl
The obvious problem with the "The second is to ask what is most useful"
approach is that if you ask 1000 C programmers, 990 are not clueful enough to
give any sane answer and will pollute your answer database with noise, and of
the remaining 10, 5 are language lawyers and favor approach #1 anyway, and the
other 5 will give 20 different answers (Note: above numbers completely made
up).

So then the compiler developers give up and implement stuff 1) according to
the standard document 2) such that performance on SPEC cpu is maximized.

~~~
_delirium
A related comment:
[http://blog.regehr.org/archives/1287](http://blog.regehr.org/archives/1287)

~~~
jabl
Indeed. The only realistic way out of this conundrum AFAICS is some kind of
consensus around "friendly C"/"boring C" or such that compilers then can
implement, whether it's an annex to the official C standard or some other kind
of spec.

That, or then everybody switches to Rust. :) (hey, I can daydream can I?).

~~~
pcwalton
And everyone will have to accept the performance loss--and there will be
performance loss. This is acceptable for many applications, don't get me
wrong. But it will be a tough sell.

If you want to get people using a Friendly C, you need to start convincing the
biggest _users_ of C and C++ that they should give up performance for a
simpler dialect of the language. Like politicians responding to voters,
compiler authors respond to what their customers demand. Up to now, their
customers have demanded performance. It's not their fault for listening to
them.

~~~
nkurz
_If you want to get people using a Friendly C, you need to start convincing
the biggest users of C and C++ that they should give up performance for a
simpler dialect of the language._

Why would would C++ programmers have to give up anything? You may notice that
the frequent complaint (and title of the article) is "undefined behavior in
C", and the hypothetical replacement language is "Friendly C", not "Friendly
C++". From the point of view of most who are troubled, rightly or wrongly, C++
isn't considered relevant to the problem or the solution.

I think part of the "divide" is that compiler writers (and probably C++
programmers) are more likely to lump C and C++ together. This makes sense, as
C++ is mostly a superset of C, and since many of the optimizations being
questioned operate at the level of internal intermediate language that's the
same for both. But many C programmers don't view them as being the same
language at all, and have no opinion on how C++ compilers should operate.

Perhaps what needs to be questioned is assumption that the needs of C and C++
programmers can effectively be served by the same compiler?

(I realize I've responded strongly to two of your comments in a row. I care
strongly about the issue, but don't intend this to be an attack. Rather trying
to convince you that you're wrong, my goal is just to understand what produces
the gap between our viewpoints.)

~~~
pcwalton
Do you have any UB-sensitive optimizations in mind that affect only C but not
C++, or vice versa? I can't really think of any.

~~~
nkurz
_Do you have any UB-sensitive optimizations in mind that affect only C but not
C++, or vice versa?_

I presume there are some, but they aren't the sort of thing I have in mind.
I'm talking more about the expectations that the users of each language have.
For whatever reason, C programmers are much more likely to complain about
"broken" optimizations than are C++ programmers.

Of course this isn't absolute, but I think it's undeniably a pattern. I'd
guess this is because C has a heritage of being "portable assembly", and thus
many programmers have an expectation of a 1:1 correspondence between the code
they write and the finished product, and are startled when it doesn't.

In the case of explicit null checks being removed and loops being removed
resulting in memory not being zeroed, I think they have a point. Perhaps there
is some way to apply different levels of optimization to the code that the
programmer writes versus code "generated" code?

------
haberman
I can see both sides of this. But looking at this argument:

> There are two ways to evaluate the the C specification's rightness and
> properness. [...] The second is to ask what is most useful. And there again
> the C committee have clearly failed.

Just two days on HN we saw this article:
[https://news.ycombinator.com/item?id=11468603](https://news.ycombinator.com/item?id=11468603)
In which it says:

> no matter the kind of software, performance is almost always worse than our
> customers would like it to be.

That is why all of this is happening. There is a market demand for
performance. Compilers that increasingly exploit UB in C is just a
manifestation of that market demand.

~~~
jerf
You don't really understand the UB problem if you model this in your head as
some sort of simple set of rules that everybody really should have known all
along, and it's not a big deal to fix the UB in your code. You have to make
sure you're modeling it as A: a set of rules so arcane and byzantine that
virtually nobody understands them, and to a first approximation, no non-
trivial well-defined C program has ever been written and B: that this was a de
facto _change_ in C's nature... in a very real way, the language changed out
from underneath people, without them expecting it.

You can't just shrug this away with "Well, if you want performance...",
because people in fact _don 't_ want abstract "performance"... they want the
language they were truly writing in to perform well, not for what is de facto
a different dialect of the language to suddenly appear and replace the
language they were using.

~~~
pcwalton
> You can't just shrug this away with "Well, if you want performance...",
> because people in fact don't want abstract "performance"... they want the
> language they were truly writing in to perform well, not for what is de
> facto a different dialect of the language to suddenly appear and replace the
> language they were using.

Is wanting a for loop setting an array to zero to optimize into memset
optimizing the language they were truly writing in? I think it is. But that
optimization frequently depends on undefined behavior.

UB exploitation usually exists because people filed bugs on compilers
complaining that they didn't optimize some case they expected to optimize.

~~~
nkurz
_Is wanting a for loop setting an array to zero to optimize into memset
optimizing the language they were truly writing in? I think it is. But that
optimization frequently depends on undefined behavior._

Yes, this is a good optimization, since it efficiently does what the
programmer intended. The bad optimization is removing the security-essential
loop altogether when the compiler notices that result appears unused, and
sensitive information is left susceptible to later attack.

 _UB exploitation usually exists because people filed bugs on compilers
complaining that they didn 't optimize some case they expected to optimize._

I doubt that anyone has ever filed a bug saying "I explicitly wrote a loop to
zero memory, but the compiler failed to optimize it out." If you know of one,
please point to it. I think you are throwing out the baby (intentional C) with
the bathwater (autogenerated C++).

~~~
pcwalton
> I doubt that anyone has ever filed a bug saying "I explicitly wrote a loop
> to zero memory, but the compiler failed to optimize it out."

That optimization is a natural consequence of SROA and DCE. If you claim you
don't want those optimizations, I don't know what to tell you. Those
optimizations are some of the most basic, critical optimizations any modern
C/C++ compiler does and throwing them away can easily result in at least a 2x
performance loss.

~~~
cwzwarich
The optimization actually has very little to do with SROA. In LLVM there is a
special loop idiom recognition pass designed to detect memset / memcpy loops
that works after ordinary SSA conversion.

------
haberman
One area that worries me that I never see anyone talk about is LTO between C
and C++. How do you even reason about that?

I asked a C++ and compiler expert about this once and he told me: "I believe
both GCC's and LLVM's LTO will happily cross this barrier, so it doesn't offer
you any real protection from their optimizers."

There is no single standard that defines what UB is for a mixture of C and
C++. There are clearly some parts of C++ that are trying to improve
interoperability with C, like "standard layout" classes. The best you can do
is try to simultaneously follow the rules for both languages when you mix the
two.

~~~
DSMan195276
Personally, I think this is blown out of proportion a bit - There are things
to complain about, but LTO problems are really only caused by other issues.
But with that said, what you described isn't really a worry:

LTO is done on the compilers internal representation of the code (IE. GIMPLE
or LLVM IR). This representation is generated based on the rules of each
language, and optimizations are performed on this representation instead of
the original source representation. Both C and C++ (and anything else) are
converted into these representations. LTO simply keeps the GIMPLE or LLVM IR
around until link time, and then when the program is linked optimizations are
performed over the entire representation of the program. Crossing the language
barrier shouldn't be a problem, because the GIMPLE has it's own rules to
follow to make sure everything still functions the same. Once you reach this
point both languages are already compiled in the practical sense, they're just
not actual machine code yet. I would expect however, that because of the
differences there are far less optimization opportunities to be taken
advantage of.

~~~
haberman
Ok let's look at a specific example then.

    
    
        // foo.c
        #include <stdlib.h>
    
        typedef struct { int x; } s;
    
        s *make_s() {
          s *ret = malloc(sizeof(*ret));
          ret->x = 5;
          return ret;
        }
    
        void free_s(s *val) {
          free(val);
        }
    
    
        // bar.cc
    
        // This class is standard-layout and matches "s" from C.
        class C {
          int getX() { return x_; }
         private:
          int x_;
        }
    
        extern C* make_s();
        extern void free_s(C* c);
    
        int main(void) {
          C* c = make_s();
          int ret = c->getX();
          free_s(c);
          return ret;
        }
    

C++ says you can only call a method on an object whose lifetime has begun. Can
C begin the lifetime of a C++ object?

~~~
Kristine1975
_> Can C begin the lifetime of a C++ object?_

Yes. C++14 Standard §3.8p1:

 _...An object is said to have non-vacuous initialization if it is of a class
or aggregate type and it or one of its members is initialized by a constructor
other than a trivial default constructor... The lifetime of an object of type
T begins when:_

 _— storage with the proper alignment and size for type T is obtained, and_

 _— if the object has non-vacuous initialization, its initialization is
complete._

Since class "C" is standard-layout and has a trivial default constructor (in
this case none at all), assuming that malloc allocates memory that is suitably
aligned for "C", the function make_s returns a pointer to an instance of C++
class "C" whose lifetime has begun.

Edit: Thanks for asking btw. I use similar code in a project of mine but never
actually checked whether it is actually standard-conforming.

------
kyberias
Glossary: UB = Undefined behavior LTO = Link time optimization

------
zokier
I'm getting bit annoyed about these sorts whiny posts that float up
occasionally. Here is a simple three step program if you don't like what C
language is today:

1) Write your own damn spec with no "problematic" UB. Hookers and blackjack
are optional

2) Write a compiler for your shiny new language.

3) Start writing/porting code to your new language

There you have it, UB problem solved once and for all.

I'm not sold over on Regehrs work, but at least he is doing _something_ with
his Friendly-C proposal. This has the added benefit that we get something more
concrete to discuss and debate about instead of vague "optimizations breaking
_my_ code are bad".

It is completely unfeasible to try to turn back time on C and somehow
magically make compilers deduce programmer intent from some random crap that
you throw at them. Like it or not, computers are based on rules, and standards
are the best way we have to establish those rules among large number of
parties.

~~~
Kristine1975
Designing your own language isn't even necessary. There's Rust, Ada, Modula,
Pascal... if you're feeling masochistic, use COBOL ;-)

~~~
dalke
The Wikipedia page for COBOL list several places where programming constructs
lead to undefined behavior.

------
robertelder
The issue of creeping optimizations causing problems far in the future is one
of the reasons that I suggested using unsigned intergers in C as much as
possible:

"Compiler authors will likely support this as long as possible, but the peer
pressure of needing things to go faster and faster will likely push them to
exploit more and more undefined behaviour to their advantage in the future.
Their argument will be: 'After all, who has sympathy for those who don't
follow the standard?'."

From: [http://blog.robertelder.org/signed-or-unsigned-
part-2/](http://blog.robertelder.org/signed-or-unsigned-part-2/)

Lots of people disagree with me, but its nice to have less UB issues to worry
about far in the future.

------
Glyptodon
UB being the various "undefined behavior" bits of C?

~~~
cpeterso
Yes. Compiler developers like it; everyone hates it. :)

~~~
twoodfin
You like it when aliasing rules allow your compiler to copy the target of a
pointer into a register, rather than read it from memory over and over in case
an intervening write through another pointer could have overwritten it.

------
sneak
Amazing that in 2016, the best programmers in the world have not chosen to
solve the "display mailing list archives on a webpage without terrible line
wrapping" problem yet.

~~~
oldmanjay
It was hard to think about, it should be hard to read

------
StringyBob
I'm a digital hardware (chips) guy. We use compilers (synthesis tools) and
normally won't get a second chance to recompile if the logic gates in our
silicon chip are wrong as a result of compiler bugs or misinterpretation of
source.

We automatically distrust the compiler (synthesis tool) to do the right thing.
You formally prove the 'compiled' output (logic gates) that will be
manufactured matches with the source code of the design (verilog/vhdl) using
tools written independently to the compiler.

This isn't easy, and I know the problem space is larger, but does anyone ever
do this for software?

~~~
nkurz
SQLite is usually (and correctly) held as an example of thorough software
testing:
[https://www.sqlite.org/testing.html](https://www.sqlite.org/testing.html)

Perhaps you could offer a sense of how this compares to the hardware testing
practices you use?

~~~
StringyBob
In hardware design you typically test functionality through logic simulation
or emulation (effectively running the code in a computer simulation or fpga),
use test harnesses, look for code coverage, run unit tests, random code
fuzzing, code assertions etc. You might also do formal checks for some
assertions to e.g. avoid deadlocks.

A secondary check is that the source that you functionally tested is logically
equivalent to what you manufacture. This is where you are not checking your
code, but the issue is trust of compiler/ compiler optimisations in synthesis.
It needs to be redone if you recompile - that's the step I don't really ever
see in software development - if I use a different compiler option or
underlying instruction set architecture to the SQLite Dev team, do I still
trust my binary?

Of course the level of paranoia is far higher in hardware where it costs
multiple millions of dollars to crank out a new spin of a chip!

~~~
dalke
> if I use a different compiler option or underlying instruction set
> architecture to the SQLite Dev team, do I still trust my binary?

If that is critical, you can join the SQLite Consortium Membership for
$75K/year and access to the test suite. There's also an option to pay SQLite
developers to "run TH3 on specialized hardware and/or using specialized
compile-time options, according to customer specification, either remotely or
on customer premises." The TH3 test harness is an aviation-grade test suite.

The level of paranoia for aviation software is also rather high.

------
forrestthewoods
"But sadly it seems that the notion that our most basic and widely-used
programming language should be one that's fit for programming in is not yet
fully accepted."

Ouch. I usually hear that about C++, not C. =[

~~~
twoodfin
My guess is that many of the optimizations causing increasing trouble were
introduced to improve C++ code performance in pursuit of the "zero-cost
abstraction" goal. But many are just as applicable to C (and may even be below
the level that a compiler like gcc or clang knows the difference).

~~~
ArkyBeagle
The whole "zero-cost abstraction" thing puzzles me. Abstraction is a means to
an end, not an end unto itself.

~~~
rbancroft
We want abstractions to make things easier to do, but we don't want to pay for
them with CPU cycles... that's what a zero-cost abstraction means. It's not
about adding in useless abstractions. The ideal is software that is faster to
write and easier to understand and maintain, while offering uncompromised
execution performance.

------
hinkley
I have to say, strictly as an uninformed outsider, this bit of nastiness is
just pushing Go, Rust and even Swift higher on my priorities list.

I don't think I'm the only one.

~~~
ktRolster
Well then, don't look too deeply at the nastiness in Go, Rust, and Swift.

~~~
dsfuoi
Can you elaborate? Hopefully they don't have straight UB do they?

~~~
zokier
Well, Rust for example does not have a formal well-defined behavior to start
with, so its kinda fuzzy about UB too. And of course `unsafe` is another story
altogether.

~~~
mastax
Can you elaborate? The rust docs [1] seem to disagree:

    
    
        Unlike C, Undefined Behavior is pretty limited in scope in Rust. All the core language cares about is preventing the following things:
    
        - Dereferencing null or dangling pointers
        - Reading uninitialized memory
        - Breaking the pointer aliasing rules
        - Producing invalid primitive values:
        - dangling/null references
        - a bool that isn't 0 or 1
        - an undefined enum discriminant
        - a char outside the ranges [0x0, 0xD7FF] and [0xE000, 0x10FFFF]
        - A non-utf8 str
        - Unwinding into another language
        - Causing a data race
    

And all of those things require `unsafe`, so safe rust cannot do any of them
(barring compiler or unsafe library or OS bugs).

Edit: And I must admit I don't know much about language theory or formal
definition, but there is also a self-described formal grammar [2].

[1]: [https://doc.rust-lang.org/nomicon/races.html](https://doc.rust-
lang.org/nomicon/races.html) [2]: [https://doc.rust-
lang.org/grammar.html](https://doc.rust-lang.org/grammar.html)

------
dsfuoi
C is designed to be capable of running of vastly different architectures,
allowing the programmer to use that flexibility when it actually isn't needed,
or is harmful.

I think the C coding mindset should be this: if an approach requires code that
isn't unambiguously defined, change the approach. If this means more boring
coding, to circumvent cute tricks, so be it. If you need that extra 5% speed,
use assembly instead of bending C.

~~~
astrobe_
The problem is that people often don't even know what is UB and what is not.
People don't even always know what is portable and what isn't, to begin with.
Some even seem to believe that "it's written in C, so it's portable".

~~~
dsfuoi
I strongly agree. There is a serious lack of respect of the C rules even among
those who teach and this is passed onto their students. It is really hard to
convince them to change ways after that.

A couple months ago, as a curiosity, I watched a few videos in a series on
programming in C in Windows environment. The teacher was a serious programmer,
but the first thing that went out the window was strict aliasing. After that
assumptions of integer sizes and range started creeping in. It came apparent
that the teacher knew C, but only superficially, signedness, integer
promotions and usual arithmetic conversions were treated like a nuisance. If
the code compiled and ran, it was good. Those videos were the first C
programming experience for at least several hundred people.

------
pgeorgi
For coreboot, I started making gcc (we use our own toolchain for various
reasons already) emit some link-time error inducing symbol, because their
current approach (__builtin_trap()) is nice for undefined behavior in
userspace programs that can segfault, but not for firmware that silently
hangs.

[https://review.coreboot.org/14364](https://review.coreboot.org/14364)

------
alexeiz
On the contrary, I'm currently implementing LTO in our project because it
allows to uncover and fix the whole slew of latent problems related to ODR
violations. At the same time I'm enabling UB sanitizers, so I'm confident that
LTO will actually make our code more robust and reliable. The usefulness of
LTO goes beyond better optimization, and it should not be ignored.

------
TorKlingberg
There has been very little new undefined behavior added to C recently. The
only one I can think of is INT_MIN % -1, but most people seem to think that
was a good idea. All compilers were already treating it as UB, even though it
wasn't.

[http://blog.regehr.org/archives/175](http://blog.regehr.org/archives/175)

------
ArkyBeagle
So the real argument is whether LTO is on by default? Because if it's the
source of trouble and it's _off_ by default, then there's only trouble when
you choose to use it.

------
ktRolster
_This narrative of `fault ' has two very serious problems._

That can be said almost any time people are trying to find fault. Better to
just look for a solution, and not worry about whose fault it is.

------
jbb555
I agree with this wholeheartedly.

I can see the C language being forked into the one we are getting and the one
users actually want.

------
cpeterso
What are the particular compiler behaviors or optimizations they would
disable?

~~~
SQLite
How about allowing pointers to be compared against NULL after they have been
passed to free(): "free(x); if( x!=NULL ){...}" That was completely harmless
(and a useful idiom) for 45 years, and now suddenly it is flagged as UB and
has to be changed. Why?

Wouldn't it be great if "memset(0,p,0)" was a harmless no-op? It was for time
out of mind. But no more.

For bonus points: Can we have a #pragma that tells the compiler to abort with
an error if the target machine uses any representation for signed integers
other than twos-complement?

~~~
Kristine1975
_> Can we have a #pragma that tells the compiler to abort with an error if the
target machine uses any representation for signed integers other than twos-
complement?_

This should do it (off the cuff):

    
    
      typedef int NOT_TWOS_COMPLEMENT[(unsigned int) -1 == UINT_MAX ? 1 : -1];
    

In C++ use static_assert.

~~~
dsfuoi
I don't see how that would work. Casting -1 to an unsigned type is independent
of representation and will always give the max unsigned value.

C has _Static_assert.

------
vmorgulis
ubsan could help a lot:
[http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html](http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html)

~~~
Kristine1975
And ASAN (the Address Sanitizer). I always have them turned on for development
builds (slowdown of about 100%). Only release builds have them switched off.

Recently: Use-after-free of a memory block. ASAN told me:

1\. Where and in which thread the use-after-free occurred (including
stacktrace).

2\. Where and in which thread the memory was deallocated (including
stacktrace).

3\. Where and in which thread the memory had originally been allocated
(including stacktrace).

Fixing the bug took about 30s with that info.

