
Into the Depths of C: Elaborating the De Facto Standards [pdf] - jsnell
http://www.cl.cam.ac.uk/~pes20/cerberus/pldi16.pdf
======
TickleSteve
I've been using C for over 20 years and I'm sure I would be caught out by
these...

...but...

The fact these curiosities are not an issue in day-to-day work and C is (one
of) the most popular languages around today mean that they aren't too serious.

When you have a knowledge of the hardware and are working at that level day-in
day-out then issues like this really don't bother you that much.

(I do realise that this is a slightly contrarian view these days, but there is
an awful lot of unjustified C-bashing around currently).

~~~
pkhuong
[http://cacm.acm.org/magazines/2016/3/198849-a-differential-a...](http://cacm.acm.org/magazines/2016/3/198849-a-differential-
approach-to-undefined-behavior-detection/fulltext)

8575 C or C++ packages in Wheezy. This tool found definite UB bugs in 40% of
them.

How would you know that your code is _sometimes_ misbehaving because of UB?

~~~
TickleSteve
not denying there are issues... just saying that C isn't the only widespread
language with issues. at least C's problems are widely known.

~~~
D-Coder
The Potzrebie car isn't the only car with issues. At least the Potzrebie's
doors-falling-off problem is widely known.

~~~
Gibbon1
If the compiler writers union decides that the spec allows for 'format hard
drive off on signed overflow' we can always change the spec to 'don't format
hard drive on signed overflow'

~~~
pcwalton
It's not that simple. The compiler writers have caused security issues before
in the Linux kernel. And the compiler writers are _right_ : the undefined
behavior that they exploit exists for important performance-related reasons.

~~~
pizlonator
To say they are _right_ is an overstatement, I think. There is insufficient
discussion of the empirically measured performance benefits of specific forms
of UB.

Some kinds of UB could be turned into something stricter, like reading a bad
pointer. This either traps or returns some value. It won't format your hard
drive unless you install a trap that formats your hard drive, but that's none
of the spec's business. Traps can happen due to timers, so if arbitrary traps
mean UB then every instruction is UB. Even if the spec punts on defining what
a trap is, that's a progression over saying it's UB.

Same thing goes for division and modulo. In corner cases, it will either
return some value or it will trap. It won't format your hard drive.

The most profitable "true" UB is stuff like TBAA, but smart people turn that
off.

Do you know what the performance benefits are of other kinds of UB? Do you
know how many of those perf benefits (like being maybe being able to take some
shortcuts in SROA) can't be solved by changing the compiler (i.e. you'll get
the same perf, but the compiler follows slightly different rules)? Maybe I'm
not so well read, but I hardly ever hear of empirical evidence that proves the
need for UB, only demos that show the existence of an optimisation in some
compiler that would fail to kick in if the behaviour was defined.

Also, if there were perf benefits of the really gnarly kinds of UB, I would
probably be happy to absorb the loss in most of the code I write. If I added
up all of the time I've wasted fixing signed-unsigned comparison bugs and used
that time to make WebKit faster, then I'd probably have made WebKit faster by
a larger amount than the speed-up that WebKit gets from whatever corner-case
optimisation the compiler can do by playing fast and loose with signed ints.

I suspect that UB is the way that it is because of politics - you can't get
everyone to agree what will happen, nobody wants to lose some optimisation
that they spent time writing, and so we punt on good semantics.

------
paxcoder
To write portable code, I wouldn't study de-facto definitions of de-jure
undefined behavior, except to see if I could cover every possible one and only
if all alternatives were inferior.

~~~
pm215
Indeed not, but enquiring into what the in-the-wild de-facto beliefs about
behaviour are might help in deciding what the de-jure rules should be changed
to, or what a compiler implementation ought to do if it cares about what it
does on the vast mass of code out there that does commit undefined behaviour,
wittingly or otherwise...

~~~
paxcoder
Undefined behavior has a purpose: Not specifying implementation details makes
it easier to write new implementations and for a wider variety of platforms.
"De facto standards" take away this freedom, so ideally you'd want to reject
reliance on UD, but I see your (second) point about that not always being
practical. I guess "be conservative in what you do, be liberal in what you
accept from others". Just make sure that your foundations are strong (pun) or
the whole house will be an EcmaScript.

------
kerneis
If you think you know C quite well, here is one of the studies the authors ran
to elaborate their semantics on corner cases of the language:
[http://www.cl.cam.ac.uk/~pes20/cerberus/notes50-survey-
discu...](http://www.cl.cam.ac.uk/~pes20/cerberus/notes50-survey-
discussion.html)

> If you zero all bytes of a struct and then write some of its members, do
> reads of the padding return zero? (e.g. for a bytewise CAS or hash of the
> struct, or to know that no security-relevant data has leaked into them.)

(and 14 other questions)

Webpage of the project:
[http://www.cl.cam.ac.uk/~pes20/cerberus/](http://www.cl.cam.ac.uk/~pes20/cerberus/)

~~~
blastrat
I knew C quite well. Haven't written any for years. The statements "zero all
bytes of a struct" and "reads of the padding" contain enough ambiguity that it
answers the question. Not to mention the ambiguity in the words "read" and
"write" as they pertain to C, since they already have a "std" meaning that's
not the same as lvalue or rvalue, so what exactly do they mean here?

And if you think you can answer the question without resolving the
ambiguities, that answers some other questions.

~~~
kerneis
I believe the questions in this study (I did not write it, only know the
authors) were deliberately open-ended, allowing for comments on the specifics.
A previous, much longer version used to contain code examples to comment, but
it proved too detailed for people to complete.

Moreover, the study was explicitly not about ISO C: "We were not asking what
the ISO C standard permits, which is often more restrictive, or about obsolete
or obscure hardware or compilers. We focussed on the behaviour of memory and
pointers. This is a step towards an unambiguous and mathematically precise
definition of the de facto standards: the C dialects that are actually used by
systems programmers and implemented by mainstream compilers."

Here is an actual example of a comment to this question:

    
    
        I would expect this code to work:
        
        struct foo
        {
            char a;
            double b;
        };
        
        foo p;
        foo q;
        memset( &p, 0, sizeof( p ) );
        memset( &q, 0, sizeof( q ) );
        p.a = 1;
        q.a = 1;
        assert( memcmp( &p, &q, sizeof( foo ) ) == 0 );

~~~
Too
Is memset and memcmp compatible with strict aliasing? Intuitively it seems
like that would be a gap into the aliasing rules. Altough void* is allowed to
alias anything so maybe it works through that. Ive never seen memset, memcpy
or memcmp on anything but char* in production code.

~~~
gpderetta
The aliasing rules only talk about dereferencing pointers. Void* can't be
dereferenced so it has no interaction with the aliasing rules. You might be
thinking about char _, and yes, you are allowed to dereference char_ to
inspect the bytes of an object, which is what the various mem* functions do
under the hood.

