
The Problem with Friendly C - Mindless2112
http://blog.regehr.org/archives/1287
======
nly
Since, according to Chandler Carruth, the aim for Clang is to not do
optimizations based on undefined behaviour without a corresponding instrument
in ubsan[0], I don't see much traction on this well-defined/boring C effort.

You know, in my experience, things would be great if people actually turned on
warnings. I write all my new code with -Weverything with a few noise
categories turned off. Everybody should build code at this level from day 1.

[0]
[http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html](http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html)

~~~
npsimons
> Everybody should build code at this level from day 1.

On the project I run, the CI will reject your code in CR if it doesn't compile
under two different compilers with all but a handful of warnings turned on as
errors, runs and passes tests, and passes various linters, including a check
for undocumented types, variables, parameters and return values. IMNSHO, every
project should be like this. Once it's setup, it's a breeze and doesn't impede
development speed at all.

------
to3m
If ((uint32_t)x<<32)==x on one system, and ((uint32_t)x<<32)==0 on another,
why is that a problem? The issue isn't that one system behaves differently
from another. The issue is that compilers take it upon themselves to decide
that because code would behave differently on these different systems - which
is why the standard leaves it undefined - it can be considered to fall into
some kind of black hole that allows the compiler to do whatever it likes.

But why can't it just produce something different on each system?

Allow me to call it as I see it: the modern interpretation of undefined
behaviour is bullshit. What compilers do today should be the recourse of
absolute last resort, and the sort of thing that makes its authors feel bad.
But it seems to be treated as a matter of course.

I don't know what to say.

Mandatory reading: [http://robertoconcerto.blogspot.co.uk/2010/10/strict-
aliasin...](http://robertoconcerto.blogspot.co.uk/2010/10/strict-
aliasing.html) ("Everyone is fired");
[https://groups.google.com/forum/#!msg/boring-
crypto/48qa1kWi...](https://groups.google.com/forum/#!msg/boring-
crypto/48qa1kWignU/o8GGp2K1DAAJ) ("these people simply don't understand what
most C programmers want"); [http://blog.metaobject.com/2014/04/cc-
osmartass.html](http://blog.metaobject.com/2014/04/cc-osmartass.html) ("please
don't do this, you're not producing value")

~~~
tormeh
The, to me, obvious thing to do with code with undefined behaviour is to emit
an error and refuse to compile it. Programmers should never rely on undefined
behaviour.

Then again, I think languages shouldn't have undefined behaviour and that
programmers who use languages with undefined behaviour deserve what they get.

~~~
scott_s
The problem, as Regehr (the author of this submission) has pointed out many
times before, is more subtle than that. Because certain behavior is considered
undefined, compilers are allowed to assume the code it is compiling _is_ well
defined, and optimize accordingly. That can cause simple bugs to have
mysterious effects. For example:

    
    
        a->thing = 42;
        if (a == NULL) {
          return;
        }
    

Obviously that code is wrong; I should check for NULL before using a. But
because I dereferenced a, the compiler can assume a must not be NULL, and just
removes the null check as dead code. This scenario has caused problems in the
kernel, and made bad bugs even worse.

~~~
tamana
But the compiler could just as well emit a warning "useless NULL comparison"
instead of blithely assuming that the programmer intentionally wrote a book
expression. That wouldn't handle every case of UB, but would handle many.

~~~
superuser2
A warning on removed dead code isn't helpful because dead come is legitimately
removed all the time. No one would ever heed it.

~~~
slavik81
If you ignored dead code produced by macros and limited the check to function-
scope, I think you'd mostly eliminate false positives.

------
userbinator
I think the most important point of Friendly C is not to define _the_
behaviour for what would otherwise be undefined, but to define _a_ behaviour;
from this point of view, it would be unnecessary to argue over the examples he
mentions like memcpy() vs memmove() and integer shifting --- it only suffices
that every implementation define the behaviour, and what that behaviour
precisely is can differ between them. A lot differs already between platforms,
and so all this basically means is to document those differences. This agrees
with the spirit of the language as a "high level assembler", removes all the
undefinedness, and would be entirely unsurprising to most programmers.

 _The situation gets worse when we start talking about a friendly semantics
for races and OOB array accesses._

I don't have anything to say about races, but OOB array accesses should simply
do what you'd expect the hardware to do: attempt to access memory at the
location the array indexing equation gives. It may segfault, or it may access
some other contents in memory. That's what any C programmer would probably
expect.

~~~
caf
You have to look deeper for the OOB array access question. For example,
imagine I have code like this:

    
    
      if (a > 1)
        b++;
      array[b] = 0;
      c = a + 10;
    

Under your suggested semantics, can the compiler use the value loaded from `a`
at the point of the if() statement to calculate `a + 10`? Or does it have to
emit a reload of `a` after the array access, in case `array[b]` was an OOB
access that overwrote `a`?

~~~
nkurz
I'm not seeing the deeper question, whereas what 'userbinator' said still
makes sense to me.

The compiler's implementation defined behavior would simply be "if you tell me
to write a value to an address, I'll create assembly that tries to write that
value to that address". That's all. If this address is outside the allocated
range of the array, there is no guarantee that it's safe that write this
value, or that it won't break something else.

So in your example, no, the compiler is not required to reload 'a'. The
compiler is allowed to presume that the array[x] syntax will have no affect on
the value of a local variable, whether that variable is stored in a register
or on the stack. The compiler is not guaranteeing safety, just best effort.

~~~
caf
When you're saying that the compiler should be allowed to optimise under the
assumption that OOB array accesses don't clobber other variables, this means
that OOB array accesses can make very weird things happen indeed. For example:

    
    
      if (a > 1)
          b++;
      array[b] = 0;
      if (a > 1)
          foo(a);
    

We might see foo() called with argument 0 when that's apparently impossible -
the OOB write has apparently "reached forward" (it can also "reach backward").

This is why it ends up just being "undefined behaviour" \- to do better either
you have to somehow exhaustively document all the kinds of weird things that
can happen ("it writes to the memory" isn't enough, because of the way that
can interact with the optimiser) or you have to unreasonably constrain the
optimiser.

------
jerf
Tone: I do not mean this as sarcasm or merely chasing fashion, I'm quite
serious. As both theory and practice are showing, you're never going to be
able to get the consensus you want out of C. There's no "saving" C... not
because that's somehow mathematically impossible, but simply because the
project is too staggeringly large for us to even wrap our heads around. It
would literally be easier to get people to start using another language...

... so, why not do that then? We have, for perhaps the first time in 40 years,
a candidate for a systems language that can truly replace C, that has truly
different semantics (i.e., not C++, which is still profoundly C with a lot of
stuff bodged on the side). I'd suggest trying to use Rust, and working with
that team to nail down whatever remaining issues may yet be undefined in Rust
that may cause trouble in the future. Whatever remaining practical problems
there may be (and my impression is that that isn't really a long list), work
on resolving them.

Again, I am not being sarcastic or cynical; it is truly my estimate that it
would be easier to get people behind that than to fix C at this point.
Probably by a good two or more orders of magnitude. Obviously we're not going
to rewrite all existing code. Obviously there's a lot of code still to be
written that is so deeply embedded in C that there's no practical alternative
to adding more C to it, even in a world of FFI support and such. But if the
people who care about the idea of BoringC or Friendly C start getting behind
Rust, getting their hands dirty with it, and doing what the Mozilla project is
doing to find places where they can start slotting subsystems in cleanly to
existing code bases, you may just be able to start creeping out of the mess
we're in now.

And... who knows. If this becomes acceptable, then generally considered a good
idea, then best practice, then perhaps even something you need to do unless
you want people to think poorly of you... perhaps real change will prove to be
less intractable than we thought. People consistently overestimate change in
the short term but underestimate in the long term. It is, perhaps, not too
much to hope for that huge swathes of our fundamental systems could be running
on Rust in 20 years, instead of C.

So I'll reemphasize once again... yes, I know I'm proposing a staggeringly
enormous change. The only thing that it has going for it is that I still think
it's easier than the staggeringly-enormous-squared other choice.

 _Something 's_ gonna happen. After all... what's the alternative? That C is
still the foundational language of the entire computing world in 2035?

Seriously?

~~~
chubot
I'll take the bet that in 2035 it's still going to be C/C++ (or a C derivative
like Boring C). Because rewriting all that code is an economic impossibility.
There's a very long way to go before the rate of foundational Rust code
written exceeds the rate of foundational C/C++ code written. And even if you
manage to have 100% Rust and 0% C/C++ code being written, you still have a
huge legacy to write, which literally costs billions of dollars (ALL operating
systems, ALL browsers, ALL interpreters (Python/PHP/Perl/...), ALL compilers
(GCC, LLVM), ALL web servers, etc.)

Honestly I think your view of technological adoption is fairly naive -- I
don't see much content here other than "everyone get behind Rust!".

IMO the more realistic approach is a systems approach: make it so that badly
written C code doesn't completely hose your system. I like the application
compartmentalization work (Chrome style, DJB style), and capability work like
Capsicum. And LangSec work in making safe parsers. The trusted computing base
has to be reduced. Not every line of C code should run in a trusted context!!!
Principle of least privilege. We know (or should know) all this stuff.

That is many of orders of less magnitude less work / cost, and I think
actually feasible. Coherent and secure systems architecture is more achievable
than everybody writing perfect C code or everybody switching to another
language.

Another thing the Rust community should be working on is easy and efficient
IPC with C programs. So you can rewrite a secure core in Rust and communicate
with legacy C/C++ running in an untrusted OS context.

And also fixing the mess that is Linux containers, so it isn't so difficult to
secure them (i.e. see Docker's security issues)

~~~
dikaiosune
I haven't used it yet, but apparently Rust's FFI with C (in both directions)
is quite good:

[https://doc.rust-lang.org/book/ffi.html](https://doc.rust-
lang.org/book/ffi.html)

~~~
chubot
That doesn't do IPC though right? When I say IPC, I mean where you have a Rust
process and and a C/C++ process running with different privileges. When people
say FFI, they mean Rust code and C code running in the _same_ process.

I thought Chromium had a library for this (for multiple C++ processes), but I
guess it is sort of ad hoc now?

[https://www.chromium.org/developers/design-
documents/inter-p...](https://www.chromium.org/developers/design-
documents/inter-process-communication)

[https://www.chromium.org/Home/chromium-
security/education/se...](https://www.chromium.org/Home/chromium-
security/education/security-tips-for-ipc)

~~~
steveklabnik
[https://github.com/pcwalton/gaol](https://github.com/pcwalton/gaol)

~~~
chubot
This seems like a great start, but it would be far more useful if there was a
C/C++ side. I only see Rust examples, which makes me think it is for Rust <->
Rust communication.

Though it's great to see this, because just rewriting in Rust -- while
fantastically expensive -- isn't necessarily enough. You need multiple
security measures. The grandparent comment was mistaken about this.

------
petke
I think this is a fundamental problem with C. Both of the language itself, but
also of the C programming culture of insisting on very low levels of
abstractions.

Bugs and undefined behaviour is too easy for programmers to write when the
level of abstraction is low. Too much boilerplate to get wrong. But too low a
level is also bad for an optimizer. It cant assume it understand the
programmers intention with the code.

C++ tries to fix some of this by creating a higher level language and library
on top of C. Low level code is considered unsafe, when higher level
abstractions can be used as replacements.

Some examples of replacing low level with high level. Raw (owning) pointers
and manual memory management are replaced with RAII value semantics and
occasionally smart pointers. Raw loops are replaced by container iteration and
better yet by STL algorithms. Casts are replaced by templates. Threads and
mutexes are replaced by tasks and async, etc.

Eventually the idea is to subset C++ so that we can rip the C out of C++.

[https://github.com/isocpp/CppCoreGuidelines/blob/master/CppC...](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md)

~~~
mtdewcmu
I don't think C is bad for optimizers. What language consistently generates
faster machine code than C?

~~~
kyllo
Fortran, but Fortran is also quite low-level. It can do some optimizations
that C can't because it disallows pointer aliasing.

~~~
nwmcsween
restrict?

------
xamuel
You have to choose whether or not you want the compiler to generate code that
spends time doing things that the programmer didn't ask for.

If you write a library with a function which accepts variables x and y and
computes x[y] (or x<<y), there's no way for the compiler to know in advance
whether the user will pass in well-behaved values for x and y. If you insist
on the behavior being defined in pathological cases, then the compiler _must_
add some kind of branching logic in there to check what's passed in at
runtime. In other words, spend time doing things the programmer didn't ask
for.

Maybe when we make more progress with formal proof-generating languages, we
can create a "friendly" C where the compiler refuses to compile the code until
it's accompanied by formal proofs of UB-avoidance.

~~~
Animats
_" Maybe when we make more progress with formal proof-generating languages, we
can create a "friendly" C where the compiler refuses to compile the code until
it's accompanied by formal proofs of UB-avoidance."_

That's entirely feasible. I headed a team which did that for a dialect of
Pascal over three decades ago.[1] It's since been done for Modula III, Java,
Spec#, and Microsoft Windows drivers.

One reasonable thing to do is to have the compiler generate run-time
assertions for every statement which has a precondition for defined behavior.
All pointer dereferences get "assert(p != 0)". The shift problem in the
original article gets "assert(n > 0); assert(n <= 32);" Now everything has
run-time checks.

Then you try to optimize out the run-time checks, or at least hoist them out
of loops. A simple prover can remove about 90% of the checks without much
effort.

[1]
[http://www.animats.com/papers/verifier/verifiermanual.pdf](http://www.animats.com/papers/verifier/verifiermanual.pdf)

------
praptak
Fun fact about large shifts being undefined.

With a 32 bit x, the expression x << b | x >> (32 - b) can be translated to a
single "roll x b bits to the left" instruction. But it is only possible if >32
shifts are treated as undefined.

~~~
pbsd
Not true. In fact, modern x86 backends recognize the safe rotation idiom and
also translate it to a single ROL instruction:
[http://goo.gl/NCMEVu](http://goo.gl/NCMEVu)
[http://goo.gl/9Ytsz2](http://goo.gl/9Ytsz2)
[http://goo.gl/vbzHNW](http://goo.gl/vbzHNW)

------
ryandrake
I must be missing something here, but can't you simply not invoke undefined
behavior in your program? That way, you don't have to worry about what your
compiler will do when it encounters undefined behavior.

~~~
marcosdumay
That's possible.

It requires extreme care (assuring no memeory leaks is comparably a walk in
the park), and will make your code much less performant. But it's possible.

------
David
It's easier to expand than contract. When the same decision is faced multiple
times, it will be made multiple ways, and it's very difficult to remove one of
those options once it's in use. On the other hand, if you remove the decision
by standardizing on one of the options, you can always allow the other option
later.

Working in Perl, I run into this a lot. When 'there is more than one way to do
it' is a driving principle, it's easy to be inconsistent in usage and design.
There's a tradeoff here -- it's easier to write code, because you can pretty
much write your solution however it pops into your head. However, it's harder
to read and maintain, because you have to recognize many many different
patterns of usage. When code has been around for awhile, it starts to get an
eclectic mix of patterns blended together, which can be frustrating to run
into when you're trying to fix that code.

How do other people deal with this? Part of it is language choice, I'm sure.
Large companies tend to use languages with less stylistic freedom, which helps
teams write code more consistently. But within a given language, how do you
balance freedom and consistency, so that people can understand each others'
code effectively without being overly burdened by restrictions about how they
can write code?

~~~
tamana
Working in say Java doesn't really force consistency, it just pushes the
inconsistency up a level. Programmers still get painfully "clever" and
idiosyncratic.

------
aplorbust
Being an ignorant fool with an uninformed opinion, I would like to see a C
compiler that is evaluated and critiqued not only based on the warnings and
errors it generates but, more importantly, on the assembly it generates.

Namely, how compact and readable is the generated asm? When we read the asm,
can we easily follow what the compiler has done and _why_?

As an ignorant fool, in my mind C is still a shorthand for writing assembly,
to save old programmers from continuing to be or new programmers from becoming
"assembly language programmers". Obviously many years have passed and "C" has
become an institution and means much more to so many people. Aopologies to
those people. I am just a fool.

I see the _theoretical_ "C compiler" as nothing more than a code generator,
spitting out assembly. Obviously the _practical_ C compiler is very different.
Base on the way it's used, it seems inextricably linked with a "preprocessor"
(glorified sed).

It is said that asm has a "one to one" relationship with machine code.
Theoretically, we can look at asm and determine its machine code equivalent
without any "clever" algorithms.

My humble, ignorant fool's opinion is that it would be better if C had a
closer relationship to the asm the "C compiler" generates.

Maybe not "one to one" but at least "predictable, boring".

Forgive me for having opinions about overly complex things few people can
comprehend (e.g., a "modern C compiler"). I am just an ignorant fool who likes
code generators. Especially ones that output assembly.

~~~
sanxiyn
I agree it would be an interesting exercise to write a compiler optimized for
assembly readability. But I don't think there is enough demand for such a
thing so that it will get written.

It also would be interesting if those who think there is enough demand start a
crowdfunding campaign to prove that there is enough demand.

~~~
mistercow
> But I don't think there is enough demand for such a thing so that it will
> get written.

Isn't that basically what we have academia for?

~~~
sanxiyn
I don't think there is enough academic prestige for a compiler optimized for
assembly readability either.

------
crawshaw
I applaud the author for reaching the daunting part of this project and
reacting with humility appropriate to the difficulty. I hope they continue in
this effort, because they sound like the right person for the job.

There are going to be a lot of trade offs in any friendly C spec. Appreciation
for both sides of a trade off is valuable in making good decisions.

------
cfv
How exactly would performance degrade by defining a virtual machine for C
programs to run in where assurances are given that all of the standard
behavior is fully defined -this operation either fails, or gives THIS result-?
It sounds like it could be massively useful, at least for non-realtime
applications.

~~~
lmm
pcwalton claims a factor of 3x to 5x elsewhere in the thread.

Honestly I think the biggest problem here is social. Every programmer thinks
they're smarter than others. If you give them a button marked "remove all
safety checks, increase performance by 1%", they'll press it. Those who
wouldn't press it have probably already moved on from C.

------
guelo
The Android team idea didn't make any sense to me.

~~~
account921
I didn't get that either. They're shipping a totally broken locale.h and
should be the last persons ever to get involved in C-language design.

------
jokoon
I wonder what Ritchie would be thinking of that.

Also, if friendly C is a problem, does that mean that friendly C++ is in the
realm of the impossible?

------
analognoise
Target a virtual architecture (LLVM IR). Run the original code in an emulator
for the target system (QEMU); give warnings where the two differ, with a
switch for the virtual architecture to mimic the behavior of whatever
architecture you need.

You get friendly C, a new way to think about warnings, and your old code can
still be compatible with the new virtual architecture.

Next up, world hunger.

~~~
def-
You can't simply run code for all possible inputs.

~~~
analognoise
No, but you can certainly exercise it well. It might be a degree of confidence
measure. Combined with property verification (so, formal verification), you'd
get a damn good idea they'd be close.

