
Re: C as used/implemented in practice - davidtgoldblatt
http://article.gmane.org/gmane.comp.compilers.llvm.devel/87749
======
nathanb
I have difficulty accepting "let's replace C with X", where X is a memory-
managed language. As a systems programmer (I write SCSI driver code in C), I
can't overemphasize how important it is to be able to address memory as a flat
range of bytes, regardless of how that memory was originally handed to me. I
need to have uint8_t* pointers into the middle of buffers which I can then
typecast into byte-aligned structs. If your memory manager would not allow
this or would move this memory around, that's a non-starter.

I don't stick with C because I love it. If I'm writing something for my own
purposes, I use Ruby. I've written some server code in Golang (non-
production), and it's pretty nifty, even if the way it twists normal C syntax
breaks my brain. I even dabble in the dark side (C++) personally and
professionally from time to time. And in a previous life, I was reasonably
proficient in C# (that's the CLR 2.0 timeframe; I'm completely useless at it
in these crazy days of LINQ and the really nifty CLR 4 features...and there's
probably even more stuff I haven't even become aware of).

But none of those languages would let me do what I need to do: zero-copy
writes from the network driver through to the RAID backend. And even if they
did, the pain of rewriting our entire operating system in Go or Rust or
whatever would be way more than the alleviated pain of using a "nicer"
language.

(We never use 'int', by the way. We use the C99 well-defined types in
stdint.h. Could this value go greater than a uint32_t can represent? Make it a
uint64_t. Does it need to be signed? No? Make sure it's unsigned. A lot of
what he's complaining about is sloppy code. I don't care if your compiler
isn't efficient when compiling sloppy code.)

~~~
spoiler
I agree, and I'd like to add that its not just this particular author, but
most people who criticise C about it's "insecurities" use sloppy code when
they criticise C, which always bothers me. I'm far from being a C fan (I'm
also a Ruby fanboy), but programming languages aren't _safe_ , only code can
be safe, and that depends entirely on the developer.

Yes, it's "easier" to introduce some bugs in C than Ruby (or Go, or whatever),
but that's because whoever wrote that code with the bug didn't know C well
enough. Is that C's fault? Same can be said about any language, really.

If you don't know that String#match returns nil on unsuccessful matches and
try to call MatchData#[], you'll get a NPE (something along the lines of
"undefined method `[]' for NilClass"). This is very similar to dereferencing a
NULL pointer in C[1].

[1]: I know dereferencing a NULL pointer in C is undefined behaviour, but your
program will crash—if you're lucky enough—when you try to work with NULL
pointers when you don't expect them.

~~~
dbaupp
This is nonsense. C has a very weak type system and very weak runtime
guarantees, making it much easier to introduce problems with no indication
that something's up. Other languages with strong type systems and/or stronger
runtime checks eliminate large classes of bugs that are very easy to trigger
in C.

So, yes, it is C's "fault" that it doesn't protect against classes of bugs
that many other languages do. Sure, those languages have _some_ of the same
bugs that C does, but they're missing most of the very worst ones and that's
really powerful. For example, a garbage collector protects against accessing
dangling pointers: it's just not something the programmer has to worry about
at all.

Rejecting cricitisms of C's safety inadequacies with "just code better"/"just
learn the language better" doesn't work in practice: there have been too many
high-profile vulnerabilities in C software, many of which would've been _much_
harder to trigger in other languages.

~~~
cremno
>C has a very weak type system and very weak runtime guarantees, making it
much easier to introduce problems with no indication that something's up.

Here's an interesting example I've stumbled upon a few weeks ago:

[https://stackoverflow.com/questions/31037149/type-safety-
for...](https://stackoverflow.com/questions/31037149/type-safety-for-complex-
arithmetic-in-c99/)

float _Complex -> float doesn't require any diagnostic even though the
imaginary component is (silently) discarded. Clang has one (not enabled by
default or -Wall/-Wextra) however current GCC versions haven't.

~~~
asveikau
I'm sure both people using C99 support for complex numbers are really bothered
by this.

~~~
stephencanon
Nope, never been a problem. Can't speak for the other guy.

~~~
FreeFull
Never been a problem for me either.

------
byuu
I understand that in some cases, these heroic compiler optimizations can offer
significant performance increases. We should keep C around as it is for when
said performance is critical.

But surely, we can design a language that has no undefined behavior, without
substantial deviations from C's syntax, and without massive performance
penalties. This language would be great for things that prize security over
performance.

And the trick is, we don't need to rewrite all software in existence in a new
language to get here! C can be this language, all we need is a special
compilation flag that replaces undefined behavior with defined behavior.
Functions called inside a function's arguments? Say they evaluate left-to-
right. Shift right on signed types? Say it's arithmetic. Size of a byte? Say
it's 8-bits. memset(0x00) on something going out of scope? If the developer
said to do it, do it anyway. Underlying CPU doesn't support this? Emulate it.
If it can't be emulated, then don't use code that requires the safe flag on
said architecture. Yeah, screw the PDP-11. And yeah, it'll be slower in some
cases. Yes, even _twice_ as slow in some cases. But still far better than
moving to a bytecode or VM language.

And when we have guaranteed behavior of C, we can write new DSLs that
transcode to C, without carrying along all of C's undefined behavior with it.

You want to talk about writing in higher-level languages like Python and then
having C for the underlying performance critical portions? Why not defined-
behavior C for the security-critical and cold portions of code, and undefined-
behavior C for the critical portions?

Maybe Google wouldn't accept the speed penalty; but I'd happily drop my
personal VPS from ~8000 maximum simultaneous users to ~5000 if it greatly
decreased the odds of being vulnerable to the next Heartbleed. But I'm not
willing to completely abandon all C code, and drop down to ~200 simultaneous
users, to write it in Ruby.

~~~
pcwalton
> But surely, we can design a language that has no undefined behavior, without
> substantial deviations from C's syntax, and without massive performance
> penalties.

…Including undefined behavior around memory allocation, in particular use-
after-free?

What to do about that is the big question, in my mind. Other forms of UB can
mostly be patched up straightforwardly with a clean design (though there are
some tough questions around bounds checks). But when it comes to UAF, there
are basically three ways you can go about this and still remain a runtimeless
systems language:

1\. Compromise on "no UB" for use-after-free. UAF remains undefined behavior.
Some variants of Ada with dynamic memory allocation have this, and I believe
many Pascals did this. It's a popular approach in many new systems languages,
like Jonathan Blow's Jai.

2\. Disallow dynamic allocation. This is the approach taken by SPARK and other
hardened variants of Ada.

3\. Allow dynamic allocation, but statically check it with a region system.
This is Rust's approach. Eliminating memory safety problems in this way while
avoiding a GC is pretty much unique to that language, though it's obviously
influenced by many other systems that came before it (C++, Cyclone).

All of the options have serious downsides. Option (1) opens you up to what has
become, in 2015, a very common RCE vector. Option (2) is very limiting and
pretty much restricts your language to embedded development. Option (3) has
large complexity and expressiveness costs (though once you've paid the cost
you can get data race freedom without any extra work, which is nice).
Altogether it's a really difficult problem with tough tradeoffs all around.

~~~
byuu
> Including undefined behavior around memory allocation, in particular use-
> after-free?

There are obviously going to be limits to what can be done. If you access
beyond memory, you get "bad data" if the address is mapped by the OS, or a
crash if it's not. That's a clear bug, and we can't make C a language that is
incapable of producing programs with bugs. I don't really think of this as
"undefined" ... we define very clearly that one of two things happens, based
on the OS' memory layout. That's very different from GCC's understanding,
where undefined == "if I want to have the program upload a cat picture to
Reddit instead of shift a signed integer right, then that's what I'll do."
(facetious, but you get the idea. Many of GCC's 'optimizations' cause outright
security vulnerabilities, and defy all logic, like deleting chunks of code
entirely.)

We want the most logical thing to happen when a user does something, not a
completely unexpected thing just because it happens to make some compiler
benchmark test look a little better.

> Other forms of UB can mostly be patched up straightforwardly with a clean
> design

I'm betting there aren't any C programmers out there that know 100% of the
behaviors that are undefined. I've been programming for 18 years, and I got
bit the other day because I had "print(sqlRow.integer(), ", ",
sqlRow.integer());" ... where the .integer() call incremented the internal
read position. MinGW decided to evaluate the second call first, and then the
first one, so my output ended up backward. You may think that one's obvious,
just like I might think that a shift by more bits than the integer type holds
being undefined is obvious, but there are people that would be surprised by
both.

Stating that function arguments evaluate left-to-right, just like "operator,"
does in expressions, would be an infinitesimal speed hit on strange systems,
and no speed hit at all on modern systems that can just as easily use an
indirect move to set up the stack frame.

And if you have a processor that can't do arithmetic shift right, which would
be _extremely_ rare, then generate that processor's equivalent of "((x & m) ^
b) - b" after the shift.

~~~
pcwalton
> I don't really think of this as "undefined" ... we define very clearly that
> one of two things happens, based on the OS' memory layout. That's very
> different from GCC's understanding, where undefined == "if I want to have
> the program upload a cat picture to Reddit instead of shift a signed integer
> right, then that's what I'll do."

I don't think there's much of a difference in practice between the two. If you
admit "bad data" into your language, you very quickly spiral into true
undefined behavior. For example, call a "bad data" function pointer—what
happens then? (This is basically how UAF tends to get weaponized in the wild,
BTW.) Or use the "bad data" to index into a jump table—what happens then?

~~~
byuu
> I don't think there's much of a difference in practice between the two. If
> you admit "bad data" into your language, you very quickly spiral into true
> undefined behavior.

Well, by your definition, it would indeed be basically impossible to turn C
into a well-defined language. You'd have to make absolutely radical changes to
memory management, pointers, etc.

So then, can we at least agree that it would be a good idea to _minimize_
undefined behavior in C? Sure, we can't fix bad pointer accesses, and I get
why this stuff was there in the '70s. But modern CPUs have largely homogenized
on some basic attributes. How about we decide that "two's complement has won"
and thus clearly define what happens on signed integer overflow? How about we
state that "a byte is 8-bits"? And so on ... all of the things that are true
of basically every major CPU in use, and that would be _exceedingly_ unlikely
to ever change again in the future.

And this can still be offered as an optional flag. But when enabled, it's just
a little bit of added protection against an oversight turning into a major
security vulnerability, and at virtually no cost.

~~~
ectoplasm
Out of curiosity, are there expensive runtime checks needed to handle signed
integer overflow or do processors actually know about two's complement? If you
had to check every single time you touched an int that could be bad. In
general, what UB can be defined with static fixes, and what UB can only be
defined with dynamic fixes?

~~~
spc476
It depends on the architecture. The VAX could be set (on a function-by-
function basis) to either ignore 2's complement overflow, or automatically
trap. The Intel x86 line can trap, but you have to add the INTO instruction,
possibly after each math operation that could overflow. I don't think the
Motorola 68k could trap on overflow. The MIPS has two sets of math operations,
one that will automatically trap on 2's complement overflow, and a set that
won't (and at the time, the C compiler I used only used the non-trap
instructions).

That's why the C standard is so weasly with overflow---it varies widely per
CPU.

------
jeffreyrogers
For those who don't know Chris Lattner[1], who wrote this post, is the primary
author of LLVM and more recently of Swift, so he knows a bit about what he's
talking about :)

[1]:
[https://en.wikipedia.org/wiki/Chris_Lattner](https://en.wikipedia.org/wiki/Chris_Lattner)

------
carlosrg
Until I see really big and open source projects like WebKit or Clang itself
moving to Swift or whatever, anything I read about moving to "better systems
languages" is like reading a letter to Santa Claus. I doubt C++ is going
anywhere, especially when C++ itself is not standing still and evolving
(C++11, 14, 17...) while maintaining backwards compatibility.

~~~
pjmlp
What about having early versions Mac OS written in Object Pascal, only to
rewrite it in C for pleasure of the UNIX hordes?

~~~
GFK_of_xmaspast
I wasn't a mac user until later, are you saying that was done for reasons of
ideology?

~~~
pjmlp
Yes. Mac OS was initially written in a mix of Object Pascal and Assembly.

Even Photoshop 1.0 was, check the available source code.

Object Pascal was the inspiration for Turbo Pascal 5.5 OOP extensions.

After a few OS releases, UNIX was already getting user in the industry and
pressure from users about C compiler availability grew.

Apple introduced a new SDK with C and C++ support, including a C++ framework.
Afterwards the new OS apis were written in C.

Also by this time, Apple did their first attempt at the UNIX market with A/UX.

------
pjmlp
"My hope is that the industry will eventually move to better systems
programming languages, but that will take a very very long time..."

\-- Chris Lattner

Yes, a very long time. Modula-2 was born in 1978, but we can go back to Algol
and Lisp even.

------
mcguire
" _In the first example above, it is that 'int' is the default type people
generally reach for, not 'long', and that array indexing is expressed with
integers instead of iterators. This isn’t something that we’re going to 'fix'
in the C language, the C community, or the body of existing C code._"

The majority of that message is pretty well said, but this particular part
leaves me cold. The problem _isn 't_ that 'int' is the default type, not
'long', nor is it that array indexing isn't done with iterators. (Ever merged
two arrays? It's pretty clear using int indexes or pointers, but iterators can
get verbose. C++ does a very good job, though, by making iterators look like
pointers.) The problem is that, in C, the primitive types don't specifically
describe their sizes. If you want a 32-bit variable, you should be able to ask
for an unsigned or signed 32-bit variable. If you want whatever is best on
this machine, you should be able to ask for whatever is word-sized.
Unfortunately, C went with char <= short <= int <= long (, longlong, etc.); in
an ideal world, 'int' would be the machine's word size, but when all the
world's a VAX, 'int' means 32-bits.

That is one of the major victories with Rust: most primitive types are sized,
with an additional word-sized type.

~~~
Gibbon1
Then again with C99 you do have stdint.h which gives you defined width types
as well as minimum width types. And others.

------
mrpippy
For the for loop example, is there some reason why clang doesn't output a
warning like "Does 'i' really need to be signed? If so, explicitly make it a
'signed int'. Otherwise, change it to be unsigned"

~~~
porges
The point here is not particularly about signedness, it's that UB allows
better optimizations to be performed.

If overflow is defined to wrap around then it's potentially an infinite loop
(take N == MAXVALUE). With overflow defined as UB you can say the loop
executes exactly N times (because you're not allowed to write code that
overflows).

So UB is both bad and a source of power :)

~~~
cowsandmilk
> The point here is not particularly about signedness

But in the case of C, that is what it is about since unsigned integers have
defined behavior, so you can only have UB and the optimizations when you use a
signed integer.

~~~
jeffreyrogers
Yes, but more generally UB allows optimizations that wouldn't be allowed
otherwise. The whole reason C has so many undefined behaviors is for the
benefit of compiler writers.

~~~
Joky
I believe it is not the original reason: most UB are about funky HW. For
instance for the non-wrap signed int it can be explained because C does not
assume you have 2-complement hardware.

~~~
jcranmer
More specifically, some early hardware would trap on signed overflow. A lot of
undefined behavior in C actually comes from "some machine would cause a trap",
and C predates the invention of precise trapping in out-of-order processors.
The possibility of traps is generally the difference between undefined
behavior or unspecified/implementation-defined behavior.

------
nikanj
Once again, [http://research.microsoft.com/en-
us/people/mickens/thenightw...](http://research.microsoft.com/en-
us/people/mickens/thenightwatch.pdf)

------
ryanmarsh
Have we lost sight of the fact that when we talk about a programming language
we're really talking about how to put bits on CPU registers?

~~~
kd0amg
We aren't. Most programming languages do not expose CPU registers to the
programmer (because they are not semantically important), and most programmers
do not think in terms of moving bits in and out of registers (and typically
don't have much to gain from doing so). CPU registers are just an accident of
implementation. Programming as a general activity is not inherently tied to a
register-based CPU. In fact, programmers are often _happy_ to have
compilers/machines remove some put-bits-into-register activities.

------
JustSomeNobody
What would be considered "security critical"? SSH? IPTables? Linux kernel?

~~~
AgentME
Personally, as someone that doesn't want to be hacked, I would think anything
I run that connects to the internet and isn't already very well sandboxed.

