Hacker News new | comments | show | ask | jobs | submit login
Defining the Undefinedness of C (2015) [pdf] (illinois.edu)
124 points by gbrown_ 11 months ago | hide | past | web | favorite | 88 comments

Here's something I'd love to know about undefined behavior in C: is this something specific to C, or is it something that any similar language would have to contend with?

It seems like problems crop up when you combine a fairly low-level language with an emphasis on performance, a specification that explicitly calls out implementation-defined and undefined semantics, and very highly optimizing compilers.

None of the "safer C" languages seems to have gained much traction. Is that because the problem really is intractable, or just because maintaining a decent level of compatibility with C is too hard? The major selling point of a safer C is being able to reuse lots of existing C code, after all.

Note that other languages probably have the same pitfalls, just latent for now -- their specs are silent about undefined operations, or they don't have specs at all, and their compilers don't optimize as aggressively as GCC or Clang.

From what I've seen, many of these behaviors appear because the semantics of the language are too low-level for the things that people are doing with it. It's not that compiler writers are maliciously optimizing things into nonsense, it's that there are things that clearly need to be optimized.

One example that comes to mind is looping over each element in an array. C doesn't actually have this concept in the language. Instead, you use an idiom that's so common that you'll read it as looping over an array:

    for (int i=0; i<n; i++) { do_something_with(arr[i]); }
The compiler can also read this common idiom and find out that it's going to be doing a bunch of sequential memory accesses, which it can optimize. That's great. Except, what if the pointer arr is near the maximum value for its type and it'll overflow when you add i times the size of the array element to it? Then you're doing some very non-sequential accesses.

This doesn't make sense in context: you wouldn't have an array that's split across the end and the beginning of addressable memory. You'd rather ignore that case so you can optimize the really, really common case of iterating over an array.

But a better-designed language can give you a way to just say "for each element in this array".

Rust aims to be as efficient as C, but does not have the goal of letting you directly reuse C code. Its surface of UB is much smaller than C's, and easier to define, because the language is more expressive of what you actually intend to do.

I think it was an unintended consequence of the wording chosen by the ANSI C89 committee. The reality of pre-ANSI C was that if you wrote ‘+’, the expectation was that you'd get the target machine's ‘add’ instruction, no more and no less. It might overflow, it might not; it might crash or hang your machine on overflow or trap values — but that is all your problem, not the language's or compiler's. I worked on a commercial compiler at the time, and at the time, people took ‘undefined behaviour’ as acknowledgement that sometimes the effect of generated code could be surprising to people unfamiliar with the particular system, not as encouragement to screw the programmer.

I've never really understood why this couldn't be slotted in as "implementation defined behavior". `add` has a well defined meaning on every platform just because it differs shouldn't give the compiler license to change the meaning of my program for that platform. It should however platform specific optimizations based on overflow, etc.

Could somebody provide me with an example of truly undefinable behavior? (Perhaps data races?)

> truly undefinable behavior

Memory corruption is pretty undefinable. Imagine writing data through a wild pointer. The program might segfault immediately, or compute a wrong result but not crash, or crash at an arbitrary later point, or delete all your files, or...

It makes no sense to try to enumerate all that might happen, so this cannot really be implementation defined (where the implementation must document what will happen).

Right, but it is not the plus operation that does this. You can reserve undefined behaviour to things like "write to a bad pointer".

Then a particular ABI could say "if your pointer overflows (by a our system specific meaning of overlow) then you have a bad pointer". Merely doing a pointer increment just gets you implementation-defined behaviour. The UB only happens if you plough on.

Such a setup wouldn't do much to save us from pointer bugs. But it would make optimizers more predictable.

When C was first written, it wasn't necessarily possible on some architectures to get traps to trigger at the exact instruction on some hardware. Thus, any potentially-trapping instruction had to be undefined instead of implementation-defined.

That's a good explanation for invalid memory accesses and divisions by zero but I'm not aware of many architectures where addition overflow traps (it can trap on MIPS but there's an other instruction that simply wraps around). That being said I don't have an encyclopedic knowledge of instruction sets.

I could be wrong though, after all compilers back then were a lot less clever than now so maybe they couldn't really make use of this information. I'd be curious to get a first hand testimony as to why certain UB exist in the first place.

As someone that got introduced to C via SmallC and Turbo C 2.0 for MS-DOS, and avid reader of "The C User's Journal", one of the reasons was that ANSI C89 was supposed to standardize existing variants outside UNIX.

No compiler vendor back then wanted to give up on their own specific deviations, given their existing customer base, proprietary OSes and extensions.

So many of those vendor specific features got tucked away into implementation specific and UB buckets.

This is not first hand, but the commonly given reason for undefined signed integer overflow is that it allows for mathematically correct reasoning: x < x + 1 is true for all x. Or rather, the compiler may assume that it is true and optimize accordingly, since the programmer is trusted to avoid the undefined overflowing case.

"Mathematically correct" doesn't seem like the right way to think about it, since the whole problem is that it isn't correct!

Yes, but it wouldn't be correct the other way round either. Wraparound on signed overflow is a well-defined operation on CPUs, but it is nonsensical. In almost no computation will it give a result that is meaningful in the context of what the program is trying to do. If you try to compute the average of two positive numbers as (a + b) / 2 but get a negative result, it would be of no use if this were "well-defined".

So from the point of view of a C compiler that exploits undefined signed overflow, if your code could overflow, it was probably already buggy, and you should have fixed it.

(One problem with this line of reasoning is that it kind of also applies to unsigned numbers, but there C goes the other way.)

The VAX. There's a bit in the condition codes register that indicates if overflow traps or not.

Then there's the matter if ((char )0) causes a trap or not; on systems without a MMU, it probably won't trap [1]. It's differences like overflow and memory references that cause most of the UB you see in C.

[1] On Linux, you can map a page to address 0. But that you can, doesn't mean you should.

Which "add" instruction? Integer (signed or unsigned), floating-point? How big are the operands? What if the operands are of different types? What if the CPU doesn't have an "add" instruction? Or, perhaps more plausibly, doesn't have a divide instruction?

C code defines run-time behavior (or fails to define it in some cases), not CPU instruction sequences. If you want assembly language, you know where to find it.

> Which "add" instruction?

I believe the intention was whichever one matched the types. If you're adding 2 16-bit ints, it would use the 16-bit add instruction, if it existed or a larger one (then, in this case, cast it back down to 16-bits when appropriate). If it's a floating-point type, it would get a floating point add.

[Edit: The compilers usually wouldn't provide a 16-bit int type if they didn't have the instructions to deal with it.]

> What if the operands are of different types?

C89 has well defined promotion rules. You'd convert the operands appropriately, then use the equivalent add.

> What if the CPU doesn't have an "add" instruction?

Then it would probably end up as a function call to a system specific function. Either way, it wouldn't do anything particularly unexpected for the system.

> If you want assembly language, you know where to find it.

C was originally just used as a high-level assembly language. I suspect people found it easier to reason about, even if it basically used the system instruction's behaviour.

Also C was designed for PDP-7/PDP-11 style machines with 8-bit bytes, 16-bit words and the like. It was ported to a variety of other architectures which had to make assumptions and adaptations. These live in UB.

One somewhat unfortunate side effect of the pervasiveness of C and Unix-style machines is that it encouraged CPUs to optimize for C rather than the opposite way around. The use of GPUs for non-graphics programming is a welcome improvement.

Point of order: the PDP-7 had 18-bit words (and no smaller addressable unit).

Thanks for the correction.

This is a really good question. Each language adopts its own philosophical stance on the matter. Java is an excellent example of a language that tries to minimize undefined behavior, as much as C maximizes it. Even in the case of data races, the range of machine behavior is quite constrained (it can't break type safety, for example). This approach has some cost in performance, but Java is still performant. We don't need to worry much about programs breaking as Java compilers optimize more aggressively, although in earlier days there were sloppy programs that depended on the specific behavior of the specific JVM they were written on (see Cliff Click's writings for more on this).

Probably the majority of languages are philosophically the same as Go; in single-threaded operation there's not much undefined behavior, but once you go multithreaded and have a data race, all bets are off. The "memory model" document is not very precise, but mostly states that if you avoid data races, things will be ok.

All languages with FFI to C, of course, inherit C's approach to undefined behavior through that mechanism.

> This is a really good question. Each language adopts its own philosophical stance on the matter. Java is an excellent example of a language that tries to minimize undefined behavior, as much as C maximizes it. Even in the case of data races, the range of machine behavior is quite constrained (it can't break type safety, for example). This approach has some cost in performance, but Java is still performant. We don't need to worry much about programs breaking as Java compilers optimize more aggressively, although in earlier days there were sloppy programs that depended on the specific behavior of the specific JVM they were written on (see Cliff Click's writings for more on this).

The Java memory model is usually cited as a success, but it's success is far more qualified than it looks at first glance. The original memory model was horribly broken and no compiler actually implemented it. Java 5 introduced the modern memory model (in particular, the notion of data-race-free being the underlying basis for the memory model in the specification), but its attempts to pin down what could happen in the face of data-races is still inaccurate in terms of what compilers do and insufficient for what people would like them to be able to do. The C/C++ model of "controlled" data races in the forms of relaxed atomics (and, to a lesser degree, release/acquire and release/consume) are still considered incorrect and in active development.

I think Java is a terrific role model. Defining the multithreaded memory model was a long and painful process, but the end result is excellent.

It just doesn't quite fully fill the role of C for systems programming, as it uses GC rather than manual memory management.

Any language which allows unprotected memory access will potentially have "undefined behaviour" if you start plowing through random addresses. Things like divisions by 0 are also commonly UB.

Some CPU ABI even have instructions which, when used with certain operands, are undefined. For instance the ARM Thumb "BL" instruction is encoded as two successive "pseudo-instructions", IIRC if you break the pair the behaviour is undefined. There are other instructions which are either undefined of implementation-defined, in particular instructions which attempt to load or store the PC register.

C takes it a bit over the top, for instance overflow doesn't have to be undefined behaviour (it could be left as implementation-defined for instance, which is a lot less nasty).

Yeah, it's not at all clear to me why so many things need to be undefined rather than implementation-defined.

It could let the compiler get rid of entire branches if it can statically assert than an overflow or other UB is guaranteed (assuming that it's not a coding error but that the check is done somewhere upstream and the branch would be unreachable).

That might seem a bit aggressive and risky but it's not rare to write funky macro code where you hope the compiler will be clever enough to get rid of the cruft.

One case I can come up with is if you have some code that does, say:

    int foo(int a) {
        int next = a + 1;

        return increment(a);

    int increment(int a) {
        if (a == INT_MAX) {
            return 0;

        return ++a;
In this situation if the compiler inlines "increment" it can also get rid of the "if" because it can assume that it'll always be false. If it wasn't false then "int next = a + 1" triggers UB so the compiler can do whatever it wants.

Now, if this is an actual coding error, that's nasty. If however I wrote foo that way because I'm sure that "a" is a small integer with no risk of overflow then it's a fine optimization.

If overflow is implementation defined then the optimization is illegal, in case of overflow "next" is implementation defined but the test should run.

Note that here I put the "next = a + 1" before the call to increment, but it would still be a valid optimization if the call was done after the call to increment (with the original value of a, that is):

    int foo(int a) {
        int next, ret;

        ret = increment(a);

        next = a + 1;

        return ret;
UB "travels backwards in time", so to speak. If the compiler knows for a fact that "a == INT_MAX" could cause an UB anywhere in the code, it has the right to assume that it can't happen if it allows him to optimize some more.

So that "next = a + 1" is essentially a note to the compiler, promising that "++a" won't overflow?

To me that seems really reckless! If you need that note, make it a compiler pragma or something, rather than trying to sneak it in by repurposing some existing code.

Now if overflow were implementation-defined, the compiler would have to assume that INT_MAX + 1 wraps to INT_MIN (say). So as written, the "if (a == INT_MAX)" line would stay. That's what I would expect to happen on reading the code, and I'm confident that that's what most reasonable people would expect too.

If you want to skip the "if (a == INT_MAX)" check, just take it out.

I'm sure optimizations like this are important for squeezing a few extra percent out of the standard benchmarks. But for most ordinary programmers they're just a minefield you have to tiptoe carefully around.

I wouldn't write code like this on purpose, rather I'd genuinely need "next" for some other computation and the compiler could then use that to infer that I know "a" not to overflow. If I wanted to make it an "annotation" then I'd use "assert(a < INT_MAX);" which would serve the same function to the compiler and make sense to the next coder reading it (or me in two months). Actually "assert" is probably not the right one since it could get removed with NDEBUG, but something like that.

This is a very simple example of course, but maybe this "increment" code is actually used in many other parts of the codebase where the test actually makes sense.

It's not rare for instance to start a C function with "if (param == NULL) return;", just in case. If the compiler inlines the function and sees that the caller dereferences "param" it can assume that it's not null and remove the test.

But you're right that it makes it easy to shoot yourself in the foot. I remember a strange bug in the linux kernel source that was "hidden" by such an optimization, the code looked something like that:

    int bar(somestruct *ptr) {
        int somevalue = ptr->val;

        if (ptr == NULL) {
            return -EINVAL;

        // Rest of the function here
The real code was a little more complicated than that obviously, but at a glance it might look that the function won't run if "ptr" is NULL. GCC however noticed the dereference before the test and happily decided that "of course this coder knows what they're doing, ptr can't be NULL!" and removed it.

It can take a while to debug code like that.

That code ought to be a compile error. "On this code path, your code assumes ptr is null and not-null at the same time!"

Not all UB can be detected at compile time, but when it can be detected, it should be flagged to the user.

But what if it's the result of some inlining or simply macro expansion? It's a tough call to make.

If it's from inlining, then the compiler has a bug. If it's from a macro, then stop using macros in a broken fashion. Let it be an error.

How would that be a compiler bug? Assuming g() is never called with a NULL pointer, this code should not produce any compile-time errors:

    int f(int *x)
      if (x == NULL) return 0;
      return *x;
    int g(int *p)
      return (*p) + f(p);

This is essentially what caused the Cap'n'Proto remote vulnerability[1] 6 months back. The compiler optimized away an overflow check that directed control to an error handling path, eliding the code that checked for the exact error. Fun.

1: https://news.ycombinator.com/item?id=14163111

But at the time C89 invented 'undefined behaviour', C compilers didn't (couldn't) do that. This modern interpretation simply didn't occur to anyone at the time. If it had, people would have screamed bloody murder, and the language simply would not have passed. Compare dmr's criticism of the 'noalias' proposal: “a license for the compiler to undertake aggressive optimizations that are completely legal by the committee's rules, but make hash of apparently safe programs”.

In the accompanying Rationale, the committee explicitly calls out places where the standard intentionally diverged from existing practice; they did not do so for UB. They thought “the most serious semantic change” was mandating value-preserving rather than unsigned-preserving integer promotion.

Undefined means : the compiler can omit this case and assume it never happens. Take integer overflow for instance: an implementation-defined behavior means in case of overflow the compiler can either wrap, crash, or have a saturation at max value. Undefined behavior means the compiler can assume it doesn't happen and optimize with this information in mind.

For instance, the following code :

    for(i = 0; i<max; i++){
        for(j = i; j<max; j=j*2){
            //do something
With the undefined behavior, the compiler assume that j will never overflow, which means it can re-order the loops if he wants, or unfold it, or any other optimization if he desires so. But if it takes the possible overflow into account, it cannot optimize anything since the inner loop could be an infinite loop if `max` is big enough.

> Undefined means : the compiler can omit this case and assume it never happens.

That's not the whole picture though. The standard actually says that anything can happen in undefined cases.

> However, if any such execution contains an undefined operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation).

The standard doesn't specify what the resulting program should do at all. It's equally valid to e.g. emit code that fails gracefully instead totally ignoring the undefined branch.

Anything can happen because the compiler will assume it doesn't happen. "Undefined behavior" essentially means, "If you try to do this, you have written a program that is not C, and you should not be handing it to a C compiler expecting it to do anything correctly."

If you were to try to write a program that accurately detected whether the given input is a meaningful C program, you'd have to solve the Halting Problem (and more, once you take IO into account.) Thus undefined behavior must exist in the places where your program is syntactically valid but has no definite semantics. (And you can't enumerate all of these places in a Turing complete language.)

Not even a C compiler will be able to tell you if it actually has been given C code; all it can say is "I parsed it, and here're some semantics for it."

No, undefined behavior is still valid C (if it wasn't, it would be an invalid construct, not a valid conatruct with UB.)

It's just C whose actual behavior is unspecified (and, because the standard is outright perverse, makes the entire program's behavior undefined) by the standard and, if you are especially lucky, explicitly defined by the particular C implementation. But maybe not even then.

> > Undefined means : the compiler can omit this case and assume it never happens.

> That's not the whole picture though. The standard actually says that anything can happen in undefined cases.

Well, that's right the standard says the compiler can do anything they want, including erasing all data from your hard drive or summoning a demon in your nose [1], but in practice I've never seen any compiler attempt such things.

What compilers usually do with UB is assuming it will never happen and use this invariant as an optimization hint.

[1] : http://catb.org/jargon/html/N/nasal-demons.html

That's a good example, thanks!

Reminds me of the Java binary search bug: https://research.googleblog.com/2006/06/extra-extra-read-all...

I think the correct code here would be something like:

    for (j = i; j <= (max+1)/2; j = j*2)
Or something like that. Obviously that's rather awkward, with the risk of a sneaky off-by-one bug. (Edit to add: heh, just noticed I'm assuming max+1 won't overflow!) It would be great to have a C-like language that encourages safer code like that.

The silver lining in that Java bug is that it is guaranteed to compute the same wrong negative indexes every time on every platform, and it is guaranteed to throw ArrayIndexOutOfBoundsException when the array is attempted to be indexed. Though the code is wrong, it is at least consistently wrong and loudly wrong. The same cannot be said for a similar bug implemented in C.

Well nobody prevents you from writing a compiler that will have this exact same behavior for the C language. In practice it doesn't happen because this «good citizen» compiler will produce code being slower than GCC or Clang on all benchmark, and since nowadays people mostly use C when performance is critical, this compiler will not be used by anyone.

Why even bother writing C if you all you achieve is Java, PyPy or JavaScript level in term of performance ?

I forget where I heard this formulation, probably in a paper about C semantics:

A useful way to think of C is as TWO languages. One language is the language of chars, ints, structs, arrays, functions, loops, if, etc. That language is fairly self-contained and could be implemented in a variety of ways.

The other is the language of memory. Pointers, non-checked array indexing, unsafe unions, guaranteed data layout, memory access alignment, arrays decaying to pointers, zero-width arrays, casting (especially the equivalent of reinterpret_cast<> in C++).

You could implement just the first language, without the second. You would have a lot of latitude in doing that. You could add all sorts of safety checks. So technically this kind of thing isn't fundamental to "C".

But it would be harder to port this language to new processors efficiently, and programmers wouldn't be able to give you hints about how to make things fast. Guaranteed data layout breaks the abstraction of the language -- but that's a good thing, for performance, and talking to hardware.

If you overconstrain the language, then some platforms would have to add a lot of checks that are not necessary on other platforms.

So if you mean "similar" by "a portable language that is efficient", then to some degree yes, because the problem of abstracting hardware is ill-defined. But if you just want to implement a safe C-like language, that's certainly possible. You could pretty much express everything you want to. You could rewrite the Linux kernel in that language probably (except maybe for interacting with mmapped hardware). But it would be slower.

I guess there is argument to how much slower it would be. It probably wouldn't matter for 90% of the code, but it would matter a lot for the hot code.

Summary: C works at two different abstraction levels that can be thought of as two languages. Hardware is a leaky abstraction.

> is it something that any similar language would have to contend with?

it's something any language could _potentially_ have to deal with. C's penchant for undefined behavior comes from two factors, i think:

1. it's ancient. we've learned a lot of lessons since its inception about how to design a language. it also comes from a time when CPU architectures varied wildly -- even within a given manufacturer's product line.

2. the developer culture during the time when C was being designed was of the "i don't want a language to tell me how to write code. if it's possible in assembly, i want to be able to do it in C." mentality.

> is that because the problem really is intractable?

by "safer C", i assume you're talking about the likes of rust et al.

C is used largely in the embedded space, and the embedded space, i think, is notoriously stuck in their ways. most embedded projects accrue a lot of third-party code (the kernel being a big one) and that code is written in C. so you already need C developers. why wander into unexplored territory, especially when it's at the cost of an additional necessary skill from your developers?

by "safer C", i assume you're talking about the likes of rust et al.

I'm actually thinking of variants of C that allow existing code to be used with no or minimal modifications. The "Friendly C" proposal that another commenter linked to is a good example: https://blog.regehr.org/archives/1180

It seems like Rust is getting some traction, and that's great, but it's certainly not a drop-in replacement for C.

I can't say for certain why none of the "safer C"s have caught on, but I can say that they all have significant compromises and/or impracticalities. Often, performance cost is one of them. Today though, you can use the safe numerics[1] library along with SaferCPlusPlus[2] (shameless plug) as a high performance solution to address undefined behavior in your C code. This solution preserves most of your C code intact, and there is an auto-translation assistance tool[3] in development to further reduce the effort required. (SaferCPlusPlus currently has a dependency on the standard library, so may not be usable on the most restricted embedded platforms[4].)

If performance is not a concern, just compiling with the appropriate sanitizers[5] is an easy option (though not a complete solution).

But note that all of these solutions take the position that not only is the undefined behavior itself a problem, but also the operation which instigated the undefined behavior. And there is good reason for that. If you just want the undefined behavior to be defined, then the question is, what exactly are you trying to achieve by doing so? Are you hoping to guarantee deterministic behavior while maintaining portability?

[1] https://github.com/robertramey/safe_numerics

[2] https://github.com/duneroadrunner/SaferCPlusPlus#safercplusp...

[3] https://github.com/duneroadrunner/SaferCPlusPlus-AutoTransla...

[4] https://github.com/duneroadrunner/SaferCPlusPlus/issues/3

[5] http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

There have been lots of solutions like "SaferCPlusPlus" over the years. In my view, the real reasons why these have not been adopted widely are (1) aversion to any sort of runtime overhead greater than zero; (2) cognitive overhead of introducing new language machinery, type-level or otherwise; (3) interoperability with existing code; (4) few people want to be the first to adopt these kinds of tools in production due to the perceived risk; (5) denial that memory safety problems are indeed problems.

If you're going for wide usage, I think it's actually easier to start over with a new language rather than to try to graft safety features onto C++. A lot of low-level folks don't realize it, but C and C++ are no longer the behemoths they used to be. If you look at the languages people are switching to rather than just absolute usage share, the usage chart [1] paints a very different picture.

[1]: https://blog.sourced.tech/post/language_migrations/eigenvect...

Interesting chart, in spite of what I answered to you in another thread, I have noticed something along those lines regarding C++'s usage on UWP.

From the prominent place it used to have at Microsoft presentations, back in the Windows 8 days, it seems to be relegated to high performance UWP componentes and all presentations, including HoloLens ones, are done in C#.

Also the pivoting from Google regarding Brillo, by dropping the planned C++ Framework and instead adopting the Java one from Android.

I think all your points are probably right (and insightful). But I think in this case the (immediate) goal isn't really popular adoption of this solution by existing C (or maybe even C++) programmers.

In part, it is simply a practical, relatively low cost solution to address the problem of invalid memory access in existing C/C++ code (and thus, for example, remote code execution vulnerabilities in internet facing code). Clearly we've still not reached the point where critical internet infrastructure vulnerabilities are perceived as costly enough to justify actually solving the problem (as you say in point 5). And collectively, we've not yet concluded that cheaper mitigation techniques aren't a "good enough" solution. At least for now.

But the costs (of critical vulnerabilities) are still going up, and there may come a time when some conclude that they have no choice but to properly address the issue. And some may opt for the least expensive option (i.e. something like SaferCPlusPlus rather than a full rewrite in another memory-safe language).

Of course, more appealing solutions might emerge before that time comes.

But there is also a longer term goal. I see a safe dialect of C++ (like SaferCPlusPlus) perhaps relevant in a future where code is liberated from the "language silo" it's trapped in, via auto-translators. For example, I don't expect C programmers to switch to SaferCPlusPlus en masse (although I'll probably continue to encourage it :). The idea is that legacy C code would be auto-translated to SaferCPlusPlus (with or without the support/consent of the original author). Kind of like how the tor guys used to build their ("safer") version of firefox with the sanitizers enabled[1], and didn't need the permission of the firefox developers to do it.

But it doesn't have to be restricted to just legacy C/C++ code. Modern C++ is so powerful that lots of languages could be translated to (a safe dialect of) it. I think. Presumably C++ would be a more appealing language if it had easy access to, say, all the existing Java code via auto-translation. This of course would apply to any language, but I think its power and performance makes C++ a good target for auto-translation. I'm sure other languages would make good targets too.

Basically, in the future I expect auto-translation quality to be a bigger factor wrt language relevance than marketing or mind-share.

[1] https://blog.torproject.org/blog/tor-browser-55a4-hardened-r...

C will always be around, unless we get rid of UNIX and POSIX compatibility layers.

Which given Microsoft's sudden love to stay relevant, means that we are at the edge of an UNIX monoculture.

So any improvement to make C safer is more than welcome.

My preference would be to have something like Frama-C be part of ANSI C.

Virtually nobody who is not already using Frama-C would start using it as a result of it being added to ANSI C.

Ask industry C programmers "why aren't you using theorem provers to prove the correctness of your C code?" The number of them who will give you "because ANSI didn't standardize them" as an answer is zero.

I know, but since we are not going to get any UNIX/POSIX variant out there replaced with C++, let alone Rust, we better get ways of improving its safety. Even it means just pilling up safety band-aids.

Even Microsoft caved in regarding not supporting C newer than C89, and now have on their Visual Studio roadmap to improve the C support beyond what ANSI C++ requires.


They also rewrote their User-Mode Driver Framework (UMDF) from COM/C++ back into C, https://docs.microsoft.com/en-us/windows-hardware/drivers/wd...

I am really looking forward to Joe Duffy's keynote at Rustcon, regarding systems programming improvements.

This is also what helped C win mindshare in the early days of UNIX's adoption across companies.

After UNIX vendors started selling their SDKs, the C SDK was a must have anyway, given the relationship of UNIX and C.

However, getting the Lisp, Fortran, Pascal, Modula-2, Ada compilers, meant buying them as additional modules, each worth several thousands of whatever currency.

The majority of them sold by third parties.

> 1. it's ancient. we've learned a lot of lessons since its inception about how to design a language.

True, but even so, one of the most damning things about C is that it wasn't even "state of the art" back when it was designed. See https://pastebin.com/UAQaWuWG

Very interesting, thanks for the link!

The problem with C's undefined behavior is not the undefined behavior itself, but that todays optimizing compilers use UB as a hint to detect "impossible" code paths which then causes the behavior to really become undefined and unpredictable, also many notionally UB constructions are used in lot of existing code, because before such aggressively optimizing compilers the results were somewhat predictable.

Many other languages have concept of UB, sometimes hidden by different wording, even languages that usually target virtual machines (ANSI CL uses the phrase "is an error" for essentially same concept). On the other hand for various VMs the undefined behavior is usually understood as "anything that does not compromise the VM integrity" and similar thing should be true for UB on CPUs (although you should read Intel's errata that contain "causes undefined behavior" as "allows privilege escalation")

Yes, there are a number of optimizations that require UB, or something like it. For example, say you have this valid C or Rust code:

    if (my_signed_int + 1 > my_signed_int) {
Since signed integer overflow is UB in C, C compilers can optimize out the conditional, but since Rust specifies that overflow may wrap, the compiler can’t assume the condition is true.

To be clear here, overflow is a "program error" in Rust; it's specified to panic in debug builds, and any build where it doesn't panic is defined as wrapping two's compliment.

I tried to toss this into godbolt, but the optimizer is too good of course, it will optimize away the entire if if my_signed_int is a literal. Here's one without optimizations: https://godbolt.org/g/tBZcTA

It seems to compare 1 to 1, and then branch. My guess, though this isn't my area of specialty, is that this is because it is also assuming that overflow can't happen, making this always true.

Clang, interestingly enough, does the load, add, and compare https://godbolt.org/g/gPjgVc

Rust does do the loop condition (well, comparing `my_signed_int` to INT_MAX) even with optimizations: https://godbolt.org/g/dnBEJi

(You need to put a side effect such as I/O in the conditional branch for it not be optimized away, "pub" before the function declaration for it not to be eliminated as dead code, and pass `my_signed_int` as an argument so the condition isn't constant-folded).

Clang OTOH optimizes away the condition with even the lowest optimization level (-Og) turned on, as you'd expect: https://godbolt.org/g/8ftkED

Ah, nicely done.

Division by zero is UB. This lets the compiler assume it does not happen and thus it won't have to emit additional instructions to check for it.

To be eliminate the check at compile to it would have to be able to infer the range of the values of the divisor.

> and their compilers don't optimize as aggressively as GCC or Clang.

Some languages use GCC or LLVM as optimizing backends. So they can suffer from just the same UB if the language itself does not prevent it.


> Division by zero is UB. This lets the compiler assume it does not happen and thus it won't have to emit additional instructions to check for it.

Right, but it doesn't then follow that it can then "optimize" out the rest of your program. It could allow it to compile to what's expected and rely on the platform to handle corrupt program state. Which is what happens if you trigger run time UB anyway. It can also be implementation defined, because virtually all modern platforms trap on division by zero which would lead to more predictable optimizer behavior.

>Here's something I'd love to know about undefined behavior in C: is this something specific to C, or is it something that any similar language would have to contend with?

How similar is similar? Pascal? Ada? Spark Ada? Rust? Friendly C (https://blog.regehr.org/archives/1180)?

Yes, I'd say those are similar. Systems languages, no mandatory GC.

That Friendly C proposal is terrific, I'd dearly love to use that. Basically, use sane and conservative optimizations for most code, with the option of using aggressive optimizations for hotspots. Kind of similar to Rust -- mostly safe by default, more dangerous stuff available when you need it.

The Friendly C idea seems to be basically dead. For an update of what Regehr is pursuing now, see https://blog.regehr.org/archives/1520 (Undefined Behavior in 2017).

IIRC Safe Rust has no UB, `unsafe` does, according to the Nomicon:

* Dereferencing null or dangling pointers

* Reading uninitialized memory

* Breaking the pointer aliasing rules

* Producing invalid primitive values:

- dangling/null references

- a bool that isn't 0 or 1

- an undefined enum discriminant

- a char outside the ranges [0x0, 0xD7FF] and [0xE000, 0x10FFFF]

- A non-utf8 str

* Unwinding into another language

* Causing a data race

These are all guarantees unsafe code must uphold or there are no guarantees anymore.

IIRC LLVM IR also has UBs.

I wonder if it's possible to sneak UB into normal "safe" Rust code, by leveraging LLVM optimizations?

Yes, though those are considered bugs in the Rust compiler. As the goal of Rust is to forbid memory unsafety in safe code, the Rust developers accept the burden of working around LLVM-related UB (which sometimes is quite difficult, see e.g. this longstanding UB bug related to how LLVM translates certain numeric casts: https://github.com/rust-lang/rust/issues/10184 ).

Assuming no compiler bugs, no.

UB isn't something "caused" by optimizations, it's something that exists in the code before optimizations, optimizations can just trigger nasal demons. So you shouldn't be able to write UB in safe Rust assuming no compiler bugs.

(And assuming that any unsafe libraries being leveraged are bug free)

Its totally possible, but its considered a compiler bug if it does happen. In the same way that its possible to segfault java if there's a bug in the jit. There's an on going project to verify the semantics of the Rust language and its safe abstractions[0] which should make it easier to choose which optimizations are legal.

0: http://plv.mpi-sws.org/rustbelt/

One word, portability. Since C is relatively low level, allows direct access to mem, primitive types correspond to actual machine types, etc. It can be crippling for a low level language that targets many many architectures to define undefined behavior. For example, it is undefined behavior to overflow a signed integer. Now, I can't think of a modern processor that doesn't use two's complement so this behavior is relatively well defined, but theoretically if there was a processor that doesn't use two's complement and signed integer overflow was well defined in C, then the compiler would have to insert a workaround which could hinder performance, have unintended side effects, or just be impossible to workaround. One of the best features of C is how portable it is, it is the defacto standard for any risc architecture, be it a tiny microcontroller or a beefy modern processor, a C compiler targets it. The cost of this portability is undefined behavior.

C occupies a niche, and has the advantage of inertia, lots of people working to improve compilers, and more programmers familiar with it.

> something I'd love to know about undefined behavior in C: is this something specific to C

Other languages with undefined behavior: Algol 68, Fortran and Lisp.


I'm young and don't know much about Fortran. Care to elaborate?

That team’s awesome. One of few groups in formal methods using rewriting logic (Maude) instead of things like Coq or Isabelle/HOL. They seem to move faster on semantics as a result. They also build their own modified logic called matching logic on top that they claim is better than separation logic.


More interesting, their use of these tools allowed them to make their C semantics executable in a style similar to GCC:


I want this group to do one for Rust and SPARK as a reference spec for certifying compilers for those. Also useful for the diverse compilation concept to counter Karger's compiler/compiler subversion. On a side note, they also have a company that they use to fund and apply their work:


One of the courses I took involved working with K, and I certainly left the course thinking "I would love to use this tool in the future if I had a project to do with it." It is actually a fairly amazing tool, even if the instructions for using it involve "start by downloading it again, because we keep fixing bugs".

One problem with building semantics for Rust is that the language doesn't have a sufficiently formal specification yet. LLVM IR also has a similar problem in that the community has generally somewhat resisted pinning down precise semantics of undefined behavior (this is changing, though, in large part because the imprecision is causing problems).

"It is actually a fairly amazing tool"

Since I watch formal methods from a distance, I can't really tell how good their approach is versus Microsoft's VCC w/ separation logic, Coq specifications, and so on. I just note they seem to have it easier than most like Coq with their method more flexible than separation logic. How hard do you think it would be for a non-formalist programmer to learn K and start modeling interpreters or compilers in it? Anything you'd recommend to a person trying to learn it in terms of background material?

"One problem with building semantics for Rust is that the language doesn't have a sufficiently formal specification yet."

My solution there was to make the semantics represent the current state of Rust and/or LLVM at any moment. They got the advantage of being the standard implementation. So, the spec needs to do what they do so users don't get inconsistent behavior. Then, the spec work might find errors or validate proposals that can feed back into reference implementations. Finally, as with KCC, the implementation might be useful for looking into compiler errors or subversions.

> How hard do you think it would be for a non-formalist programmer to learn K and start modeling interpreters or compilers in it? Anything you'd recommend to a person trying to learn it in terms of background material?

I am not a formal methods person, and I have very little grounding in it. I still felt quite happy using it and came out of the course thinking "gee, I think I might actually like to use this if I ever had a need," which is the opposite anecdote I hear from people using a lot of formal methods tooling. You probably do need a basic grounding in semantics, though, things like big-step versus small-step operational semantics. Beyond that, the K framework itself has a tutorial on actually building the semantics in K. (See http://www.kframework.org/index.php/K_Tutorial and the corresponding directories within its tooling).

Maybe making Frama-C part of the C standard would also be an improvement.

> using rewriting logic (Maude) instead of things like Coq or Isabelle/HOL. They seem to move faster on semantics as a result.

This statement is close to nonsensical. What are you comparing it to? Where is a comparable executable semantics project in Coq or Isabelle?

This is great work, and they are definitely using the right tool for the job. But the field of "formal methods" is enormous, and you seem to be saying that apples are better than oranges.

Here is a Coq formalization of C11: http://robbertkrebbers.nl/research/ch2o/ It only covers a large fragment of C11, but they formalized the operational, axiomatic, and executable semantics and proved that these correspond to each other.

Yes, and there are other formalizations as well. But none of them (as far as I am aware) formalize the undefined parts in the way the featured article does. I think comparing projects with different goals and then saying "the differences must be due to the tools used" isn't solid reasoning.

"What are you comparing it to? Where is a comparable executable semantics project in Coq or Isabelle?"

In the others, they tend to do formal semantics of interpreters or compilers then extract them to ML or something. This group uses an executable, rewriting engine to do an executable semantics of a lot of languages which seems to require less people. Perhaps they're doing less on the verification side but what I've seen make me think more should try such methods.

Most work I see posted in formal methods isn't done with rewriting tools like Maude. The ones I see use it get some interesting results for effort put in such as the SCOOP verification. So, I took the time to highlight that they use a different approach with quite a bit of tooling to use that people might want to look into trying. My 2nd reply, jcranmer, had exactly the kind of positive experience I was thinking readers might get out of a follow-up.

"But the field of "formal methods" is enormous, and you seem to be saying that apples are better than oranges."

You wouldn't notice it's enormous with the number of people that use a relatively-small set of tools. Convergence is usually good since ecosystems get built that can make new work easier. Coq and Isabelle/HOL are good examples where people keep plugging in prior work to new work. There's also quite a few more that don't get as much attention. Hence, me highlighting two (Maude and K) that had a series of practical results out of research using it.

My good god. Do you actually believe that this scenario reflects a weakness in C or a C compiler?

We detached this subthread from https://news.ycombinator.com/item?id=14857128 and marked it off-topic.


In the grandparent's solution, I should be able to e.g. collection.map, the type system and/or hints from myself should inform the compiler whether or not there are side-effects, if runtime bounds checks are needed, if this is something that can and should be turned into SIMD or concurrent threads. OR I should be able to write some mildly-portable assembly-type algo using intrinsic functions that represent precisely what I want the machine to do, and expect that is precisely what will be done (and deal with porting this to whatever platforms I support). In neither case is there a concept of undefined behavior.

Modern C/C++ is at a level of abstraction where I'm writing explicit instructions for a hypothetical machine, the compiler will consider explicit details in my implementation as implicit suggestions in order to map what I've asked it to do to what the hardware is capable of, and it's up to me to have to be aware of the consequences.

It's not the worst problem of course, but it's not a non-problem.

Your approach is emblematic of programming as a culture of production versus a culture of understanding. I don't agree with any approach which characterizes a language as good or bad based on it's ability to read minds and make decisions based on the unknown.

When you mandate safe behavior based on any expert rules-based or heuristic system you have to leave the unsafe option open which is _just_ as bad.

Integral overflow and volatile memory are the 2 most problematic kinds of UB in C, because they can appear basically everywhere. Rust has features to deal with them.

Who cares about what Rust does to be perfectly honest?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact