A Guide to Undefined Behavior in C and C++ (2010)

banachtarski · on Nov 4, 2018

For those viewing this thread, one year after this article was written, this was standardized:

https://en.cppreference.com/w/cpp/numeric/fenv

kstenerud · on Nov 4, 2018

And therein lies a major problem with c and c++ compilers:

It's effectively impossible to write bug free code. Bugs in c and c++ usually trigger undefined behavior. It is therefore impossible to write a conforming program, which makes any guarantees in the spec meaningless.

I've hit heisenbugs like these that only trigger when optimized, and resist write(), out(), fflush(), etc and it's infuriating.

Or even worse: programs that no longer work when compiled on a newer compiler. With other languages, you are at least spared from this kind of code decay.

But everyone's writing compilers to the spec so tough :/

banachtarski · on Nov 4, 2018

This is a myopic viewpoint. Undefined behavior is critical for code that needs to be fast. The premise of languages like C and C++ is that the airgap between the language's abstract execution and memory model and the hardware's is thin to non-existent.

*(edit accidentally submitted early)

In this case, the UB is due to the compiler's ability to reorder statements. This is such a fundamental optimization that I can't imagine you're really suggesting that a language without this optimization capability is a "problem." Rearranging instructions is critical for pretty much any superscalar processor (which all the major ones are), and I hate to imagine the hell I'd be in if I had to figure out the optimal load/store ordering myself.

saagarjha · on Nov 4, 2018

> In this case, the UB is due to the compiler's ability to reorder statements.

No, the undefined behavior arises because dividing by zero is undefined. Languages wishing avoiding this particular bug can make division by zero defined to trap or have some sort of guaranteed behavior, after which the compiler is required to not reorder those statements. In this case reordering the instructions is legal because having undefined behavior in your program makes literally anything the compiler does legal.

banachtarski · on Nov 4, 2018

My point still stands in that I don't want the compiler to check for division by zero if I don't ask it to.

saagarjha · on Nov 4, 2018

Sure, then use C or C++ and check for it yourself. But if you mess up your program is invalid, so there's that. If you don't like that, write your own assembly by hand to convert this to implementation-defined behavior instead of undefined behavior.

lmm · on Nov 4, 2018

> The premise of languages like C and C++ is that the airgap between the language's abstract execution and memory model and the hardware's is thin to non-existent.

Which is not really the case for processors from the last 20+ years.

> In this case, the UB is due to the compiler's ability to reorder statements. This is such a fundamental optimization that I can't imagine you're really suggesting that a language without this optimization capability is a "problem." Rearranging instructions is critical for pretty much any superscalar processor (which all the major ones are), and I hate to imagine the hell I'd be in if I had to figure out the optimal load/store ordering myself.

UB is not necessary for that though. E.g. define that integer division by zero leads to the process receiving a terminal signal. That could be implemented just as efficiently (trapping on division by zero free at the hardware level in modern processors, the C standard gives broad freedom for signals to be received asynchronously so instructions could still be reordered), but would close the door to silent memory corruption and arbitrary code execution: unless the programmer has explicitly defined how the signal is handled, their program is noisily terminated.

kstenerud · on Nov 4, 2018

My point is that it is humanly impossible to write a bug-free program. In C and C++, bugs usually manifest themselves in UB.

To make matters worse, compilers, ever searching for diminishing returns on performance improvements, have been steadily making the CONSEQUENCES of UB worse, to the point that even debugging is getting harder and harder. These languages are unique in their growing user hostility.

llukas · on Nov 4, 2018

> growing user hostility

You have excellent tooling (*sanitizer, static analysis, valgrind, ${WHATEVER}) and abstractions provided to do handholding for you (ie. unique_ptr).

Most of that was created in last few years.

pjmlp · on Nov 4, 2018

Completely wrong.

Lint was created in 1979, during the mid-90's we already had Insure++, Purify and others, ages before of the free beer alternatives.

Yet being free still doesn't help the large majority of C and C++ to actually use them, as proven at one of CppCon talks, where a tiny 1% of the attendees confirmed using any kind of analysers.

gpderetta · on Nov 4, 2018

Lint is not a whole program static analyzer and being free is a big deal.

Your -second- third paragraph is, unfortunately, still correct though.

pjmlp · on Nov 4, 2018

Somehow other professionals manage to pay for what goes into their toolboxes, music case, kitchen knives, ....

minipci1321 · on Nov 4, 2018

How well the attendees of CppCon represent the developer base of C++ projects which are in dire needs of these tools?

Maybe they are so skilled the need does not arise?

jcranmer · on Nov 4, 2018

One of the most eye-opening papers in this regard was the integer overflow checking paper (which Regehr was coauthor on), which found that every library tested invoked undefined signed integer overflow. This includes SQLite, infamous for its comprehensive testing, and various libraries whose sole purpose was to check if an operation would overflow without invoking undefined behavior.

Belief that you are skilled enough to write C/C++ code that doesn't exercise undefined behavior either shows that you don't know what is undefined behavior or that you believe you are the best programmer to have ever lived.

esrauch · on Nov 4, 2018

Even people who work on the language spec can't really avoid writing code that hits UB without the help of sanitizers.

pjmlp · on Nov 4, 2018

The amount of CVEs found per month on highly skilled projecs, with deep review processes, like the Linux kernel proove otherwise.

As for how well, usually conferences like CppCon have the top of the tops.

nickpsecurity · on Nov 4, 2018

On Linux side, these slides I got from Alex Gaynor illustrate your point really well:

https://events.linuxfoundation.org/wp-content/uploads/2017/1...

There hasn't been much in terms of changes. Languages immune to classes of vulnerabilities by default and/or sound checkers that can catch all of them seem necessary. And by seem necessary, I mean massive, empirical evidence that most developers can't code safely without such alternatives even on critical, widely-used, well-funded projects.

banachtarski · on Nov 4, 2018

Well don't use it. You aren't the target customer because for me, fixing the performance bottleneck is a lot harder than finding a divide by zero, and I certainly don't want to pay for the compiler to check things like divide-by-zero without me asking it to. When I don't care about performance, I reach for a scripting language or something. It's a tool, don't get all emotionally worked up about it.

pjmlp · on Nov 4, 2018

We are entitled to be emotional about it, because we all have to use tools which have the misfortune to be written in C derived languages.

Even if I don't touch those languages for production code, my whole stack safety is dependent how well those underlying layers behave, and how responsibile the developers were towards writing secure core.

Which as proven by buffer oveflows in IoT devices not much.

xmiller · on Nov 4, 2018

So where is your production code written in Ada or Pascal?

pjmlp · on Nov 4, 2018

Ada cannot tell.

Pascal was replaced by Java and C#.

jacoblambda · on Nov 4, 2018

Ada absolutely can tell. Divide by zero throws an exception and buffer overflows are caught by essentially every modern compiler.

pjmlp · on Nov 4, 2018

Sure it can, but that wasn't the question. Rather what happened to the code I have written.

NDAs make us not able to tell about stuff.

What magic variant of C or C++ compiler are you using that throws errors on buffer corruption, unless you are speaking about using code with debugging mode enabled in production instead of using a proper release build.

jstimpfle · on Nov 4, 2018

> buffer overflows are caught by essentially every modern compiler.

If you use high-level arrays with bounds-checking, which are not always fast enough, and can become a maintenaince burden. If they should be absolutely secure (like, std::vector isn't - it can be moved behind your back), they also require GC.

dagenix · on Nov 4, 2018

> don't get all emotionally worked up about it.

Not constructive

pjmlp · on Nov 4, 2018

I was always able to write fast code in languages like Object Pascal and Ada, while being safe from C and C++ UB cargo cult.

jstimpfle · on Nov 4, 2018

Cannot speak for Ada. Only looked through a few tutorials once or twice and did not pursue further, I think mainly because I did not like the verbosity.

> Object Pascal

FWIW I've worked for 6 months with a large Delphi code base (which is Object Pascal right?) and I really wanted to like it (and I did like quite a few aspects to it). Note that I was employed to improve performance, and I was competent enough about performance to have achieved speedups of 100x to 1000x for the parts I was working on. So, I'm not saying you can't write performant code in Delphi. But here are a few annoyances I can remember:

- Need to typedef Type of Pointer-To-Type for anything before using it in function arguments: type Foo = record .. end; type PFoo = ^Foo; type PPFoo = ^^Foo. This is not only annoying. It's also extremely hard to read function signatures like function Foo(a: PFoo; b: PPBar) compared to function Foo(A: ^Foo; b: ^^Bar) IMHO.

- Pointer types not really type-checked. Why? It would be so easy to do.

- No macros, which are incredibly important (need experience to use them well? Yes, but still).

- Object oriented culture pervasively used in libraries (for example, deep deep inheritance trees), leading to bad maintainability and bad performance. Tons of badly thought out features, weird syntactic tricks. Reminds of C++.

Pretty sure there were more. As I said there are good things in it that are not in C, like better compilation performance and some parts to the syntax. But especially the first three are show stoppers for me. (The OO culture is painful as well, but you don't need to buy into it if you can do without the libraries).

Of course, the Delphi code I wrote is just as unsafe as if it was in C.

pjmlp · on Nov 4, 2018

> Delphi code base (which is Object Pascal right?)

Object Pascal was created by Apple with feedback from Niklaus Wirth for Lisa and Mac OS implementation.

Other Pascal vendors eventually copied the extensions, most notably Borland.

When they released Delphi, they kept calling it Object Pascal, although most of what was in Apple's MPW and Turbo Pascal variants is mostly legacy.

As for the rest I will have to agree to disagree.

- I love that I already had those OOP features and modules in 1992 vs bare bones C;

- I consider proper TDD (Type Driven Programming) a good practice;

- pointers are type checked, not sure what you mean here

- No macros is plus, the large majority at ISO C++ is creating safer alternatives for each use case

- OOP is quite useful in many use cases. I loved Turbo Vision and OWL.

> Of course, the Delphi code I wrote is just as unsafe as if it was in C.

Naturally one can disable all safety buttons and go full speed, but here lies the beauty of Algol linage of languages.

Type safe by default and if one really requires that extra mile, then escape hatches are in place.

Thing is, for 99% of most applications that is largely unnecessary.

jstimpfle · on Nov 4, 2018

> - pointers are type checked, not sure what you mean here

They weren't here with Borland. Maybe it was one of the many optional compiler switches, so I'll take that back.

> No macros is plus, the large majority at ISO C++ is creating safer alternatives for each use case

That's just wrong. Learn how to use the tool and use it when it makes sense. There are a LOT of situations where the easiest for maintainability by far is to abstract at the syntactic level. The Rust guys acknowledge this as well. Even the Haskell compiler (GHC) has a C-style preprocessor built-in. For example, I use macros for data definitions, to create identifiers, to create strings from identifiers, to automatically insert additional function call arguments like size information or source code location...

> that extra mile

You mean 100x - 1000x in speed?

> Thing is, for 99% of most applications that is largely unnecessary.

Your initial comment was explicitly about performance. And I disagree that 99% of applications should accept a 100x - 1000x decrease in speed (and harder maintenance by far, if you ask me!) (corollary: less purity and joy in programming, by far), or even a 10x for that matter, just to get some more safety. I mean, safety is nice and all, but it's not all the reasons why I'm doing this. YMMV.

EDIT: I now understand that you mean "99% of most applications", while I read it "99% of (or most) applications". I disagree with the implicit statement that you should write 99% in a "safe" language and the rest in a systems language. It is well known that you can never easily know where the bottleneck is or where the next will be. And it is well known that it is very laborious to accomodate multiple languages, or incompatible code styles, in a single project. (I've also heard some negative stories about integrated scripting languages, for example from the games industry, but I don't work there...)

And in the end, speed is actually not the primary reason why I prefer C...

pjmlp · on Nov 5, 2018

Macros can be easily replaced by other means, that is what Java and .NET do via annotations and compiler plugins.

Rust macros are tree based and integrated into the compiler, not a text substitution tool running before the compiler, which cannot reason about what is happening.

I don't get where that 100x - 1000x factor comes from, most compiled languages aren't that slower than C, specially if you restrict to ISO C.

If C extensions are allowed in the benchmarks game, then those other languages also have a few tricks up their sleeves.

For example, I can disable all checks in Ada via Unchecked_Unsafe pragmas and in the end there will hardly any difference to generated C code.

The big difference is that I am only doing that in that function or package that is required to run like a 1500cc motorbike at full throttle to win the benchmark game, while everything else in the code is happy to run under allowed speed limits.

jstimpfle · on Nov 5, 2018

> Macros can be easily replaced by other means, that is what Java and .NET do via annotations and compiler plugins.

There are cases where text substitution is the right thing to do. How do you do the things I mentioned, for instance? Java in particular is an infamous example, requiring lots of hard to maintainable boilerplate. Tooling helps in some cases to write it, but can't help reading it, right?

Some examples from my current code

    #define MAKE(x, y, z) [x] = { #x, y, z }

    #define MSG_AT_EXPR(...) _msg_at_expr(__FILE__, __LINE__, __VA_ARGS__)

    #define PARSE_LOG() \
            if (doDebug) \
                    MSG_AT(lvl_debug, currentFile, currentOffset, \
                           "%s()\n", __func__);

    #define BUF_RESERVE(buf, alloc, cnt) \
            _buf_reserve((void**)(buf), (alloc), (cnt), sizeof *(buf), 0, \
                         __FILE__, __LINE__);

    #define CLEAR(x) mem_fill(&(x), 0, sizeof (x))
    #define SORT(a, n, cmp) sort_array(a, n, sizeof *(a), cmp)

    #define RESIZE_GLOBAL_BUFFER(bufname, nelems) \
            _resize_global_buffer(BUFFER_##bufname, (nelems), 0)

In Delphi I've had to manually write all these expansions, resulting in less maintainable code. Go look in the linux kernel, I'm sure there are tons of examples that you'd be hard pressed to replace by a cleaner or safer specialized means.

> I don't get where that 100x - 1000x factor comes from, most compiled languages aren't that slower than C, specially if you restrict to ISO C.

It's not so much about the language, but what you do with it and how you structure your code. Or, actually, how you structure the data. OOP culture is bad for performance.

If I were to chose the single best resource for this kind of argument, that would be Mike Acton's talk from CppCon 2014 on youtube. If you want to watch that. Note that I'm not his fanboy. These are experiences I've made on my own to a large degree. And the arguments apply just as well to maintainability if you ask me.

> The big difference is that I am only doing that in that function or package that is required to run like a 1500cc motorbike at full throttle to win the benchmark game, while everything else in the code is happy to run under allowed speed limits.

And so the rebuttal is: No. If your code is full of OOP objects you can micro-optimize a certain function like crazy, but the data layout and the order of processing are still wrong.

To give another anecdata, for my bachelor's thesis I had to write a SAT solver for clauses of length <= 3 in Java. I modeled clauses as POD objects holding 3 integers (the 2nd and 3rd of which could be -1). My program could do about 10M clauses before memory was getting tight and it was doing only GC for at least a minute before it would finally die. Note that all objects are GC'ed reference types in Java (as you probably know).

I then converted it to highly unidiomatic Java by allocating 3 columns of unboxed integers of length #clauses, instead of allocating #clauses POD objects. The object overhead went away, so I could do about twice as many clauses before memory was used up. And when it was used up, since there was basically no GC overhead, the program died immediately (after a few seconds of processing, without a minute of GC). The downside was that maintainability was drastically worse since I was using only the most primitive building blocks of Java, and none of its "features".

If that had been in C, I could have just stayed with the first approach, since C has only "value types". It would have been performant from the start. C would have yielded a more maintainable program since I would not have had to fight the language. I could also have chosen the second approach, and it would have been easier to write and read than the Java code (which required tons of boilerplate).

pjmlp · on Nov 5, 2018

You know that Mike Acton is now working on Unity's new C# compiler team, right?

And yes he is also having a role on the new ECS stack, which is just another form of OOP, for anyone that bothers to read the respective CS literature.

Had you implemented your SAT solver in a GC language like Oberon, Modula-3, Component Pascal, Eiffel or D, among many other possible examples, then you wouldn't need such tricks as they also use value types by default, just like C.

jstimpfle · on Nov 5, 2018

I know and as far as I know he's trying to improve performance there. If you actually bother to watch the video you will find him ranting hard against mindless OOP practices.

ECS as I understand it is pretty much not OOP. My idea of it I would call Aspect-oriented, i.e. extracting features from artifacts and stuffing them in global tables, which of course separate data by shape. If you look on wikipedia, the first association you will find is also Data-oriented programming (the term from the talk; it is about tables-shaped and cache-aware programming and I believe it was also coined by Mike Acton).

Data-oriented programming stands particularly opposed to OOP which the games industry has found to scale miserably.

pjmlp · on Nov 5, 2018

Yes, I did watch that talk back when he did it. I always follow CppCon talks.

Then you should also watch the talks he did later at Unite, after joining Unity.

As I mentioned regarding ECS, on CS literature.

For example,

"Component Software: Beyond Object-Oriented Programming"

https://www.amazon.com/Component-Software-Object-Oriented-Pr...

First edition (1997) used Component Pascal, C++ and Java, while the 2nd edition replaced Component Pascal with C#.

"Component-Based Software Engineering: Putting the Pieces Together"

https://www.amazon.com/Component-Based-Software-Engineering-...

ECS and Data-oriented programming aren't the same thing.

jstimpfle · on Nov 5, 2018

"Unity at GDC - A Data Oriented Approach to Using Component Systems" https://www.youtube.com/watch?v=p65Yt20pw0g

xmiller · on Nov 4, 2018

But the authors of large, popular code bases chose C or C++.

Anyone is free to rewrite Apache in Ada, but for some reason it isn't happening.

the_why_of_y · on Nov 4, 2018

Do so many people write code in JavaScript today instead of countless other high level languages because JavaScript is technically superior and better designed than any other language, or because browsers and the web provide an ubiquitous runtime platform?

pjmlp · on Nov 4, 2018

Ever heard of this little thing called money?

Since free UNIX brought C into the masses, and Bjarne made C++ as a means to never have to touch bare C after his encounter with BCPL, many people have choosen this languages because they were a language with an OS SDK.

So now unless we get some nice lawsuits, companies will keep picking the easy path.

jstimpfle · on Nov 4, 2018

So, are you saying it's been difficult to get Ada or Free Pascal or Java up and running on a Unix system in the last 10 or 20 years?

pjmlp · on Nov 4, 2018

What I am stating is that to re-write existing systems, regardless how rotten they might be, someone needs to pay for the work to happen.

What many tend to forget on those "X re-written in Y" posts.

Pay Per Hour * Total Hours Effort = Money Spent on Rewrite

Additionally what I am saying is that languages that come with the OS SDK have first class privileges and experience shows that 99% of the companies won't look elsewhere.

For example, in commercial UNIX days, you would get C and C++ compilers as part of the base SDK. The vendors that had support for additional languages like Ada, had them as additional modules.

So any dev wanting to push for language X needed to make a case why the company needed to put extra money on the table instead of going with what they already got.

A similar process happens on mobile and Web platforms nowadays, you either go with what is available out of the box or try to shoehorn an external toolchain and then deal with all integration issues and additional development costs ($$$) that might arise.

jstimpfle · on Nov 4, 2018

Many many free software projects are started by people who don't get paid for it. Those people start their project in whatever language they want. If someone wants to write webserver software in Java or an OS in Object Pascal, they can do it.

Successful projects may get financial support from companies later. I doubt that these companies are overly selective towards "obviously bad languages". I don't buy that there are any mechanisms in place to get cynic or outraged about. Maybe it _is_ just that some languages are more productive.

shin_lao · on Nov 4, 2018

| It's effectively impossible to write bug free code.

What does that even mean? It's impossible to have bug free code in any language. Bug in the libraries, the compiler, in the OS, in the hardware...

nwmcsween · on Nov 4, 2018

That's actually good your code breaks using new compilers as it means the code is bogus anyways. The alternatives to UB is either a strict spec that will be dog slow on $arch or abstract away everything and make it complex.

pjmlp · on Nov 4, 2018

Being dog slow is very much dependent on the use case.

Yes it will be slower than taking every shortcut in the name of performance.

What really matters is, does the execution speed and memory footprint meat the requirements?

If the user is happy to get their data in 100ms, with a requirement of 300ms max, getting it in 10ms is hardly an advantage.

jjnoakes · on Nov 4, 2018

If you can deliver responses 10x faster than required, you can scale with 10x fewer resources (or at least presumably some factor of fewer resources) and still stay within the user requirements.

I'd say that's a nice advantage.

slededit · on Nov 5, 2018

These most recent optimizations are nowhere near 10x faster. The only one that can even come close is autovectorization and that can be done without heavy reliance on UB.

If your loop is worth autovectorizing the 2 instructions to check for pointer aliasing, and other showstoppers is not material.

pjmlp · on Nov 4, 2018

Not everyone is going to be the next Google, Facebook, Crytech, EA, ...

There are better ways to waste money than spending it on YAGNI features.

jstimpfle · on Nov 4, 2018

Consider Parkinson's law... Compare git to mercurial... Consider how many successful Java command line programs are there?

pjmlp · on Nov 5, 2018

There are plenty of them at the enterprise level.

Just a couple of months ago I re-wrote several Korn shell scripts doing ETL related tasks into a couple of saner Java CLI programs at customer request.

jstimpfle · on Nov 5, 2018

And the responsiveness is good? Isn't there always this terrible startup lag? Could we rewrite the C implementation of e.g. git in Java to get a program that is just as fast (e.g., instantaneous response for most operations)?

pjmlp · on Nov 5, 2018

Surely, you know that there are native code compilers for Java, right?

jstimpfle · on Nov 5, 2018

Is this practical? What are the limitations? Is this Java spirit? Why doesn't everybody use it? What does this do about startup time?

My initial point in the meantime was only that performance does matter. And that for some reason I cannot recall any (open source or free software) CLI programs written in Java from the top of my head. While there are free Java implementations easily available no?

pjmlp · on Nov 5, 2018

Many don't use them, because they are commercial tools, and many developers nowadays don't like to pay for software.

The only limitation is that for reflection code one needs to white list which classes end up on the binary.

All major third party commercial JVMs always had the capability of AOT to native code, it was just tabu at Sun.

Oracle has other point of view thus kept the Maxime project alive, rebranded it into Graal, and now those that don't like to pay for developer tools can also enjoy AOT compilation to native code via SubstrateVM, GraalVM and Graal integration into OpenJDK.

Just Windows support is not yet fully done for the time being.

Their long term roadmap is to rewrite the C++ parts of OpenJDK in Java itself, also known as Project Metropolis.

OpenJDK 11 also adopted a feature already common in another commercial JVMs, which allows a JVM to dump a JIT image before existing. Which then allows for AOT like startup on the 2nd execution onwards.

Also Java isn't the only safe alternative to C, those that don't mind lack of generics can just pick Go instead of dealing with C.

Which then we already have several high profile projects using it, including Google's exploratory microkernel based OS.

MauranKilom · on Nov 4, 2018

Correct, it depends on the use case. Nobody is saying you should write everything in C++. But there is plenty of software that actually uses your CPU for data processing where speed is absolutely crucial. If you know that every second (or 100 ms) saved in every functionality your software offers will eventually matter, would you still choose the more pessimistic performance guarantee?

Did you know (at the point where the entire architecture was chosen) that you will get 100 ms with a safer language? What if you got 1000 ms and C++ got 100 ms?

pjmlp · on Nov 4, 2018

Thing is, C and C++ aren't the only language with those capabilities, e.g. CPU for data processing.

What to live on the danger zone and disable bounds checking on e.g. Turbo Pascal?

Surround the critical performance code path with {$R-} and {$R+}, while enjoying safe bounds checking everywhere else.

saagarjha · on Nov 4, 2018

This isn't necessarily true. Many compiled languages these days have a strong specification that guarantees a lack of undefined behavior for the vast majority of code, yet remain relatively performant–think Swift, Rust, and the like.

otabdeveloper2 · on Nov 4, 2018

"Undefined behavior" doesn't mean "buggy". It simply means stuff that's CPU-specific or compiler-specific. C++ has a standard, unlike languages like Rust or Python. This is a good thing, because compiler-specific crap doesn't magically go away if you avoid standards and just declare one implementation as a "reference".

saagarjha · on Nov 4, 2018

> "Undefined behavior" doesn't mean "buggy". It simply means stuff that's CPU-specific or compiler-specific.

Yes, it does. The behavior you are describing is "implementation specific", and it is ok to have this in your program provided you know what your implementation will do. It is illegal to have any undefined behavior in a well-formed C/C++ program.

tom_ · on Nov 4, 2018

The standard doesn't defined well-formedness, nor does it consider undefined behaviour illegal necessarily. It simply has nothing to say about what happens when undefined behaviour is invoked.

And it is OK to invoke it if you know what your implementation will do. The standard even gives documenting the behaviour as an option.

(I wonder how many people worry about supplying clang a file that doesn't end in a new line? That is undefined behaviour, and yet you know exactly what's going to happen: you'll get a warning, if compilation continues the code will build as if the new line were there, and clang won't delete your source file, even though it would be perfectly within its rights to.)

MauranKilom · on Nov 4, 2018

> I wonder how many people worry about supplying clang a file that doesn't end in a new line? That is undefined behaviour

It no longer is, since C++11. Check Phase 2.2 here: https://en.cppreference.com/w/cpp/language/translation_phase...

> clang won't delete your source file, even though it would be perfectly within its rights to.

UB allows the execution of the compiled program to wipe your hard drive, but it certainly does not give your compiler that right. I mean, the standard doesn't say what side effects invoking a compiler is allowed to have (because that's out of scope), so none of it can be predicated on UB.

(Mandatory mention: https://github.com/munificent/vigil)

tom_ · on Nov 5, 2018

Thanks for the clarification. An outbreak of good sense? I hope it's contagious. It is still undefined behaviour in C11.

Not sure I agree with you about UB - missing line-endings is a parse-time issue, so the intent appears to be that the compiler (or interpreter) is free to do what it likes even at this stage, before your program is even ready to run.

otabdeveloper2 · on Nov 4, 2018

When people talk of UB in C++ they always mean "implementation specific".

saagarjha · on Nov 4, 2018

No, they don’t. These are separate concepts with different behavior. “Undefined behavior” is illegal to have in C/C++ programs and includes things like division by zero and out-of-bounds array access. “Implementation specific” behavior is legal but allowed to differ, such as querying the size of an int.

esrauch · on Nov 4, 2018

You really don't need to go very far to get into UB instead of implementation defined: signed int overflow is already UB and the program is permitted to do literally anything if you ever accidentally make one of your ints too large.

ilovecaching · on Nov 4, 2018

This is why if you're currently writing C or C++, please, please look into Rust.

Titus Winters just did a great talk at Pacific++ outlining how difficult it is to understand the quirks of C++. Not even Bjarn can keep all of it in his head. The dice are completely loaded against you to fail and introduce bugs, unintended behavior, and vulnerabilities in your code.

Rust isn't a silver bullet that will suddenly fix your life, but it doesn't deal with the crazyness of C undefined behavior and doesn't have to be backwards compatible. It's designed from the ground up to make comprehensive sense, and safety is a feature enforced for you by the compiler in many cases.

banachtarski · on Nov 4, 2018

> This is why if you're currently writing C or C++, please, please look into Rust.

Ugh if you read the article, you would know that the UB described here is the same in Rust. There is this massive group think about which languages are safe and/or fast, and whenever an article like this comes up, invariably someone says "Rust!" without really looking at the problem itself.

Personally, I am in wait and see mode on Rust. Without better generics support (aka template <size_t> or similar) and a number of other things, it still doesn't meet my bar for writing fast generic code, and who knows how complex it will be at that point? I say the jury is still out.

newacctjhro · on Nov 4, 2018

> Ugh if you read the article, you would know that the UB described here is the same in Rust.

Rust doesn't have a formal memory model yet, but it's already known that UB in Rust is quite restricted:

https://doc.rust-lang.org/nomicon/what-unsafe-does.html

Most importantly, UB in Rust should only arise if you're writing unsafe code (barring compiler bugs). Typically, most Rust code is safe. This is a huge win.

saagarjha · on Nov 4, 2018

> Ugh if you read the article, you would know that the UB described here is the same in Rust. There is this massive group think about which languages are safe and/or fast, and whenever an article like this comes up, invariably someone says "Rust!" without really looking at the problem itself.

Rust does have vocal proponents, but I don't see how "the UB described here is the same in Rust". Rust should trap consistently when dividing by zero.

oconnor663 · on Nov 4, 2018

> the UB described here is the same in Rust

I don't think this is correct. Dividing by zero in Rust produces a panic, and panics are observable side effects that may not be reordered with other side effects. This is sometimes annoying, because it can prevent the compiler from eliminating bounds checks in some cases, but it avoids the problems mentioned in this article.

ilovecaching · on Nov 4, 2018

> Ugh if you read the article, you would know that the UB described here is the same in Rust.

Ugh, if you took the time to learn Rust before criticizing it, you'd know that Rust only allows undefined behavior in unsafe blocks.

> Without better generics support

So you have no idea what generics are. Templates aren't generics, they're fancy code generators expanded at compile time, which means they're incredibly loosely typed and allow you to do things like have scalar parameters like size_t n (because they're basically macros). C++ templates are essentially a form of duck typing that has led to an increased need for specification, hence concepts in C++20.

Rust's parametric polymorphism and type classes (aka generics) follows in the ML and Haskell tradition in that the type checking is like a constraint program that is run at compile time. When a generic item is invoked, the specification of that type is immediately checked for consistency at the call site, otherwise the typechecking fails.

red75prime · on Nov 4, 2018

UB is a way to better optimize code by (silently) offloading a burden of correctness proof to a programmer. Rust prefers to be more explicit about it. Const generics will surely not change that.

flohofwoe · on Nov 4, 2018

It's not like you stumble over UB in your daily work in C/C++, you know it exists, you know which areas to avoid, you compile and test on the 3 "big" compilers (gcc, clang, cl.exe). On clang, test with UBSan enabled.

UB isn't "crazyness" it's just a detail to be aware of.

A compiled binary that worked correctly won't suddenly run into undefined behaviour, that can only happen if you upgrade compiler versions or change compilation flags. But this only happens while the application is still actively developed and tested.

the_why_of_y · on Nov 4, 2018

That presumes that your testing process will find the input that actually triggers the UB in your program before the release.

The CVE history of the major browser vendors indicate that this is currently an unsolved problem for their large C++ applications.

flohofwoe · on Nov 4, 2018

Of course, and that's exactly where Rust makes sense. Few pieces of software are as safety-critical and complex as a browser though, that's why I'm not a fan of the general "UB scaremongering" ;)

s3cur3 · on Nov 4, 2018

Full-time C++ game dev of ~5 years here. I absolutely run into UB issues at least once a month, sometimes as often as a couple times in a week. I doubt our codebase is of below average quality either. UBsan is a godsend here... without it, this stuff was truly hell to debug.

oconnor663 · on Nov 4, 2018

> A compiled binary that worked correctly won't suddenly run into undefined behaviour, that can only happen if you upgrade compiler versions...

Upgrading the compiler version and causing undefined behavior sounds pretty sudden. Here's my favorite story along those lines: http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-...

ilovecaching · on Nov 4, 2018

It isn't just UB. It's ADL, ODR, templates, multiple inheritance, etc. The de facto compatibility with C and its own backwards compatibility are also both curses on C++. C++ also blew their complexity budget in a lot of questionable ways that just makes it hard to keep C++ code maintainable. When things become unmaintainable, they become buggy.

humanrebar · on Nov 4, 2018

> It's not like you stumble over UB in your daily work in C/C++...

I don't know. I often help colleagues out with things like preprocessor nonsense and violations of the One Definition Rule.

A lot of it has to do with difficult build and packaging tools. Actually standardizing a module system will be an important step in the right direction.

beefhash · on Nov 4, 2018

> This is why if you're currently writing C or C++, please, please look into Rust.

But please, please also look into its platform support table[1], too. If you're the slightest bit opposed to the Windows/macOS/Linux hegemony, you may want to reconsider Rust.

[1] https://forge.rust-lang.org/platform-support.html

ericpauley · on Nov 4, 2018

What exactly are you saying? Rust can be built with zero os/kernel dependencies at all. Case in point, people have written entire operating systems on bare metal in rust.

xmiller · on Nov 4, 2018

Are there any large programs like BIND written in Rust? If so, what is their actual security record?

jcranmer · on Nov 4, 2018

Several bits of Firefox are written in Rust now. Looking through the bug reports, there has yet to be any CVE in any of the Rust-rewritten components.

nickpsecurity · on Nov 4, 2018

TrustDNS is one in that space. We won't know for a while what actual, security record is. We do have some evidence from fuzzing where it led to panics vs worse effects in C. I think an objective test would compare effects of fuzzing Rust and C libraries to see how often Rust just panicked. The study should use different types of libraries that use the language in different ways. If Rust's claims are true, most problems will involve 3rd-party libraries that aren't Rust, unsafe Rust, and/or compiler errors that break safety. People doing the study could just grab popular C apps vs similar stuff in Rust on Github that's actually maintained. Also, look at amateurish and unmaintained stuff to see if those still have minimal vulnerabilities.

That's how I'd do it anyway.

pjmlp · on Nov 4, 2018

There is a whole OS, good enough?

gbw · on Nov 4, 2018

If you mean Redox, that is hardly a production OS (yet).

People don't have infinite time and only search for security issues in software that's actually in use.

Which is one of the reasons that C/C++ appear more often in CVEs.

pjmlp · on Nov 4, 2018

The OP asked for an example of a large program like BIND, I gave him/her an OS, which naturally also contains something like BIND, but apparently we now move the goal posts for how much users it has.

I love how people hand have the fact that even projects like the Linux kernel, with their stringent processes aren't able to avoid CVEs, to the point it became the major focus of Linux Kernel Security Summit 2018.

Yet we all know that only newbies do major errors in C. /s

sethammons · on Nov 4, 2018

Depends on popularity and use.

pjmlp · on Nov 4, 2018

The question was about an example for a large program like BIND.

sureaboutthis · on Nov 4, 2018

Which one?

otabdeveloper2 · on Nov 4, 2018

"Undefined behavior" in C++ just means stuff that's either CPU-specific or compiler-specific.

Rust has no undefined behavior because it only supports one CPU and has only one compiler implementation.

When Rust supports a dozen architectures and has three different compiler implementations then Rust will be just as full of UB as C++. Except with the added "bonus" of all the UB being brand-new and unspecified, unlike the well-known portability pitfalls of C++.

pcein · on Nov 4, 2018

I am happily running Rust generated code on my ARM microcontrollers. Rust uses LLVM for code generation and can support any architecture which LLVM supports.

steveklabnik · on Nov 4, 2018

> When Rust supports a dozen architectures

"Rust now available on 14 Debian architectures" https://lists.debian.org/debian-devel-announce/2018/11/msg00...

pjmlp · on Nov 4, 2018

There are plenty of system languages with a very tiny percentage of UB vs all the languages that allow for copy-paste of C code.

Because they prefer to loose a couple of ms to paying the price of the ultimate performance might bring.

Like in real life you can speed at 300 KMH with a 1200cc motorbike, or doing it a car with reinforced structure wearing seatbelts and airbags.

Even with diminished chances of survival, I rather be in the car.

otabdeveloper2 · on Nov 4, 2018

UB is an implementation issue, not a language issue. UB is a fact of life that cannot be removed by overspecifying a language -- if your language spec has too much implementation-specific stuff in it, then compiler writers will simply ignore it. Many such examples. The best you can do is to document the implementation-specific stuff extensively.

Also, implementation-specific details aren't always about maximizing performance. Compilers implement languages slightly differently because of architecture and OS differences and because it's easier that way for the people writing compilers.

pjmlp · on Nov 4, 2018

UB is not implementation specific at all, the ISO documentation is quite clear about which is which.

Other systems languages manage to do it quite well.

It was really a shame that Bell Labs gave UNIX away for a symbolic price during 10 years before being allowed to actually sell it.

We wouldn't be having this kind of C related talks.

dagenix · on Nov 4, 2018

> Rust has no undefined behavior because it only supports one CPU ...

That is not true. Rust works on multiple architectures.

otabdeveloper2 · on Nov 4, 2018

There's a massive chasm between "works on" and "supports".

Retra · on Nov 5, 2018

That doesn't seem relevant -- Rust works on and supports multiple architectures. And there's plenty of UB in unsafe code.

anderskaseorg · on Nov 4, 2018

(2010)

duneroadrunner · on Nov 4, 2018

Yeah, with modern C++, you can largely choose to avoid using elements that are prone to undefined behavior. For example, rather than native integers, you could use a compatible integer class that checks for division by zero [1]. Or one that checks for overflow too [2]. The Core Guidelines lifetime checker aims to (eventually) make native pointers and references memory safe via (severe, but not quite as severe as Rust) usage restrictions. And when you need to circumvent the restrictions, you can use an unrestricted smart pointer with run-time checking [3][4].

[1] shameless plug: https://github.com/duneroadrunner/SaferCPlusPlus#primitives

[2] https://github.com/boostorg/safe_numerics

[3] https://github.com/duneroadrunner/SaferCPlusPlus#registered-...

[4] https://github.com/duneroadrunner/SaferCPlusPlus#norad-point...

pjmlp · on Nov 4, 2018

I agree, the problem is ensuring everyone at the company and third party dependencies, many of which available as binary libs only, do actually use those modern C++ constructs.

the_why_of_y · on Nov 4, 2018

Interesting project; are you aware of any actual users?

Meanwhile, in the real world, C++ programmers typically use operator+(int,int) with UB on overflow because it's conveniently built into the language.

The problem with C++ isn't that doing the right thing is impossible, it's that doing the wrong thing is the default, with no dependencies and no syntactic overhead.

jcelerier · on Nov 4, 2018

> Meanwhile, in the real world, C++ programmers typically use operator+(int,int) with UB on overflow because it's conveniently built into the language.

In more than a decade of coding with C++ I have never been bitten by a signed overflow bug which is UB. However I have been hit by unsigned underflow (e.g. if(2u - 3u > whatever)) way too often even though it is perfectly "legal" from the point of view of the language.