Prof. Regehr did not find problems with SQLite. He found constructs in the SQLite source code which under a strict reading of the C standards have “undefined behaviour”, which means that the compiler can generate whatever machine code it wants without it being called a compiler bug. That’s an important finding. But as it happens, no modern compilers that we know of actually interpret any of the SQLite source code in an unexpected or harmful way.
At some point some popular compiler is going to make a subtle but important change to some undefined behavior that's not going to be immediately obvious as to it's repercussions, and the fallout will be massive. It boggles my mind the mental contortions people will go though to justify what is essentially an argument of "it hasn't caused a problem yet" while ignoring that it's caused many problems already, just not that they've noticed or that have affected them.
At some point some popular compiler is going to make a subtle but important change to some undefined behavior that's not going to be immediately obvious as to it's repercussions, and the fallout will be massive.
Wait until you realize that the length of a byte in C is not clearly defined. Someday a processor will come along where a byte is 6 bits, and the fallout will be massive. (Really, it's happened before).
No, actually that processor would not become popular because on one would use it.
The reality is unless you are using a formally-defined language like ML, you are relying on undefined behavior in your language.
The number of bits in a byte is specified by the CHAR_BIT macro, which is required to be at least 8. It's commonly larger than 8 on C compilers for some DSPs, but almost universally 8 for hosted implementations.
Did you see the short-lived attempt to create "friendly C" a few months ago? [1]
There's languages that have a few undefined corners, and then there's C. A sufficiently large difference in quantity becomes a difference in quality. C is qualitatively worse than most modern languages with its undefined behavior. (Granted, some modern languages escape by having the one implementation, which is then the definition. But still, that's less undefined than C.)
It was too ambitious. Some of those examples, like shifting by 32, are about trying to remove even unspecified behavior. If you could reduce most cases of undefined behavior down to "what some real machine would do", it would be an enormous help, and wouldn't be anywhere near as hard.
For example you could say that division by zero will either give a result or trap, but that it can't do anything else, and the code path cannot be ignored. Or that an uninitialized variable is equivalent to initializing it with a semi-random number.
Even if an out of bounds array access will cause untold chaos, you can at least specify that it will cause that chaos at X bytes past the base of the array.
Merely creating an invalid pointer would be, on architectures where it doesn't trap, 100% harmless.
And for crying out loud, is forgetting to terminate a string literal with a " still undefined? There are so many bits of undefined behavior that are easy to remove.
Because uninitialized variables are allowed to change values arbitrarily over their (nonexistent) "live range". Your proposal would force them into having a real live range, with a stable value. See this section for an example, which shows examples of the kinds of optimizations this opens up:
There's also the issue that having a stable value forces the register allocator to keep a live range for the undefined variable, which can cause unnecessary spills or remat in other places. Especially on 32-bit x86, this can be a problem.
> Because uninitialized variables are allowed to change values arbitrarily over their (nonexistent) "live range".
Even specifying that they have a new arbitrary value on each access would be a big improvement over the status quo. It wouldn't allow nasal demons. Since LLVM seems to already have these semantics, it makes a good argument that it wouldn't hurt C's performance to tighten the spec at least that much.
But I'm not seeing how C or C++ code gets you in a situation where you're purposefully doing arithmetic or bitwise operations on uninitialized variables. If it almost never happens, it doesn't need to optimize particularly well. What parts of the STL can be faster by treating uninitialized variables as impossible?
> There's also the issue that having a stable value forces the register allocator to keep a live range for the undefined variable, which can cause unnecessary spills or remat in other places.
This would only happen when the variable is accessed multiple times. In which case having to keep the value safe is no worse than if it actually had been initialized. I don't see how this is a problem.
> What parts of the STL can be faster by treating uninitialized variables as impossible?
What I'm mostly thinking of is allowing unused branches to be pruned. The STL tends to get inlined really heavily, which results in a whole pile of IR being emitted for what look like very simple operations. Based on the actual parameters and state, the optimizer then wants to prune out as much dead code as possible to reduce i-cache footprint, and sometimes time as well.
Take small string optimization. That optimization requires a branch on length in almost every string operation. But if you have a std::string of constant size, you don't need the heap spilled code to be emitted at all. Usually the only way to work this out is inlining + constprop + DCE. That's where undefined value semantics are really helpful: a branch on an undefined value can be completely removed as undefined behavior, which can make its branch targets unreachable, and allow them to be removed, and so on recursively. Undefined values allow entire CFG subtrees to be eliminated in one fell swoop, which is an extremely powerful technique for reducing code size.
I don't have a precise example off the top of my head as to where this kicks in in the STL, but I strongly suspect it does.
I still don't understand. Why would length be undefined if that's how you tell whether a string is small or not?
Even if you can remove one of the branches because you know if the string is small, the logic of "this branch can't happen" -> undefined -> delete sounds more complex than "this branch can't happen" -> delete.
The order is undefined -> "I, the compiler, declare this branch can't happen" -> delete. The middle step is valid because "undefined behavior" permits the compiler to make that declaration, then act on it. If you don't want it to do that, use defined behaviors only, which is a great deal easier said than done. Partially because of how hard it is to avoid it in your own code, and partially because it is shot through all the other library code (which as pcwalton points out, courtesy of aggressive inlining, is also your code).
I was skeptical about all this about six months ago myself, but the continuous stream of articles on this topic, plus the spectacular and highly educational failure of Friendly C (and not just that it failed, but why it failed, which is why I posted that exact link) has satisfied me. It is also part of why I've stepped up my own anti-C rhetoric since I've been so convinced... as bad as I thought C was, it really is, no sarcasm, not merely "hating", even worse than I thought. I am honestly scared to use it professionally and pretty much refuse to touch it without good static analysis support.
I just don't understand how "std::string of constant size" leads to the compiler inferring that a variable is undefined along a particular code path. In particular, on the not-taken code path, what is the uninitialized variable being branched on?
Edit: Is it figuring out that the pointer in the string object is uninitialized? That doesn't seem any easier than reasoning about the length value, and I don't see how it would lead to "br undef".
LLVM has been doing this optimization since 2008, and it contains justification in the commit message. LLVM commit #60470:
Teach jump threading some more simple tricks:
1) have it fold "br undef", which does occur with
surprising frequency as jump threading iterates.
...
Chris didn't cite any specific numbers in the log, but I believe him when he says it actually happens. You should be able to run "opt -O2 -debug-only=jump-threading" and see where it does :)
Related, with some actual numbers to prove it helps, LLVM commit #138618:
SimplifyCFG: If we have a PHI node that can evaluate to NULL and do a load or
store to the address returned by the PHI node then we can consider this
incoming value as dead and remove the edge pointing there, unless there are
instructions that can affect control flow executed in between.
In theory this could be extended to other instructions, eg. division by zero,
but it's likely that it will "miscompile" some code because people depend on
div by zero not trapping. NULL pointer dereference usually leads to a crash so
we should be on the safe side.
This shrinks the size of a Release clang by 16k on x86_64.
For the price of removing some of the rusty nails sticking out of C, I'll happily pay a few 16KB on a binary as large as clang. Plus in most or all of those cases, the optimization would still be possible even with stable uninitialized values. It just needs to be done in a different way. You might be able to get 13 of those 16KB back with a very minor amount of work.
> For the price of removing some of the rusty nails sticking out of C, I'll happily pay a few 16KB on a binary as large as clang.
Lots of LLVM users won't. The competition between LLVM and GCC is (or was) pretty brutal.
> Plus in most or all of those cases, the optimization would still be possible even with stable uninitialized values. It just needs to be done in a different way.
I doubt that's possible without justification. How? Are you familiar with all of the passes of LLVM, how they interact, and with the code patterns generated by the STL?
I trust Chris in that he didn't add the jump threading optimization for no reason, which is the main one that matters here. If he says that this occurs with "surprising frequency", sorry, but I'm going to trust him. Compiler developers are usually right about the impact of their optimizations. Submit a patch to LLVM if you like to remove it, but I highly doubt it'll go through. If it did go through, Rust (and perhaps Swift) would probably revert it, as we ensure that we don't emit UB in the front end, so losing the optimization hurts us for no reason.
The goal is to refine the semantics of C. Removing a valid optimization doesn't help that.
It's not that I think the Chris is wrong, it's that I think other optimizations related to dead code have gotten better in the last many years, and focused effort could improve them more.
I think "refining the semantics of C" is a doomed effort that we shouldn't undertake. We'll lose performance for very little benefit, if market game theory even made it possible (and it doesn't: benchmarks of C compiler performance are much more important to customers than friendliness of the C dialect). We should just be moving away from C instead, or if we must stick with C we should invest in dynamic checks for undefined behavior.
My position, by the way, is hardly uncommon among longtime C compiler developers.
Do you think C managed to get things exactly right? Or should we add more kinds of undefined behavior?
What do we do about the fact that undefined behavior actually makes it harder or impossible to write efficient code in some cases, like checking for integer overflow?
Even if your goal is performance over anything, there are a whole lot of undefined behaviors that have absolutely zero performance benefit.
> What do we do about the fact that undefined behavior actually makes it harder or impossible to write efficient code in some cases, like checking for integer overflow?
This is a good example of why we should be moving away from C. :) Signed overflow being undefined is basically necessary due to a self-inflicted wound from 1978: the fact that "int" is the easiest thing to type for the loop index when iterating over arrays. Nobody is going to go through and fix all the C code in existence to use unsigned, so we're stuck with that UB forever. The realistic solution is to start migrating away from C (and I have no illusions about how long that will take, but we'll never get there if we don't start now).
> Even if your goal is performance over anything, there are a whole lot of undefined behaviors that have absolutely zero performance benefit.
Sure. But compilers don't exploit those in practice (because compiler authors are not language lawyers for fun), so they're basically harmless in practice. They're essentially just spec bugs for the committee to fix in the next revision.
It just seems like such a waste to abandon C instead of smacking compiler/spec writers and getting them to specify behavior that was de-facto specified for decades.
Totally disagree, and it's disappointing to see you suggest "smacking compiler writers" after this discussion which clearly established why the change you suggested does not work. This is not the fault of compiler writers like Chris, at all. They understand how the entire set of optimization passes fits together and do not throw in optimizations for no reason. They are doing what needs to be done to get performance out of a language written in 1978 with design mistakes. They are doing this because their customers demand it.
This entire discussion is predicated on the idea that undef values having a live range is something that actually prevents bugs, which I find extremely dubious. This is not like the "null dereference is UB" thing in the Linux kernel (which I also found to be overstated in importance, but it has caused at least one problem). I can't think of a single bug that this has caused, other than a compiler bug in Rust that was only exposed by people intentionally trying to write code that would break--that is, the bug had no practical effect. It strikes me as "friendly C lawyering" for little benefit.
>They are doing what needs to be done to get performance out of a language written in 1978 with design mistakes.
Then smack the spec writers for never removing undefined behavior. You don't have to smack both, but the two groups are working together to make C more dangerous every day.
>They are doing this because their customers demand it.
If you show a chart of how often the customer's code will stop working because of these optimizations, perhaps they will stop demanding it. But even if it's implemented, there should be a --reduce-undefined-optimization flag.
> I can't think of a single bug that this has caused
If it was defined, then the Debian RNG bug likely would not have happened.
If an undefined behavior is rare, that's a good reason to flatly remove it from the spec. If we could have 10 undefined behaviors instead of 200, the world would be better off. There are undefined behaviors you can hit during preprocessing, for crying out loud.
(Also even in the most extreme version, you wouldn't need to count it as live until the first access, and if that's the last access then there's no live range at all.)
> If it was defined, then the Debian RNG bug likely would not have happened.
No. The Debian bug was not the result of a compiler optimization based on undefined behavior. It was due to OpenSSL doing something dodgy [1]. The compilers, being conservative, all declined to do any optimizations based on the undefined behavior. It was Valgrind that called OpenSSL out for the uninitialized use of memory. This is not a bug in Valgrind either! If the C standard defined live ranges for uninitialized values, then Valgrind would still be right to flag uses of uninitialized values, because they usually indicate bugs, and Valgrind is a bug finding tool. The Debian bug occurred because OpenSSL was doing something silly, Valgrind was doing its job properly pointing out that silly thing that OpenSSL was doing, and the fix was dangerously incorrect.
They got a report of dodgy code. Looking at the code, it was intentional but unimportant and also violated the C standard. So they removed it, and they removed it badly.
If it had been valid by the C standard, it is not unlikely that they would have kept it. Then there would have been no disaster. They would not have gotten confused about what was valid and what wasn't.
Extra entropy to mix in isn't that silly.
Edit: In other words, making "Is this code valid?" more complex to answer can have bad effects even in the absence of aggressive compilers.
> Even specifying that they have a new arbitrary value on each access would be a big improvement over the status quo.
This is approximately what LLVM describes undef as, and it indeed leads to nasal demons. This description enables a single value to both pass a bounds check, and then subsequently go out of bounds!
You're right, in certain cases it would still cause trouble, but it would be a lot fewer cases than 100%. Reading only once would be safe, and passing it to another functions would make a variable that can no longer change unexpectedly.
Its far more likely for a major compiler to exploit more undefined behavior for optimizations than for a processor with a bizarre sized byte to become popular. The major compilers already do this.
> The reality is unless you are using a formally-defined language like ML, you are relying on undefined behavior in your language.
This is not true, not using the definition of "undefined behavior" provided in the C standard.
"while ignoring that it's caused many problems already"
Really? I believe Mr. Hipp is claiming that it hasn't. Do you have evidence to the contrary?
I don't think the argument is that these cases should be ignored -- they are now corrected, after all. It is that most of these cases should be treated as low priority compared to issues that are creating observable problems.
I take kbenson to be saying that what Mr. Hipp claims isn't a problem for SQLite is or has been a problem for other projects. That is, other projects have been bitten by relying on undefined behavior (such as what a variable's initial value might be) that may have been consistent across a variety of implementations, and then suddenly wasn't. Perhaps it's safe to say that today we don't need to address these issues on this project in this language, but that's taking on technical debt and setting ourselves up for work in the future (that may be difficult to identify at that time).
That was exactly my point, and I think the alternative interpretation makes no sense in light of the portion of my comment saying "just not that they've noticed or that have affected them."
> Perhaps it's safe to say that today we don't need to address these issues on this project in this language, but that's taking on technical debt and setting ourselves up for work in the future (that may be difficult to identify at that time).
Yes, that is succinctly saying something I was just implying. Even if a current C program is verified as having absolutely no problems with any current compiler due to the way it is using undefined behavior and the compilers interpret it, it's impossible to assume it will remain in that state by the nature of the problem being examined. Undefined behavior is undefined, and thus may change. Now, any language may decide to change how something works, so every program has to deal with this at some level, but again by the nature of this problem, there's much less assurance that the problem won't be a small,subtle change, possibly in one compiler, which is missed until it's widespread.
This would be much less of an issue if there was a specific subset of C which defined most the undefined behavior which could be turned on with a flag. It would probably prevent portability in some respect, but I hope we've finally reached a place where portability is accepted as secondary to security.
I was making a general point using his statement as the impetus, not a specific point in the case of SQLite. Undefined behavior (specifically, developers relying on it being consistent) has caused many problems in the past, and will undoubtedly do so in the future. Relying on behavior which by it's definition is impossible to rely on, is not a situation I think we should be defending.
> The disagreement is not over whether or not [undefined behavior] is a problem, but rather how serious of a problem. Is it like “Emergency patch – update immediately!” or more like “We fixed a compiler warning” or is it something in between. –Hipp
In this specific case with sqlite it certainly is more of a fixed-compiler-warning. However, you could imagine some undefined behavior in SSL implementation to require emergency-patch level as well. I agree with you that undefined behavior "in most cases should be treated as low priority".
Can you suggest what that change might be related too? I, too, think it's less likely than one might think.
This discussion seems to miss a couple things. SQLite is embedded quite often, by programs written in C; would embedding a rust library and possibly runtime fix things? The parent program would still potentially have defects and those defects could impact SQLite.
SQLite is also rather mature, I do t see how you can compare a rewrite to nature code. Take OpenSSL as an example, they are t tossing it, they are fixing it, it's a much more shallow lift to fix it.
I'm all for some big rust programs to prove its case though, a mailer, a dns daemon, some sort of database. Something useful cut from whole cloth, and ideally something we have historically not done well. I don't think a rewrite is it though.
> Can you suggest what that change might be related too? I, too, think it's less likely than one might think.
Aliasing rules[1]? Compilers apparently don't agree on that now, so even if it were to coalesce into the same undefined behavior, one would have to change.
> This discussion seems to miss a couple things. SQLite is embedded quite often, by programs written in C; would embedding a rust library and possibly runtime fix things?
I'm not advocating for Rust as much as I'm advocating against C, or at least against C as it currently exists and is implemented with so much undefined behavior. I've argued elsewhere in this discussion that a special subset of C with as much undefined behavior as possible specifically defined, which could be enabled through a flag, would do wonders (even if at the expense of some portability).
> Take OpenSSL as an example, they are t tossing it, they are fixing it, it's a much more shallow lift to fix it.
It's much more shallow to review and patch it. Let's not kid ourselves that it will be fixed when they are done with it. There will likely be bugs regardless of the language used to implement a crypto library. Does that mean we should ignore when one language allows an entire class of bugs that another does not, especially when it's for a Crypto library?
Let me make my case another way. What if C nevercaught on as the dominant language, or C was defined originally in a more strict manner, and most the utilities and tools we take for granted were instead implemented in a language that had less undefined portions, allowed less undefined behavior, and thus had less bugs and security problems? Would the resultant small performance hit (due to being unable to optimize around the undefined behavior) outweigh the added stability and security, or would we have been better off? I think we would have been better off, and I think since C's still widely in use, it's not too late to make that case.
Undefined behavior is not always identifiable through static analysis. Obviously it can be checked against at runtime, but that's actually quite expensive. It would, for example, include bounds checks for everything, and overflow checks on all signed arithmetic.
That's not the worst of it: the truly intractable part is preventing use-after-free UB. The only ways to do this are (a) remove malloc from your language; (b) add a lifetime system (incompatible with all existing C libraries); (c) add a garbage collector (which most projects written in C will not accept for performance reasons).
I think that's mostly solved by changing the whole of the idea from "remove/disallow all undefined behavior" to "remove as much of the common needed undefined behavior as possible such that most programs need not really use it". Perfect is the enemy of good.
Are we arguing the same thing? If UAF is the cause of security problems, and UAF is undefined behavior now, redefining it under a special mode to mean "this is an error (whether or not the compiler enforces it)" at least clarifies the situation,and lets and included static analysis report an error as needed and capable.
That is, I'm not arguing that as much undefined behavior as possible should be made defined and possible, just that it's defined. That definition may very well be "you are not allowed to do this. Don't do this."
And once you've eliminated that, you have DoS bugs like forcing infinite loops, abandoned (referenced but unused) memory leaks, and worst-case hash table insertions. All of those are serious attacks for anything with a resource budget.
At some point some popular compiler is going to make a subtle but important change to some undefined behavior that's not going to be immediately obvious as to it's repercussions, and the fallout will be massive. It boggles my mind the mental contortions people will go though to justify what is essentially an argument of "it hasn't caused a problem yet" while ignoring that it's caused many problems already, just not that they've noticed or that have affected them.