Hacker News new | past | comments | ask | show | jobs | submit login

You seem to be arguing "It's not possible to catch all cases of this". I'm arguing "It's possible to catch some cases of this, and those should be caught and fixed".

In this specific case, the compiler inferred that since two pointers were added, and pointers that overflow is undefined, that it couldn't possibly be that the programmer specified undefined behavior, so it must be that there was no overflow. Since there was "no overflow", the compiler decided that the conditional statement testing for overflow could never be true, and removed that branch.

In the specific example submitted to HN, the compiler has a few options:

First, it can assume the programmer is infallible and will never make a mistake, and use that to actually change the code as defined for performance reasons.

Second, it can assume the programmer is fallible, and that without additional information, it is unsafe to alter the code based on assumptions of programmer infallibility for safety reasons.

Finally, it can assume the programmer is fallible, but try to keep the information present in some manner (tag as "maybe true") so that it can be reported on later, while not using it for additional optimizations for safety with advantages. It would be trivial to note optimization that it could do, but will not because it cannot reason adequately about one or more required assumptions. It would also be trivial to note occurrences of statements where undefined behavior it knows may be encountered if the programmer is not vigilant about the inputs using that same data (that is, it need not report everything, just what it can definitively find). That is, it's trivial because the hard work of detecting the problem cases is already being done.

Currently many C compilers default to optimizing for performance, the first options above. They could, and many think should at a minimum default to safety instead. We know people aren't infallible, so acting like they are has no basis in reality.

That's not to say every optimization has to be thrown out. There should be a distinction between something that can be assumed because of mathematical properties and constraints the compiler enforces compared to constraints it's assumed the programmer will correctly follow. There are clearly cases where the compiler can optimize based on knowledge it it has. If you cast an unsigned char to an int, and add another unsigned char to it prior to doing any other operation, you can assume there will be no overflow. You can assume the same of short on platforms where int is twice as bit as a short.

The bottom line is that compilers are assuming truthfulness of expressions that aren't necessarily true, and that require extraordinary effort from programmers to make sure they avoid, as evidenced by their continual discovery in enterprise grade libraries and applications.




> First, it can assume the programmer is infallible and will never make a mistake, and use that to actually change the code as defined for performance reasons.

You keep saying that it changed the code. It didn't. It compiled exactly what the code said. You dislike C, and that's OK, but that doesn't mean that a C compiler compiling C code into machine language that is semantically equivalent to the C code is somehow changing the program. It's not.

> That is, it's trivial because the hard work of detecting the problem cases is already being done.

No, it's not, you are still committing the same fallacy as before. The compiler doesn't "collect knowledge about undefined behaviour in the program", because that is useless knowledge for the compiler.

The compiler collects knowledge that helps it reason about the program. A big part of that is tracking the range of values variables can take on. That is information that is useful in selecting how to express certain code in machine code. Undefined behaviour plays into this because operations that are defined to have undefined behaviour have no results that need to be considered as possible values of the variable that the result is stored in. So, if the compiler sees an if(x>0&&y>0){ x+=y;x/=2; }, with x and y being ints, it can derive that x will be positive, and therefore, for example, the division can be compiled to a right-shift.

There is no code in the compiler that goes "well, there is this pointer dereference, so let's remove the NULL-check branch". It's rather that the dereference limits the range of possible values of the pointer to non-NULL values, which a later stage then uses to determine that the branch cannot ever be taken, and thus can be eliminated. The same "knowledge" could be inferred from an assignment of a constant, for example, or from a preceding check ... the compiler tracks the value, not whether undefined behaviour could happen.

> Currently many C compilers default to optimizing for performance, the first options above. They could, and many think should at a minimum default to safety instead. We know people aren't infallible, so acting like they are has no basis in reality.

That's completely tautological. Every programming language assumes the programmer to be infallible. Every compiler and interpreter does what the program means according to the language spec, and if the programmer fails to express what they mean in the language, the program will do the wrong thing.

Now, there is an argument to be had over what kind of semantics of a language are easier to reason about than others, and to construct languages that make reasoning about the code as easy as possible, and some of that can be applied to adding additional restraints in a C compiler on top of the language specification, to make the language as understood by the compiler easier to reason about (while still staying within what the C spec defines, so as to stay compatible with existing, correct C code).

But there are two major problems with your reasoning here:

(1) Distuinguishing programming mistakes and legitimately optimizable code is far harder than you think. You are just handwaving through that part, but that's actually the hard part. You will either miss a lot of optimization opportunity, or you will catch close to none of the relevant mistakes. If you think you have the solution to tell those cases apart much better than current compilers do, please write a paper about it, compiler writers certainly will be interested.

(2) Performance is actually kindof central to C. If you don't need performance, you probably should just not be writing C in the first place. And if you actually need performance, just erring on the side of safety isn't necessarily gonna cut it. The question is not whether you could add all of the safety features of, I dunno, python, to C. The question is what you would expect the result to look like? There is a reason why C code tends to be faster than python, and part of that is the lack of safety.

To maybe get an idea of why compilers do assume signed overflow to be undefined behaviour, this article seems to give a good overview: https://kristerw.blogspot.de/2016/02/how-undefined-signed-ov...


> You keep saying that it changed the code. It didn't. It compiled exactly what the code said.

From the submitted article: Thus, the compiler removes this part of the check. It did not compile what the code said, it compiled what it determined it needed to compile. That determination included an assumption of what values a variable could be based on whether they could overflow, which would be undefined. That's the whole point.

The code specfied to test whether "target < segmentStart", and the compiler determined that could never be true and removed that check. We have in this bug report direct evidence that the compiler was too aggressive in its assumptions, as it is indeed possible. It was too aggressive for exactly the reasons I have been going over, which is to say that condition can only never be true as long as the programmer protects the actual values being used from being large enough to cause an overflow in the prior statement.

> It's rather that the dereference limits the range of possible values of the pointer to non-NULL values, which a later stage then uses to determine that the branch cannot ever be taken, and thus can be eliminated.

The check in question (in the compiler in question) specifically uses an assumption that the programmer will prevent an overflow which would be undefined. That is the assumption of infallibility I'm referring to.

> That's completely tautological. Every programming language assumes the programmer to be infallible.

No, they don't. If they did, Java, Rust and just about every dynamic language would never do bounds checks. One of the main reasons for a type system is to force the programmer to follow rules to prevent mistakes.

> Distuinguishing programming mistakes and legitimately optimizable code is far harder than you think.

At this point I'm referring to a specific type of optimization that they are doing that I think rests on shaky presumptions.Removing that optimization is not hard work. It may be hard for the community to stomach, depending on how much performance impact it has.

> Performance is actually kindof central to C. If you don't need performance, you probably should just not be writing C in the first place. And if you actually need performance, just erring on the side of safety isn't necessarily gonna cut it.

I gave a specific example for how to get the same performance if this particular type of optimization was made more conservative. Performance is important, but when it comes to performance or correctness, correctness should win. Full stop.

Very little of what I'm referring to at this point is theoretical. I'm referring to real world situations, mostly the one these comments are in response to. Your comments seem to indicate you think this situation isn't possible. Can you clarify on whether you think the bug report is wrong, or whether I'm incorrect in my assessment of what the bug report is saying, or whether I'm misinterpreting your point? At this point, I'm under the impression that much of what I'm stating is fact, so I'm not sure how to interpret statements such as "It compiled exactly what the code said." as anything but wrong, but that's not getting us anywhere.


> It did not compile what the code said,

Yes, it did--that seems to be your fundamental confusion.

What that code means is determined by the specification of the C language, and only the specification of the C language. You constantly keep implying stuff that you think, or hope, or would prefer the code means, but that is completely irrelevant for the question of what the code actually means. Just because some intuitive reading of the characters that make up the code makes you assume that it should means a certain thing, does not make it so.

That code does not mean "check for overflow", no matter how much you wish it did. And because it doesn't mean that, the compiler didn't translate it as that either.

> No, they don't. If they did, Java, Rust and just about every dynamic language would never do bounds checks. One of the main reasons for a type system is to force the programmer to follow rules to prevent mistakes.

You are completely missing the point, essentially due to the same confusion as above. I didn't say that those languages didn't have bounds checks. I said that they assume that the programmer is infallible. Every programming language specifies exactly what each syntactic construct means, and which syntactic constructs don't mean anything, and what the runtime behaviour is, and where it is undefined. That is what makes a programming language a programming language. It is the job of the programmer to translate what they mean into the syntax of the respective programming language. If the programmer makes a mistake in this translation, the programm will be wrong, and it will not do what the programmer meant it to do, no matter which programming language they are using--in that sense, every programming language expects the programmer to be infallible.

The difference between programming languages is not whether they allow you to make mistake (none does or ever will), but how difficult it is (mentally) to avoid making mistakes.

> I gave a specific example for how to get the same performance if this particular type of optimization was made more conservative. Performance is important, but when it comes to performance or correctness, correctness should win. Full stop.

That's completely besides the point. Nobody is saying we should have incorrect code (well, ok, some misguided people probably do, but they aren't really part of this discussion). The question is how we are going to achieve that, and that is ultimately a question of economics: What is the easiest/cheapest way to get the greatest amount of software into a state where its execution matches what the programmer intended? Just claiming that we should throw infinite resources at the problem doesn't actually help the problem disappear.

> Can you clarify on whether you think the bug report is wrong, or whether I'm incorrect in my assessment of what the bug report is saying, or whether I'm misinterpreting your point?

Really none of those, I think. I think the way you think about the problem is just confused, which makes it difficult to nail down why exactly your suggested solutions aren't really solutions.

> I'm under the impression that much of what I'm stating is fact, so I'm not sure how to interpret statements such as "It compiled exactly what the code said." as anything but wrong, but that's not getting us anywhere.

I hope I maybe managed to explain that above? I think that's really at the core of your confusion: You are mixing up what you intuitively think things mean and what things mean according to the appropriate formal definition in the respective context. But code in particular does not mean anything, except for what the formal specification of the respective language defines, and that can deviate arbitrarily far from your intuitive understanding.

It's a bit like false friends in natural languages: Just because you know a word from one language, doesn't mean the same word cannot mean something completely different in another language, and it's just confused to use the vocabulary of one language to determine the meaning of a sentence in a different lanugage.


> Yes, it did--that seems to be your fundamental confusion.

No, it compiled what it determined it had to, based on the C standard. There is a difference. The code, as written, specified a certain set of actions to be taken. The compiler determined some of those directions need not me translated to machine code, and thus did not, but they were specified nonetheless.

To say that the compiler did not remove any code, or directions to be carried out, when translating to machine code, is to subscribe to a torturous and unuseful definition of the terms we have been using.

Of the actions specified by the programmer in the source file, one of which was optimized out in the translation of that source specification to machine code. This change alters the execution path of the program when it is present, and to such a degree that without the optimization the program would halt almost immediately, but with the optimization it allows an out of bounds memory access.

We are not arguing whether the C standard allows this. We are arguing whether the C compilers should do this. There is a distinct difference. Stating that no code was removed has been extremely unhelpful to this conversation, regardless of whether you think it is a technically correct statement. In the generated machine code, a condition of a branch statement does not exist in the version with optimization, but does without it.

The fact that this particular optimization relied on a case where the programmer specified a statement that depending on values not knowable to the compiler at the time of compilation may have resulted in undefined behavior or not makes this a poor optimization to carry out.

> Just because you know a word from one language, doesn't mean the same word cannot mean something completely different in another language, and it's just confused to use the vocabulary of one language to determine the meaning of a sentence in a different lanugage.

Perhaps you could actually address a point I've made instead of arguing over the words used. You are arguing over a technicality of the instead of the topic at hand.

Feel free to reply, I'll read it, but I'm done with this conversation beyond that.


> The code, as written, specified a certain set of actions to be taken.

That's a nonsensical statement. "The code, as written" doesn't have any meaning, other than perhaps what you make up in your mind, which is not a useful reference for discussion, unless you also explain what you interpret it to mean.

I understand that maybe you do not actually mean this literally, and that you maybe are just using somewhat imprecise language to get the idea across--the problem is that exactly in the details that you are not spelling out are the problems that this discussion is all about.

> To say that the compiler did not remove any code, or directions to be carried out, when translating to machine code, is to subscribe to a torturous and unuseful definition of the terms we have been using.

No, quite to the contrary. Those definitions might not be useful for day-to-day programming work, but they are exactly the definitions that you need to clearly discuss compiler behaviour, because those are the definitions that the compiler is using, and the compiler is using those definitions because they match the concepts of how you build a compiler.

> Of the actions specified by the programmer in the source file, one of which was optimized out in the translation of that source specification to machine code. This change alters the execution path of the program when it is present

No, there is no "change", that's just confused language. There is a difference between compilation results, but neither of those is in any way the "real" thing, while the other is "changed", they are both equally valid mappings from C to machine code, with one arguably being closer to the intention of the programmer and thus maybe more useful in this specific case.

> We are not arguing whether the C standard allows this. We are arguing whether the C compilers should do this.

The problem is that those are inextricably interlinked, because the compiler must still stay within the bounds of the standard, and still produce code with reasonably good performance.

> Stating that no code was removed has been extremely unhelpful to this conversation, regardless of whether you think it is a technically correct statement.

The point is not that it's a technically correct statement, the point is that that's not necessarily how the compiler "thinks", so it's often unhelpful in discussing compiler behaviour to talk about "removing code".

> In the generated machine code, a condition of a branch statement does not exist in the version with optimization, but does without it.

It just so happens that in this case, the compilation result without optimization was closer to the programmer's intention than with optimization. But the usefulness of this observation is severely limited because in other cases the exact opposite could be true. The programmer wrote something different than what they meant, and the compiler in some situation produced code that still matched the intention of the programmer ...

> The fact that this particular optimization relied on a case where the programmer specified a statement that depending on values not knowable to the compiler at the time of compilation may have resulted in undefined behavior or not makes this a poor optimization to carry out.

Except that if a C compiler avoided all optimizations for which this is true, a lot of code would be a lot slower. You seem to only be seeing some specific cases for which the performance difference is negligible, and the risk of the optimization is obvious to you, and your imprecise use of language doesn't make discussing this any easier. What you don't seem to realize is how much optimization a C compiler does that is perfectly safe that the compiler cannot easily, if at all, distinguish from this arguably dangerous case, which is why the compiler could only choose to either in many cases produce unnecessarily slow code, or use the current strategy and occasionally produce code that does something else than what the programmer had in mind.

> Perhaps you could actually address a point I've made instead of arguing over the words used. You are arguing over a technicality of the instead of the topic at hand.

Your point is incoherent because you are using imprecise language, which makes it difficult to address. That's why I am addressing your imprecise use of language first.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: