There are four main rules about pointers in C. The first two are pretty basic rules that shouldn't be controversial:
* You cannot use a pointer outside of its lifetime (e.g., use-after-free is UB).
* You cannot advance a pointer from one object to another object (so out-of-bounds is UB, even if there is another live object there).
The third rule is one that causes issues, but needs to exist given how C code works in practice:
* The pointer just past the end of the object is a valid pointer for the object, but it cannot be dereferenced. It may be identical in value to a pointer for another object, but even then, it still cannot be used to access the second object.
The final rule is simultaneously necessary for optimization to occur, not explicitly stated in C itself, and I'm stating vaguely in large part because trying to come up with a formal definition is insanely challenging:
* You cannot materialize a pointer to an object out of thin air; you have to be "told" about it somehow.
So the immediate corollary of rule 4, the most obvious instantiation of it: if a variable never has its address taken, then no pointer may modify it without reaching UB. And that's why it's necessary to state: without this rule, then anything that might modify memory would be a complete barrier to optimizations. In a language without integer-to-pointer conversions, there is no way to violate this rule without also violating rules 1-3. But with integer-to-pointer conversions, it is possible to adhere to rules 1-3 and violate this rule, and thus this becomes an important headache for any language that permits this kind of transformation.
So how do we actually give it a formal semantics? Well, the first cut is the simple rule that no pointer may access a no-address-taken variable. Except that's not really sufficient for optimization purposes; in LLVM, all variables start with their address taken, so the optimizer needs to reason about when all uses of the address are known. So you take it to the next level and rule that so long as the address doesn't escape and you can therefore track all known uses, it's illegal for anyone to come up with any other use. So now you need to define escaping, and the classic definition suddenly shifts back to describing a data-dependent relationship.
Let me take a little detour. In the C++11 memory model, one of the modes that was introduced was the release/consume mode, which expressed a release/acquire relationship for any load data-dependent on the consume load. This was added to model the cases where you only need a fence on the Alpha processors. It turns out that no compiler implements this mode; all of them pessimize it to a release/acquire. That's because implementing release/consume would require eliminating every optimization that might not preserve data dependence, of which there is a surprising number. You could get away without doing that if you first proved that the code wasn't in a chain that required preserving data dependence, but that's not really possible for any peephole-level optimization.
And this is where the tension really comes into play. For pointers, it's easy to understand that preserving data dependence is necessary, and special-case them. But now your semantics to adhere to rule 4 also says that you need to do the same to integers, which is basically a non-starter for many optimizations. So the consequence is that the burden of the mismatch needs to lie on integer-to-pointer conversions (which, as I've established before, is already the element that causes the pain in the first place; additionally, in terms of how you compute alias analysis internally in the compiler, it's also where you're going to be dealing with the fallout anyways).
In summary, as you work through the issues to develop a formal semantics, you find that a) pointers have provenance, and need to have some sort of provenance; b) compilers are unwilling to give integers provenance; c) therefore pointers aren't integers, and everything assuming such is wrong (this affects both user code and compiler optimizations!); and d) this is all really hard and at the level of needing academic-level research into semantics.
Is N2676 the final word on pointer provenance? No, it's not; as I said, it's hard and there's still more research that needs to be done on different options. The status quo, in terms of semantics, is broken. The solution needs to minimize the amount of user code that is broken. Maybe N2676 is that solution; maybe it isn't. But to refuse clarification of the situation is unacceptable, and suggests to me noncomprehension of the (admittedly complex!) issues involved.
To clarify my position, I am not advocating for the situation not to be clarified. What I am advocating is that the resolution to be that pointer provenance be severely limited: roughly speaking, pointers casted to/from integers might alias and memory-based optimizations should not be permitted when two pointers have been converted to/from integers. I further claim that something like this is the only possible way to resolve the ambiguity that is consistent with the charter. Additionally, getting rid of pointer provenance (or equivalently: well-defining the behavior involved with various integer/pointer conversions) is the only way to, as you want, "minimize the amount of user code that is broken", because much user code assumes the integer/pointer conversion happens according to what TFA calls concrete semantics.
I dispute that there is a "need" for pointer provenance, so much as a desire on the part of compiler developers to honor the sunk cost of various optimizations that relied on particular interpretations of the ambiguity.
> memory-based optimizations should not be permitted when two pointers have been converted to/from integers.
Just to be clear, what you are saying is that it is not legal to transform this function:
int foo() {
int x = 5;
bar();
return x;
}
into this function:
int foo_opt() {
bar();
return 5;
}
And the end result of disallowing that kind of optimization is that effectively any function that transitively calls an unknown external function [1] gets -O0 performance. Note that one of the memory optimizations is moving values from memory to registers, which is an effective prerequisite of virtually every useful optimization, including the bread-and-butter optimizations that provide multiple-× speedups.
I suspect many people--including you--would find such a semantics to have too wide a blast radius. And if you start shrinking the blast radius, even to include such "obvious cases" as address-not-taken, you introduce some notion of pointer provenance.
That's how we arrived at the current state. We've given everything the "obvious" semantics, including making the "obvious" simplifying assumptions (such as address-not-taken not being addressable by pointers). But we have discovered--and this took decades to find out, mind you--that using "obvious" semantics causes contradictions [2].
[1] An unknown external function must be assumed to do anything that it is legal to do, so a compiler is forced to assume that it is converting pointers to/from integers. And if that is sufficient to prohibit all memory-based optimizations, then it follows that calling an unknown external function is sufficient to prohibit all memory-based optimizations.
[2] And just to be clear, this isn't "this is breaking new heroic optimizations we're creating today", this is a matter of "30-year-old C compilers aren't compiling this code correctly under these semantics."
I would say that that transformation is legal, it doesn't even involve pointers. I don't think anything I said precludes escape analysis. How would provenance come into play here?
void bar() {
int y = 0;
int *py = &y;
uintptr_t scan = (uintptr_t)py;
while (1) {
scan ++;
char *p = (char*)scan;
if (p[0] == 5 && p[1] == 0 && p[2] == 0 && p[3] == 0) {
*(int*)p = 3;
break;
}
}
}
This code will scan the stack looking for an int whose value is 5 and replacing it with 3. It's only undefined behavior if there's some notion of provenance: there's no pointer arithmetic, it only happens without pointers. There's not even a strict aliasing violation (since char can read anything). And yet, this code is capable of changing the value of x in foo to 3.
> I don't think anything I said precludes escape analysis. How would provenance come into play here?
Could another approach be taken, where local variables are considered implicitly “register”? In that case this simple example has no problem whatsoever. It does arise unnecessarily if the address of a local is taken but does not escape, but that ought to be rare.
It simplifies code generation. LLVM has no address-of operator; instead, you start with the address of an object (be it a global variable or a local alloca), and you take the address by using the address instead of loading the value at the address. Also, this means that all lvalues start by living in memory, and this dramatically simplifies bookkeeping you need for lvalues that may not be easily representable as registers (e.g., struct values).
* You cannot use a pointer outside of its lifetime (e.g., use-after-free is UB).
* You cannot advance a pointer from one object to another object (so out-of-bounds is UB, even if there is another live object there).
The third rule is one that causes issues, but needs to exist given how C code works in practice:
* The pointer just past the end of the object is a valid pointer for the object, but it cannot be dereferenced. It may be identical in value to a pointer for another object, but even then, it still cannot be used to access the second object.
The final rule is simultaneously necessary for optimization to occur, not explicitly stated in C itself, and I'm stating vaguely in large part because trying to come up with a formal definition is insanely challenging:
* You cannot materialize a pointer to an object out of thin air; you have to be "told" about it somehow.
So the immediate corollary of rule 4, the most obvious instantiation of it: if a variable never has its address taken, then no pointer may modify it without reaching UB. And that's why it's necessary to state: without this rule, then anything that might modify memory would be a complete barrier to optimizations. In a language without integer-to-pointer conversions, there is no way to violate this rule without also violating rules 1-3. But with integer-to-pointer conversions, it is possible to adhere to rules 1-3 and violate this rule, and thus this becomes an important headache for any language that permits this kind of transformation.
So how do we actually give it a formal semantics? Well, the first cut is the simple rule that no pointer may access a no-address-taken variable. Except that's not really sufficient for optimization purposes; in LLVM, all variables start with their address taken, so the optimizer needs to reason about when all uses of the address are known. So you take it to the next level and rule that so long as the address doesn't escape and you can therefore track all known uses, it's illegal for anyone to come up with any other use. So now you need to define escaping, and the classic definition suddenly shifts back to describing a data-dependent relationship.
Let me take a little detour. In the C++11 memory model, one of the modes that was introduced was the release/consume mode, which expressed a release/acquire relationship for any load data-dependent on the consume load. This was added to model the cases where you only need a fence on the Alpha processors. It turns out that no compiler implements this mode; all of them pessimize it to a release/acquire. That's because implementing release/consume would require eliminating every optimization that might not preserve data dependence, of which there is a surprising number. You could get away without doing that if you first proved that the code wasn't in a chain that required preserving data dependence, but that's not really possible for any peephole-level optimization.
And this is where the tension really comes into play. For pointers, it's easy to understand that preserving data dependence is necessary, and special-case them. But now your semantics to adhere to rule 4 also says that you need to do the same to integers, which is basically a non-starter for many optimizations. So the consequence is that the burden of the mismatch needs to lie on integer-to-pointer conversions (which, as I've established before, is already the element that causes the pain in the first place; additionally, in terms of how you compute alias analysis internally in the compiler, it's also where you're going to be dealing with the fallout anyways).
In summary, as you work through the issues to develop a formal semantics, you find that a) pointers have provenance, and need to have some sort of provenance; b) compilers are unwilling to give integers provenance; c) therefore pointers aren't integers, and everything assuming such is wrong (this affects both user code and compiler optimizations!); and d) this is all really hard and at the level of needing academic-level research into semantics.
Is N2676 the final word on pointer provenance? No, it's not; as I said, it's hard and there's still more research that needs to be done on different options. The status quo, in terms of semantics, is broken. The solution needs to minimize the amount of user code that is broken. Maybe N2676 is that solution; maybe it isn't. But to refuse clarification of the situation is unacceptable, and suggests to me noncomprehension of the (admittedly complex!) issues involved.