Hacker News new | past | comments | ask | show | jobs | submit login
A Provenance-aware Memory Object Model for C [pdf] (open-std.org)
56 points by signa11 41 days ago | hide | past | favorite | 41 comments

This proposal, specifically it's decision not to use the "concrete semantics", violates several of the stated principles in the C2X charter [0]

Specifically "1. Existing code is important, existing implementations are not" the document relies heavily on what mainstream compilers currently do and making things easier to work with current implementations.

"3. C code can be non-portable" the proposed document seems to preclude the use of C as a "high-level assembler" in that it makes it seemingly undefined to do many simple conversions between a machine word and a pointer.

"4. Avoid “quiet changes.”" This seems to be a quiet change, depending on how one interprets the current ambiguity.

"6. Keep the spirit of C" Specifically (a) Trust the programmer. (b) Don't prevent the programmer from doing what needs to be done. (c) Keep the language small and simple.

"8. Codify existing practice to address evident deficiencies." The document acknowledges that existing code and compiler behavior have diverged.

"11. Maintain conceptual simplicity." This is clearly much more complicated than the "concrete semantics" discussed and previously assumed given the prior ambiguity.

In contrast, I don't see anything in the charter indicating the standards committee should try to retrofit existing implementations' overzealous optimizations onto the language spec. The closest are 6(e) "make it fast", but the explanation there is talking about something different than opportunistic optimizations on a fixed architecture; and "5. A standard is a treaty between implementor and programmer", but that appears to be talking about just the minimum sizes of various integer types.

[0] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2086.htm

For what it's wort, I agree with a lot of the criticism you raise. But I think it's a bit misdirected to blame this proposal - the C implementors and standards people went down this road a long time ago, and arguably the proposal cleans it up and puts it on a sounder footing.

What we now realize is that the C abstract machine is a much more complicated and slippery idea than was originally hoped. The strictly typed world does not blend smoothly with an assembly-like model where pointers are simply integers used to index into memory. But staying strictly in the latter model is also not viable, as it precludes some optimizations that we really want to do.

Incidentally, this is one of the big things holding up a formal spec for Rust as well, as Rust also has a strictly typed fragment (even more so than C) but C-like pointers as well. The semantics of all this more or less inherit from LLVM, but as we're finding out, LLVM is not particularly sound when it comes to these difficult cases, namely converting back and forth between pointers and integers, and doing nontrivial manipulation of those integers.

> C implementors and standards people went down this road a long time ago

I'm aware of that, but just because something was done before doesn't justify it being done more aggressively in the future. This isn't really an academic or tilting-at-windmills complaint either; most existing code, for better or for worse, uses pointers more broadly than the strict sense the standard allows, and more aggressive optimizations in this area are all but guaranteed to break existing code in dangerous ways.

> But staying strictly in the latter model is also not viable, as it precludes some optimizations that we really want to do.

Does that really make it "not viable" though? I would counter that a strictly typed model is not viable in that it precludes optimizations I want to do, as a programmer! For example: xor linked lists, doing anything with shadowed memory, using packed structures or more generally any misaligned loads on architectures that support it (which is most of the important ones), binary search trees over pointers, parsing nearly any binary file format or network protocol without lots of memcpys... and that's just off the top of my head

>this is one of the big things holding up a formal spec for Rust as well

I'm not familiar with LLVM's internals, but does anything preclude these optimizations being gated by an attribute that could get set (by default) by a Rust frontend but not (by default) by C frontend? Gating with attributes is already done with e.g. __attribute__((may_alias)) so it would seem that this ought to be doable. Heck, if you give me an opt-in __attribute__((contrete_semantics)) and keep the defaults UB I'd be more than content.

You forget that basically _all_ int<->ptr casts were undefined before and that compilers already violated the concrete semantics.

But kudos for finding an official source of misinformation.

What’s more, the committee already gave license to implementations to be “overzealous”, as you call them. That did contradict the standard, but they never bothered to fix it.

So it’s not the first time the C committee contradicts itself and doesn’t bother to resolve the contradiction. So yes, you’re entitled to feel you’ve been lied to before, they’re just being honest now.

GCC has been using http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_260.htm to justify its use of provenance, even when it contradicts the standard.

There are four main rules about pointers in C. The first two are pretty basic rules that shouldn't be controversial:

* You cannot use a pointer outside of its lifetime (e.g., use-after-free is UB).

* You cannot advance a pointer from one object to another object (so out-of-bounds is UB, even if there is another live object there).

The third rule is one that causes issues, but needs to exist given how C code works in practice:

* The pointer just past the end of the object is a valid pointer for the object, but it cannot be dereferenced. It may be identical in value to a pointer for another object, but even then, it still cannot be used to access the second object.

The final rule is simultaneously necessary for optimization to occur, not explicitly stated in C itself, and I'm stating vaguely in large part because trying to come up with a formal definition is insanely challenging:

* You cannot materialize a pointer to an object out of thin air; you have to be "told" about it somehow.

So the immediate corollary of rule 4, the most obvious instantiation of it: if a variable never has its address taken, then no pointer may modify it without reaching UB. And that's why it's necessary to state: without this rule, then anything that might modify memory would be a complete barrier to optimizations. In a language without integer-to-pointer conversions, there is no way to violate this rule without also violating rules 1-3. But with integer-to-pointer conversions, it is possible to adhere to rules 1-3 and violate this rule, and thus this becomes an important headache for any language that permits this kind of transformation.

So how do we actually give it a formal semantics? Well, the first cut is the simple rule that no pointer may access a no-address-taken variable. Except that's not really sufficient for optimization purposes; in LLVM, all variables start with their address taken, so the optimizer needs to reason about when all uses of the address are known. So you take it to the next level and rule that so long as the address doesn't escape and you can therefore track all known uses, it's illegal for anyone to come up with any other use. So now you need to define escaping, and the classic definition suddenly shifts back to describing a data-dependent relationship.

Let me take a little detour. In the C++11 memory model, one of the modes that was introduced was the release/consume mode, which expressed a release/acquire relationship for any load data-dependent on the consume load. This was added to model the cases where you only need a fence on the Alpha processors. It turns out that no compiler implements this mode; all of them pessimize it to a release/acquire. That's because implementing release/consume would require eliminating every optimization that might not preserve data dependence, of which there is a surprising number. You could get away without doing that if you first proved that the code wasn't in a chain that required preserving data dependence, but that's not really possible for any peephole-level optimization.

And this is where the tension really comes into play. For pointers, it's easy to understand that preserving data dependence is necessary, and special-case them. But now your semantics to adhere to rule 4 also says that you need to do the same to integers, which is basically a non-starter for many optimizations. So the consequence is that the burden of the mismatch needs to lie on integer-to-pointer conversions (which, as I've established before, is already the element that causes the pain in the first place; additionally, in terms of how you compute alias analysis internally in the compiler, it's also where you're going to be dealing with the fallout anyways).

In summary, as you work through the issues to develop a formal semantics, you find that a) pointers have provenance, and need to have some sort of provenance; b) compilers are unwilling to give integers provenance; c) therefore pointers aren't integers, and everything assuming such is wrong (this affects both user code and compiler optimizations!); and d) this is all really hard and at the level of needing academic-level research into semantics.

Is N2676 the final word on pointer provenance? No, it's not; as I said, it's hard and there's still more research that needs to be done on different options. The status quo, in terms of semantics, is broken. The solution needs to minimize the amount of user code that is broken. Maybe N2676 is that solution; maybe it isn't. But to refuse clarification of the situation is unacceptable, and suggests to me noncomprehension of the (admittedly complex!) issues involved.

To clarify my position, I am not advocating for the situation not to be clarified. What I am advocating is that the resolution to be that pointer provenance be severely limited: roughly speaking, pointers casted to/from integers might alias and memory-based optimizations should not be permitted when two pointers have been converted to/from integers. I further claim that something like this is the only possible way to resolve the ambiguity that is consistent with the charter. Additionally, getting rid of pointer provenance (or equivalently: well-defining the behavior involved with various integer/pointer conversions) is the only way to, as you want, "minimize the amount of user code that is broken", because much user code assumes the integer/pointer conversion happens according to what TFA calls concrete semantics.

I dispute that there is a "need" for pointer provenance, so much as a desire on the part of compiler developers to honor the sunk cost of various optimizations that relied on particular interpretations of the ambiguity.

> memory-based optimizations should not be permitted when two pointers have been converted to/from integers.

Just to be clear, what you are saying is that it is not legal to transform this function:

  int foo() {
    int x = 5;
    return x;
into this function:

  int foo_opt() {
    return 5;
And the end result of disallowing that kind of optimization is that effectively any function that transitively calls an unknown external function [1] gets -O0 performance. Note that one of the memory optimizations is moving values from memory to registers, which is an effective prerequisite of virtually every useful optimization, including the bread-and-butter optimizations that provide multiple-× speedups.

I suspect many people--including you--would find such a semantics to have too wide a blast radius. And if you start shrinking the blast radius, even to include such "obvious cases" as address-not-taken, you introduce some notion of pointer provenance.

That's how we arrived at the current state. We've given everything the "obvious" semantics, including making the "obvious" simplifying assumptions (such as address-not-taken not being addressable by pointers). But we have discovered--and this took decades to find out, mind you--that using "obvious" semantics causes contradictions [2].

Take a look at the quiz that prompted this discussion: <https://www.cl.cam.ac.uk/~pes20/cerberus/notes50-survey-disc...> and <https://www.cl.cam.ac.uk/~pes20/cerberus/notes30.pdf>. There's a lot of cases where pointer provenance crops up that asking C experts "does this work" or "should this work" ends up with head-scratching. Indeed, if you compare several claimed formal semantics of C, there are several cases where they disagree.

[1] An unknown external function must be assumed to do anything that it is legal to do, so a compiler is forced to assume that it is converting pointers to/from integers. And if that is sufficient to prohibit all memory-based optimizations, then it follows that calling an unknown external function is sufficient to prohibit all memory-based optimizations.

[2] And just to be clear, this isn't "this is breaking new heroic optimizations we're creating today", this is a matter of "30-year-old C compilers aren't compiling this code correctly under these semantics."

I would say that that transformation is legal, it doesn't even involve pointers. I don't think anything I said precludes escape analysis. How would provenance come into play here?

Suppose I invent bar as follows:

  void bar() {
    int y = 0;
    int *py = &y;
    uintptr_t scan = (uintptr_t)py;
    while (1) {
      scan ++;
      char *p = (char*)scan;
      if (p[0] == 5 && p[1] == 0 && p[2] == 0 && p[3] == 0) {
        *(int*)p = 3;
This code will scan the stack looking for an int whose value is 5 and replacing it with 3. It's only undefined behavior if there's some notion of provenance: there's no pointer arithmetic, it only happens without pointers. There's not even a strict aliasing violation (since char can read anything). And yet, this code is capable of changing the value of x in foo to 3.

> I don't think anything I said precludes escape analysis. How would provenance come into play here?

Escape analysis is a form of pointer provenance.

> I would say that that transformation is legal, it doesn't even involve pointers.

you don't know that void b() isn't implemented as

    void b() {
       int* ptr = make_a_valid_pointer_from_an_integer(1638541351);
       *ptr = 10;       
with 1638541351 sometimes being the address of x above ?

Could another approach be taken, where local variables are considered implicitly “register”? In that case this simple example has no problem whatsoever. It does arise unnecessarily if the address of a local is taken but does not escape, but that ought to be rare.

Great example, can I please borrow that? Is a link sufficient attribution?

Go ahead.

"...in LLVM, all variables start with their address taken, so the optimizer needs to reason about when all uses of the address are known."

Why is this? Does it simplify SSA construction, or is there another reason?

It simplifies code generation. LLVM has no address-of operator; instead, you start with the address of an object (be it a global variable or a local alloca), and you take the address by using the address instead of loading the value at the address. Also, this means that all lvalues start by living in memory, and this dramatically simplifies bookkeeping you need for lvalues that may not be easily representable as registers (e.g., struct values).

this: https://www.ralfj.de/blog/2020/12/14/provenance.html is quite instructive as well (just in case you were not aware of it)

Disregard this. I was wrong.

First, this change is reducing undefined behavior* (EDIT: see below), and that undefined behavior (most casts between pointers and integers) did _not_ work reliably in practice. And in fact, the semantics they chose is rather sensible.

But yes, some code with UB which works for now might break. But that’s true whenever you upgrade your C compiler or use optimizations, so for correctness you shouldn’t do either! People working on certified embedded software have known that for long.

The real problem is that C users have been lied to. They’ve been taught the lie that C pointers are just addresses, that memory is an array of bytes, that relying on this lie is robust, and that this combination is fast. But this combination is impossible.

EDIT: officially some behavior was implementation-defined, but implementations made it undefined anyway — either officially or via bugs; links in subthread.

Then i think i was wrong. For me this looks like it _added_ a whole lot of UB and made the rules of object provenance way more complex.

>The real problem is that C users have been lied to.

I think the real problem is that the committee is creating a language that the users neither need nor understand. The users want a language where pointers are, in fact, just address.

Why don’t they just disable optimizations then? I think that’s because they also want performance, even when it requires pointers to not be just addresses.

And pointers weren’t addresses before either. My favorite example of “pointers aren’t addresses” is this kind of code:

int a[…]; int b[…]; /* … */ a[i] = 0; b[j] = 1; return a[i];

Ignoring this discussion for a moment, wouldn’t you expect `return a[i];` to optimize to `return 0;` here? After all, `a` and `b` are disjoint arrays! I’m sure many would be disappointed if the compiler did not do that optimization, right?

If i wanted the compiler to generate code for `return 0;` I'd write it that way. If i wanted to use c as my macro assembler i very much expect the compiler to _not_ make this transformation.

If I wanted language with clever optimisations I'd go for something way higher level like lisp, haskell. Most optimisations are often only useful to make sure you can program generically without a headache. Optimizing code generated by macros or monomorphisation/type specialisation/template instantiation/however you call it is really useful. Is it worth it otherwise? For me the difference between gcc -O1 and -O3 suggest this is not worth it in the domain where c is typically used.

Optimizations don't really matter here. Optimization can't cause non-conforming code to be emitted (for a non-buggy compiler). A non-optimizing compiler could emit exactly the same code. If they don't, that's just inefficient code gen by the non-optimizing compiler.

The point of this is to reduce miscompilations caused by pointer-int-pointer casts, use-after-free crashes, etc.

Maybe i missunderstood and overracted. I me it looked (and looks) like it introduces a whole lot of new UB and making pointer even more abstract.

If it helps, anything that introduces UB for optimizations (like forbidding assuming that 'int y,x;' implies an order in memory) is also good for security.

Mainly because it lets you use systems with type-safe pointers (some acronyms are PAC, BTI, CHERI) - the regular "concrete semantics" aka "whatever current pointers let you get away with" has a lot of issues! But also because compiling with extra security is not popular if it causes performance regressions.

> anything that introduces UB for optimizations (like forbidding assuming that 'int y,x;' implies an order in memory) is also good for security.


> But also because compiling with extra security is not popular if it causes performance regressions.

Yes. So while UB could theoretically be (in some cases) good for security, in practice it is absolutely not, because no tooling exists to introduce runtime checks light enough to be used in production. So UBs can and do reveal and/or amplify latent bugs, or even introduce new ones if new UB are introduced in new specifications.

PAC/BTI are definitely used in production, UBSan is fast enough to ship in production if you want, and there's more tools being used that I can't talk about here.

> I don't think the world has a need for faster c code. But is in dire need for less broken software. This is trying to do the opposite.

And that's ok. There are tools for that at the reach of those who place a premium on those requirements.

For those who favour interoperability, simpler mental models, efficiency, and more importantly a tool with a proved track record, there's C.

And there are plenty of reasons why C is the de facto standard in spite of all alternatives.

Are you genuinely claiming that this makes for a SIMPLER mental model (as compared to the implicit "concrete" model)? I don't see how you can justify that...

If you are trying to mentally model how the keys you are pressing will be turned into the machine instructions that will ultimately get executed, yes. But that is no excuse for not looking at assembly.

But you can’t compare to the concrete model, because it’s not the status quo. That ship already sailed with C89 and strict aliasing. Ptr2int casts were never legal, and support was inconsistent.

For instance, GCC sometimes tracks pointer provenance through integers; this proposed change will make that clearly illegal.

The concrete model was raised explicitly as a possible option, why can't I compare to that? TFA says that the status quo is ambiguous.

>Ptr2int casts were never legal

Citation on this? The existence of uintptr_t seems to say otherwise.

>That ship already sailed

In principle, I see no reason preventing future versions of the standard from defining previously-undefined behavior. That said, you are probably right in a realpolitik sense.

Should have been more careful — in C and C++ they were implementation-defined. But both theory and practice did not mandate a concrete model — GCC already violates that and implements a mishmash of PVI and PNVI.

From N2176, Sec., p. 5-6: > An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.

> Any pointer type may be converted to an integer type. Except as previously specified, the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type.

AFAICT, that doesn’t even guarantee that a roundtrip is the identity — which C++ mandates and is still the case: http://eel.is/c++draft/expr.reinterpret.cast#5

EDIT: GCC does have those restrictions in place already: https://gcc.gnu.org/onlinedocs/gcc-11.1.0/gcc/Arrays-and-poi...

and since a while (earliest relevant docs I could find easily): https://gcc.gnu.org/onlinedocs/gcc-3.1.1/gcc/Arrays-and-poin...

EDIT2: LLVM doesn’t document this “implementation-defined behavior” (or most others), even if docs are explicitly required by the standard; but it does use pointer provenance in practice already, but not in a way that’s easy to explain.

UNIX and C being available as free beer since their inception (actually 6th edition) kind of outnumbers all the other possible reasons.

> UNIX and C being available as free beer since their inception (actually 6th edition) kind of outnumbers all the other possible reasons.

That would only explain why C featured as one of the choices on the table.

But from that point on, people like you and me made conscious decisions based on their requirements and preferences, and C frequently was everyone's top choice.

Just like JavaScript is frequently everyone's top choice for the browser, and PHP is frequently everyone's top choice for making their websites dynamic in most ISPs.

Quality is surely not the common factor to all those top choices, rather being the platform owners, with all that it entails to the surrounding ecosystem.

> Just like JavaScript is frequently everyone's top choice for the browser

No, not really, and frankly this feels like a disingenuous comparison done in bad faith. I'm sure you know that, unlike C, javascript is pretty much the only option on the table if your goal is to produce anything remotely resembling production code. Also, as you certainly are aware, that's far from the case with C. In fact, you should certainly be aware that one of the most popular compiler toolkits there is, and one which has been the de facto standard for a few decades now, offers C only as a front-end, on par with languages such as Ada, which already was around for close to a decade before C's first international standard was published.

Thus, you'll need to come up with an explanation of why people repeatedly pick C over well established languages explicitly designed to meet he requirements you favour, and one which is already around for four decades, and is already standard and widely available and enjoys some adoption in some fields. But in spite of it everyone still picks C over it.

C is the only option in UNIX/POSIX platforms like 99% of the time.

The other being C++, which happens to be for better or worse, almost copy-paste compatible with C89, also also a UNIX child born a couple of offices down the corridor.

Interesting that you pick Ada, given how outrageous the compilers used to cost, versus the freebie C compiler that already came on the box, and cleverly washout the complaints on the FOOS community that won't touch GNAT due to licensing FUD.

Who is on bad faith here?


I don't get what you are trying to tell me. People use C. Yes, it get it. I actually argued infafour of that. Embedded development, device drivers and kernels.

None of these need speed. They need reliability and maintainability. The standards committee is prying the language out of the hands of these people. Because they need, using your words, "interoperability, simpler mental models," which is sabotaged by efforts like this one.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact