The CompCert C Compiler

lisper · 2024-05-26T19:54:37

I have never understood why the C community does not rally around what seems to me to be the obvious answer to the main problem with C: stop conflating pointers and arrays, or at least deprecate it. Make arrays a separate data type, not a synonym for a pointer, and require the system to track the size of arrays either at compile time or at run time (or both). Make a[x] be a bounds-checked array reference rather than a synonym for an unchecked dereference of a+x. It seems stupidly obvious to me that this is the Right Thing. Yes, it would be a non-backwards-compatible change, but the benefits seem to me to vastly outweigh the costs.

elteto · 2024-05-26T20:01:06

Because you can't do that and still call the resulting language C. It would be a different language incompatible will _all_ existing C code.

Pointer decay is a fundamental mechanic in C. If you use arrays in C, and you pass said arrays as an argument to a function, then you are dealing with pointer decay.

nine_k · 2024-05-27T04:53:58

OK, don't call it C, call it C[*].

It will accept any existing C program unmodified.

With a per-file switch, say, #define __FEATURE_WEAK_ARRAYS, it will start to discriminate T* and T[], make it an error to mix them in assignments, or to pass one instead of the other if both the function definition and the function call are in files with this feature on. It will not complain about functions defined elsewhere.

Then, say, #define __FEATURE_STRICT_ARRAYS the compiler will complain about mixing arrays and pointers as function arguments, no matter where the function is defined. It would require updated stdlib headers, for instance.

Additionally, #define __FEATURE_MULTI_ARRAYS would enable syntax for fixed-size multidimensional arrays, Fortran-style. Now uint8[3][2][10] foo; would allocate 60 elements, and access to foo[1][2][3] would involve one memory dereference, not three.

More support would be needed: sizeof, support for slices, safe array copies and length checks. Nothing extraordinary.

Having this implemented would make a terrific master's degree work.

klodolph · 2024-05-27T07:34:52

There are a million little places in existing codebases where you start with an array, and then you work with pointers to elements inside the array. So it may be a lot of work to retrofit some of your existing code. If you are not retrofitting enough of your existing code, then you’d be serving greenfield projects—and greenfield projects can use Rust or something.

There are also probably a ton of edge cases you haven’t thought about yet. For one thing, your T[] would be a different beast depending on where it appears—if you declare a variable in a block as T[], it’s an array, but it sounds like your proposal has different semantics for T[] in function parameters—it’s a reference type.

    void f(int x[]) {
      int y[] = {1, 2, 3};
      y = x; // is this allowed?
    }

I’m not trying to fight over the specifics of your proposal. I just want to illustrate that the language design is a tapestry, and you’re pulling at one of the threads.

There are a few proposals I see like this that circle around. This is not the first array improvement proposal I’ve seen for C. There are also lambdas / closures, which are surprisingly untenable in C when you really dive into it. There’s sum types / discriminated unions in Go, and higher-kinded types in Rust. For each of these features, you can find languages which already have these features, giving you all sorts of templates for how to build it, and yet it’s still such a pain in the ass to add these features to the languages which lack them.

KerrAvon · 2024-05-27T05:25:57

It’s more complex than that in practice. You need to support variable length arrays, which means you need to pass length information with the pointer, which doubles the size of array arguments. And that has ripple effects; you can’t do that indiscriminately without memory bloat and severe runtime impact, so you need to annotate argument. And you don’t have complete memory safety. And it takes a lot of manual programmer effort to get there.

This has been studied for decades; there have been many attempts to build a safer version of C. And eventually the C standard committee will have to decide on an approach or lose out to newer languages at an increasingly accelerated rate.

nine_k · 2024-05-27T08:02:24

The fun part here is that most allocators already keep the size of the chunk within the header of the chunk, before the pointer they return to the caller of malloc(3). (Yes, jemalloc does not, it keeps the size info in a different way.) BTW I mentioned this as a master's graduation project, not a realistic way forward, because, well, it's not needed.

I'd say that C will lose relevance slowly, more and more, as much as Zig will gain relevance, hopefully to the point of becoming the default choice, and having key parts of Linux kernel ported to it. Not Rust, which mostly replaces C++; even though it can venture on the C territory, its not comfortable nor seamless there. Zig is so seamless, it can even compile your C code along the way. It can do gnarly stuff like handling memory-mapped control registers with relative ease, and with much fewer footguns than C.

C is old, and its age shows. It needs to gradually retire, the way Fortran-77 did.

Koshkin · 2024-05-29T00:10:24

> C will lose relevance slowly, more and more, as much as Zig will gain relevance

I have felt a bout of nostalgia for the years past reading this.

n_plus_1_acc · 2024-05-27T14:52:57

Rust always passes slices as (ptr, len), and it seems to have to runtime performance impact.

steveklabnik · 2024-05-27T16:00:32

What impact are you thinking of?

n_plus_1_acc · 2024-05-29T18:49:13

GP claimed that passing arrays as (ptr,len) would have a dramatic performance impact. I used Rust as a counterexample.

steveklabnik · 2024-05-30T08:26:45

Oh! I think you had a typo: “to” instead of “no.” I thought you were saying there is a performance impact. Thanks for clarifying :)

n_plus_1_acc · 2024-05-30T10:34:55

I didn't notice, thanks. I agreee that made it confusing.

n_plus_1_acc · 2024-05-30T10:35:31

*no runtime impact

pklausler · 2024-05-27T05:23:42

Why do you think that foo[1][2][3] requires more than one memory reference?

nine_k · 2024-05-27T07:45:46

AFAIK in regular C a T[][][] foo would be a the same as T ** foo, so it gets implemented as a pointer to an array of pointers to arrays of pointers, each pointing to contiguous allocations of multiple Ts, each not necessarily near the other. So you need three dereferences to get to an element.

  T value = foo[1][2][3];  // becomes:
  foo ->  (T **)
          (T **) -> (T *)
          (T **)    (T *)
          ...       (T *) -> (T)
                    (T *)    (T)
                    ...      (T)
                             (T) <-- This one!

This allows for jagged arrays, yay! So useful.

This is in stark contrast to a Fortran-style array, which is allocated as one contiguous piece, all dimensions folded up for linear access with one dereference.

ynik · 2024-05-27T09:38:27

You are mistaken there. An array `int arr[10][20][30];` is a single contiguous (stack) allocation.

I recommend you read up on what pointer decay actually does; it's more complicated than replacing all arrays with pointers!

In particular, the type of an `arr` expression (after decay) is `int(*)[20][30]`. Decay only ever changes the top-level (outermost) type! And the type of `&arr` is even `int(*)[10][20][30]` -- using `&` or `sizeof` prevents decay from happening. Pointers to arrays are rarely used because using decay is more idiomatic (and because their syntax is unwieldy), but they still exist and would be safer than using decay.

pklausler · 2024-05-27T14:24:01

Nope.

skissane · 2024-05-27T11:21:37

> Because you can't do that and still call the resulting language C

You could if you introduced a new type of "safe array".

e.g.

int[] is a traditional C array which decays to int*

int[@] is a "safe C array" which is syntactic sugar for struct { size_t __count; int* __items; }, and as such can't decay

int[] and int[@] would not be directly interoperable, except by converting both to int* – maybe casting an int[@] to int* would automatically extract the __items member.

(The [@] syntax was chosen at random, if you don't like it, pick something else.)

pif · 2024-05-27T11:34:31

You can define such a struct yourself. You do not need to modify the language.

adrianN · 2024-05-27T13:28:28

By having it defined at the language level you increase the chances that your dependencies use the same struct.

SpaghettiCthulu · 2024-05-27T12:21:24

But your struct cannot provide an indexing operator (right?)

orf · 2024-05-27T15:39:27

So why do we have so many issues with arrays then, since everyone can individually fix it themselves? Hmm…

mhh__ · 2024-05-26T21:32:44

I don't know how much C code actually uses it tbh.

Most things just take a pointer.

Wouldn't be easy by any means but it could be done at the scale of the Linux kernel if anyone cared enough.

eru · 2024-05-27T04:33:43

The Linux kernel already uses not standard C, but a dialect of C defined by the GCC flags they use.

But if you are willing to use something-like-C-but-not-shackled-by-backwards-compatibility, then why stop at arrays and pointers? Just move all the way to D or Zig (or even Rust). These are all languages designed (partially) so that you can port an existing C system bit-by-bit over into them.

Many people who can afford that, are doing that, of course. And that's why you don't really hear much about backwards incompatible developments for C. What would be the point?

elteto · 2024-05-26T23:14:30

If you use arrays in C you are using pointers and pointer decay. There’s not a lot of useful C code that doesn’t use arrays.

mhh__ · 2024-05-26T23:44:19

That's true but this is mostly the easy case where things decay in relatively trivial fashion as opposed to the "they're different but actually the same" aspect which is genuinely a bitch of change.

LeFantome · 2024-05-27T00:55:31

I think that is part of his point. Most functions take a pointer, even if they are expecting an array.

lisper · 2024-05-26T20:11:03

> Because you can't do that and still call the resulting language C.

Says who? Non-backwards-compatible changes are made to language standards all the time. It's not pain-free, but neither is the status quo.

Besides, who cares what the language is called? Change the name if that's what it takes, but stop conflating pointers and arrays. The cost of that has been literally billions of dollars in losses due to buffer overflows over the decades.

colonwqbang · 2024-05-26T21:00:23

There are many good "better C" alternatives already. Rust, zig, D. They address other common types of bug like resource leaks, overflows, use after free. If you're ok with rewriting code, you have terrific options.

And of course there is C++, the most famous attempt to fix C in a somewhat compatible way by adding more features. It is debatable whether this effort resulted in a better language. C++ has all the bits needed to check array bounds by default but chooses not to do so...

The problem is that the huge existing stock of C code is written in C and not rust, zig, D, etc. The same would be true for your proposed "better C" language and any other incompatible iteration of C.

If you can come up with a way to add these guarantees to C without needing significant rewriting, I can assure you that most C programmers would be very interested.

lisper · 2024-05-26T21:12:00

> There are many good "better C" alternatives already. Rust, zig, D.

All of these are very different from C. What I'm proposing is just one small change to the existing C language.

> If you can come up with a way to add these guarantees to C without needing significant rewriting, I can assure you that most C programmers would be very interested.

I can pretty much guarantee that they would not because this is easy: phase in the changes. Start by turning array-pointer conflation into a mandatory warning rather than undefined behavior or whatever it is now. Then wait a few years. Then turn it into an error that you can muffle using -C2024 or whatever.

I actually don't know whether array-pointer conflation is required by the standard or if it's undefined behavior (I'm pretty sure it's one or the other). But if it's the latter then you don't actually have to change the standard to make this happen, all you need to do is write a compiler that does the Right Thing. AFAIK no such compiler exists. But there is just no excuse for this:

    % gcc -v
    Apple clang version 14.0.0 (clang-1400.0.29.202)
    ...
    % more test.c
    int main () {
      int x[10];
      return *(x+20);
    }
    % gcc -Wall test.c
    %

rfoo · 2024-05-26T21:20:26

There are more than one billion lines of C code which hasn't been updated in the last decade and is still in use.

Who is going to update these code once the "do the right thing"-compiler become available?

Oh, and the worst part: some of them may already be bug-free due to 15 years (if not more) of people trying to make money by selling exploits to surveillance vendors or who knows. But there's certainly high-impact bugs left. Now what, refactor the code to use the fancy eliminate-spatial-memory-corruption C variant and introduce a few UAF by the way?

lisper · 2024-05-26T21:24:43

The authors and maintainers of that code. Any error or warning flagged by this change is something that really ought to be changed anyway because any unchecked array or pointer dereference is a potential security risk.

ncruces · 2024-05-27T09:02:40

That's patently false, or we wouldn't resort to all kinds of tricks to convince the (moderately smart) Go compiler to elide bounds checks (that we know to be unnecessary) from inner loops where they degrade performance significantly.

112233 · 2024-05-27T05:44:42

> any unchecked array or pointer dereference is a potential security risk.

I take exception to this. Of course if I write once-test never, copying from google results and trying to hit jira metrics, then any safety feature in the language will filter out some of the toxic waste code I am producing.

If secure code is designed and engineered, like any other secure technical system would, the language used does not matter so much, but, unsurprisingly, needs to be easy to reason about formally.

rfoo · 2024-05-26T21:36:46

Have you read the source code of Xpdf (the thing being exploited in the famous NSO Apple iMessage 0-click blah blah)?

I did (because I wrote an exploit for the bug after the Google blogpost, for curiosity), the code looks disgusting. The author (one poor guy) does not have the code in an online VCS and instead dump a source tarball every few months (or years). The upstream vulnerable code was fixed months after news outbreak.

My conclusion is if Apple had a choice it won't end up in iOS at all. Clearly, Apple already paid a lot of maintaining cost in this case (fixing bugs before upstream did), but what you asked for is a whole new level.

VogonPoetry · 2024-05-26T22:48:11

I think you conflating bugs. Apple doesn't use Xpdf as the basis of the PDF Framework. The NSO bug exploited a bug in the JBIG2 file format. The same implementation of this code was included in both Xpdf and the Apple PDF code. That is why Xpdf needed to also fix the same code.

rfoo · 2024-05-27T05:41:52

Of course I mean the fact they use the JBIG2 part of Xpdf. You don't have to use Xpdf for processing PDF in order to use these `class JBIG2*` and you don't even need to patch it.

LukeShu · 2024-05-26T22:09:47

There is a well-maintained in-VCS less-disgusting version of Xpdf. It's called Poppler, and Apple chooses to not use it.

duskwuff · 2024-05-26T23:06:24

Popper is GPL licensed, so Apple cannot use it.

(If you're going to say "but they could use it if they relicensed all of iOS as GPL": don't be daft.)

medo-bear · 2024-05-27T08:35:29

I guess that is Apple's problem

retrac · 2024-05-26T21:25:26

In assembly, arrays and structs reduce to base address, plus offset times scaling factor. C provides a thin veneer over that. Scaling derived from size of type. Offset is the array index. The basic programming model of C is to view memory as an array.

eru · 2024-05-27T04:35:53

It's tempting to view C as a thin veneer over assembly.

Alas going there ignores all the nice undefined behaviour landmines the standard has buried for you.

lisper · 2024-05-26T21:38:33

Yes, I understand that. But the topic at hand is a compiler whose "intended use is the compilation of life-critical and mission-critical software written in C". The idea of using C in life-critical and mission-critical software is risible as long as the language definition requires it to have gaping security holes, especially when the single biggest contributor to this problem is fairly easy to fix, at least from a technological point of view if not a political and sociological one.

nequo · 2024-05-26T21:58:12

My understanding is that the primary purpose of CompCert is to make formally verified code that is extracted into C also get compiled by a compiler that is formally verified to preserve the intended semantics.

So CompCert seems to me to aim to help mission-critical software to move away from C, and possibly into Coq/Isabelle/etc., except for the purposes of compilation to machine code.

lisper · 2024-05-26T22:12:32

That is a noble goal, but I don't see how it can possibly achieve the intended result as long as the C standard is a fundamentally borken as it is.

I tried to download CompCert so I could try it out, but they only have a source distribution and to build it you need Coq and OCaml and a few other things because of course CompCert is not written in C. No one in their right mind would write mission-critical software in C.

seabird · 2024-05-27T02:43:53

If your original source is provably transpilable to C with no UB, and your compiler provably compiles that C without any bungling, then you've made it. The standard is bad and I want to see the end of C before I'm dead, but this isn't a stretch, and C is just a detail in the process.

LeFantome · 2024-05-27T03:59:05

I am mostly in the other side of this argument. C is an insane language in my view ( in a modern context ).

That said, what “mission critical software” are you using that is not running on an OS written in C?

eru · 2024-05-27T04:43:22

> > No one in their right mind would write mission-critical software in C.

> That said, what “mission critical software” are you using that is not running on an OS written in C?

I'm not sure that's relevant?

If you have a piece of mission critical software, almost all the time you run it on an existing OS like Windows or Linux. You don't _write_ a new OS just for your one piece of software.

Of course, that OS had to be written at some point in the past (and is still being worked on). Presumably that writing was (and is) being done by people not 'in their right mind'. But that shouldn't concern you.

The problem with C is not that you can't write secure-ish software at all; the problem is that this is insanely difficult, and that the trade-offs aren't worth it. Especially for new software.

For software that I get from some third-party, like the OS, I only care about its quality (and price); I don't care about the trade-offs and pains the authors had to endure. If they want to use C in the privacy of their own bedroom, that's up to them.

Of course, Linux in 2024 is written in C, mostly because Linux in 2023 was written in C, then 2022, etc all the way back to the 1990s. There's a lot of path dependence. Back in the 1990s C was a more reasonable choice to write your new OS in. Especially if it was a clone of Unix, C's original home and killer app.

airbreather · 2024-05-27T10:08:06

0Yeah, so I am a functional safety engineer, but for industrial process plant and machinery. We don't use C, traditionally it is safety controllers, specialised PLCs.

Most safety PLC's boot into a hypervisor that boots an OS (Wind River Linux or something) that runs a program that might be your complied config, or runs a program that runs your configuration (eg code you wrote).

So what languages does it seem likely were used for all those extra layers between your code and the CPU?

And I am talking the sort of controllers that supervise LNG plants, large buildings where they might have more than one elevator in any shaft, prevent overpressuring pipelines and creating environmental disasters and so on.

I would be more comfortable personally if I could write a c program and compile it knowing that the compiled code will run on the bare metal, at least then there are not a couple of closed source proprietary layers of abstraction between me and the processor.

Note : in case you wonder what the difference between a regular PLC and a safety PLC is, a safety PLC has a fuckload more diagnostics. For a safety system PLC, faults aren't the problem, it is dangerous undetected faults. A detected dangerous fault will trip to a shutdown immediately and is an availability issue, not a safety issue.

But, guess what language the firmware that does these diagnostics is written in? I don't know, but I doubt strongly it is one of the 5 PLC languages specified in IEC 61131, so that leaves it likely to be C.

pjmlp · 2024-05-27T08:26:19

Historical reasons, my Windows, Android, macOS, iOS devices have plenty of OS code written in C++.

Even those criticial OSes that refuse to move beyond C, most likely are using C compilers written in C++.

lelanthran · 2024-05-27T06:43:51

> No one in their right mind would write mission-critical software in C.

Read my previous response to you[1]: you clearly haven't worked on systems that would kill people if things went wrong.

[1] https://news.ycombinator.com/item?id=40488277

lisper · 2024-05-27T15:51:46

> you clearly haven't worked on systems that would kill people if things went wrong

True, but I have worked on a system that would have cost hundreds of millions of dollars if things went wrong. And they did go wrong, though we managed to save the asset. So I do have some relevant experience here.

Yes, if you put enough effort into it and deploy into a non-adversarial environment, you can get the odds of success pretty close to 100%. But then you also get the Therac-25 every now and then.

But mainly you get an endless stream of buffer overflows that lets hackers steal people's bank accounts. That's not life-and-death, but it's a significant societal cost nonetheless.

user2342 · 2024-05-27T11:43:36

> My understanding is that the primary purpose of CompCert is to make formally verified code that is extracted into C also get compiled by a compiler that is formally verified to preserve the intended semantics.

Thats my understanding too. Code is written in high level systems generating C as output. C becomes rather an implementation detail in a hopefully, more or less completely verified tool chain.

lelanthran · 2024-05-27T06:41:43

> The idea of using C in life-critical and mission-critical software is risible as long as the language definition requires it to have gaping security holes,

And yet, even though C has been the primary language for safety and life-critical software for decades, with billions of lines of code written to control things where failure results in loss of human life, there has been no significant loss of human life due to the C language.

Throughout the 80s, 90s, 2000s and 2010s C has been the primary language used to control industrial machinery that would kill people on software failure, munitions that would kill people on software failure[1], vehicles that would kill people on software failure, medical devices that would kill people on failure ... and out of these billions of deployments, with billions of lines of code, offhand I can think of only one instance where a different language would have prevented 3 deaths.

I'm not saying that C is safe, but it is clear from the statistics that the danger is very very highly overrated. There is a much greater danger in rewriting battle-tested systems just for the sake of rewriting.

[1] An industry I worked in, btw.

medo-bear · 2024-05-27T08:43:01

> I'm not saying that C is safe, but it is clear from the statistics that the danger is very very highly overrated. There is a much greater danger in rewriting battle-tested systems just for the sake of rewriting.

From 80->90->00->10->20s reading and writing C seems less and less magical, including for exploit writers. In 10 years exploits might even be written willy-nilly by an LLM. One of the reasons why writing safe and secure code requires thinking few steps into the future.

pjmlp · 2024-05-27T07:09:10

So you are well aware that isn't regular C that gets written, rather something that most HN readers would run away from if forced to write such kind of C.

lelanthran · 2024-05-27T08:03:38

> So you are well aware that isn't regular C that gets written, rather something that most HN readers would run away from if forced to write such kind of C.

I'm not sure what your point is.

It isn't always standard C, if that's what you're trying to say.

It's usually not a hosted implementation, but sometimes it is. It's usually done within industry regulated guidelines, but not always.

The fact is, the "not always" bit matters, because the body of C code controlling actions where human lives matter is so large that there is still a substantial body of standards-compliant C code that doesn't kill people!

The claim being made is contrary to the large body of evidence that we have.

pjmlp · 2024-05-27T08:13:51

My point is MISRA, AUTOSAR, DO-178C and plenty of other ones, alongside the industry moving on into hardware memory tagging as the only means to fix C for such kind of security critical systems, short of using something else if possible as suggested by upcoming cybersecurity laws.

lelanthran · 2024-05-27T09:12:48

> My point is MISRA, AUTOSAR, DO-178C and plenty of other ones, alongside the industry moving on into hardware memory tagging as the only means to fix C for such kind of security critical systems,

I dunno how relevant that is.

The argument was "Irresponsible to use C for critical systems"

The counterpoint is "Despite being the primary language for critical systems, negligible failures have been attributed to the language."

I'm basically saying this: How do you explain both that severe reaction to using C AND the historically negligible failure rate of the language itself?

RealityVoid · 2024-05-27T11:00:39

> How do you explain both that severe reaction to using C AND the historically negligible failure rate of the language itself?

You CAN get from point A to point B by riding a horse, but why would you when cars are a faster alternative?

But to the point, many failures have been attributed to the language, most of security bugs stem from the C's lack of memory safety.

_Fortunately_ the reason not a whole lot of deaths can be attributed dirrectly to C is the fact that:

- The safety critical sw is has multiple redundancies baked in, including at the HW level that would safeguard against fatal outcomes.

- Safety critical SW is tested intensley. This proves the "common" cases of usage, but in my experience still fails for long-tail events.

- Memory corruption issues would most of the cases "mearly" lead to resets instead of wrong program output.

- Thinking about SW that is deployed in large numbers, if we admit that memory corruption issues happen in very special cases ( see 2nd pct) then the sudden appearence of a bug could _very_ easily be bundled as a fluke instead of a bug and we would probably not be able to distinguish the failure leading to death as being attributed to C. (since " it works fine on my machine" in 99.99% of the cases)

bdw5204 · 2024-05-27T14:17:48

I'd say the main reason why there's a phobia of using C would be that it is a language that requires skilled programmers. C programmers are not interchangeable cogs which means corporations that are dependent upon C code have to hire the best programmers not the ones with the best "soft skills" or the ones with the "right" physical appearance.

The one thing corporations wanted, above all, was to increase the supply of programmers who won't break everything. This explains the trends towards safety in programming languages and it also explains why OOP became so popular. It also explains the push over the last 10-15 years or so for everybody to learn to code. Anything that is hard reduces the number of potential programmers which is bad for business's bottom line.

pjmlp · 2024-05-27T12:01:32

Very easy, the only way C can be resposibly used in critical systems is to put so many guard rails and training wheels in place, that it no longer looks like C in first place.

It is only negligible for those that don't have to fix CVE issues.

Which is why we have all those ongoing security laws, companies and goverments have finally started to map money burned due to those CVE fixes.

colonwqbang · 2024-05-27T09:59:36

You are just not using the correct compiler flags. Compile with gcc -O2 -Wall -Werror, and you will see that GCC rejects this program due to the out of bounds condition. No need to trash talk GCC when it already does what you want it to.

lisper · 2024-05-27T15:43:53

I'm actually running clang (look at the second line of the transcript). I just invoke it with gcc out of thirty years of habit.

colonwqbang · 2024-05-27T17:18:21

I see! For recent clang, the flag -Warray-bounds-pointer-arithmetic did the same thing for me. So if missing bounds checking for static arrays was your main gripe with C, you can rejoice.

balnaphone · 2024-05-27T11:14:03

Perhaps that is inexcusable. GNU gcc version 13.2.0 (with -O2, as documented) does report a problem.

    $ cat tst.c
    int main () {
      int x[10];
      return *(x+20);
    }
    $ gcc -Wall -O2 tst.c
    tst.c: In function ‘main’:
    tst.c:3:10: warning: array subscript 20 is outside array bounds of ‘int[10]’ [-Warray-bounds=]
        3 |   return *(x+20);
          |          ^~~~~~~
    tst.c:2:7: note: at offset 80 into object ‘x’ of size 40
        2 |   int x[10];
          |       ^

HelloNurse · 2024-05-27T16:13:33

These are easy mode arrays, with size and offset known at compile time. Receiving x as an int* parameter to a function, with no way to know its length automatically, would be more realistic.

flohofwoe · 2024-05-27T06:42:12

> All of these are very different from C.

With the changes you have in mind, that new "C+" would be much closer to Zig than C. For a backward compatible bounds-checking proposal see: https://discourse.llvm.org/t/rfc-enforcing-bounds-safety-in-...

This basically just associates a pointer and a length via new (and optional) annotations.

pjmlp · 2024-05-27T07:06:38

That is what Microsoft is already doing since the Windows XP SP2 security task force, Apple came a bit late into the game.

The

    void foo(int *__counted_by(N) ptr, size_t N);

with SAL

    void foo(_In_reads_bytes_(N) int *ptr, size_t N);

https://learn.microsoft.com/en-us/cpp/code-quality/understan...

But given how much long time ago XP SP 2 was, and how many people actually use them, unless forced at their job, that is quite telling how much people care.

rurban · 2024-05-29T12:04:07

The proper simple syntax IMHO should better be something like:

    void foo(int ptr[n], size_t n);

with ptr[n] not copying the full array, just that n is the size.

you can try it now with:

    #include <stddef.h>
    #include <stdio.h>
    int ptr[6] = {0,1,2,3,4,5};
    #define N sizeof(ptr)/sizeof(int)
    
    void foo(int ptr[n], size_t n) { // error: ‘n’ undeclared here (not in a function)
        for (unsigned i=0; i<n; i++)
            printf("%d ", ptr[i]);
    }
    void main(void) {
        foo(ptr, N);
    }

instead of compile-time:

    #include <stddef.h>
    #include <stdio.h>
    int ptr[6] = {0,1,2,3,4,5};
    #define N sizeof(ptr)/sizeof(int)
    
    void foo(int ptr[N], size_t n) {
        for (unsigned i=0; i<n; i++)
            printf("%d ", ptr[i]);
    }
    void main(void) {
        foo(ptr, N);
    }

The Linux kernel [restrict .n] syntax is just too weird, almost perl-like, inventing new magic glyphs. And deviating from the normal restrict meaning.

Gibbon1 · 2024-05-27T09:47:08

What bothers me is a function that takes both a ptr and a length is taking a phat pointer.

What feels bad is you could add standard phat_ptr_t to the C library. But they refuse to do even that.

flohofwoe · 2024-05-27T11:02:18

Such a bounds-checking language extension needs to be able to annotate existing libraries without changing their API, otherwise it's not all that useful.

Gibbon1 · 2024-05-27T20:37:07

What I've noticed about C and it's problems with safety is the discussion always assumes that C will be replaced real soon now. So the problem is really about existing code bases.

After 40 years of that I think that was a bad assumption.

I also think that with annotations you can fix code mechanically.

You got

   void foo(int *__counted_by(N) ptr, size_t N);

That could be replaced mechanically by

   void foo(sized_buf_t buffer);

And if it can't that's already a big problem.

flohofwoe · 2024-05-28T15:55:50

Yes, easy to do and a good idea in your own code but maybe not an option in a library that needs to remain API and ABI compatible with the previous version of the library that used separate ptr and size arguments (for whatever reasons - for instance Apple might want to harden system libraries without breaking existing applications).

flohofwoe · 2024-05-27T14:54:49

PS (too late for edit so I'm replying to myself): with the above Clang extension you can also define your own phat_ptr_t struct and still associate the length with the pointer. Not sure if that's also possible with the Microsoft extension, for instance this is copied from the proposal text:

    typedef struct {  
      int *__counted_by(count) buf;
      size_t count;  
    } sized_buf_t;

pyrolistical · 2024-05-26T22:14:48

You should take a closer look at zig. While superficially the syntax is very different, what zig really is, c but more specific

lisper · 2024-05-26T22:20:46

No, I get that. I was referring to the syntax. Syntax matters.

Zig is hands-down a better language than C, and (I'll take your word that) it fills the same niche as C, but it is still a different language with its own idioms and lore and conventions. It is not C-with-tweaks. It cannot be compiled by an extant C compiler. Code written under my proposal would be legal C code under the current standard (but not the other way around).

[EDIT] Actually, that turns out not to be true. You'd need to change the behavior of SIZEOF or provide some other way of getting the size of dynamically allocated arrays at run time, since this information would now be maintained by the compiler.

unwind · 2024-05-27T07:37:12

Of course sizeof already has that capability since C has variable length arrays. They are kind of being phased out, but they are there.

You can do this:

    #include <stdio.h>

    int main(int argc, char *argv[])  {
        unsigned int lens[argc];
        for (int i = 0; i < argc; ++i)
            lens[i] = (int) strlen(argv[i]);
        printf("Computed %d lens into %zu bytes of array\n", argc, sizeof lens);
        return 0;
    }

Very contrived pointless example, but still.

riku_iki · 2024-05-27T05:16:58

> C++ has all the bits needed to check array bounds by default but chooses not to do so...

std::array::at does bound check.

grumpyprole · 2024-05-27T06:04:07

Yes, but they've made the convenient lightweight syntax not bounds check. Defaults matter in languages. Other languages make you use an esoteric function e.g at_unsafe to skip bounds checking.

riku_iki · 2024-05-27T06:37:50

its probably because per contract std::array has to behave exactly like C array for legacy ops. Also this is not zero cost abstraction, so programmer has a choice: ultimate performance or extra safety.

retrac · 2024-05-27T11:47:42

* = [] and friends can be overloaded in C++. So just about any kind of data structure can masquerade as items[3] or *value.

Opinions differ on whether this is the great strength or fatal flaw in C++.

pjmlp · 2024-05-27T08:29:41

While a bummer, most C++ compilers do have a build flag to enable bounds check in operator[]().

Which most sane compilers will do for you in debug builds.

colonwqbang · 2024-05-27T23:02:36

Adding a special function for safe indexing doesn't really count, we can do that in C too. "By default" means the most natural and common way to index an array, namely the [] syntax.

riku_iki · 2024-05-27T23:14:17

> "By default" means the most natural and common way to index an array, namely the [] syntax.

Sorry, I challenge your authority to decide what is "most natural way".

admax88qqq · 2024-05-26T21:19:27

> huge existing stock of C code

I feel like the amount of effort that has been spent so far to make C safe and fix bugs due to C not being safe is greater than the effort that would have been required to rewrite all existing C code into memory safe languages.

But I think secretly C programmers don't want memory safety. Dealing with pointers and remembering to malloc and free are part of what makes them feel more skilled/elite than those other programmers who have garbage collection and bounds checking.

optymizer · 2024-05-26T23:19:37

It's not that I don't want memory safety or that I feel superior - what I want is to write the fastest possible portable code. That's what C does, and nothing more.

Memory management, array bounds checking and a bunch of other 'safe' features have a price that I'm not willing to apply broadly and redudantly to all of my software.

I'm going for speed, that's why I'm using a Ferrari. Corolla's are fast and safe - use those, don't lobby for Ferrari to add safety to their cars at the expense of speed.

There are hundreds of languages. Use those. Write transpilers for C code for software that shouldn't have been written in C because it had to be safe. That would be a better use of your time.

admax88qqq · 2024-05-28T14:33:40

C is not a fast language outside of microbenchmarks.

If you’re writing large systems Java is the fastest language. Go benchmark Jetty vs Apache when serving non trivial web apps. Java is actually amazingly fast but it feels slow due purely because it starts up slower, startup time is not an issue for long running applications.

Heck just look at Apache Lucene the gold standard of full text search.

optymizer · 2024-05-30T11:54:15

You must be trolling. This is just patently false. Please don't spread misinformation.

Your comment is confusing "fast enough" with "fastest". Java is fast enough for lots of applications and that's fine, particularly because large systems are usually I/O bound, but it makes no sense to conclude that it is therefore faster than C.

I've been writing Java code for the past 10 years. Do you know how people speed up Java applications? They write the code in C, compile it as a library and use JNI to invoke it.

I would recommend you brush up on your CS fundamentals.

tialaramex · 2024-05-26T23:55:06

But that's not what C does, You've been, at best, misled.

What C does is assume that you're willing to sacrifice correctness to make the compiler simpler which is quite different from what you described.

In practice this has a negative consequence for performance as well as safety.

optymizer · 2024-05-27T00:28:02

Having written a compiler for a subset of the C99 standard, I'm going to disagree here.

Array bounds are not being checked on every array access not because it would make compilers too complex.

Correctness is being sacrificed mostly for speed or portability on future CPUs.

There are examples of language features that simplify compiler writing, however.

For example, type promotion from char to int is a feature that reduces the number of cases one would have to deal with when implementing the type system in a compiler, but it's there because it doesn't sacrifice neither performance nor portability.

pjmlp · 2024-05-27T08:31:48

Yet every other systems programming language never had any issues with enabling bounds checks, their only failure was not having a free beer OS to come along for the ride.

rfoo · 2024-05-27T10:34:18

I have to constantly fight against rustc and LLVM to convince it to eliminate bounds check in hot loop when I'm writing high performance Rust and it's a cursed experience I hope nobody has to.

Other replies in this thread mentioned they have similar problems writing Go, I don't know to what extent it applies, in my limited experience working on Go codebases I never see such issues.

pjmlp · 2024-05-27T11:48:38

I am writing code since 1986, in my experience most of those cases are mostly a I feel good kind of thing, and have contributed zero to the project delivery acceptance criteria.

When it does in fact cause an issue with project delivery acceptance testing, the issue is solved by making use of profiling tools, and cirugically disable bounds checking, which most systems languages since the dawn of time also support.

rfoo · 2024-05-27T17:50:47

Well, I would have no idea if a bounds check is eliminated at all (and who wants to care??), if it does not show up in profiling results.

Unfortunately for what I do I had to do this a lot. I guess that's also why I'm not seeing it in Go, never tried to write a query engine in Go.

pjmlp · 2024-05-27T19:13:19

Well, does something like CERN TDAQ/HLT count?

The algorithms, networking protocols and thread scheduling are much more relevant, that the bounds checking done in the C++ data structures.

As for writing query engines with bounds checking languages, there are several examples.

eru · 2024-05-27T04:44:57

> Array bounds are not being checked on every array access not because it would make compilers too complex.

That might be true, but you could still specify something slightly less exploit heavy than 'undefined behaviour'. Eg you could make out-of-bounds access into implementation defined behaviour.

jstimpfle · 2024-05-27T08:42:40

There is no way to predict what will happen if your program is accessing random memory at runtime, especially if it's a write access. To specify what would happen on a write to random memory would fill books that basically lay out most of the internals of the compiler and also the host OS.

eru · 2024-05-27T09:09:19

You could at least define it not to travel backwards in time.

Undefined behaviour in C infects the whole execution, not just what comes afterwards.

jstimpfle · 2024-05-27T10:01:35

Can you clarify what you mean? Is it defined to "travel backwards in time"? I suspect not.

tialaramex · 2024-05-27T22:13:20

Is the situation here that you're unaware of time travel UB optimisations in C and C++?

https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=63...

eru · 2024-05-28T01:46:53

Thanks for digging out the link, so I don't have to!

jstimpfle · 2024-05-28T09:39:05

No I'm aware of examples like these, I just asked for clarification what they mean by "define it to not travel backwards in time". To me this sounds nonsensical.

I'm not deep into compiler construction, but to me these examples seem just like a logical consequence of what UB is -- it's a (runtime) situation that the compiler is not required to take into consideration. It can opt to not emit code to treat these situations at all, etc -- effectively assuming they don't happen. The point is to allow the compiler to blindly dereference a pointer even when it can't prove that the pointer is valid. Or to allow it to implement arithmetic on a register of bigger size (assuming the computation doesn't overflow), etc.

Now, depending on how optimizers are written, the compiler may end up inadvertently detecting UB and optimizing out entire branches of code, just by virtue of how the optimizer works internally. You can bet that the compiler doesn't think much of e.g. what is earlier or later in time, when doing optimizing transformations.

Of course a "miscompilation" (of code that is buggy in the first place) is an unfortunate situation and a diagnostic would be better. Compilers should improve (and they probably do). Compilers should be friendly and give unsurprising results and good diagnostics as much as possible.

But to "define it to not travel backwards in time" right in the spec would probably be very hard and might negate the point of UB in the first place. It would require doing the work of compiler authors, which are the people responsible to figure out how to make _their_ compiler solid and ergonomic while also offering the optimizations people want. This is already a hard task for the authors of a specific compiler, and probably not something that you can easily define in a language spec!

And for balance, I've never consciously had to deal with a miscompilation like this, and I write C and C++ in professional capacity almost every day. Instead, most bugs I deal with are of the most trivial kind, you hit a segmentation fault, quickly navigate to the piece of code where there is still some initialization stuff missing, and fill it in. Or there is a logic bug that is entirely unrelated to UB, those are in fact, typically, more difficult to find and fix.

Note that while I'm by no means an exceptional programmer (not that I think you think that of me). I simply want to solve a problem. And while developping I introduce bugs and even UB sometimes (even though it seems to be quite rare if I can trust sanitizers). I'm actually sophisticated enough to develop in debug mode, with most optimizations turned off, and this might be one explanation why I've never hit an annoying situation like this.

To me, these stories are fascinating, and I think they should be taken seriously. But their effect on online forums is mostly to heat up discussions.

tialaramex · 2024-05-28T12:22:08

This comes up now because SG21 (the Contracts Study Group) have a proposal for C++ 26. Proponents of this work would like to portray it as a crucial safety improvement - you can now write a pre-condition contract and, hypothetically, this could be enforced to deliver a meaningful safety improvement over just documenting the same requirement on a web page nobody reads.

But of course the proposed C++ 26 Contracts rely on C++ expressions. In C++ the expressions are themselves full of potential UB footguns, including signed overflow and illegal pointer de-reference. Thanks to time travel, this means adding the "safety" pre-condition may actually make your software much more dangerous not safer.

One proposed way to defuse this somewhat is to prohibit that time travel. Your contract expressions might still be UB but the idea is to promise by fiat that if so this doesn't actually time travel and destroy previously correct parts of the software.

I genuinely don't know what will happen there and can offer no predictions. In terms of what would be amusing as a spectator I hope either SG23 (Safety) explicitly says this is a terrible idea but WG21 ships it anyway or, equally funny, SG23 endorses the current unsafe nonsense as safe and then a subsequent committee has to establish a "Safety but really this time" Study Group to replace SG23 in a few years when it's thoroughly discredited.

> most bugs I deal with are of the most trivial kind, you hit a segmentation fault, quickly navigate to the piece of code where there is still some initialization stuff missing, and fill it in

Sure. C++ is such a bad language that most of your bug fixing is stuff which wouldn't even happen in a better language. Rust's std::mem::uninitialized<T>() is ludicrously dangerous, so it's deprecated (as well as unsafe) and yet C++ not only does this, it's silently the default for the built-in types. Hilarious. My sense is that a correct fix for this won't land for C++ 26 although maybe Barry can get the stars to align and prove me wrong.

jstimpfle · 2024-05-28T12:32:35

See, I don't mind UB on signed integer overflow for example. You make it sound like a terrible terrible thing. I know it's not defined (and there is a rationale for keeping it undefined even assuming 2's complement). So I don't rely on it.

Quite honestly I don't recall signed overflow to happen, like ever. It's probably happened at some point but I really don't recall. I'm not trying to make it happen because I don't have a use for it. It's not useful anyway to have a number wrap around from e.g. 2^31-1 to -2^31. It is useful however to wrap from UINT_MAX to 0 (modular arithmetic), and this is in fact defined.

Of course, if you write "if (x < x + 20)" and turn the optimizer to -O7, then the compiler will run the body unconditionally, even though assuming signed overflow the test should fail when x equals INT_MAX. Woah, I'm crushed. That condition is exactly what I needed to write.

> Sure. C++ is such a bad language that most of your bug fixing is stuff which wouldn't even happen in a better language. Rust's std::mem::uninitialized<T>() is ludicrously dangerous, so it's deprecated (as well as unsafe) and yet C++ not only does this, it's silently the default for the built-in types. Hilarious. My sense is that a correct fix for this won't land for C++ 26 although maybe Barry can get the stars to align and prove me wrong.

I mean I could just write "#error Unimplemented" to get a compile time error but I'm not bothering. It seems what you describe as a terrible memory safety bug is simply my way of browsing to the next piece of code that I need to work on. Go figure...

Are you still developing C/C++ code? I get the impression you've given up on it and have jumped on the Rust train a hundred percent. At least there is a huge disconnect between the pictures you paint and my own development experience from daily practice.

But to make it clear again, I'm obviously not opposed to having the compiler issue an error whenever it's able to detect UB statically. In fact, this is how it should be.

tialaramex · 2024-05-28T15:08:40

> Of course, if you write "if (x < x + 20)" and turn the optimizer to -O7, then the compiler will run the body unconditionally

You seem very confident how the compiler will react to UB, I wouldn't be. You also seem unduly confident that you can spot such a footgun and wouldn't pull the trigger.

> It seems what you describe as a terrible memory safety bug is simply my way of browsing to the next piece of code that I need to work on. Go figure...

It's Undefined Behaviour, and you're just quietly confident that it'll be fine. Which it will until it isn't one day (and maybe that day was yesterday).

> I mean I could just write "#error Unimplemented" to get a compile time error but I'm not bothering.

A compile time error seems like a weird choice. Why write such an error only to immediately have to fix it? In Rust I'd write todo!() when I need to come back and actually provide a value or write some more code here later, that way it only blows up if this code actually executes.

> Are you still developing C/C++ code?

Not in anger for several years. I write Godbolt-sized samples to make a point sometimes.

> But to make it clear again, I'm obviously not opposed to having the compiler issue an error whenever it's able to detect UB statically. In fact, this is how it should be.

All the popular C and C++ compilers provide a great many flags you can set to get more of these diagnostics you're "obviously not opposed to". How many are you using today? How many did you try and then turn back off because of all the "false positive" diagnostics about things you knew were a bad idea but have preferred not to think about because hey, it seems like it works, right ?

jstimpfle · 2024-05-28T16:34:05

> A compile time error seems like a weird choice. Why write such an error only to immediately have to fix it? In Rust I'd write todo!() when I need to come back and actually provide a value or write some more code here later, that way it only blows up if this code actually executes.

Well that's exactly what I get too by doing nothing and noticing the segfault when running my debug build. Sure, I get it, it's UB and there could be "time travel" and what not. But in practice I seem to get my segfault, so that's just how I end up developing. If it wouldn't work, I could write my own todo() macro, nothing magical about it right?

> All the popular C and C++ compilers provide a great many flags you can set to get more of these diagnostics you're "obviously not opposed to". How many are you using today? How many did you try and then turn back off because of all the "false positive" diagnostics about things you knew were a bad idea but have preferred not to think about because hey, it seems like it works, right ?

I compile with -Wall on Linux and -W4 on MSVC. If I'm not seeing bugs in the integration tests, there is for most domains very little economic incentive to setup various static analyzers etc, so I rarely do that. I run -fsanitize on some of my stuff from time to time just for kicks, but haven't gotten enough value out of it, which is why it's not a habit for me.

But since you mentioned it I went ahead and ran -fsanitize=undefined -fsanitize=address on a test program of the multi-threaded queue I'm working on, which is a bit performance-oriented -- on my older desktop computer it persists > 2M individual messages/sec (600MB/s) to a single disk, with to-memory message submission latencies of < 300nsecs for 99th percentile, < 2usecs for 99.9th percentile and < 30usecs for 99.99th percentile. The test program runs for ~6 seconds, submitting 16M messages (4GB of data), with 4 concurrent readers receiving the messages as soon as they come in. 178 fsync() calls were done by the enqueuer threads or the dedicated flusher threads. There are various internal buffers (a couple MB) and multiple internal message stores (1 optimized for fast submission / 1 for dense storage), and a couple low-contention mutexes but also some wait-free stuff.

-fsanitize didn't find a single UB (I double checked that the detection does work in principle by introducing a signed-overflow bug and a null-pointer dereference as well as an OOB memory dereference). And it found 3 leaks of 1 byte, which seem to be false positives: all related to smaller structures (more than 1 byte) that I allocated and freed correctly. That's all it reported.

I then went on to test using valgrind, which notably reported 0 leaked bytes, and otherwise only reported tons of spam exclusively related to printf-family calls. IIRC these are common false positives due to library mismatches or something like that. You can get rid of them, but I won't bother now.

This is the first time I tried static and runtime analyzers on this project, other than -Wall. In other words, it seems that just by fixing bugs and adding code until it worked, I produced a software of ~5K lines of C code that performs quite well and has 0 bugs or UB uncovered in the good hour of work that I put in.

steveklabnik · 2024-05-28T14:06:41

The sort of thing your parent is talking about is being presented to the committee as an example of things that need to be considered, so it appears to be a serious enough issue to at least seriously discuss.

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p32...

jstimpfle · 2024-05-28T17:04:52

There appears to exist a guy with a formal background who is interested in submitting a paper about formal verification and static analysis and stuff. Impressive work, but I really don't know what to take home from the existance of this, or what argument this supports.

In my sibling post I merely want to illustrate that all these concerns have little bearing on my day-to-day work (which mostly doesn't need to be certified, and is not related to the defense industry or similar). Some of these I perceive as FUD, as said I know that you can provoke these situations but I've never personally encountered nasal daemons in practice, and I feel quite productive, am not spending a lot of time on bugs, so why bother.

tialaramex · 2024-05-28T23:55:22

Gabriel Dos Reis is specifically an old colleague of Bjarne Stroustrup (they worked at the same University, Texas A&M) who is now working for Microsoft on C++ tooling and so on.

So, one way to understand these papers is that Microsoft (at least some parts of it) thinks unsafe Contracts are worse than no Contracts. Now, would that mean they just won't implement an unsafe Contracts feature shipped in a C++ 26 document? Maybe. Would these fixes get it over the line? Maybe.

jstimpfle · 2024-05-29T08:33:04

I'm taking a closer look, but from the looks of it I'm not a fan of adding yet another sub-language with differing syntax and semantics. This leads to complexity, it's a path to madness.

Without being involved -- I have no intention using any of these Contracts in whatever form. I will say though that I wouldn't care if there is UB in the contract language (just like there is in the normal language). I would prefer the variant with UB if it is simpler and more aligned with the language core. Removing the UB here is an academic exercise. Safety absolutists are uncompromising about the goal of correctness and provability. They are blind to the pragmatic issues created by the idealism. Contracts in either form could probably improve correctness by a lot, like 99% or whatever. So why should I care about the paper which could in theory bring the remaining 1%? It doesn't affect me pragmatically.

The flaw with either is that this is only in theory. In practice, I will never create enough formal contracts in to significantly improve correctness. Whatever system it is. Why? The costs are just too damn high, the only way to achieve 100% correctness when considering also pragmatic concerns, is to just not write any code.

My approach of just coming up with a simple design (not in code), trying to implement it in the most straightforward way, and fixing the code until it works, as described in my other comment, seems to have achieved something very close to correctness (maybe even 100%? Probably not).

Again, I'm not saying that UB is good or should be tolerated. I don't want it in my programs and if I find an instance of UB I'll try hard to get rid of it. However there is a reason why UB exists in C/C++ (as well as many other languages that may not have as much of it, but still have a lot of it even when not defined explicitly). And alternative approaches, trying to prevent UB mechanically, come with a cost that may not be worth it depending on what you're working on. I feel strongly like it isn't worth it for me. If you're building a fully verified or certified product, tradeoffs are likely different.

If we're citing big names, here is a well known person describing their view, which I find myself agreeing a lot with.

https://www.youtube.com/watch?v=EJRdXxS_jqo

rurban · 2024-05-29T12:08:28

Gabi is the Visual C compiler maintainer, not just tooling. The only sane person in the C++ ISO committee (_besides the sdcc maintainer, who has no power at all_).

rfoo · 2024-05-26T21:50:55

The problem here is people seldomly get paid rewriting existing C code into memory safe lamguages. While once in a while someone would be annoyed enough and pay for a fixing-C effort for a little.

Do you have suggestions on how to fix the incentive?

admax88qqq · 2024-05-26T22:28:24

I don't think that's the root problem. I think C programmers don't believe C is a problem. New software is started every day on C. There's no excuse for that and no financial incentive to do it.

If the engineers actually admitted that C is not a safe languages for shipping software, then we could at the very least freeze the existing code and write everything _new_ in a same language. But we don't. Engineers still go starting brand new greenfield projects in C which is just insane.

rfoo · 2024-05-27T05:47:43

> then we could at the very least freeze the existing code and write everything _new_ in a same language

Sure, if you are willing to help, here's my wishlist:

- I wish we can freeze libssh and write everything new in Rust.

- I wish we can freeze CPython and write everything new in Rust.

- ...

Can you do it for free? At work I'm busy maintaining old projects in C++ and writing new ones in Rust. Since I'm not getting paid to rewrite or maintain our dependencies full-time I can't do above. Oh, I'm not paid to initiate an effort to freeze our old projects either.

If this sounds too harsh:

- I wish we can freeze ZMK [1] and write everything new in Rust (or Zig, though it's not memory safe, whatever).

That's about one of my hobbies and I always wanted to do it.

[1] https://github.com/zmkfirmware/zmk

optymizer · 2024-05-26T23:22:33

Just because you don't understand it, doesn't mean it's insane. It just means that your view of the world is different from others.

girvo · 2024-05-27T01:12:54

> But I think secretly C programmers don't want memory safety

Having just come out of embedded firmware land: it's not secret, a few members of my team were pretty open about either not caring about or not wanting memory safety. But the added productivity that Nim gave us outweighed their complaints in the end

usrnm · 2024-05-26T20:41:09

> The cost of that has been literally billions of dollars in losses due to buffer overflows over the decades.

How much, do you think, would rewriting all existing C code cost?

junon · 2024-05-26T20:59:56

Nothing, if new standards are opted in via a #pragma.

eru · 2024-05-27T04:46:03

I'm not sure that's this easy, because of the copy-and-paste nature of C's "libraries" via pre-processor #include directives.

junon · 2024-05-27T06:35:07

Not really. We already have `#pragma once` which is per-file, not per-translation-unit.

lisper · 2024-05-26T21:04:15

Why do you think that would be necessary?

usrnm · 2024-05-27T10:05:08

It's the only way to really solve the problem. Simply creating a safer alternative won't help, they already exist. The real problem is the vast ocean of already existing critical unsafe code

candiodari · 2024-05-26T21:40:07

The whole reason C code is used is that it can be used for free. In other words, infinitely more than they're spending now. More than even the CCP is willing to spend to protect state secrets.

vbezhenar · 2024-05-27T09:44:06

Is it main problem with C? I never had any issues with arrays.

For me problems with C are as follows:

1. Inadequate standard with lots of unspecified behavior. I want behavior matching my architecture. Wrap my integers, always. I want my bit fields to be laid out in a predictable way, so I can actually use that feature to work with protocols rather than using shifts and bitwise operations. I must not research whether compiler will throw away my `while (true)` loop. Compiler must be predictable.

2. Bad standard library backed by the language standard (so I can't just throw it away, compiler will replace my loops with memcpy or will replace any standard library call with particular semantics).

3. Unexpected runtime cost. For example global variables are initialized to zero and that requires special code inserted before main.

4. Bad syntax. Switch must not allow fall-through, not by default anyway. `()` must be treated as function with zero arguments, not as a function with unspecified arguments.

5. Lack of namespaces and modules.

I definitely don't want compiler to track array lengths. C is not Rust.

JonChesterfield · 2024-05-27T11:31:47

1. Is what gets in my way the most when writing C. The whole undefined behaviour story is a wreck.

Not hugely relevant but zero initialised globals don't imply code inserted before main. They get allocated at link time into a continuous blob which is zeroed by the loader. Non-zero values take up space in the binary but also involve no code generation. Global constructors, C++ style, usually do involve walking an array of function pointers before main though.

lieks · 2024-05-27T11:44:21

The Rust/memory safety crowd does exaggerate its importance somewhat. Arrays aren't actually that common in some types of applications, especially kernels, which is what C was designed to implement.

Null-terminated strings are bad, but there's nothing about the language (other than the standard library and literals) that forces you to use them. There are several alternative implementations[a][b].

1 is true, but the modern interpretation of UB is against the spririt of the original standard, even though it's according to the letter of it. It was meant to mean exactly what you're talking about, but got optimized into oblivion later.

2 is what I'd say is the main problem with C. The library is truly awful, to the point where I treat all of it except the mem* functions and maybe exit deprecated.

3 is not true; look up how the BSS segment works (it might be true on Windows, I don't know, but that's not really a very good C implementation, since it's for C++ primarily).

4 is true for a reason; pre-standard C didn't have function prototypes. This was invented by the standard, to much rejoicing. But they needed compatibility, so that's where it comes from. Switch cases are really labels, and that's why they fall through. If C was more actively developed (read: less stable) these would have already been fixed, by deprecating them and creating replacements.

5 is not something I usually care about much. Write one header file per module, don't include headers from other headers (declare anything you need in the header itself, the compiler allows it) and just prefix each exported function with foo_ or something like that. It's not that much trouble, and you get multiple .c-files pre module, which some languages don't allow.

The thing is that, in order to fix most of those issues, you'd either break backward-compatibility (and fork the laguage) or break most existing implementations (and lose all support). Making a new language is much easier.

As for tracking lengths, I mostly agree. If the length is statically known, that's perfectly fine. But dynamic lengths shouldn't have hidden tracking. Still, the C99 solution isn't that great, and I can't see how else to do it.

[a] https://nullprogram.com/blog/2023/10/08/ [b] there was a fat-pointer string library, but I can't find it now

vbezhenar · 2024-05-27T13:20:12

#3 is true for embedded. When you power your MCU, you'll get RAM with 0xff. Something will insert code to zero bss section. Usually it's so-called system initialization code which will transfer control to libc initialization code. So you'll have your MCU wasting cycles (if you don't really need zero-initialized globals).

I'm not aware of operating systems implementation, but I doubt that there's some magic to provide zero-initialized memory. To zero out memory, you need CPU work. Probably OS won't reuse memory because of security reasons and will need to erase it any way.

That's not a big issue, of course. With some linker tinkering you can declare truly uninitialized global variables. Not "standard C", but whatever.

cozzyd · 2024-05-27T14:50:22

You can change this zero initialization easily enough on a microcontroller (either by removing that code or modifying the linker map).

glitchc · 2024-05-27T13:53:10

> The Rust/memory safety crowd does exaggerate its importance somewhat. Arrays aren't actually that common in some types of applications, especially kernels, which is what C was designed to implement.

I'm not sure if this is true. All kernels rely on stacks and queues to manage processes and scheduling. They're all implemented as arrays.

naasking · 2024-05-27T14:09:37

> I'm not sure if this is true. All kernels rely on stacks and queues to manage processes and scheduling. They're all implemented as arrays.

It's not true AT ALL. Show me a kernel that doesn't deal with buffers. Buffers clearly have array semantics. Even microkernels with a minimal API surface map and copy between buffers.

nesarkvechnep · 2024-05-27T11:58:43

Is [b] https://github.com/antirez/sds?

lieks · 2024-05-27T19:32:07

Yes! That's the one.

juliangmp · 2024-05-28T09:01:36

I 100% agree with the first part, "undefined behavior" should have never become an excuse for the compiler authors to remove error handling from our code as an "optimization".

But its still beyond me how in decades of programming the C people somehow haven't had the idea of a pointer+size type. Whether you call it a slice, span or a view, doesn't matter. Why do I have to manually pass a size everywhere, it's such a common case, why is there not a common solution??

Are the people writing the standard just afraid of change? Seems to me, especially when it comes to things that could resemble new syntax. Sidenote why does _Atomic look like that?

pif · 2024-05-27T11:36:10

> I want behavior matching my architecture.

Then use assembly. C is meant to abstract CPU architecture away.

rstuart4133 · 2024-05-27T22:01:54

> Then use assembly. C is meant to abstract CPU architecture away.

Maybe. Looking back, I'd describe C as a portable assembler. K&R only optimised CPU architecture away when it was efficient to do so. For example, it wasn't efficient to specify the signed-ness of a char - so they didn't. Likewise the result of modulo ('%') on a negative number was undefined. Even the size of an int was undefined. In all these cases you get what the hardware gives you - just like you do in assembler.

I think that was the right decision. It was the best that could be done at the time. The price of the abstractions languages put in place in the name of safety and convenience meant they were a poor fit for embedded and system work. Consequently C has had decades of dominance in it's niche, a success that speaks for itself. Now a plausible challenger has arisen, but it's taken forever in internet time.

What I will criticise is what came after K&R. In K&C Undefined Behaviour meant "you get what the hardware gives you". Since what the hardware did was always well defined with a given compiler and arch everything the behaviour of your code was 100% deterministic. Then the compiler writers seeking to produce faster code twisted the definition to mean "if the behaviour is undefined we can do whatever we damned well please in order to generate faster code". And with that, at different optimisation levels code started to do different things.

The C++ plumbed the depths of this idiocy by definite infinite loops as UB. I have no idea why - embedded computers, services on modern OS's and even OS's themselves have infinite loops at their core - loops running until the electrons are removed is exactly the behaviour the programmer is depending on. It's idiocy because the halting problem (and ergo whether a particular loop is infinite) is famously not decidable. So they've undefined behaviour in a way that itself isn't defined. In practical terms, this means you write an infinite loop you are making a bet some you've made so convoluted a future version of the optimiser won't spot it. It's a bet some the occasional person has lost when they moved to a new version of the compiler. It is possible to signal to the compiler the loop must stay regardless, but you have to be a language lawyer to know how, and most don't.

So yes for today's C you are right, it is not assembler. In some ways it's objectively worse, as assembly is always well defined.

pif · 2024-05-28T09:33:02

> In K&C Undefined Behaviour meant "you get what the hardware gives you".

No, what you are talking about is "implementation-defined". Out of all trades, programmers should know better than everybody else to code against the specification rather than the implementation. If your code worked with K&R compiler by chance, and not by contact, then your code is wrong. Period.

rstuart4133 · 2024-05-28T11:26:47

> No, what you are talking about is "implementation-defined".

Fair enough, with one qualification. K&R uses the phrase "behavior is undefined" very frequently. Almost every use in fact means what you term "implementation-defined". The one exception I could find is reading an the uninitialised variable, where what you read may be truly undefined. In K&R they feel to need to clarify that particular usage of "undefined" by saying "have undefined (i.e., garbage) values". Even in that case, you are getting whatever the hardware gives you.

For example, K&R says "The effect of the call is undefined if the number of arguments disagrees with the number of parameters in the definition of the function". That only makes sense if it means "The effect of the call is implementation defined" because we make such calls it all the time, and rely on the outcome being completely deterministic. Or in their description of comparing pointers: "But the behavior is undefined for arithmetic or comparisons with pointers that do not point to members of the same array." And indeed it is not defined by C, but what happens is well defined by many implementations and embedded C programmers rely on it.

Contrast that to the current meme "if the compiler detects undefined behaviour, it may do whatever it damned well pleases and that may change depending on optimisation level or version, or even on the order you compile your modules" doesn't appear in K&R. If it did mean that the K&R compiler would be fully entitled to remove an if and it's else when it depended on the comparison between pointers to different arrays. And if it detected a negative shift it could delete the instruction entirely instead of emitting it and letting the hardware do whatever it does with negative shifts. The old C compilers did not do that, ever, because K&R's "behaviour is undefined" did not give that sort of licence. Renaming K&R's "behaviour is undefined" to "implementation defined" and appropriating "undefined behaviour" for entirely different usage seems like word games to me.

pif · 2024-05-28T14:18:20

> The old C compilers did not do that, ever, because K&R's "behaviour is undefined" did not give that sort of licence.

This extrapolation of yours is not warranted! The first compilers were simpler, and optimizations came later. The fact that people got used to simple compilers does not make their code correct with respect to the standard, sorry!

rstuart4133 · 2024-05-28T23:17:47

> make their code correct with respect to the standard,

The point I'm making is about the standard. Somehow, instead of "undefined behaviour" (using your definition) generating a compile time error, we drifted into letting it mean compile time non-determinism. By compile time non-determinism, I mean we let the compiler generates machine instructions that don't reflect the code.

I've spent a good chunk of my life battling problems introduced by non-determinism. Giving compilers a licence to generate more of it strikes me as pure insanity.

And now you come along, and say the problem isn't the standard, it's the programmers not being language lawyers. It's their fault for not memorising a standard so complex it literally gives compilers permission to base their decision on whether it can generate non-deterministic code on a undecidable puzzle - the halting problem. ffs.

To anchor the conversation back to the original point about C not being assembler, I want all languages I use to be glorified assemblers. Which is to say I want the way the code is translated to the next level down to be clear and transparent. Granted, clever optimisers it something clever and faster, but that those clever and faster instructions should perfectly emulate what the clear and transparent translation would have done.

To the compiler writers throwing their hands up, protesting that limits them too much: this is literally what the hardware does now. It takes a stream of instructions and mangles it by compiling to micro ops, reordering them, executing them in parallel, makes guesses about branch directions and speculatively executing code, hell it sometimes even executes both the if and else sides of a conditional. It manages to extract amazing speed out of the serial instruction stream, sometimes executing 10 in parallel, and yet it manages to preserve the exact meaning of all the original instructions, even those defined 40 years ago by the 8086.

I imagine the response will be "but hardware has options we don't, we need more flexibility to get the speed". It would be a fair point. But their solution to that dilemma was not so fair: twist the standard by redefining terms like "undefined behaviour" so they could get the flexibility they needed. In doing so they turned the standard maze of hidden "undefined behaviour" foot guns. The standards committee seems to think their primary job is producing something for the compiler writers. It isn't. Their primary job is to define a language programmers used to create complex, reliable computer systems. A standard full of foot guns isn't conducive to that.

pif · 2024-05-30T12:57:48

> Somehow, instead of "undefined behaviour" (using your definition) generating a compile time error,

I'm sorry, but this sentence shows how little you understand about undefined behaviour. Undefined behaviour cannot generate compile time errors in the general case, as the compiler does not have all the information. The only thing the compiler does is assume you wrote your program correctly, which you seem to refuse to accept as a responsibility of yours.

As an example, a simple statement like "int a = b + c;" cannot cause a compile-time error about integer overflow, because the values of b and c are unknown. The hardware will know them at execution time, but the compiler does not. And as soon as you state your expectation that overflow mimic the underlying architecture, you are wrong again because you choose the wrong tool: unless you are developing a compiler, you are not allowed to use "C" and "architecture" in the same sentence.

> And now you come along, and say the problem isn't the standard, it's the programmers not being language lawyers.

You said it a bit more bluntly that I would, but you are right. The problem lies on the developer's side. Mind you, I found myself making this error several times; I'm not pretending to be a master programmer, not at all! But every time the only solution was to accept responsibility and modify my code accordingly.

Unless you accept that your expectations are wrong, you are condemned to live the same stress over and over and over! Good luck to you!

rstuart4133 · 2024-06-01T06:29:59

> As an example, a simple statement like "int a = b + c;" cannot cause a compile-time error about integer overflow, because the values of b and c are unknown.

Then there is no problem, because to use your terminology what happens is implementation defined. The compiler will emit the appropriate instructions and I will get whatever the hardware gives me. That is the K&R behaviour I said I was happy with.

The problem happens precisely under the conditions you side stepped: when the compiler does know the behaviour is undefined. The standard says the runtime can assume that undefined behaviour never happens; therefore, the compiler can do whatever it pleases should the UB condition arise. If the compiler takes that option it generally does whatever is fastest, which is typically to delete the code.

Consider this example:

    if (buffer + len >= buffer_end || buffer + len < buffer)
        loud_screaming_panic("len is out of range\n");

The second part of that condition can be true if len is negative or huge (beyond the end of the array), but adding a negative number to a pointer and indexing beyond the end of an array both yield undefined behaviour. Ergo the compiler is free to do what it damned well pleases with that condition, and gcc chose to delete it.

Unfortunately, the example is real. The buffer was a packet received from the internet, "len" was a field read from that packet. If the code accepted a garbage value in len, it would then index beyond the array and bad things would happen. The programmer put a test in there to prevent that. Gcc deleted it. A CVE was the result.

The point being: gcc knew the code triggered undefined behaviour. I guess there are three options when you see code like that: do what it did with your "int a = b + c;" example and just emit the instructions (which is what it would do with -O0), or print an error and refuse to compile the code because it has UB, or merrily do whatever it damned well pleased without so much as a warning and continue on its merry way. The first option is K&R behaviour. The second is very Rust like. Either of those would have been fine in this case. The third option lead to disaster, and I think the standard allowing it is inexplicable.

And now we get to the point where I have to say it: I don't think you know as much about C as you think you do. The C standard defines a lot of undefined behaviour, so much it would be near impossible to avoid it even if you wanted to. Worse as it happens, C programmers very deliberately don't avoid it. Instead they exploit it by assuming K&R behaviour.

Take your illustration as an example, "int a = b + c;". You conveniently assumed the compiler knew nothing about "b" and "c", but lets say we had some if tests above that constrained "b" and "c" values so that "b + c" must overflow. In that case the standard says a is undefined, and it can throw everything that depends on it's value away. But as a C programmer you must know C programmers regularly rely on the fact that integer overflow yields exactly the same result on all C implementations. For example, if we are calculating a checksum of a buffer modulo 256, every C programmer will just "uint8_t checksum" and sum every byte into it. But if you are right about overflow being UB (and you are), then every C compiler could put a check to see if the next add did overflow "checksum", and stop the loop immediately if so. There is no need to tell the programmer about this because it is UB, and the compiler can do whatever it damned well pleases in the name of speed.

No compiler does particular optimisation of course. Not because the standard doesn't allow it, but because they would be lynched. Unfortunately the threat of lynching didn't stop them from deleting the code above.

> Unless you accept that your expectations are wrong, you are condemned to live the same stress over and over and over! Good luck to you!

I didn't say you were wrong about what the standard says. I'm saying the stance taken by the standard on how undefined behaviour can be handled is insane. This attitude taken by the committee is one of the factors leading to government agencies saying C should be dropped. I am also saying you are wrong when you claim the compiler could not emit an error when it classified something as UB, and then deleted code or whatever. I think that is self evident from the points I made above.

vbezhenar · 2024-05-27T13:24:31

Using just assembly is not realistic.

I'd love C-like language which compiles to assembly in an understandable way. I mean that optimizations are certainly useful and needed. I want to write `24 * 60 * 60` and expect for compiler to multiply those values at compile time. I want to write debug logging which could be disabled for release build and all relevant variables and functions be excluded. Basically there are reasonable optimizations which I would expect for compiler to do.

Anyway it's not like there's a choice in reality. Manufacturer provides SDK and examples with C, using another language for most projects is not realistic.

pif · 2024-05-28T09:37:09

> Using just assembly is not realistic.

And using C as fully deterministic is idiotic, considering the effort that went into specifying in great details everything that is NOT deterministic.

> I'd love C-like language which compiles to assembly in an understandable way.

Which is orthogonal to its purpose. A compiler's job is to provide the fastest code that implements what your (correct) code specified. Readability is pointless for generated code, as it not meant to be read nor modified.

lieks · 2024-05-26T22:28:53

C99 did that.

    void f(int len, char s[len]);
    void g(float m[static 16]);

The first one is for variable-length arrays, the second one for fixed-length. TCC even implements array bounds-checking.

But because a lot of people were stuck using C89 for several decades (due to old compilers), and the syntax isn't that great, nobody even knows they exist.

Personally, I think Dennis Ritchie's[a] proposed syntax[b] was much better:

    void f(char s[?]);

[a] The creator of C.

[b] https://www.bell-labs.com/usr/dmr/www/vararray.html

segfaultbuserr · 2024-05-27T02:47:47

> and the syntax isn't that great, nobody even knows they exist.

Linux kernel maintainers certainly do - they even invented a non-standard extension of C99's VLA notation (their own invention, not implemented in GCC)... In newer versions of the package "Linux man-pages", most libc functions are documented using a non-standard extension of that syntax. For example, the prototype of memcpy() now reads [0]:

    void *memcpy(void dest[restrict .n], const void src[restrict .n], size_t n);

It means this function accepts two arrays (dest[] and src[]), with n bytes of data type "void", src[] is read-only ("const"), and both src[] and dest[] are non-overlapping ("restrict"). Two non-standard notations are used here: a "void" array with elements of unknown type (not allowed in C99), also, ".n" means "a variable in the argument list, but defined after this variable" (which is not allowed in C99).

The equivalent C99 definition would be something similar to:

    void *memcpy(size_t n, char dest[restrict n], const char src[restrict n]);

As more people are exposed to these man pages, the C99 syntax hopefully will have more publicity. Finally, C23 interestingly states that:

> 15. Application Programming Interfaces (APIs) should be self-documenting when possible. In particular, the order of parameters in function declarations should be arranged such that the size of an array appears before the array. The purpose is to allow Variable-Length Array (VLA) notation to be used. This not only makes the code's purpose clearer to human readers, but also makes static analysis easier. Any new APIs added to the Standard should take this into consideration.

[0] https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/...

[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2611.htm

guenthert · 2024-05-27T07:57:17

> TCC even implements array bounds-checking.

Optionally (command line option '-b') though and (somewhat perplexingly) only at run-time. The documentation hints at the run-time performance (and code size) impact.

lieks · 2024-05-27T11:15:34

TCC doesn't do any optimization at all, so it shouldn't be that perplexing.

But any compiler that does constant propagation, inlining and (ideally) integer range analysis can optimize away most run-time bounds-checks already. If GCC or clang did it, it would probably be fast enough.

pornel · 2024-05-26T22:31:10

This article is 15 years old now, and nothing has changed:

https://digitalmars.com/articles/C-biggest-mistake.html

(and of course the problem wasn't new 15 years ago either.)

It wasn't fixed then. It won't be fixed now.

C is valued for not changing. C is valued for backwards compatibility with the most obscure platforms with unmaintained compilers.

The C userbase is self-selected to like C exactly the way it is.

pjmlp · 2024-05-27T08:35:35

The irony is that C does change, we are at C23 now, but not in the ways that would actually improve its safety.

dzaima · 2024-05-26T21:35:19

For the people who actually care about using C, there can be no unified concept of "array". Any single "the system" will just plain and simply be unusable in a good majority of situations.

Some people will want to store the size in a type smaller than size_t (and potentially place it not adjacent to the data pointer for better packing in a struct; or perhaps even bit-pack it). Some will want to place the size relative to the data instead of the pointer (esp. flexible array members). Some will want to store half (or a third, etc) of the element count, the array being used as multiple back-to-back arrays. Then you'll have questions on pointers in the middle of an array, indexable by positive and negative indices. Never mind the pretty significantly increased register shuffling of having to pass the size across functions.

For projects that are fine with doing things in the single way forced upon you and don't care about how things are done as long as it's safe, C is already rather clearly not the language to choose.

C++ as-is can already pretty trivially allow adding bounds-checked array types, and compilers can even be configured to warn on all pointer indexing (https://godbolt.org/z/W8sqGW5sh), achieving the entirety of your proposal while not locking people into a single data structure. (granted, some may not want to expand to C++ "just" for one or two features (me included), but here allowing customizable array data structures is basically the only sane option, and C would have to take a rather significant amount from C++ to allow for such itself)

lisper · 2024-05-26T22:06:40

> Some people will want to store the size in a type smaller than size_t

So? My point here is that this should be the compiler's responsibility, not the programmer's. Why could not a compiler bitpack the same way -- or even better than -- a programmer could?

dzaima · 2024-05-26T22:13:10

The compiler cannot change the length field's size if a reference to the array (or a struct where it's contained) is ever passed to an unknown function, as that function has to be able to read the length from memory based just on the spelled-out type.

Not a problem when passing the array by value (i.e. two registers of the data pointer and length), but then any packing automatically does not apply anyway.

lisper · 2024-05-26T22:29:39

I don't understand this. Why does it matter if you're calling an unknown function? Why would an unknown function be unable to get the length? All you would need to do is to change the behavior of SIZEOF to make it aware of dynamically allocated arrays.

dzaima · 2024-05-26T22:41:04

I mean in the case of an array type that tracks its length at runtime. Take:

    typedef struct {
      uint32_t arr1_len, arr2_len;
      int* arr1;
      int* arr2;
    } Foo;

That's, on a 64-bit system, a 24-byte structure. Were it written as a struct of two array fields, the compiler couldn't choose a layout as efficient while maintaining being able to get a pointer to each field. Never mind that the compiler would likely not be omniscient enough to be able to tell that the structure is never used with arrays exceeding 2^32 elements.

Perhaps you mean to keep using regular pointers for non-trivial heap-stored things, but I'd imagine that makes up a pretty significant amount of cases with buffer overflow potential.

actionfromafar · 2024-05-26T22:48:46

Or add another sizeof keyword.

flohofwoe · 2024-05-27T06:24:29

Once you have "proper" arrays, you'll also need "array-references" e.g. fat pointers carrying a length (aka slices), and if you want to avoid unsafe pointer/array conversions you'll also need a typed allocator function, and all of that also requires a new stdlib and probably an extended ABI, or at least a standard for how the new types are layed out in memory, and passed into and out of functions. At that point the whole thing is so different from C that we should give it a new name - maybe, I dunno "Zig"? ;)

PS: There was actually a quite recent bounds-checking proposal by (I think) Apple Clang folks that works with annotations and IMHO looks pretty good (in the sense of "I would actually use it in my libraries"):

https://discourse.llvm.org/t/rfc-enforcing-bounds-safety-in-...

pjmlp · 2024-05-27T06:57:52

Yes and even Dennis Ritchie failed to have WG14 caring about them.

Not that his proposal was perfect, but WG14 didn't even bothered to improve upon it.

zik · 2024-05-27T07:46:05

This has been done quite a few times before, but the resulting languages are not C.

Also saying that's "the main problem with C" seems to miss the mark. As a C programmer I wouldn't call this the crux of any particular problem. It's weird and problematic in multiple ways but surely not "the main problem".

fijiaarone · 2024-05-27T02:36:47

That’s like saying “the problem with aircraft is that they keep crashing when there is not enough lift under the wings, so the obvious answer is to get rid of wings”.

Only, it would even be stupider if your grasp of programming was as great as your understanding of aerodynamics.

lisper · 2024-05-27T05:41:44

Ironic that you would pick that metaphor because I happen to have both a pilot's license and a Ph.D. in computer science. So I really snookered someone.

LegionMammal978 · 2024-05-26T20:07:48

In fact, per the standard, a+x is already an array reference: you aren't allowed to go outside the bounds of the array immediately containing the pointee, on pain of UB.

(Arrays do exist in the object model, and you can take pointers to them of type T (*)[N]; you just can't copy them around by value, and the name of an array decays to its first element pointer.)

Compilers just typically don't track array bounds at runtime because of (a) performance and (b) big ABI incompatibilities. There's nothing in the language itself that stops them.

lisper · 2024-05-26T21:17:29

> on pain of UB

But this is exactly the problem. UB can be anything, including nothing. This should be at the very least an optional warning. But here is what happens with a fairly current C compiler:

    % gcc -v
    Apple clang version 14.0.0 (clang-1400.0.29.202)
    ...
    % more test.c
    int main () {
      int x[10];
      return *(x+20);
    }
    % gcc -Wall test.c
    %

fweimer · 2024-05-26T21:40:02

These kinds of warnings need optimizations enabled.

    $ gcc -O2 -Wall -c t.c 
    t.c: In function ‘main’:
    t.c:3:14: warning: array subscript 20 is outside array bounds of ‘int[10]’ [-Warray-bounds]
        3 |       return *(x+20);
          |              ^~~~~~~
    t.c:2:11: note: at offset 80 into object ‘x’ of size 40
        2 |       int x[10];
          |           ^

Not sure how to get Clang to warn. It clearly recognizes the undefined behavior.

dzaima · 2024-05-26T21:52:22

Clang appears to have -Warray-bounds-pointer-arithmetic for this, though not enabled on -Wall nor -Wextra. (fwiw, clang has -Weverything to turn on literally every warning, including conflicting ones, for finding what flags there are)

lisper · 2024-05-26T21:57:36

Yeah, Clang warns on x[20] but not *(x+20) even with -Wall and -O2. It's kinda weird.

dzaima · 2024-05-27T13:28:54

Ah, the -Warray-bounds-pointer-arithmetic warning is actually just about the pointer addition; thus even that doesn't warn on *(x+10) on the 10-element array, as the construction of the past-the-end pointer is still valid, and seemingly no warning checks bounds validity for actual dereference.

LegionMammal978 · 2024-05-27T00:34:05

You can also make it throw an error at runtime:

  $ gcc -fsanitize=undefined -o test test.c
  $ ./test
  test.c:3:10: runtime error: load of address 0x7ffea07fded0 with insufficient space for an object of type 'int'
  0x7ffea07fded0: note: pointer points here
   b2 55 00 00  d8 df 7f a0 01 00 00 00  c8 df 7f a0 fe 7f 00 00  00 00 00 00 00 00 00 00  62 0f 73 ea
                ^

> But this is exactly the problem. UB can be anything, including nothing.

There's nothing stopping compilers from implementing the semantics you want (when not crossing an ABI boundary), and indeed, they've been adding more gradual hardening options that can be used in production. It's just that there's little demand for universal bounds-checking on arrays, and some users even want more flexible accesses, e.g., for operations like container_of.

On the other hand, people can and have made experimental forks of the Rust compiler to turn all panics into immediate LLVM-level UB, but the mere existence of such an option doesn't mean that Rust's bounds-checking is now worthless, as you seem to be implying for C.

Xeamek · 2024-05-26T20:06:33

C programmers NO! You can't do that, that would break compatibility!

also C programmers Wow, You are trying to compile this old program on new toolchain?

Here is a list of 10k errors due to changed defaults flags for compilation, dependencies breaking by going from 9.4.1 to 9.4.2 and also, your code contains platform-specific extensions anyway.

But C is portable guys! It really is!

pdw · 2024-05-26T20:44:17

That makes sense when you realize that what C folks care about more than anything else is _ABI compatibility_. Changes to language or toolchain are acceptable as long as they don't change the ABI.

E.g. consider the hemming and hawing about a 64-bit time_t. That's a tiny change in comparison, and one that's obviously unavoidable and with a strict deadline. And yet...

zzo38computer · 2024-05-26T21:44:17

> E.g. consider the hemming and hawing about a 64-bit time_t.

I had received a compiler warning about this when trying to compile a program on Raspberry Pi (it was a old version; I did this a few days before they implemented 64-bit time_t on Raspberry Pi). Fortunately I was able to add a macro to allow it to work on computers without 64-bit time_t. Other than that, the program I compiled worked perfectly, although I wrote the program on PC and did not specifically try to make it work with Raspberry Pi. So, a C code is portable, although sometimes a few changes are required to make the program to work.

But I think that 64-bit time_t is a good idea.

kukkamario · 2024-05-26T20:59:48

64-bit time has already been resolved in newer 32-bit Linux versions. Issue isn't that changing ABI couldn't be done. It is that no one wants to update OS and custom software of a 15-year-old embedded system that still works. Archeology to find correct instructions to build a working OS image is challenge itself and then there is a need to adapt them to more modern tools. Been there, done that, and it wasn't fun...

jeroenhd · 2024-05-26T20:34:44

But that's the thing, code bases are full of non-standard C, but they never break because the C standard changes, only because their weird hacks are falling out of favour. Fixing a design flaw in C will break all existing code, instead of the code of the dozens of companies relying on weird hacks they wrote ten years ago.

Once Zig becomes stable, I think it may have a chance of slowly fixing existing C code bases, by its virtue of being able to co-exist with existing code bases. On the other hand, a lot of C programmers don't want to change because they don't see a problem, so the presence of those will probably ruin any chance Zig has of improving the situation.

samatman · 2024-05-26T21:37:47

Other than the pessimistic finish, my prediction is that this is exactly what will happen.

The change will be a generational one. C will never entirely disappear, but over time, C code will be old code. People will occasionally write new C for what amounts to aesthetic reasons, and there will be a robust niche (doubtless well-paid) of maintainers of legacy C projects. All the kernels currently written in C will still exist, and still be composed mostly or entirely of C.

But new work in embedded, drivers, implementation of programming languages, netork stacks, will be in Zig instead. With some Rust as well, but I figure we've seen enough of that arc to tentatively conclude that what Rust is displacing is mostly C++, not C.

But when I say generational, I mean it. You're quite right that there is a legion of C programmers who like it and intend to stick with it. But they'll retire one day.

jeroenhd · 2024-05-27T06:57:57

Except for maybe in kernel projects, C code already is old code. As far as I can tell, (modern) C++ is the oldest common language for new projects.

Rust is already part of the Linux kernel, being used for rather complex things like GPU drivers in Asahi. Android rewrote their Bluetooth stack into Rust and Windows is actively replacing existing operating system components with Rust as well. I don't think we'll need to wait for Zig to start replacing old C code, it's already being replaced with Rust now.

I think Zig would've been a better choice for some parts, but it's taking too long to become stable for it to be included in large projects. Still, it may be useful for people working on embedded stuff, as embedded code seems to be stuck with 90s C when it comes to language support, if someone can manage to write a compiler backend for that specific embedded chip.

pjmlp · 2024-05-27T08:39:18

In many places outside the UNIX FOSS sphere and hardcore embedded devs, C++ keeps replacing C already, that includes Apple, Google and Microsoft OSes.

While Rust might be a better alternative, C++ is already much better than raw C.

qiqitori · 2024-05-27T05:23:58

> Once Zig becomes stable

Zig was introduced in 2016. If it still isn't stable 8 years later, I wouldn't hold my breath.

jeroenhd · 2024-05-27T06:50:42

Rust began in 2006 and became stable in 2015. Zig still has a year at the very least!

Zig lacks the corporate backing Rust had, though, so it's hard to say when they'll get stable.

Regardless, even if it'll take Zig a decade, I'm sure it'll have a stable release with decent memory management as part of the language spec before C will.

pjmlp · 2024-05-27T07:14:58

Zig is basically the Modula-2 features (1978) in a more C like friendly packaging, with an additional metaprogramming capabilities.

It needs a bit more than only corporate backing for taking off, something that makes it unavoidable, specially given the alternatives.

glitchc · 2024-05-27T13:51:56

An outright ban would break a lot of existing code out there. Deprecation may not help, as people can simply turn off the guard if optional or use an older compiler if not. The problem with C is the massive mountain of legacy code that operates all of our computer systems. C is everywhere, in some form or other.

zzo38computer · 2024-05-26T21:51:16

Arrays are a separate data type than pointers, although in many contexts you can use an array where a pointer is expected and it will work, and this feature is useful.

Bounds-checking can sometimes be useful, and can perhaps have a switch to control its working.

Some instruction sets (such as Flex) have tagged memory. In Flex, a pointer contains the address of a block (and a pointer can also be designated is read-only, disallowing writing through the pointer). There is also a "reference" consisting of a pointer and a displacement; I suppose this "reference" can be used to implement C pointers. If you use this, then the computer will automatically do bounds-checking and will result in an error if you make an access that is out of bounds.

Tagged memory also allows to easily find the pointers in memory without needing to know the types of variables, which can be helpful for some uses, e.g. to detect improper use of realloc/free. Furthermore, null pointer can be zero (without the tag bits), which is automatically an error when used as a pointer because it is not a pointer.

pmalynin · 2024-05-27T16:46:18

Maybe not the C community at large, but companies / teams / projects that care about security have done this. For example, iBoot uses Firebloom which is basically C with fat pointers and bounds checking.

mattgreenrocks · 2024-05-27T12:52:32

Conflation of arrays and pointers is only the tip of the iceberg. A big problem lies in the many varieties of undefined behavior and the relative ease one can invoke it, even when trying to avoid it.

abtinf · 2024-05-27T02:31:03

Are you referencing this?

https://www.digitalmars.com/articles/C-biggest-mistake.html

lisper · 2024-05-27T05:39:43

marmakoide · 2024-05-26T21:52:56

That would break most C code handling hardware directly, like on MCUs

lisper · 2024-05-26T21:56:20

pjmlp · 2024-05-27T08:42:05

Most likely because while they call C, what they actually write is macro Assembler looking code, full of compiler specific language extensions doing MCU intrisics, instead of Assembly opcodes.

MaxBarraclough · 2024-05-27T08:13:53

See also Walter Bright's 2009 article C's Biggest Mistake, on this topic.

https://www.digitalmars.com/articles/C-biggest-mistake.html

Recent discussion: https://news.ycombinator.com/item?id=40392371

Discussion thread about the article, 10 months ago: https://news.ycombinator.com/item?id=36564535

pjmlp · 2024-05-27T06:59:49

Because typing &myarray[0] is too much typing. /s

Even Dennis Ritchie tried to add fat pointers, but failed to have the new C overlords care about his proposals.

Despite what we might think about their feature set, it is quite relevant that Alef, Limbo and Go were designed by the UNIX folks, and all of them don't repeat the same mistake with arrays and strings.

psychoslave · 2024-05-27T07:03:35

Interesting hint on Dennis Ritchie attempt. Here is a related thread: https://news.ycombinator.com/item?id=39677581

tombert · 2024-05-26T22:48:53

When I was writing my proposal to get into a PhD program, I had to do a crash course in formally verified applications. The focus of the program is actually in Isabelle, but Coq is similar enough (in a hand-wavey kind of way) to where it was relevant to what I was writing about, and I stumbled across a few formally verified things with Coq.

I became slightly obsessed with CompCert, but it felt like a "real" program that was utilizing proper formal verification techniques. It seemed so cool to me that there can be (to some extent), and "objectively correct" version of a C compiler. I still think it's very cool; I wish people would geek out about this stuff as much as I would sometimes.

dang · 2024-05-26T22:32:25

CompCert C a formally verified optimizing compiler for a large subset of C99 - https://news.ycombinator.com/item?id=27644356 - June 2021 (1 comment)

CompCert – A formally verified C compiler - https://news.ycombinator.com/item?id=18968125 - Jan 2019 (57 comments)

Closing the Gap – The Formally Verified Optimizing Compiler CompCert [pdf] - https://news.ycombinator.com/item?id=13046449 - Nov 2016 (10 comments)

CompCert: A formally verified optimizing C compiler - https://news.ycombinator.com/item?id=9130934 - March 2015 (62 comments)

CompCert - Compilers you can formally trust - https://news.ycombinator.com/item?id=2619650 - June 2011 (28 comments)

nullifidian · 2024-05-27T01:46:33

I think it's important to note that you can't use it commercially.

"The INRIA Non-Commercial License Agreement is a non-free license that grants you the right to use the CompCert verified compiler for educational, research or evaluation purposes only, but prohibits any commercial use.

For commercial use you need a Software Usage Agreement from AbsInt Angewandte Informatik GmbH."

f1shy · 2024-05-27T14:24:44

The name of the company is joke, for those who do bot know, it is read same as absinthe in german, where the company is.

drdrey · 2024-05-28T16:20:54

it's also short for Abstract Interpretation, right?

JonChesterfield · 2024-05-27T11:51:11

I think compcert uses this https://github.com/jhjourdan/C11parser as the parser. It's a lalr grammar with hooks for the usual lexer hack. I suspect that with a symbol table would make a cheap and reliable C to AST compiler. Somewhere on my todo list.

gergo_barany · 2024-05-27T15:13:23

What would you want to use it for, and why not use an existing solution? My first choice would be Frama-C (frama-c.com), which has served me well in the past.

JonChesterfield · 2024-05-27T16:12:03

Personally, I want to compile C with vector types to x64 and to amdgpu at the same time, for running code on APUs like the MI300A. I have a suspicion checking that idea works in practice will be easier building from a grammar than from clang.

gergo_barany · 2024-05-27T17:34:29

So you want to add new builtin types for vectors? Yes, building on a parser sounds like a good approach. Otherwise, Frama-C would give you a nice detailed AST, but it might be complicated to extend this way.

Sounds like a cool project, good luck!