Been there, done that, in 2012. The array syntax can be more C-like and interfaces can be backwards-compatible.[1] Many people have been down this road. Selling it is harder than doing it.
Ah, thanks. I didn't know this proposal. I am trying to push similar things.
The main obstacle are people coming from MSVC or C++ not knowing variably modified types and people being convinced that VLAs are always bad. This then leads to many bad attempts at fixing the problem instead of simply using arrays which know their run-time length. While we still miss a bit of compiler support (I am working on it), this already helps today: https://godbolt.org/z/4a45xq5hr
(Update: Of course, the use of references in the proposal above and the motivation is a bit obscure. In any case, VM-types will not be optional in C23 anymore. And usage and interesting is going up.)
Attacker controlled sizes are always bad, this is also true for heap allocations. With stack clash protection this becomes a DOS for VLAs (same as for heap allocations). But I am not saying that VLAs are always the right choice, but in many cases they are better than the next best alternative.
It's not about attacker control, it's about correctness of your code. Without VLAs then static analysis of the call graph (and absence of recursion) is sufficient to prove maximum stack depth; with VLAs then much deeper analysis is needed, if it's possible at all.
Your first reaction is "sure, even an old Intel 80286 chip can easily do memory-safe C arrays". Because if your code runs in '286 protected mode, using a (yes, scarce) separate segment register for each array, and you don't botch loading the array's base, limit, etc. into the segment descriptors - yep, your regular array-access code (in assembly, C, or whatever) can enjoy memory-safety array access for "free".
We made variably modified types mandatory in C23. Compiler support for bounds checking is improving (via UBSan). Static analysis is improving (a bit). Flexible array members can now be secured using length information provided by an attribute. So yes, things are moving in the right direction. For the version after C23 I am relatively sure we will see a bounded pointer type.
The question is whether the fat pointer types will be useful and accesible enough for developers to migrate to them from the pointer+size combination. VMTs, which are considered the current best practice (recommended by CERT too), have a bad name associated with them, due to automatic VLAs and the whole unbounded stack allocation debate, are not compatible with C++, and most notably, Microsoft and some other vendors, like CompCert refuse to implement them. These things mean that programmers are less likely to use them. Also most teaching material for C is stuck on C89isms which doesn't help (just ask a student how to pass a 2D array in a function). I would love for fat pointers to enter the standard (either Walter Bright or Dennis Ritchie's syntax is fine, though a `lengthof` operator is absolutely necessary imo), but if Microsoft and other vendors are not going to implement them and compatibility with other languages (C++, SystemC, OpenCL, ISPC, etc.) is poor, I'm afraid that we will continue to see the confusing pointer+size method.
I think fat pointers are relatively straightforward. VLAs and VMTs are now supported by many compilers (with some exceptions) even very small ones. Microsoft did - for a long time - not implement anything after C89 and wanted people to use C++. They now catched up and I hope that they will implement VMTs as well. Microsoft Research has CheckedC and I hope at some point someone there will understand that VMTs are very similar to what they have their except with better syntax.
C11 threads are now available in VS preview, so there is still hope, though one can't be sure for anything in these times (it's still funny that tcc has support for VMTs and MSVC doesn't). As for the fat pointer discussion, it would be a net positive in the standard (even better if there was an easy way to get access to the length, without using `sizeof`+division). Also, thank you for your contributions to the standard, I'm looking forward to see what people will cook with N3003 once compiler support lands!
Supposedly it's already in use: "The -fbounds-safety extension has been adopted on millions of lines of production C code and proven to work in a consumer operating system setting."
Is there a reason discussion around C is phrased like this? My reaction to that quote is, "yeah it could possibly have some issues we're not really sure of, but it seems reasonably battle-tested too". It evokes a bit of mixed signal messaging, to me at least.
The C committee (unlike the C++ committee) usually only considers proposals which have implementations to show. And the more real-world experience there is with the extension, the better.
Also consider that this extension is designed in a way that existing code can be annotated without requiring drastic changes, and that it has been designed to remain ABI compatible.
A 100% watertight solution most likely requires new language features (or even a completely new language like a "Rust--") that would violate both of those requirements.
That's not really a problem though. Pretty much any non-trivial real-world codebase isn't pure standard C, many are absolutely riddled with non-standard extensions and it works just fine (you'll need to build and test on all supported compilers and platforms anyway).
For that Clang extension above it looks like it's possible to annotate source code without breaking compilers that don't support the extension by defining a handful of dummy macros.
IMHO the actual strength of C is that compilers can (and do) explore beyond the standard on their own.
The real blocker is that the various solutions are almost certainly never ABI-compatible with existing code, and for most people it's unrealistic to recompile the world (and even if you can forbid inline asm elsewhere, libc is nasty). (edit: I suppose WASM is sort of forcing that though, but unfortunately it didn't take the opportunity to allow dynamically fixing this)
A solution is mostly possible with a segmented allocator, which is quite reasonable on 64-bit platforms (32-bit allocation ID, 32-bit index within the allocation).
But keep in mind that "buffer overflow within a struct" is often considered a feature.
I think the best approach is "design a new, 'safe' language that compiles to reasonable C code, and make it easy to port to that new language incrementally".
The last item of a struct could be a variable sized array. That's sound,
and enough for things such as network packets. It's important to avoid bikeshedding.
The "buffer overflow within a struct" is used a lot when you have built a "header" struct that you are just pinning to the start of the some record and the bottom of the struct has a line like:
char data[0];
The purpose is to give you a handle on the remainder of the data even though you don't know the size beforehand.
Technically this is not valid under strict C, but it's also incredibly common.
There must be an expression which can be evaluated to determine the length of the array, and which can thus be used for checking. Without that, the code has little chance of working, since something had better define the size of that array.
> If you can't tell how big something is at all, the program is broken and will probably fail randomly.
That's not what I said at all. It's the exact opposite of what I said. Why this strawman? I already replied above and explained very clearly that I'm talking about when the size is known but not via the struct itself:
>>> the expression doesn't have to come from the same struct, though. It could be provided somewhere else.
>> It might very well be straightforward to obtain, just not located in that struct itself.
The size could be communicated in a different struct, no? Or passed back to the caller via a pointer argument? Or a million other ways beside the same struct itself?
>> And if the code isn't available to you to change?
> Then you don't get the improvements in safety yet.
Huh? This isn't a limitation with current implementations like -fbounds-safety. It's just a limitation with the proposal I was pointing out this issue with [1]. The existing implementations decorate the function/usage sites rather than the struct, which gives you access to information outside the struct. And there's no need to change every single use of that struct, which you obviously don't generally have access to.
I'm saying to deal with it. Change the code to be compatible. It's not that important to keep it the way it is.
Now you're referring to better designs, which is great. Have the best of both worlds if that's possible.
But when you were just pointing out that difficulty, my response is that it's a very small difficulty so that's not a big mark against the idea. If it was that proposal or nothing, that proposal would be much better than nothing, despite the forced code changes to use it.
> I'm saying to deal with it. Change the code to be compatible. It's not that important to keep it the way it is.
> But when you were just pointing out that difficulty, my response is that it's a very small difficulty so that's not a big mark against the idea.
In what alternate timeline do we exist where HNers believe you can just recompile the entire world for the sake of any random program? Say you're a random user calling bind() or getpeername() in your OS's socket library. Or you're Microsoft, trying to secure a function like WSAConnect(). All of which are susceptible to overflows in struct sockaddr. Your proposal is "just move the length from 3rd parameter into the sockaddr struct" because "it's not that important to keep these APIs the way they are"?! How exactly do you propose making this work?
I can't believe you think changing the world isn't a big deal.
So say I'm on board and decide sockaddr Must Be Changed. Roughly how long do you think it will be from today before I can ship to my customers a program using the new, secure definition?
And how does the time and effort required compare against the more powerful implementation that's already out there?
The C standard defines which array accesses are valid or not in the C abstract machine. This definition isn’t simple at all. A C implementation can in principle add runtime machinery to check all accesses during execution. C implementations generally don't, due to performance and ABI compatibility reasons. But C the language doesn’t prohibit it. Most existing programs making use of "buffer overflow within a struct" probaby aren't actually conforming C programs.
Without knowing the context and purpose of the function, which the author doesn’t state, it could actually be perfectly sensible. For example, -1 (or any negative int value) could be used to mark unused entries in the array, and entries beyond the current array size are simply unused by definition.
It works because the type (char (buf)[n]) knows the dynamic size 'n'. So the compiler can simply add the bounds check to an array access (buf)[i] if instructed to do so.
The safety story is not complete though: If you pass the wrong size to 'foo' this is not detected (this is easy to add to compilers and I submitted a patch to GCC which would do this): https://godbolt.org/z/T8844e1z8
(ASAN still catches the problem in this case, but ASAN does not work consistently and has a high run-time overhead.)
Basic idea is to have a system call that allows library writers to get the bounds of a pointer. This way they can ensure they're not writing too much data to a location.
Another idea I've implemented in userspace is to create an allocator that allocates a page (via mmap) then set protections on the page before and after. The pointer returned aligns the end at the next page. If a write goes beyond the end of the pointer, it bumps into the protected page, and causes a fault. Then you can handle this fault, and detect an overflow.
A even more strict version of this is to add protection to the page the allocated pointer is assigned to. On _every_ write you get a fault, and can check that it's not out-of-bounds.
All of these methods are slow-as-hell, but detect any memory issues. While slow, they are faster than valgrind (not badmouthing it, it's an amazing tool!). So the recommendation is to use it in testing and CI/CD pipelines to detect issues, then switch to a real allocator for production.
Only if the stride is small enough to not skip over the guard page, surely? Unless you're setting the entire address space to protected, for any given base pointer BP there's a resulting address BP[offset] that lands on an unprotected page.
It might be interesting to expose an instruction that restricts the offset of a memory operation to e.g. 12 bits (masking off higher bits, or using a small immediate) to provide a guarantee that a BP accessed through such an instruction cannot skip a guard page; but that would of course only apply to small arrays, and the compiler would have to carry that metadata through the compilation.
> In this case you'd need to store that the max value is bufsize at an earlier point in the program, not the current value. A full implementation would get quite complicated quite quickly.
We're almost 1/4 into the 21st century. As someone that has attempted this as an exercise in C, please, just learn a little bit of Rust and move onto modern problems. The analogy I like to use is adjusting all the springs on a mattress vs buying a memory foam mattress and letting better material do its job.
I’m using rust and it’s not always clear how to solve this problem at compile time in this language either. If your array sizes are const then it’s doable but that’s only a subset of the usage because often arrays must have some known variable size.
Since you seem to know how to do it, can you please tell me how to check the bounds of the arrays at compile time if the actual underlying value is not known until runtime? I guess I could make a type-level algebraic field and add, subtract, multiply, divide my generics but this seems like a huge pain in the butt.
If anyone knows how to perform comptime algebraic bounds checking in stable rust lmk
I don't get this: why using both the annotation and the boundary checks? Once I annotate the function's parameters, the compiler can safely assume what the annotations state. Of course then it must check that the condition is guaranteed at caller side. Just like you don't have to check that index >= 0 because it's declared as unsigned, but it the caller's duty to not call the function with -1, for example.
The annotations say "the bound of the pointer/array p is expression X", but if I subsequently do for (int i.....) { p[i]++; } you have to perform a bounds check unless you can prove that i never exceeds the expression given by the bounds.
e.g. (using the implemented and used in the real world bounds safety extension in clang[1])
void f1(int *ps __counted_by(N), int N) {
for (int j = 0; j < N; j++) ps[j]++;
}
void f2(int *ps __counted_by(N), int N) {
for (int j = 0; j < 10; j++) ps[j]++;
}
In the function f1 the compiler can in principle prove that the bounds checks aren't needed, but in f2 it cannot, and so failing to perform a bounds check would lead to unsafe code.
Making the bounds of a pointer explicit does not mean that you no longer need bounds checks, it just means that now you know what the bounds are, so you can most enforce those bounds. Then as an implementation detail you optimise the unnecessary checks out. The enforcement aspect is required for correctness, if you cannot prove at compile time that a pointer operation is in bounds a bounds check is mandatory.
Later it suggests changing just the declaration, adding a parameter annotation in this way:
int do_indexing(int *buf,
const size_t bufsize [[arraysize_of:buf]],
size_t index);
It suggests that, with this function declaration and the former function body, the compiler should emit a warning:
> Similarly the compiler can diagnose that the first example can lead to a buffer overflow, because the value of index can be anything.
This clashes with my understanding of why we would use the annotation in the first place. You seem to hold the same feelings, as this is analogous to your f1(), which you agree the compiler should consider safe.
Instead TFA seems to advocate that the correct function body should be:
> Now the compiler can in fact verify that the latter example is safe. When buf is dereferenced we know that the value of index is nonnegative and less than the size of the array.
> ''"`memory safety''"` is not the point of C and never has been.
You're right, of course; sadly, people still use C to write software other than bootloaders and kernel memory managers, which C's "you figure it out" approach is an excellent approach to. Instead, you see command line utilities parsing configuration files and user input, all written in a language that lacks the modern standards for bug detection and program safety.
With the readability of C programs, criticism of Rust's admitted ugliness is a bit weird. C is a horrible language already, no need for Rust influences to make it any worse.
However, this isn't about "let's then C into Rust". This is "let's take one of the obviously good idea Rust brought to the mainstream and see if we can add them to C". In this example, by using attributes, which are already part of C.
If you like your memory safety bugs and buffer overrun vulnerabilities, nobody is mandating that you use these attributes. These suggestions are for other people to make their code better, not to take away your DOS system memory model.
The robustness principle applied to memory? Like when the browser adds the missing closing HTML tags instead of telling the dev to fix their code? We sure we want that?
Not every html tag has a closing tag. "<br />" parsing is supported in html5 as recognizing self closing xml tags is necessary due to people publishing xhtml with an html mime type. But to be clear it's incorrect: <br>...</br> is incorrect as <br> is a self closing tag, so the closing tag itself is erroneous.
In terms of validation: html5 has a specified parsing algorithm, you can validate html against it. That validation is not xml validation. Again, because html is not xml.
There is a very clear specification about how html tags are parsed, trying to reason about html as if it is xml is just as (if not more) incorrect than trying to reason about C as if it were C++. e.g.
void f();
Is a different type in C than it is in C++, but if you do this you can't turn around and complain that C and C++ see it as having a different type when the syntax has different meanings.
I don't necessarily disagree, but...is this worse than what we already have? My argument against this sort of this for a language like C is that "do what you think I mean instead of what I say" might not actually result in what I _actually_ mean if the compiler guesses wrong, but undefined behavior isn't generally going to do what I want either, so I'm not sure this would be any worse.
Some people depend on undefined behavior, but there's generally not much sympathy for it, because it is in the name.
On the other hand, many people have depended on interpretations of intent.
Changing the behavior of one is much easier for the community to swallow than the other.
An easy example is how early versions of IE failed to correctly implement the box model for sizing elements. For backwards compatibility reasons, IE6 and on would revert to the old, incorrect behavior if parsing the html document put it into 'quirks mode', and later on this behavior was added as an optional CSS box-sizing property.
The modern insistence on (absolute) memory safety feels like another symptom of regression. It is like going back to insisting you can and should be absolutely safe from HIV by making certain lifestyle choices. In the '80s we knew that the answer was being safer, not being safe.
I am all for making C and C++ safer by adding mitigations for the most common security relevant bugs. Either on language level, or on tooling level or on OS level or on CPU level. Or a combination of them. But since making a language absolutely memory safe doesn't make it automatically impossible to have bugs at all (even security relevant bugs), we should consider everything a trade-off.
This is a silly argument. Making a language _memory safe_ protects against one class of bugs and that has merit in and of itself. The fact that there are more classes does not negate the value of that. To borrow(lol!) your own analogy here: it makes $language Safer. Not safe.
I'm new to properly native languages, but having worked with Rust for 6 months, I've yet to encounter a single crash that wasn't on part of my error handling.
I _love_ C and I've written one C program every year for the last decade or two. I generally write something really stupid, but there's going to be crashes that "prints the stack pants." As in: I'll fudge up returning the stack from a function, basically.
This proves nothing, of course. But Rust won't even let me fudge things up that way. This means that I have more time to spend fixing those other classes of bugs and that is the win!
I have been full time in C++ development for >15 years. I can count the number of times I had memory correctness issues that weren't discovered and fixed trivially on one hand, all of them very early in my career. As your adage, of course mine doesn't prove anything.
I don't want to sound like I'm downplaying your carreer, but what kinds of projects did you work on for 15 years to have an experience this positive?
For instance I find it incredible that you never had to debug and workaround a third party vendor DLL that you didn't have source for but that was leaking memory like crazy. This is the just one example of something that can be "fixed trivially" as you don't have source to modify, and is extremely common in some fields (i.e. embedded)
I have been working on a monorepo C++ codebase with on the order of 1000 person years of development on it.
And no, we didn't have major issues with memory leaks either. About on par with what I have seen in garbage collected languages. RAII works quite good most of the time.
I take it that your project being a monorepo means you have access to the code and can change it when deemed neccessary?
I get how that would be characterized as "trivially fixable", but I assure you that this is not the kind of projects people complain about when they discuss memory safety issues.
Consider that some people need to send emails to vendor companies begging them to stop segfaulting, writing the stack or leaking memory. You're lucky if it gets fixed in a few months, because that means you wouldn't have to seek alternatives which would be even more time consuming. In conclusion, there are many people out there dealing with memory safety issues which are anything but "trivially fixable".
Microsoft and Firefox have cited around 70% memory bugs, and they probably have tooling and whatnot. There are a few languages with good C/C++ FFI that are a better choice for memory safety, and so the tradeoff there isn't very high. I grant there may be ABI edge cases or whatever, but C/C++ is no longer viable or necessary for a good portion of software.
Well 3rd party libraries that are sloppy may be fixed by a psychological method…if you had an autotest environment that fuzzed the inputs and checked for correctness, you could label each library as “robust” or “weak” and leave it at that.
Then people could decide which ones to use based on the label alone. This would be an incentive for people to fix their libraries.
Then the process of normal attrition would take care of all the sloppy libraries.
I've been working in a similar environment, and once a month (or more) someone comes to me with a crash they don't understand. It usually takes a couple of days to debug (since the easy bugs don't make it to me), and every single one is some kind of undefined behavior (temporary lifetime confusion, dangling reference, callbacks modifying containers during caller iteration, ...)
Your analogy would work if every person having sex gets HIV. Buffer overflows are deeply rooted into C to the point that in the past some standard functions made buffer overflows mandatory.
>It is like going back to insisting you can and should be absolutely safe from HIV by making certain lifestyle choices. In the '80s we knew that the answer was being safer, not being safe.
This is a singularly bad analogy. HIV used to be a death sentence, but now we have effective treatments and prophylactics.
What's the PrEP equivalent for buffer overflows? There isn't one.
Oh, there are lots of PrEP equivalents. For example NX. Or virtual memory more in general. Or static analysis. Etc etc. If you think the modern ecosystem is as memory bug prone as it was in the 80s, you are being incredibly naive.
[1] http://animats.com/papers/languages/safearraysforc43.pdf