This is a long paper and the author has 2 main claims:
1) C Language popularity is more to do with cognitive ease of memory addresses as a conceptual model for inspection and change. Author claims memory address mental model overshadows runtime performance.
2) switching to "safe" languages like Java/C#/Rust is not necessary. With no changes/violations to existing C Language specification, a new/different implementation (compiler) can add more runtime safety checks similar to managed languages. An example from the paper:
>Consider unchecked array accesses. Nowhere does C define that array accesses are unchecked. It just happens that implementations don’t check them. This is an implementation norm, not a fact of the language.
Those 2 ideas look orthogonal but he ties them together at the end.
I'll take some poetic license (e.g. a little exaggeration) to reword the author's idea to help spur discussion...
Consider the idea of the Sufficiently Smart Compiler[1] that claims that a "slow" and "high-level" language like Python/Ruby could be theoretically analyzed and compiled to be as fast as C or handcrafted assembly.
In a way, the author is coming from the opposite direction. If you had a "Sufficiently Smart Runtime" for a new C Language compiler implementation, it could (theoretically) do all sorts of extra checks and bookkeeping that wouldn't require any changes to C source code and wouldn't violate the existing C Language standard. (E.g. Imagine a new C runtime that did many checks similar to Valgrind + UBSAN + ASAN + debugger memory fences, etc.)
Would the program execution be slower? Well yes, but that's not really an issue because according to author's claim #1, what programmers really like about C is the mental ease of accessing memory addresses. The performance is important, but it's a secondary benefit -- according to the author.
Excellent, I think the author would do well to re-frame the question as you have. If nothing else to put it more clearly into the space of provable compilation.
When I transferred into the "Oak" group that later became the "Java" organization, the team I was on was looking at whether or not you could write an OS in Java sort of in spite of its safety rules. This sort of concept has been revisited by Rust with its safe/unsafe modal operation.
What both of those efforts have in common is that determining safety may be impossible at the construct level but provable if you were to exhaustively search all possible outcomes.
What the paper and your comment add to the discussion is the intriguing idea that you could create a 'safe' backend (say the equivalent of the JVM) as a target for a C compiler. And code that could not be compiled would be flagged for later analysis. Much like VHDL can express hardware that cannot by synthesized you might end up with a C compiler that could compile code that could not be executed. I could be fun to spend a bit of time poking around that rabbit hole.
Indeed. Even the "obvious" example of the compiler inserting bound checks in the generated code does not work with the well-known method of marking the beginning of a variable-length memory block at the end of a struct using an array of some fixed size, say, 1 (or even 0).
The problem is that it isn't a new idea. People keep trying it as shown below. Unfortunately, C wasn't designed so much as a modified version of something (BCPL) that was the only collection of features Richards could get to compile on his crappy hardware. It's not designed for easy analysis or safety. So, all the attempts are going to hit problems in what legacy code they can support, their performance, or even effectiveness if about reliability/security in a pointer-heavy language. Compare that to Ada, Wirth's stuff, or Modula-3 to find they don't have that problem or have much less of it because they were carefully designed balancing the various tradeoffs. Ada even meets author's criteria for safe language with explicit memory representation despite him saying safe languages don't have that.
To back that up with references, first is a bunch of attempts at safer C's or C-like languages with performance issues. The next two are among most recent and practical at memory safety for C apps far as CompSci goes. The last one is an Ada book that lists by chapter each technique its designer used to systematically mitigate bugs or vulnerabilities in systems code.
I'm a long time C programmer, and I was struck by how clumsy and error-prone any manipulation of C strings turns into. It's really hard to look at a mass of strlen/strcpy/memcpy/etc. and see just what is happening. Contrast that with, say, BASIC or Javascript, where string manipulation is easy, natural, and bug-free.
I'm going to disagree about the mental ease of programming in C, and a large part of that is difficulty in building useful abstractions around the pointer model.
That particular problem (strlen/strcpy/memcpy) comes from the problems of the standard library string functions. It can be solved by creating your own string library. Then string manipulation is easy.
That falls over as soon as you integrate with anybody else's C code, including the operating system APIs, and with C string literals :-(
If it was as easy as you say, it would have happened.
And heaven knows I wrote my own string packages, one after the other, and so did everyone else. I eventually abandoned all of them. C's abstraction abilities are simply not good enough to do a decent string encapsulation.
No other language solves this perfectly either, certainly not in a way that interoperates _across_ languages and environments.[1] Which is pretty much the whole point of the article. But what C excels at is the ability to write code which can examine and work with the representation of most string-like objects exported from any environment. The difficulty of doing so is a function of how opaque and complex the alien implementation.
I gave up on trying to solve strings in C applications a long time ago, too, much as you have. I did so not because I found C too inexpressive, but because I realized that I was trying to shoe-horn too many concepts into a "string". A string is almost by definition the wrong data structure--either too abstract or not abstract enough--for almost everything. Not coincidentally, that was about the same time I stopped abusing regular expressions for parsing data.
[1] Even C++ didn't solve this. We're still in the midst of a std::string ABI compatibility break in the C++ ecosystem. Granted, it's been about 12 years since the last one, but these last fairly long because systems software (i.e. infrastructure software) has a really long tail.
shrug It doesn't fall over. I've done it, the openBSD team has done it. DJB has done it. Maybe something is wrong with your implementation that I can help you with?
OpenBSD takes a fairly minimalist approach, which is vaguely described here: http://www.freebsdforums.org/forums/showthread.php?threadid=... They basically replace the unsafe functions with things that are easier to use. Their idea is that it isn't the format of the C-string that causes security issues (null-terminated string), it's the poorly defined functions (with weird corner cases that are hard to get right). It's worked well for their use cases.
DJB did something similar in qmail, I don't recall the details but you can look at the source code as easily as I can, and it eliminated security problems.
When I'm working in Java, I find that most of my string parsing uses the split() function. This is a pain in C, because even if you had a split() function you'd need to deal with memory allocations. Most of these are solved with a memory pool. In my own library, I also added runtime, grammar-based parsing functionality. So to parse a CSV line you might do something like this:
char *g = " S -> WORD | WORD , S;"
"WORD -> [^,]";
results = parsegram(g, inputString);
Grammar parsing + memory pools makes string parsing in C easier than in Java. The biggest difficulty with this kind of library is to do it right, you need to be something of a unicode expert, and that's tough.
Here's roughly what that would look like using Bernstein's C string library (which was not only used in qmail).
#include "stralloc.h"
...
static stralloc s, t;
...
if (!stralloc_ready(&s, 0)) die_nomem();
if (!stralloc_copys(&t, "hello")) die_nomem();
if (!stralloc_copy(&t, &s)) die_nomem();
if (!stralloc_cat(&t, &s)) die_nomem();
if (!stralloc_copy(&t, &s)) die_nomem();
if (!stralloc_cat(&t, &s)) die_nomem();
if (!stralloc_cat(&t, &s)) die_nomem();
if (!stralloc_copys(&t, "hello")) die_nomem();
if (!stralloc_cat(&t, &s)) die_nomem();
if (!stralloc_copy(&t, &s)) die_nomem();
if (!stralloc_cats(&t, "hello")) die_nomem();
if (!stralloc_copys(&t, "hello")) die_nomem();
if (!stralloc_cats(&t, "world")) die_nomem();
Yes, that does work. But it's not without problems, not the least of which it's just not attractive to look at. For example, concatenating "hello" and "world" allocates memory, when it should instead give you a "helloworld" string literal. In fact, simply initializing `s` with a string literal needlessly allocates memory, and that's anti-ethical to performance. Calling die_nomem() leaks memory if it does anything but terminate the program. All those tests for memory exhaustion are tedious.
> Even such a simple use case is fraught with major problems:
>
> 1. who allocates needed memory?
>
> 2. who free's it?
That's also a major feature. It allows people to write systems that are resilient in the face of tight memory limitations. It's not cool when a language forces string operations to allocate & duplicate memory willy-nilly.
> 3. can the compiler constant fold cat("hello","world") ? Does the result wind up allocating memory anyway?
I fail to see how that's a major problem. Why are you concatenating string literals? How common is that?
> 4. what about the lack of function overloading to handle the permutations?
I consider lack of overloading to be a feature. Overloading is one of the things that are way too easily abused, and it makes code auditing harder than it needs to be. Please just type out the different function names so I can see exactly what is going to be called when I read the code. Or use the sprintf family of variadic functions.
It's the opposite. I've seen lots of code written in C that pretends to be out of memory safe. I've never once seen such a program that actually is out of memory safe. Invariably the codepaths triggered by malloc returning null are never exercised.
With a GC and exceptions you can theoretically be quite resistant to OOM conditions, not that anyone really cares.
> I've never once seen such a program that actually is out of memory safe. Invariably the codepaths triggered by malloc returning null are never exercised.
sqlite takes care to correctly deal with out of memory conditions. It has explicit tests for that code too. See section 3.1, Out-Of-Memory Testing, of [1].
Now I found my first program that actually tests it properly :)
I knew you had to systematically drive the code through every OOM codepath to even have a shot at doing that in an unmanaged language. Sadly a lot of C code is written by people who think:
if ((ptr = malloc(sizeof(struct foo))) == null)
return -1;
One of the things with tight memory systems is that you don't use malloc to begin with, if you can avoid it. C gives you the option.
When you're concatenating strings, you already have storage for those strings. Maybe you can re-use that storage. Maybe you have a static buffer. Maybe you have a fixed size buffer on the stack and the stack use is bounded.
A language that forces you into making redundant duplicates onto the heap is terrible in these situations.
And yes there are programs that try to deal with failing mallocs. Again, C gives you the option.
Very, very few C programs can handle running out of disk space. This includes the operating system(s). Get close to filling up the disk, and try various things.
Just recently, I was having a lot of trouble with Windows Update hanging. I finally noticed that free disk space was low. Freed up more space, and WU started working again.
For fun, try:
#include <stdio.h>
int main() { printf("hello world\n"); return 0; }
and redirect stdout to a file on a device that is full. Amazingly, it succeeds!
I assume you're referring to OpenBSD here, they didn't use snprintf(). They used asnprintf(), which solves the problem of who should allocate (but not who should free).
"That means that we have been going through the tree cleaning out all calls to sprintf(), strcpy(), and strcat(). Instead, these things are being rewritten to use asprintf(), snprintf(), strlcpy(), and strlcat()."
These functions will take care of buffer-size checking, and reallocation if necessary. For cases where you need to interface with pre-existing libraries, you can return a cstring(). Make it a function/macro to enable you to change the struct definition in the future:
#define ktCstr(x) (x)->str
then you can pass it into write() or whatever you need:
... and end up with silent truncation unless you happen to always remember to use only C library functions with explicit length arguments (and which do not assume NUL-terminated strings).
Look, I get that there is a place for C, but string manipulation is absurdly bad and error-prone.
Hi! I can't imagine how you understood what I wrote. I specifically said to not use those C library string functions.
I fully admitted that string manipulation is absurdly bad and error-prone, then built on that by showing a way to make it better. Use ktStrcat() instead of strCat(), then you don't have to worry about truncation. Use ktSprintf() instead of snprintf(), then you don't have to worry about truncation. I wish you had understood.
Yes, I agree. If everyone would just avoid those C stdlib functions everything would be peachy. :)
I was agreeing with you, but just adding caveats. :)
Well, except... some problems surface when interfacing with "things" (libraries, OS'es) written by other people... and there's no escaping those problems, fundamentally. It's C.
Of course UTF-8 was invented with the express purpose of being "C-compatible", but... what happens if you have a string with a NUL in it and you pass that to the POSIX (I think?) printf function as an argument for a "%s" format string? Well, it gets truncated. Did you mean for that to happen, or didn't you? Who knows? That's the problem.
Honestly, I'm not trying to win "internet points" or something. It's just that C, as I'm trying to point out, is a bad language for almost everything that's required for a "user-facing" languages these days. Write the thing in C#, Java, O'Caml, Qt[1], or Haskell, or whatever... but please don't think you need to write in a sort of weird approximiation of the old PDP.
[1] Yeah, yeah, not a language, but it's at least an ecosystem that seems to be moderately successful.
This problem was actually solved, but almost nobody uses it. Safe variants of most of those string, memory, io, wchar, stdlib and misc functions are defined in the C11 standard Annex K (finally after 9 years), but nobody is using it, and rather propose to keep using known unsafe variants like the truncating versions with an n. Like snprintf and not the safe variant sprintf_s.
glibc, bsd, darwin, musl, newlib: nobody cares to implement the safe bounds checking variants. They solely rely on the compile time size checks, which fail to check any dynamic boundaries.
Only Microsoft, Android, Cisco and Embarcadero implement the safe libc functions.
I recently took over Cisco's safelibc (MIT licensed) and extended it to more platforms, all C11 api's, and an improved testsuite. And boy was I surprised to find so many missing API's, upstream libc bugs and wrong API's everywhere. Flawless were only musl and the BSD's. But musl is lacking with it's errno and of course zero C11. Only ReactOS has a proper testsuite for their libc. Glibc is somewhat ok, but I still find crashes daily.
No. The major motivation not to use it was _FORTIFY_SOURCE with it's compile checks for compile-time known buffer sizes and it's accompanying _chk functions.
This leaves out all dynamic buffers.
You cannot mix PTR + LONG args without serious compile-time errors
I don't have any idea how _FORTIFY_SOURCE works, other than it is GCC specific and as such no place in ANSI C.
What I know is that having something like strcpy_s() does not provide any actual safety, because with the prototype "strcpy_s(char * restrict s1, rsize_t s1max, const char * restrict s2)" there is no guarantee that s1max is a valid size for s1.
This is what the _chk functions do. In most cases it know the compile-time size of s1.
But in dynamic cases the _s functions are far better than the truncating 'n' versions. Read the rationale.
I think the mental model isn't the issue, it's that the C Standard Library is very anemic. When writing a C application either you're using a big library like APR or GLib or you're rolling your own, and since rolling your own is a pretty big, complicated, and fraught proposition it's no surprise bugs creep in. Furthermore you can't really interop with other libraries if they also rolled their own data structures because theirs probably aren't like yours. Consequently libraries tend not to do that at all, setting for things like NULL-terminated lists and special, opaque data structures.
I feel like if someone wants to throw C a life vest, they should start with a meaningful standard library that engineers can build on to provide functionality we pretty much consider standard now (HTTP libraries, JSON libraries, database libraries) with a consistent interface.
It's not just the mental ease, it's also the physical typing ease (and in some cases, the possibility).
For example, he points out that to connect C to existing parts of the system (which is the OS and OS level tools), all you have to do is call the functions. If you want to call a C library from a Java program, it's a lot more work. Furthermore, C has the capability of understanding Java structures (although it's awkward), but Java has no way of understanding C structures from within the language. There is no way to model a driver I/O port in Java, but in C there is.
The paper is worth thinking about. If you are creating a language, take interoperability between already existing languages into consideration. JNI is ok, but think how much better it could be if it did auto-marshalling of objects!
I've been out of the C world for a long, long time but it seems to me that anywhere that C's pointer arithmetic and ability to cast pointers to/from other types is objectively appealing, that's going to be one of the cases that a compiler can't understand.
Of course there's always the subjective "everything looks like a nail" usage as well, which makes every problem seem like a pointer problem because you've never tried to think of them as anything other than a pointer problem. I'm sure you could cater to that usage with a proper runtime but really, it doesn't hurt to try new things sometimes...
In my case, nearly 100% of the C code I write is for embedded systems. Casting a hex literal to a pointer type that is a volatile hardware register is better than dropping into asm....
So yes, compilers will always have a hard time understanding device drivers and such unless you turn hardware device concepts into language primitives.
A more correct thing would probably be to create linker scripts that expose symbols for the registers. It's probably not worth the trouble now but the hypothetical compiler would understand it better.
There still needs to be a description of the underlying hardware behavior somehow. The hardware engineers often give you a somewhat correct Excel sheet or force you to look at the HDL to figure it out.
C's popularity is due to the fact that it is predictable within certain bounds (single thread or limited concurrency).
No GC pauses, no weird runtime crashes due to a strange constructor, no gigantic exception chains, etc.
The only languages in the TIOBE index that can even try to make that claim are: C at #2, C++(if you subset it) at #3, Objective-C/Swift(#18/#11), Assembly at #14, Ada at #29, and
maybe FORTRAN(#35).
That's not a lot of options if you need runtime predictability. Basically C, C(with additions), C(with additions), assembly(hack, spit), Ada (okay), and FORTRAN (God help you).
Even now, that means C or Ada--and the first free Ada compiler was 1992.
Yes, Rust is coming. But it's got a way to go yet.
The idea that C is predictable is in my view a sign of someone who hasn't got to know C really well.
The trends around undefined behaviour will hopefully put a bullet in the head of this idea for good. It's extremely hard to look at C and reason about what an optimising compiler will turn it into.
Malloc is not more predictable than a GC pause. Both malloc and free can take unpredictable amounts of time. If anything it's less predictable because modern GCs at least have pause time targets, but mallocs never do. You just don't notice it because people don't tend to measure malloc latency. In turn that's because malloc pauses only affect memory allocation operations, they don't stop every thread, which is a benefit it's true, but it's not about predictability and more about UI latency.
C not having exceptions doesn't make it more predictable. It just means that if something goes wrong you get a useless and probably corrupted core dump. The number of times I've been able to fix a bug in a piece of managed code given only a stack trace from the end user is huge. The number of times I've been able to fix a bug given "Segmentation fault" with no other info is zero.
> The trends around undefined behaviour will hopefully put a bullet in the head of this idea for good. It's extremely hard to look at C and reason about what an optimising compiler will turn it into.
Sure when you turn on -Oinfinity. Nobody does that in embedded unless they are hard pressed on some metric (RAM size, generally, or CPU flops occasionally).
Overall, though, C is really fairly predictable. Unsigned arithmetic does what you expect--the fact that signed arithmetic doesn't under higher optimizations is a fairly recent phenomenon (and not an uncontroversial one). Variables go where you expect. Pointers act like you expect. Casting and precedence sometimes sneak up on you, but parentheses generally manage that.
Const has issues at the boundary cases. Trying to stuff something into ROM and then telling the rest of the system that "really-no-you-cant-cast-that" can make things tricky with "incompatible pointer" issues.
Floating point arithmetic, though, is just a disaster.
> Malloc is not more predictable than a GC pause.
Ayup. And what's the first thing real-time embedded folks do? Throw out malloc (which is library, not language, but that's pedantic). Real-time-embedded systems tend to allocate all memory statically, up-front. Or they use a custom malloc that they control the behavior of.
> C not having exceptions doesn't make it more predictable. It just means that if something goes wrong you get a useless and probably corrupted core dump.
Predictable and useful are orthogonal.
And, the fact that I can't attach to running state of a crashed program is a failure of TOOLS not the language. The fact that I can't attach to a system that crashed, examine the state, fix what I need to, and continue is a fault of the people who make C IDE's. There is no reason other than lack of monetary incentive that this cannot be done.
C's popularity is due to the fact that it is predictable within certain bounds (single thread or limited concurrency).
Your post, and reading a discussion further down about Rust's reference counting, has made me realize something primitive that Rust is getting right--a real move forward--which even those who don't enjoy the default "safety switch" being flipped from C (like me) may agree.
The C model for memory in time and space is so clean for heap data and the function call stack for one thread (plus global registers), but C has no community-understood/concurring model when it comes to concurrency.
Rust, older C++ libraries, C malloc implementations, and other are all alluding toward the simple memory model for multiple threads, which is reference counted pointers, IMO. Basically use a separate type of pointer for heap data, where the max size of the heap is divided by whatever binary power of 2^p processors exist.
Rust folks or other languages are welcome to add more ownership semantics or whatever, but the whole family of languages could benefit by this extension to the lingua Franca of C.
We may not even need to add a new nominal pointer type to C, just by fiat understand and expect shared, free store objects to always live inside the lowest 1/p th portion of the word address space.
Once upon a time it was common to write non-consing Lisp code precisely in order to get predictable behaviour; I think that it worked pretty well. Non-consing code won't have GC pauses; it won't have weird runtime crashes; and it probably wouldn't have gigantic exception chains unless it needs them.
The fact that there are these other languages with the same properties means that predictability isn't the real reason, right? It's that it also is sparse in its specification and easy to implement a compiler for.
I was more referring to the historical perspective of how C became popular, many have fallen by the wayside. Though there are certainly current alternatives to C besides those on TIOBE.
Also there are real-time extensions to current GC'd languages like Java.
(Though standard C isn't very predictable timing-wise either, or suited to real-time work)
And, with the exception of Ada and Pascal, most of those language have been dead for at least 20 years--for various good reasons.
And, please do remember that Apple switched away from Pascal when writing its operating systems in spite of an enormous code base. That's pretty damning--apparently C's "undefined behavior" didn't seem to matter.
So, we're back to: the only alternative to C is Ada.
> Though there are certainly current alternatives to C besides those on TIOBE.
Let me make it easy. Give me a list of languages that have been used to build an operating system in a product in the last 20 years. It doesn't have to be Linux, even a small RTOS counts.
I'll start the list:
C family--C, C++, ObjC/Swift
Forth(?)--probably counts as it runs on pretty bare metal
Ada--not sure anybody has used it to build an OS, but I don't debate that they could
Rust--has a feature set of articles about this
Pascal--the original Lisa and Macintosh OS (probably stretching that 20 year limit a bit).
I am happy with your list of languages to write an OS in, maybe add D and Oberon. I'd point out that you can also use managed languages, see MS Singularity, or the various Lisp and Smalltalk operating systems, or the UCSD P-system, etc - there is a list at https://en.m.wikipedia.org/wiki/Language-based_system .
Counting new commercial operating systems is not a useful benchmark as they are very rare, and we already agreed that the alternatives are not popular.
"Consider the idea of the Sufficiently Smart Compiler[1] that claims that a "slow" and "high-level" language like Python/Ruby could be theoretically analyzed and compiled to be as fast as C or handcrafted assembly."
1) C Language popularity is more to do with cognitive ease of memory addresses as a conceptual model for inspection and change. Author claims memory address mental model overshadows runtime performance.
2) switching to "safe" languages like Java/C#/Rust is not necessary. With no changes/violations to existing C Language specification, a new/different implementation (compiler) can add more runtime safety checks similar to managed languages. An example from the paper:
>Consider unchecked array accesses. Nowhere does C define that array accesses are unchecked. It just happens that implementations don’t check them. This is an implementation norm, not a fact of the language.
Those 2 ideas look orthogonal but he ties them together at the end.
I'll take some poetic license (e.g. a little exaggeration) to reword the author's idea to help spur discussion...
Consider the idea of the Sufficiently Smart Compiler[1] that claims that a "slow" and "high-level" language like Python/Ruby could be theoretically analyzed and compiled to be as fast as C or handcrafted assembly.
In a way, the author is coming from the opposite direction. If you had a "Sufficiently Smart Runtime" for a new C Language compiler implementation, it could (theoretically) do all sorts of extra checks and bookkeeping that wouldn't require any changes to C source code and wouldn't violate the existing C Language standard. (E.g. Imagine a new C runtime that did many checks similar to Valgrind + UBSAN + ASAN + debugger memory fences, etc.)
Would the program execution be slower? Well yes, but that's not really an issue because according to author's claim #1, what programmers really like about C is the mental ease of accessing memory addresses. The performance is important, but it's a secondary benefit -- according to the author.
[1] http://wiki.c2.com/?SufficientlySmartCompiler