I was hoping for a more thorough survey of the compilers and their history. Notably missing is any mention of Intel’s compiler, the Cray compiler, or the competition between Borland and Watcom C compilers. There was a big span of time when GCC wasn’t the only game in town! And also no mention of the more fringe compilers like TCC or the newer cproc. I understand it was meant to be brief, but this felt a bit too brief for my taste.
I was definitely hoping for a look at modern compilers. This article was written last month, yet its history ends with the release of LLVM. There's quite a lot of development in small C compilers lately!
- TinyCC, SCC (Simple C Compiler) and Kefir are all fairly serious projects in active development.
- QBE is a new optimizing backend much simpler than LLVM; cproc and cparser are two of the C compilers that target it, in addition to its own minic.
- There's the bootstrapping work of stage0, M2-Planet, and mescc.
- chibicc is an educational C compiler based on the author's earlier 8cc compiler. The author is writing a book about it.
- lacc is another simple compiler that works well, although development appears to have stalled.
I think a lot of these projects are inspired by the problem that GCC and Clang/LLVM are now gigantic C++ projects. A pure C system like OpenBSD or Linux/musl/BusyBox ought to be able to compile itself without relying on C++. We really need a production quality open source C compiler that is actually written in C. I'm hopeful one of these compilers can achieve that someday.
- Though freeware and not foss, there is also pellesc over at Windows land, with almost full C23 support.
- For small 8 bit systems, SDCC is an excellent choice, supporting even C23 features! Also its lead maintainer is a committee member with really useful contributions to the standard.
- I have heard the RiscOS compiler is pretty cool and supports modern standards. That one uses the Norcroft frontend.
I agree with you in that we need a production level C compiler written in C. Though that is not a simple task and the C community nowadays prefers to engage on infighting over pedantic issues or rust rather than working together. A simple example of this is the lack of a modern library ecosystem, while everyone and their mother has their own custom build system. Even though C is sold as a performant language, there isn't a single parallelism library like OneTBB, Kokkos or HPX over at C++. Don't get me started on vendors not offering good standard support (Microsoft, macos-libc, OpenBSD libc)...
One correction though, cparser uses libfirm as a backend, not qbe. Also the author of chibicc has stopped writing that book AFAIK.
Bonus non-c based entries:
- The zig community is working on arocc. Judging by the awesomeness of zig cc, these are really good news.
- Nvidia offers their EDG based nvc with OpenACC support for free these days, which is cool.
"This work is based on the observation that in cases where returning a stack-allocated value is desired, the value’s lifetime is typically still bounded, it just needs to live a little bit longer than the procedure that allocated it. So, what would happen if we just don’t pop the stack and delay it until one of the callers resets the stack, popping multiple frames at once? It turns out that this surprisingly simple idea can be made to work rather well."
> QBE is a new optimizing backend much simpler than LLVM; cproc and cparser are two of the C compilers that target it, in addition to its own minic.
I thought cparser targeted libFirm. That's what their GitHub page says [0].
"It acts as a frontend to the libFirm intermediate representation library."
> We really need a production quality open source C compiler that is actually written in C.
I honestly think cproc or cparser are almost there already. For cproc, you just need to improve the quality of code optimization; it's really QBE you'd need to change. For example, you could change unnecessary multiplications by powers of 2 into left shifts (edit: IDK if it's cproc or QBE that's responsible for this, actually), and you could improve instruction selection so that subtraction is always something like "sub rax, rdi" and not "neg rdi / add rax, rdi" [1]). It also doesn't take advantage of x86-64 addressing, e.g. it will do a separate addition and then a multiplication instead of "mov eax, [rdi + rsi * 8]".
For cparser, I notice slightly higher quality codegen; libFirm just needs more architecture support (e.g. AMD64 support appears to work for me, but it's labeled as experimental).
You have the cproc compiler which does use the QBE backend. It generates much faster code than tcc since there are some basic optimization passes. On bz2 compression, with crude and basic testing, I got ~70% of the speed of gcc 13.1. tcc code is really slow, I am thinking of a QBE backend for tcc.
I would use that everywhere instead of the grotesquely and absurdely massive and complex gcc (and written in that horrible c++!). I would re-write in assembly some code hot spots. But it means those extra ~30% performance are accutely expensive, at least they could have been carefull to keep gcc written in simple and plain C99 (with benign bits of c11) to reduce the technical cost. Yeah, switching gcc to c++ is one of the biggest mistakes in open source software ever (hopefully, that mistake is not related to b. gates donations to MIT media labs revealed by the pedophile Epstein files... if all that is true though not to mention that would explain the steady ensh*tification of GNU software).
The problem is linux which does require bazillions of gcc extensions to even compile correct kernel code nowdays. You can see clang (which is no better, actually even worse) playing catchup with gcc for all the extensions creeps the kernel is getting.
All that stinks corpo-backed planned obsolescence, aka some kind of toxic worldwide scam.
> Standard C library: newlib (850k lines), glibc (1.2M lines), musl (82k lines) or uClibC-ng (251k lines)
A big shout out to the musl developers. I was debugging a program's use of posix_spawn and went down a rabbit hole of looking at all the libc implementations from the BSDs to the ones mentioned above, and found that musl was by far the most elegant and easiest to follow. It's the sort of code we all wish we could write the first time.
FTA: “After a new compiler was written for this new language, it was renamed to C, and the rest is history. This was a significant breakthrough, as, until then, kernels were written in Assembly.”
Among C crowds there is this urban myth that before C came to be, no one whatoever was using high level languages to write operating systems, all pour souls cracking out Assembly code.
I took a look at the original K&R book not long ago, and was surprised how sparse of a reference it was.
For non-Unix platforms, the only header mentioned is <stdio.h>, but it does includes such functions/macros as isupper, system, calloc, cfree, exit, and _exit.
For Unix, there is some mention of other headers, but many functions don't seem to be associated with headers at all since they return `int` and as such don't need a declaration. Headers are mostly for constants, typedefs, and structs.
I also noticed that: long float is equivalent to double, unsigned is only allowed on plain int, and scanf %h is a top-level specifier instead of a modifier.
There's one other thing that's different in those early Cs, there was no union, and structs worked differently. Effectively a field in a struct was just an offset, and all offset names were part of a global namespace of all field offsets - you could use any field name with any struct pointer - which is how you did the equivalent of unions - the v6/v7 unix kernels used that for block device structures, they all began with the same first fields (a block queue) followed by per-device stuff
K&R C was almost perfect, with its simplicity, minimalist elegance, and power. It would have been perfect if it was not for the addition of “->” which was unnecessary; it was done perhaps to simplify the compiler.
K&R C didn't have a `void` return type nor even check for the number of function arguments. Later C versions may have been bloated, but K&R's simplicity was often at odds with a minimal level of protection against the human stupidity. (Or you have a different definition of K&R C, which is normally used to call what was described in the first edition of TCPL in 1978.)
It's weird to see you downvoted for this mere opinion. I agree - C has collected a LOT of garbage over the years. C++ is even worse and if you've ever used Smalltalk or Common Lisp, it barely even seems like OO.
"->" is taken from the IBM PL/I language, which was an important source of various syntax elements of C (the other 2 main sources being BCPL and Algol 68).
It was added because the indirection operator "*" has the wrong position in C, it is prefix instead of being postfix, like in the languages where it had first appeared, i.e. Euler and Pascal, which had inherited it from Euler (both are Wirth languages).
Had "*" been postfix, ".*" could have always been used instead of "->". As it is, "->" eliminates a pair of parentheses that is needed when the indirection is followed by other postfix operators, like array indexing or structure member selection, which happens very frequently.
IIRC, in one of his papers about the history of C, Dennis Ritchie has recognized several mistakes in the design of C and one of them was the prefix "*", besides others like the precedence of "&" and "|".
In PL/I "->" was the only indirection operator, while CPL had no explicit indirection, the indirection was implicit whenever a pointer was used in an expression, like with the C++ references. The CPL pointers (which were named references, "pointer" is a term introduced later by PL/I) could use a special assignment operator, which assigned the address of an expression, not its value.
While "->" is somewhat redundant, its sole purpose being to eliminate some parentheses, it is no more redundant than e.g. the existence of both "for" and "while" or of "do ... while" when "break" exists, or of the comma operator, or of several other redundant features of C, which offer alternative ways of writing the same thing, with slightly shorter variants for certain use cases. While C has much less redundancy than Perl (which is proud of it), it still has much more than I consider good for a programming language.
When structures had been added to C, they had realized that the wrong position of "*" requires extra parentheses, but they must have had plenty of code written in late B and early C that had used "*", so instead of redefining "*" they have chosen the easier path of taking the structures from PL/I together with all their syntax elements used in PL/I, i.e. the keyword "struct" and the operators "." and "->".
Hey array indexing is postfix, so let's just use the old "arrays decay to pointers" staple, but obviously in reverse:
typedef struct {
int a;
int b;
} foo;
int main(void) {
foo one = { 1, 2 };
foo * const p = &one;
p[0].b = 13; // Look ma no -> !
printf("Now we have { %d, %d }\n", one.a, one.b);
return 0;
}
Of course one could go all nasty and flip the base and index:
0[p].b = 13;
But hey that's just being silly.
In all seriousness, thanks a lot for the historic review of C syntax! Very interesting, as someone who really likes C enough to (ab)use it whenever possible.
I know that a->b is equivalent to (*a).b, but still the difference makes sense to my brain which first learned assembly. It's much "lighter-feeling", not having to evaluate the "entire" structure just to get to a single field. :) Weird, but I really like this little piece of syntactic sugar. Yum.
Yes, I believe there was a variant where struct members had global identifier scope. This means that in order to compile x->mem or x.mem, you don't need to know the type of x at all: mem alone tells you the offset to use. In this context, the distinction between -> and . matters because you don't have the type to tell whether an indirection is needed.
It's surprising that unified -> and . wasn't widely added as a compiler extension. Most compilers already do the adjustment as part of error recovery, and most of them accept much more dubious type errors by default (things that never were part of any C standard). GDB handles it just fine, though.
Personally I like the distinction between them as it gives me context for the object being a pointer, which helps with "at-a-glace" reading of clean code.
I discovered long ago that the -> . dichotomy makes it essentially impossible to refactor code by switching a type to/from a value/ref type. It is a major factor in C being a very difficult language to refactor code in. Long lived C code still retains its original design mistakes because of that.
Haha, even returning a structure still seems a bit fancy to me. I remember at the time, thinking, how can they return all that data and went straight to whatever '-S' was in those days (-E?) to look at how it worked in assembly.
It's worth noting that Stallman wanted $150 (in 1987 dollars, so over $400 today) for a tape containing a copy of GCC, clearly showing the difference between "free as in speech" and "free as in beer".
The money for the tape presumably covered the cost of the tape and of access to a computer and a tape drive and the time spent by someone to write the tape and handle packing and shipping, with a small markup.
From the GPL 2.1 licence:
"You may charge a fee for the physical act of transferring a copy,.."
That ($400 in today's value) isn't an unfair price for doing all the work necessary to ship out a tape. It's actually quite cheap. If I were to do that, using the tapes we had back then, I would charge more (I would have to take time from my working hours to go through that process).