Hey everyone,
I'd like to point out that the c-- domain is no longer cminusminus.org, the historical site can be found on norman ramsey's homepage here http://www.cs.tufts.edu/~nr/c--/ ! It also has actual reading material / papers!
The cminusminus domain is no longer valid (though it has more modern CSS), also it lacks links to all the informative papers!
C-- is very similar overall to LLVM IR, though there are crucial differences, but overall you could think of them as equivalent representations you can map between trivially (albeit thats glossing over some crucial details).
In fact, a few people have been mulling the idea of writing a LLVM IR frontend that would basically be a C-- variant. LLVM IR has a human readable format, but its not quite a programmer writable format!
C-- is also the final rep in the ghc compiler before code gen (ie the "native" backend, the llvm backend, and the unregisterized gcc C backend).
theres probably a few other things I could say, but that covers the basics. I'm also involved in GHC dev and have actually done a teeny bit of work on the c-- related bits of the compiler.
C-- is intended for higher-level languages than LLVM. E.g. the latter still doesn't have a modern infrastructure interfacing with garbage collectors. Also, C-- is a lot easier to read in textual form than LLVM IR. Finally, the use of C++ and templates makes the code size of LLVM absolutely enormous (20MB binary).
That said, so much work goes into LLVM supporting new platforms, optimizations, etc, that it's probably easier to hack around LLVM's limitations than use C--, etc.
You probably don't need LLVM to work with LLVM-IR. In fact, I imagine that there could be a market for a "low-level" IR with more compact implementation, sort of similar to Oberon "slim binaries" or something like that. The fact remains, though, that the "pragmatics" stuff around LLVM-IR (GC and other interfaces) seems still a little bit problematic.
But I'm a Luddite, though; I like self-hosting native compilers and I'm not terribly fond of the idea of having to pack a 20MB blob of C++ code with my code either. (There's a lot of apps that could profit from dynamic translation at run-time, but using LLVM for that feels unnecessarily heavyweight. I wish the algorithms and passes from LLVM were available in form of some high-level DSL that you'd be able to translate into whatever language you use and automatically adapt to whatever data structures you use in your code. It doesn't sound exactly impossible.)
Small is beautiful. How much cache does your phone have? How long does it take to compile a 20MB binary? How long does it take it to load all those symbols into a debugger? How long does it take to process all those relocations if you are linking dynamically? How big will your overall binary be if every component is 20MB?
When I first entered college in 1988 there was a small DOS compiler floating around called C-- back then, which I got from some BBS (yes, a BBS, how antiquated!), probably in 1989. It was a mix of a subset of C and proto-assembly. I have looked for it a few times over the years, and this one isn't it, although it has some similar ideas. It makes me wonder how many other little-known C-- projects there are.
YES! That's the one I was thinking about too. I remember making some simple games in this as a kid and I learned a lot of assembly using this at the time also because, if I recall, you actually referenced registers and such right in with your other code - or something strange like that.
According to this, C-- is still a large part of the Glasgow Haskell Compiler. It looks like (Fig 5.2) code goes into C-- before being translated to LLVM.
This thing is written in ML, not in Haskell, and the GHC docs themselves claim that Cmm (the GHC thingy) is "is rather like C--. The syntax is almost C-- (a few constructs are missing), and it is augmented with some macros that are expanded by GHC's code generator". So "GHC's Cmm/C--" sounds rather like an independent reimplementation of a superset of a subset of "this C--".
"As of May 2004, only the Pentium back end is as expressive as a C compiler. The Alpha and Mips back ends can run hello, world. We are working on back ends for PowerPC (Mac OS X), ARM, and IA-64. Let us know what other platforms you are interested in."
I own the cminusminus.com domain. Was planning on using it for a blog, mainly to post horrible c/c++ code snippets I come across. If anyone wants it, let me know.
In case anyone [else] is wondering, RFC 1035 disallows domain names starting or ending with "-", so there is no registering C--.com, alas:
<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]
<ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
<let-dig-hyp> ::= <let-dig> | "-"
<let-dig> ::= <letter> | <digit>
<letter> ::= any one of the 52 alphabetic characters A
through Z in upper case and a through z in lower case
<digit> ::= any one of the ten digits 0 through 9
I'm involved in GHC dev (and thus incidentally c-- dev as it exists in GHC). And i may be spending a lot of time helping improve ghc's code gen over the coming year (which is essentially the most widely used c-- compiler on the planet per se)
I felt the same way about "C++" postincrement, returns C. This seems more like it should be a predecrement (--C) because it does not return C, rather assembly.
EDIT: changed, I said decrement and meant increment.
Predecrement is fairly unidiomatic in C, much less commonly used than in C++. Typically it's only really used if you're doing something tricky that actually depends on the semantics of predecrementing, usually in an array reference of the x[--i] variety. It's idiomatic in C++ I'd guess because of operator overloading: postdecrement/postincrement may produce large unused object copies that not all compilers will optimize away, so using the pre- versions by default has become idiomatic in that community.
It's actually pretty idiomatic in optimised C, for the same sort of reason (when all extra instructions count), e.g., this sort of loop:
for( i=LEN-1; i>=0; --i ) {...}
I've carried this into a lot of other languages, as it's equally readable to the alternatives, when ordering isn't important. In (mainly older) optimised code where ordering is important, you'll often still see this, with a second incrementing variable so that the exit condition retains the compare to zero (avoiding a variable comparison); although that part is separate from the use of the pre-decrement.
I'm not sure if either of these make much difference with modern CPUs and compilers. Certainly not worth worrying about for the most part.
It's untrue that this idiom started with C++. This style is preferred by K&R, which predates optimising compilers. Since then it's use among C programmers has probably been force of habit, but it is definitely idiomatic. It's use in K&R makes it about as idiomatic as it is possible to be.
That isn't the style used in K&R, though. When K&R introduces a for loop for the first time, in section 3.5, it does it like so, and sticks to this style throughout:
They do also use predecrement/preincrement, but only in assignments or comparisons, where it actually semantically matters that the increment/decrement is "pre":
while(*--np == '/')
*np = '\0';
In contexts where it doesn't matter, like the 3rd clause of a for loop, or just incrementing a variable as a standalone operation, they always default to x++.
I just flicked through my copy of K&R (second edition), and at the beginning it clearly prefers preincrement. The introduction of for loops comes before that of ++, and the first for loop in the book is actually
for (fahr = 0; fahr <= 300; fahr = fahr + 20)
...
Each for loop after the introduction of ++ uses preincrement (where applicable) for a while, and then the style shifts to postincrement.
EDIT: specifically, increments are deliberately prefix ("For the moment we will stick with the prefix form") until the full introduction of increment and decrement operators in section 2.8, after which they are postfix.
The EDG C/C++ frontend (used by a lot of commercial C++ compilers) still supports this, and Comeau's compiler still relies on this for code generation. EDG also implements a different template instantiation model than other C++ compilers, partially so that it can function with a C++-agnostic linker.
I think its part of the quaint appeal of the article, that c-- only supports i386 but the way the world moved did not require c--, FSF supported GCC supports something like 50 archs and many more not officially FSF supported. And GCC is hardly the only c compiler out there. The world just didn't move in the direction directly requiring c--.
Now as for current ideas and projects similar in concept to c--, that is interesting to think about.
Maybe the startup lesson is sometimes, if you try to intermediate yourself as a middleman, even if you do a good job of it, and appear to be a good idea, it just doesn't work. It would be interesting to dissect the c-- experience and figure out why.
GCC did already exist when the C-- project was started, so that part didn't really change. The C-- team's hypothesis was that some kinds of languages (especially functional languages) would benefit from a cross-platform layer that is a better compilation target than C, partly through being a bit more low-level. The previous two popular routes were native-code compilers and compile-via-C compilers. Native-code compilers have the disadvantage that porting them and maintaining them on N architectures is significant work, and compile-via-C compilers have the disadvantage that C isn't a very convenient compilation target for many language features, especially to efficiently compile some common FP language features. The hope was that C-- would be a nice middle ground between compiling via C and compiling to native code, producing a target that was nicer than C as a target.
Imo, that hypothesis did have some legs, but people are now instead usually using LLVM in various ways to accomplish it. LLVM's intermediate representation isn't really properly cross-platform, but it can be sort of hacked to be used for that purpose.
I'm not really a language implementor, more of a spectator of language implementations, so someone with more experience would give a better answer.
My impression is that the GHC people, at least, still think that C-- is a nicer IR than LLVM-IR for their purposes. But they are slowly moving more things to LLVM anyway, because the LLVM project as a project has, in the meantime, built up a lot more momentum and infrastructure. In the early 2000s this wasn't obvious, but in 2013 it's clear that LLVM has an ecosystem, institutional support, resources to maintain ports, etc., while C-- didn't manage to get the same traction.
From my personal experience, LLVM is tied a bit to closely to the C ABI, which makes it difficult to implement some FP features cleanly. One example is forgoing C's stack discipline to implement double-barreled continuations.
It looks like OCaml's C-- predates this C--, but has had an influence on it.
From Xavier Leroy, one of the lead Ocaml developers [1]:
I think I'm the one who coined the name "C--" to refer to a low-level,
weakly-typed intermediate code with operations corresponding roughly
to machine instructions, and minimal support for exact garbage
collection and exceptions. See my POPL 1992 paper describing the
experimental Gallium compiler. Such an intermediate code is still in
use in the ocamlopt compiler.
I had many interesting discussions with Simon PJ and Norman Ramsey
when they started to design their intermediate language. Simon liked
the name "C--" and kindly asked permission to re-use the name.
However, C-- is more general than the intermediate code used by
ocamlopt, since it is designed to accommodate the needs of many source
languages, and present a clean, abstract interface to the GC and
run-time system. The ocamlopt intermediate code is somewhat
specialized for the needs of Caml and for the particular GC we use.
RPython, the Python subset that PyPy is written in, is compiled to C. (That said, it is a fairly restrictive subset, and the compiler is very much designed for interpreters and nothing else.)
When compiling a language you have to compile it "to" somewhere. You could pick assembly, but assembly is so low level that it changes bit by bit depending on the processor. You want a stable target that's "low level" enough to let you optimize things for the machine you'll be working with, but not so low level as to be a moving target.
You could pick C, but C is actually still fairly complex and has lots of undefined behavior. You could pick JVM, but then you'll pick up the entirety of the JVM architecture which is quite large and may include many things you're not interested in.
C-- is another choice. It's decidedly lower-level than C (and thus far lower than JVM), higher level than assembly, and was crafted, as far as I know, under the deep influence of how to compile the pure functional language Haskell (or more specifically, it's System FC style and STG underlying languages).
In the mean time, LLVM took a similar place in this hierarchy and has probably taken off much more than C-- has. GHC, a Haskell compiler, in fact is moving its compilation pathway that way.
Cool! the first parser/compiler I wrote was for C-- (the version that is a small subset of C, not this one). Had not even heard an mention of the different C-- languages for a few years now.
Say you're writing a compiler for a language in Haskell, and you want to generate machine code rather than having it be interpreted. Is C-- a natural choice on this platform? Or might LLVM be a better choice?
The cminusminus domain is no longer valid (though it has more modern CSS), also it lacks links to all the informative papers!
C-- is very similar overall to LLVM IR, though there are crucial differences, but overall you could think of them as equivalent representations you can map between trivially (albeit thats glossing over some crucial details).
In fact, a few people have been mulling the idea of writing a LLVM IR frontend that would basically be a C-- variant. LLVM IR has a human readable format, but its not quite a programmer writable format!
C-- is also the final rep in the ghc compiler before code gen (ie the "native" backend, the llvm backend, and the unregisterized gcc C backend).
theres probably a few other things I could say, but that covers the basics. I'm also involved in GHC dev and have actually done a teeny bit of work on the c-- related bits of the compiler.
relatedly: i have a few toy C-- snippets you can compile and benchmark using GHC, in a talk I gave a few months ago https://bitbucket.org/carter/who-ya-gonna-call-talk-may-2013... https://vimeo.com/69025829
I should also add that C-- in GHC <= 7.6 doesn't have function arguments, but in GHC HEAD / 7.7 and soon 7.8, you can have nice function args in the C-- functions. See https://github.com/ghc/ghc/blob/master/rts/PrimOps.cmm for GHC HEAD examples, vs https://github.com/ghc/ghc/blob/ghc-7.6/rts/PrimOps.cmm for the old style.