Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
C-- (cminusminus.org)
553 points by msvan on Oct 27, 2013 | hide | past | favorite | 72 comments


Hey everyone, I'd like to point out that the c-- domain is no longer cminusminus.org, the historical site can be found on norman ramsey's homepage here http://www.cs.tufts.edu/~nr/c--/ ! It also has actual reading material / papers!

The cminusminus domain is no longer valid (though it has more modern CSS), also it lacks links to all the informative papers!

C-- is very similar overall to LLVM IR, though there are crucial differences, but overall you could think of them as equivalent representations you can map between trivially (albeit thats glossing over some crucial details).

In fact, a few people have been mulling the idea of writing a LLVM IR frontend that would basically be a C-- variant. LLVM IR has a human readable format, but its not quite a programmer writable format!

C-- is also the final rep in the ghc compiler before code gen (ie the "native" backend, the llvm backend, and the unregisterized gcc C backend).

theres probably a few other things I could say, but that covers the basics. I'm also involved in GHC dev and have actually done a teeny bit of work on the c-- related bits of the compiler.

relatedly: i have a few toy C-- snippets you can compile and benchmark using GHC, in a talk I gave a few months ago https://bitbucket.org/carter/who-ya-gonna-call-talk-may-2013... https://vimeo.com/69025829

I should also add that C-- in GHC <= 7.6 doesn't have function arguments, but in GHC HEAD / 7.7 and soon 7.8, you can have nice function args in the C-- functions. See https://github.com/ghc/ghc/blob/master/rts/PrimOps.cmm for GHC HEAD examples, vs https://github.com/ghc/ghc/blob/ghc-7.6/rts/PrimOps.cmm for the old style.


Code examples: http://www.cs.tufts.edu/~nr/c--/download/c--exn.pdf

Slides: http://www.cs.tufts.edu/~nr/c--/download/c--exnslides.ps.gz

Audio: http://wino.eecs.harvard.edu:8080/ramgen/nr-pldi00.rm

The Manual: http://www.cs.tufts.edu/~nr/c--/extern/man2.pdf

The manual contains the specifications and a few code examples. It looks like it's easy to learn, but it's a little different than other languages.


Could someone enlighten me what's the advantage of this over LLVM-IR?

Edit: Ok, I've found the following SO thread: http://stackoverflow.com/questions/3891513/how-does-c-compar...


If you want to know why it exists despite LLVM-IR existing, one answer is that at the time C-- was launched, LLVM-IR didn't yet exist.

As for how they compare, the answers to this question, on why Haskell didn't use LLVM (in 2009, it didn't) get into it a bit: http://stackoverflow.com/questions/815998/llvm-vs-c-how-can-...


C-- is intended for higher-level languages than LLVM. E.g. the latter still doesn't have a modern infrastructure interfacing with garbage collectors. Also, C-- is a lot easier to read in textual form than LLVM IR. Finally, the use of C++ and templates makes the code size of LLVM absolutely enormous (20MB binary).

That said, so much work goes into LLVM supporting new platforms, optimizations, etc, that it's probably easier to hack around LLVM's limitations than use C--, etc.


You probably don't need LLVM to work with LLVM-IR. In fact, I imagine that there could be a market for a "low-level" IR with more compact implementation, sort of similar to Oberon "slim binaries" or something like that. The fact remains, though, that the "pragmatics" stuff around LLVM-IR (GC and other interfaces) seems still a little bit problematic.

But I'm a Luddite, though; I like self-hosting native compilers and I'm not terribly fond of the idea of having to pack a 20MB blob of C++ code with my code either. (There's a lot of apps that could profit from dynamic translation at run-time, but using LLVM for that feels unnecessarily heavyweight. I wish the algorithms and passes from LLVM were available in form of some high-level DSL that you'd be able to translate into whatever language you use and automatically adapt to whatever data structures you use in your code. It doesn't sound exactly impossible.)


LLVM-General but more so LLVM-General-Pure give you a starter kit for building something like that, http://hackage.haskell.org/package/llvm-general http://hackage.haskell.org/package/llvm-general-pure


> Finally, [C-- is better than LLVM because] its use of C++ and templates makes the code size of LLVM absolutely enormous (20MB binary).

In a world where my phone has 2gb of RAM, I don't understand why 20mb for an optimizing compiler is in any way onerous or unreasonable.


Small is beautiful. How much cache does your phone have? How long does it take to compile a 20MB binary? How long does it take it to load all those symbols into a debugger? How long does it take to process all those relocations if you are linking dynamically? How big will your overall binary be if every component is 20MB?


Downloading 20 MB here and 20 MB there adds up pretty quickly if you're using a cellphone data connection like I am.


When I first entered college in 1988 there was a small DOS compiler floating around called C-- back then, which I got from some BBS (yes, a BBS, how antiquated!), probably in 1989. It was a mix of a subset of C and proto-assembly. I have looked for it a few times over the years, and this one isn't it, although it has some similar ideas. It makes me wonder how many other little-known C-- projects there are.



YES! That's the one I was thinking about too. I remember making some simple games in this as a kid and I learned a lot of assembly using this at the time also because, if I recall, you actually referenced registers and such right in with your other code - or something strange like that.


Thank you, thank you, thank you!


Wow - I used this one too (vividly remember the "fire" effect demo) and have looked for it a few times to no avail. Nice find :-)


According to this, C-- is still a large part of the Glasgow Haskell Compiler. It looks like (Fig 5.2) code goes into C-- before being translated to LLVM.

http://www.aosabook.org/en/ghc.html


Uh, I believe that the GHC C-- (Cmm) is one of the dozen different C--es that have nothing in common with this project save for the name.


Given that Simon Peyton Jones is listed as a co-author on several of the papers, I find that a bit too coincidental to be true.


This thing is written in ML, not in Haskell, and the GHC docs themselves claim that Cmm (the GHC thingy) is "is rather like C--. The syntax is almost C-- (a few constructs are missing), and it is augmented with some macros that are expanded by GHC's code generator". So "GHC's Cmm/C--" sounds rather like an independent reimplementation of a superset of a subset of "this C--".


It literally can't be an "independent reimplementation" if one person played a major role in both of them.


Domain specific dialect?


The GHC C-- is a subset of the full Norman Ramsey C--


"As of May 2004, only the Pentium back end is as expressive as a C compiler. The Alpha and Mips back ends can run hello, world. We are working on back ends for PowerPC (Mac OS X), ARM, and IA-64. Let us know what other platforms you are interested in."


I own the cminusminus.com domain. Was planning on using it for a blog, mainly to post horrible c/c++ code snippets I come across. If anyone wants it, let me know.


In case anyone [else] is wondering, RFC 1035 disallows domain names starting or ending with "-", so there is no registering C--.com, alas:

    <label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]

    <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>

    <let-dig-hyp> ::= <let-dig> | "-"

    <let-dig> ::= <letter> | <digit>

    <letter> ::= any one of the 52 alphabetic characters A
      through Z in upper case and a through z in lower case

    <digit> ::= any one of the ten digits 0 through 9


Hello! Very very cool.

I'm involved in GHC dev (and thus incidentally c-- dev as it exists in GHC). And i may be spending a lot of time helping improve ghc's code gen over the coming year (which is essentially the most widely used c-- compiler on the planet per se)

your remark intrigues me!


Email me joshliptzin at gmail dot com


According to this infographic, a C-- was the influence for JavaScript.

http://www.georgehernandez.com/h/xComputers/Programming/Medi...

If this has any truth (perhaps a different c--?) I'd like to know which one is being referred to.


> perhaps a different c--?

Yes, the graph talks about http://sourceforge.net/projects/cmmscript/ not the C-- used in/extracted from GHC.

There's at least one other C-- used in the OCaml compiler.


That's almost certainly a different one. It's described as a scripting language and that it appeared in 92. This C-- only came around 97...

It's quite an obvious name, I'd expect there to be many more called C--!


That must me the Cmm we did in the early 90s. That engine became the basis for (probably) most JavaScript uses up until 10 years ago. I wrote too much about it here: http://www.brent-noorda.com/nombas/history/HistoryOfNombas.h...


it mentions c-- as a scripting language, i'd guess they're different things.


Another take on portable assembly languages is Dan Bernstein's qhasm: http://cr.yp.to/qhasm.html

An overview here: http://cr.yp.to/qhasm/20050129-portable.txt


The originality (and practicality!) of choosing the name "C - -" leaves me speechless.


I felt the same way about "C++" postincrement, returns C. This seems more like it should be a predecrement (--C) because it does not return C, rather assembly.

EDIT: changed, I said decrement and meant increment.


Predecrement is fairly unidiomatic in C, much less commonly used than in C++. Typically it's only really used if you're doing something tricky that actually depends on the semantics of predecrementing, usually in an array reference of the x[--i] variety. It's idiomatic in C++ I'd guess because of operator overloading: postdecrement/postincrement may produce large unused object copies that not all compilers will optimize away, so using the pre- versions by default has become idiomatic in that community.


It's actually pretty idiomatic in optimised C, for the same sort of reason (when all extra instructions count), e.g., this sort of loop:

for( i=LEN-1; i>=0; --i ) {...}

I've carried this into a lot of other languages, as it's equally readable to the alternatives, when ordering isn't important. In (mainly older) optimised code where ordering is important, you'll often still see this, with a second incrementing variable so that the exit condition retains the compare to zero (avoiding a variable comparison); although that part is separate from the use of the pre-decrement.

I'm not sure if either of these make much difference with modern CPUs and compilers. Certainly not worth worrying about for the most part.

It's untrue that this idiom started with C++. This style is preferred by K&R, which predates optimising compilers. Since then it's use among C programmers has probably been force of habit, but it is definitely idiomatic. It's use in K&R makes it about as idiomatic as it is possible to be.


That isn't the style used in K&R, though. When K&R introduces a for loop for the first time, in section 3.5, it does it like so, and sticks to this style throughout:

    for (i = 0; i < n; i++)
        ...
This is the kind of code I run across most commonly (almost exclusively) in pre-1990s C. It's also the style used in the old Unix sources. For example, take a look at the source code to 'nohup' or 'mount' from 5th Edition Unix, 1974: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V5/usr/source/s... http://minnie.tuhs.org/cgi-bin/utree.pl?file=V5/usr/source/s...

They do also use predecrement/preincrement, but only in assignments or comparisons, where it actually semantically matters that the increment/decrement is "pre":

    while(*--np == '/')
            *np = '\0';
In contexts where it doesn't matter, like the 3rd clause of a for loop, or just incrementing a variable as a standalone operation, they always default to x++.


I just flicked through my copy of K&R (second edition), and at the beginning it clearly prefers preincrement. The introduction of for loops comes before that of ++, and the first for loop in the book is actually

    for (fahr = 0; fahr <= 300; fahr = fahr + 20)
        ...
Each for loop after the introduction of ++ uses preincrement (where applicable) for a while, and then the style shifts to postincrement.

EDIT: specifically, increments are deliberately prefix ("For the moment we will stick with the prefix form") until the full introduction of increment and decrement operators in section 2.8, after which they are postfix.


Well, the really early C++ compilers would output C code and daisy chain on a C compiler.

That trend didn't last very long, though.


The EDG C/C++ frontend (used by a lot of commercial C++ compilers) still supports this, and Comeau's compiler still relies on this for code generation. EDG also implements a different template instantiation model than other C++ compilers, partially so that it can function with a C++-agnostic linker.


I notice that whatever CMS they're using for the website seems to be turning double hyphens (--) into dashes (–), which is unfortunate.


I'm glad I'm not the only one that noticed this.


Actually, it turned out to be totally impractical: you can't register the domain c--.org, and you can't usefully Google for "C--". Live and learn.


Actually, it turned out to be totally impractical: you can't register the domain c--.org, and you can't usefully Google for "C--". Live and learn.


I must be missing something since I see the line "The specification is available as DVI, PostScript, or PDF.", but cannot find any download link.


I think you'll have better luck with this page: http://www.cs.tufts.edu/~nr/c--/


This is pretty old. I don't know if anyone else besides the GHC team use it?


I think its part of the quaint appeal of the article, that c-- only supports i386 but the way the world moved did not require c--, FSF supported GCC supports something like 50 archs and many more not officially FSF supported. And GCC is hardly the only c compiler out there. The world just didn't move in the direction directly requiring c--.

Now as for current ideas and projects similar in concept to c--, that is interesting to think about.

Maybe the startup lesson is sometimes, if you try to intermediate yourself as a middleman, even if you do a good job of it, and appear to be a good idea, it just doesn't work. It would be interesting to dissect the c-- experience and figure out why.


GCC did already exist when the C-- project was started, so that part didn't really change. The C-- team's hypothesis was that some kinds of languages (especially functional languages) would benefit from a cross-platform layer that is a better compilation target than C, partly through being a bit more low-level. The previous two popular routes were native-code compilers and compile-via-C compilers. Native-code compilers have the disadvantage that porting them and maintaining them on N architectures is significant work, and compile-via-C compilers have the disadvantage that C isn't a very convenient compilation target for many language features, especially to efficiently compile some common FP language features. The hope was that C-- would be a nice middle ground between compiling via C and compiling to native code, producing a target that was nicer than C as a target.

Imo, that hypothesis did have some legs, but people are now instead usually using LLVM in various ways to accomplish it. LLVM's intermediate representation isn't really properly cross-platform, but it can be sort of hacked to be used for that purpose.


Do you think LLVM IR covers this ground adequately, despite not being cross-platform?


I'm not really a language implementor, more of a spectator of language implementations, so someone with more experience would give a better answer.

My impression is that the GHC people, at least, still think that C-- is a nicer IR than LLVM-IR for their purposes. But they are slowly moving more things to LLVM anyway, because the LLVM project as a project has, in the meantime, built up a lot more momentum and infrastructure. In the early 2000s this wasn't obvious, but in 2013 it's clear that LLVM has an ecosystem, institutional support, resources to maintain ports, etc., while C-- didn't manage to get the same traction.


From my personal experience, LLVM is tied a bit to closely to the C ABI, which makes it difficult to implement some FP features cleanly. One example is forgoing C's stack discipline to implement double-barreled continuations.


OCaml also internally uses an intermediate code representation called C-- (or Cmm). I do not know if the two have any relationship.


It looks like OCaml's C-- predates this C--, but has had an influence on it.

From Xavier Leroy, one of the lead Ocaml developers [1]:

    I think I'm the one who coined the name "C--" to refer to a low-level,
    weakly-typed intermediate code with operations corresponding roughly
    to machine instructions, and minimal support for exact garbage
    collection and exceptions.  See my POPL 1992 paper describing the
    experimental Gallium compiler.  Such an intermediate code is still in
    use in the ocamlopt compiler.

    I had many interesting discussions with Simon PJ and Norman Ramsey
    when they started to design their intermediate language.  Simon liked
    the name "C--" and kindly asked permission to re-use the name.

    However, C-- is more general than the intermediate code used by
    ocamlopt, since it is designed to accommodate the needs of many source
    languages, and present a clean, abstract interface to the GC and
    run-time system.  The ocamlopt intermediate code is somewhat
    specialized for the needs of Caml and for the particular GC we use.
[1] http://article.gmane.org/gmane.comp.lang.caml.inria/9436/


Wow, I was surprised by the content of the website. I think this is how a website should be.

First they are talking about the problem and then they present the solution.

I also like the words that are marked bold.

This is how interaction design should be done (imho).


For another take on this area, I recommend reading about Pillar from Intel. http://dl.acm.org/citation.cfm?id=1433063


this seems really awesome, but I have no idea what it does?

I know that Python compiles to C and that Clojure compiles to JVM (or even to JavaScript).

My cartoon:

  scripting lang --> programming lang --> native code
Honestly, I have never experimented with Assembly language much except for COOL (http://en.wikipedia.org/wiki/Cool_(programming_language)) and TOY (http://introcs.cs.princeton.edu/java/52toy/).


Python does not compile to C (well, there might be such a compiler, but it is not the standard mode of operation).

In CPython, Python compiles to bytecode, which is then interpreted by the Python interpreter (which itself is written in C).


RPython, the Python subset that PyPy is written in, is compiled to C. (That said, it is a fairly restrictive subset, and the compiler is very much designed for interpreters and nothing else.)


Python does not compile to C. In CPython, it compiles to VM bytecode for a VM written in C (analogous to Closure compiling to JVM bytecode).


When compiling a language you have to compile it "to" somewhere. You could pick assembly, but assembly is so low level that it changes bit by bit depending on the processor. You want a stable target that's "low level" enough to let you optimize things for the machine you'll be working with, but not so low level as to be a moving target.

You could pick C, but C is actually still fairly complex and has lots of undefined behavior. You could pick JVM, but then you'll pick up the entirety of the JVM architecture which is quite large and may include many things you're not interested in.

C-- is another choice. It's decidedly lower-level than C (and thus far lower than JVM), higher level than assembly, and was crafted, as far as I know, under the deep influence of how to compile the pure functional language Haskell (or more specifically, it's System FC style and STG underlying languages).

In the mean time, LLVM took a similar place in this hierarchy and has probably taken off much more than C-- has. GHC, a Haskell compiler, in fact is moving its compilation pathway that way.


Cool! the first parser/compiler I wrote was for C-- (the version that is a small subset of C, not this one). Had not even heard an mention of the different C-- languages for a few years now.


Would C-- be a good choice for JIT machine code generation, or is it mostly for static compilation?


How does C- compare to CIL of .NET?


As far as I can tell, some similar features, but overall completely different.

For one, I think C-- still looks a lot like C, i.e. still an imperative language. CIL is a stack language (also with a built-in object system).


Where can I find a code example?


If the language is called C-, how come the website is named Cminusminus?


Is the 'minus minus' being converted into a dash throughout?


I saw it in a few places, but often it was typeset C - -


Say you're writing a compiler for a language in Haskell, and you want to generate machine code rather than having it be interpreted. Is C-- a natural choice on this platform? Or might LLVM be a better choice?


Use llvm-general. http://hackage.haskell.org/package/llvm-general

Idris uses llvm-general to have a simple llvm backend. Also llvm general is probably the nicest and most thorough llvm binding you'll find.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: