
C as an Intermediate Language (2012) - jlturner
http://yosefk.com/blog/c-as-an-intermediate-language.html
======
cokernel_hacker
Professional compiler engineer here, C is a mediocre intermediate language.

Let's start with an excellent quote from Wittgenstein. "The limits of my
language mean the limits of my world."

Using C as your intermediate language means that your expressiveness is
limited to valid C programs. This is workable but only if your language can be
mapped to C in _useful_ ways.

For example, let's say your language has behavior similar to scheme's tail-
call. How would you get this behavior from a C compiler? You will never be
able to make this reliably across optimization levels, etc.

Guaranteed tail-calls are the tip of the iceberg, there are a lot more
features which cannot be reasonably mapped onto C.

Real compiler IRs increase your expressivity beyond what the C language
designers decided was important.

~~~
pcwalton
Exactly, and this is what most people who work on compilers discovered long
ago. It's disappointing to see "just compile to C!" still so popular on HN.

I would cite GC as the most prominent example of a feature that is
incompatible with compiling to C. It is impossible to write a competitive high
performance tracing GC on top of C. Mandatory register spills at safe points
and conservative scanning have huge downsides.

~~~
ndesaulniers
> It's disappointing to see "just compile to C!" still so popular on HN.

Posts on C get blindly upvoted on HN and proggit. I guarantee it. Me thinks
it's like the history channel for programmers; learning about their
forerunners and how they had to bang rocks together to make fire.

It's also how I've gotten all of my karma. Unfortunately, the abyss has also
looked into me, as I now write C code for a living...

------
kragen
As far as I can tell, on all the axes that LLVM excels as an intermediate
language (ease of getting started, debugging support, many backends,
optimization, flexibility), C is even better. C is easier to get started with,
has easier debugging support, has more backends, better optimization, and more
flexibility.

To take a simple example, I have here a 2428-line C program generated by
compiling Linus Åkesson's Game of Life in BF
([http://www.linusakesson.net/programming/brainfuck/](http://www.linusakesson.net/programming/brainfuck/))
into C using Daniel B. Cristofani's dbf2c.b
([http://www.hevanet.com/cristofd/brainfuck/dbf2c.b](http://www.hevanet.com/cristofd/brainfuck/dbf2c.b)),
which is a BF compiler written in BF. Compiling these 2428 lines of C to
machine code using tcc 0.9.25 takes 20ms on my 1.6GHz Atom netbook. Most of
this is about 16ms of tcc overhead (startup and shutdown time); the rest is
compiling several hundred thousand lines of C per second with tcc. You should
get several million lines of C per second with tcc on a modern machine.

This isn't optimized code, about equivalent to gcc -O0, typically about 3×–5×
slower than optimized code. But that's enormously better than interpretation
overhead.

(Using decent optimization levels with GCC makes it take several seconds to
compile, because GCC's optimizer doesn't deal well with enormous functions.)

dbf2c.b, the C-generating BF compiler, is 892 bytes of BF code when stripped.
Now, I'm not saying you should write your compilers in deliberately obfuscated
programming languages in as few bytes as possible; I'm saying that the fact
that this is even possible at all should give you really good feelings about
how easy it is to compile things to C.

~~~
pcwalton
What? How does C have "better optimization" than LLVM? As far as I can tell,
this seems clearly false: you can't do custom alias sets, for instance, in C,
whereas you can in LLVM. None of your examples mention optimization at all
(and BF is also totally unrepresentative of anything in the real world).

Also, C may have easier debugging support, but it's significantly worse than
having exact control over the contents of your DWARF DIEs. Making it easy to
have a bad debugging experience isn't a plus as I see it.

~~~
kragen
C has better optimization than LLVM because if you generate LLVM IR, you can't
compile it with icc or Visual Studio. (Or GCC, but LLVM probably already
equals or beats GCC at optimization already.)

You're probably right that LLVM also has better optimization than C because of
things like the possibility of doing better aliasing.

I agree that BF is unrepresentative of anything in the real world. Thank
goodness.

I can certainly see how debugging support could be better if you can go beyond
the support C gives you.

------
dom96
Indeed. C is brilliant as an intermediate language. One of the greatest things
about it is that once you generate C code it's not too much of a stretch to
also generate C++/Objective C specific code and take advantage of the
libraries written in those languages.

Despite this, there are many people out there who view languages that compile
to C as being inferior. I still don't understand it.

~~~
yokohummer7
There are many things that are present in assembly but lacking in C. For
example, how do you detect integer overflow in C without resorting to compiler
extensions?

~~~
jjnoakes
You stub out the few architecture specific routines you need and let someone
write them for their architecture in asm if it doesn't exist already.

A minimal porting effort for the best of all worlds.

~~~
pcwalton
The best of all worlds, except that you lose your optimizations.

~~~
jjnoakes
I would like to see the benchmarks before I agreed to that.

You might not get the best optimization possible if you have to leave it up to
the C compiler with inline assember... But that mix worked well for C and C++.

And rust can still optimize things like deciding which integer optimizations
need the overflow check, and only add it where static analysis failed.

~~~
pcwalton
No, it doesn't work well for C and C++ at all. Turn off InstCombine and see
what performance you get in LLVM for an idea of what will happen if the
compiler doesn't know about the semantics of + and -.

~~~
jjnoakes
We are only talking about a subset of additions and subtractions to begin
with.

How often will idiomatic rust code need overflow checks?

Also, if the UB operations are defined by target assembler (instead of in C++)
then you may not even need extra overflow checks, depending on the assembler
semantics. This would only be needed in the places where static analysis
failed, of course.

One could also use the native compiler options, if available, to avoid the
issue alltogether (like -fwrapv).

~~~
pcwalton
> We are only talking about a subset of additions and subtractions to begin
> with.

No, you're talking about all signed addition.

> How often will idiomatic rust code need overflow checks?

You'd be surprised. Go look at the implementations of containers in the
standard library.

> Also, if the UB operations are defined by target assembler (instead of in
> C++) then you may not even need extra overflow checks, depending on the
> assembler semantics. This would only be needed in the places where static
> analysis failed, of course.

I don't know what this means exactly. If you're saying you should implement a
static analysis to try to eliminate unneeded overflow checks, then this is a
huge burden.

> One could also use the native compiler options, if available, to avoid the
> issue alltogether (like -fwrapv).

If you're dependent on GCC/clang extensions to C, then you're not compiling to
C. You're compiling to a front end to GIMPLE/LLVM. At that point there's no
benefit over just compiling to GIMPLE or LLVM in the first place.

~~~
jjnoakes
> No, you're talking about all signed addition.

No, I'm talking about the ones that won't overflow. Some can be statically
shown to not need a runtime check.

> If you're dependent on GCC/clang extensions to C, then you're not compiling
> to C.

I'm not saying that. I'm saying require either the target compiler (which the
user is in control of) to provide a flag to give signed overflow the proper
semantics (which would be useful if the user happens to be using clang or
gcc), or fall back to requiring some asm for the target system which provides
the needed semantics.

This would all be up to someone to configure for the target machine. Not a big
deal once per architecture.

> At that point there's no benefit over just compiling to GIMPLE or LLVM in
> the first place.

Well if GIMPLE or LLVM was available for all of the targets that C compilers
are, then I'd agree.

But they aren't. So I don't.

------
FraaJad
A few languages that generate C that are currently popular are:

* nim

* Vala (GObject backend)

* Purescript (technically it has a C++ backend, but it is worth mentioning here for the very clean C++ it produces!)

------
felixangell1024
It's also insanely easy to generate code for. In addition to this, you don't
have to build an entire toolchain of LLVM stuff to get it to work on your
system -- LLVM is a nightmare to setup on Windows.

------
neopallium
Here is a list of compilers that can generate C code for different languages:

[https://github.com/dbohdan/compilers-
targeting-c](https://github.com/dbohdan/compilers-targeting-c)

------
jjnoakes
Portability and interoperability are key in industry.

I think all of the new languages are amazing these days. I am learning Rust,
Clojure, Haskell, and many more.

Buy I can not use any of them at work. I'm in a position to influence teams of
smart programmers, but the only language that I would be able to use would be
one that fits in to the existing infrastructure, which is natively compiled
shared libraries on esoteric unix platforms.

So no LLVM. No JVM. No Haskell. I could possibly get away with a lisp or
scheme that compiled portable C if I wanted. But that doesn't excite me as
much.

I would kill for a Rust to human-readable C++ transpiler. I think it could be
used immediately by many.

I may have to write one.

~~~
qznc
I believe porting LLVM to esoteric Unix platforms would be easier than a C
backend. Less fun though.

~~~
vmorgulis
Porting nothing is better.

~~~
jjnoakes
Sure. Port nothing and keep writing in suboptimal languages. It is what we do
now.

But does it hurt to strive for better? I don't think it does.

~~~
vmorgulis
Yes it is.

For me, C as an intermediate vs LLVM language is a "worse is better" or "good
enough" issue.

LLVM is far more powerful but more complex. C is simpler and already works
everywhere.

------
Artlav
This begs the question of what would be better - to compile your favorite
language to C and use something like Emscripten to make JS, or to compile your
code directly to JS?

So far as my experiments went, a 3000 native score Dhrystone test compiles to
C at 1300, and that C compiles to 850 worth of JS. Which suggests that a
direct-to-JS compiler might be worth the efforts...

~~~
gokr
You can play around with Nim for getting some numbers on this - since it can
compile to C, C++ or js.

------
keithnz
really wish this was more common, as language choices on embedded platforms
are often VERY limited. ie, C.

Though on embedded systems you'd also want more constraints for the backend,
like do not use dynamic memory, perhaps being able to specify the output code
is MISRA compliant.

------
ilaksh
If you guys are interested in languages that compile to C, the best
programming language I know of compiles to C and it is called Nim.

~~~
felixangell1024
EDIT0: It appears I was wrong, for some reason I was under the impression DMD
produced C code. Ignore this comment! :)

Another language that is widely used by hobbyists and in production is D
which, as far as I know, produces C code too.

EDIT: I should say there are a few D compilers, DMD, LDC, and some other one
by GNU(?).

I think that DMD is the main compiler, but I could be wrong. LDC generates
LLVM code.

~~~
gmfawcett
There are three compilers for D (DMD, GDC, LDC), but none of them uses C as an
intermediary language.

------
vmorgulis
If the target compilers are GCC and clang, C++ is even a better intermediate
language.

