Hacker News new | comments | show | ask | jobs | submit login
C as an intermediate language (2012) (yosefk.com)
133 points by yinso 10 months ago | hide | past | web | favorite | 60 comments

I considered compiling D to C. The trouble is when you need to do things like exception handling, adjustor thunks, define things that are implementation defined or undefined in C, etc. I figured I'd spend too much time struggling with that, and besides, it would make compilation slow.

> Without doubt, today the answer is C++ and Objective-C – languages whose first compilers emitted C code.

I wrote the first native C++ compiler for the PC. It was a huge win over the cfront based PC compilers, which were nearly unworkable.

Another problem I failed to mention is you don't have control over the C compiler the customer is using. It may behave differently, have bugs in it, etc., and the customer will blame you. Being unable to fix the C compiler means you'll be fixing the problem in the wrong place, making things pretty hard on you.

I run into these sorts of problems with D's C++ interface. In particular, how "long long" and "long" are mangled when they are the same size, and when "long long" is used vs when "long" is used. It's an endless whack-a-mole problem.

Well, you don't have to generate raw C code and deal with all the compilers. It might be possible if your language doesn't have machine types, but if it does you have to do what C software usually does: test the compiler for various things and generate a header to rely on. Maybe even supply a ./configure script.

> I figured I'd spend too much time struggling with that, and besides, it would make compilation slow.

Is that only because of the extra compilation step, or are there other things making this more slow ?

Things that make it slow:

1. Compile the code 2. Generate strings for C code (C is much larger than intermediate code). Translating things like integers to strings is slow. 3. Writing the file out, as opposed to just keeping the data in memory.

C compilers are already slow (due to the preprocessor and phases of translation), so add your compiler on top of that, and you're going to have an unfixably slow solution.

If you were writing D from scratch today, would you go for a full back end solution or would you target the LLVM?

There's already LDC, a D compiler using the LLVM. The 3 D compilers (DMD, GDC, LDC) nicely complement each other.

The Mercury compiler has a backend which compiles to C. This is currently described as "high-level" C, because the generated code sort-of (relatively speaking) looks like code that a C programmer might write. But back in the early days (mid 90s?) it emitted "low-level" GNU (not ANSI/ISO) C, which basically looked like assembly code for an abstract machine. Kind of like the "compilation strategy" described in TFA, taken a number of steps further.

One of the nice things about GNU C is that various GCC-specific extensions can be taken advantage of. For example ISO C does not require a compiler to optimise tail recursive calls, which are heavily used in functional (and logic) programming languages, where recursion is used instead of loops. It also lets you access CPU registers directly, and write inline assembly code.

This becomes difficult when you want to do something that C doesn't really handle (well), for example the processor flags or for example passing control to another function, leaving the stack intact (objc_msgSend).

If you can then insert some asm, that's good, otherwise you hit a brick wall.

See: https://cr.yp.to/qhasm/20050129-portable.txt

This. C is a great intermediate language for languages that have semantics close to C. Otherwise, not so much.

Another language that famously took this route is Eiffel.

EiffelStudio uses a VM like environment for the develop-compile-debug cycle, and then compilation to native code via the platform's system C compiler for deployment.

Eiffel seemed like a really good language. I wish it had succeeded more and become mainstream. I've tried it out some with EiffelStudio earlier. Their support for developing GUI apps, at least on Windows, also seemed good, based on what I tried of it.

The book "Object-Oriented Software Construction" by Bertrand Meyer, Eiffel creator, is also very good. I had read most of it some years ago.

I remember compilation being rather slow though, and deleting the EIFGENS folder frequently when something stopped working. This meant all the C code had to be regenerated and then recompiled, which took a while.

I still think it was a great language for learning programming.

In addition to the ones already mentioned here, pypy, a tracing jit for python2 and python3, is another project which gets compiled to C. It's written in RPython, which gets compiled down to C.

RPython includes a jit generator that can be used to speedup new languages written in it. Pixie [0] is an example of another language written in RPython.

[0]: https://github.com/pixie-lang/pixie

Cython is another Python-like language that gets compiled to C; it's a great tool for getting big performance bumps for very little effort.

In my experience, by the time I start getting performance improvements with Cython (typing everything, removing Python function calls, etc.), I seem to end up with code that's bigger and messier than if I had just written a C extension or used f2py.

Pixie was abandoned by the author mostly because he realized he couldn't achieve the performance that Clojure has on the JVM.

> [...] pypy, a tracing jit for python2 and python3, is another project which gets compiled to C.

That's interesting. How does it work? I think most JIT compilers emit assembly directly and then execute it. Does PyPy generate C code while your program is running?

To add to the other answer, PyPy is a "Meta-JIT compiler", because it generates JITs, it isnt itself a JIT.

You might want to read the PyPy's documentation wiki[1] and the Tracing the Meta-Level paper from the PyPy authors[2].

Pypy consists of a Python interpreter written in the RPython language. RPython is a statically typed language that is easy for an opimizing compilation framework to work with. It also happens to be a valid subset of Python but that isn't actually that important.

The RPython translation framework has three ways to run RPython code:

* The first way is by using an ahead-of-time compiler to convert the RPython to C and then passing that to a C compiler (gcc or clang).

* The second way is by compiling the RPython down to RPython bytecode and running that through an RPython bytecode interpreter. (The RPython bytecode interpreter is written in C)

* The third way is the trace compiler. It takes in a linearized trace of the program execution (as observed by the RPython bytecode interpreter) and generates optimized machine code for this trace.

So, going back to PyPy as a concrete example, this is what is going on:

PyPy is a Python interpreter written in RPython. The pypy executable contains two versions of this python interpreter, both created by the RPython translation framework. The first version is the one where the rpython code for the python interpreter gets compiled to C. The second version is one where the rpython code for the python interpreter is converted to rpython bytecode (which is stored in the data section of the pypy binary) and where this bytecode is in turn executed by the rpython bytecode interpreter.

When you run a python program with pypy, it starts being executed by the first interpreter. When this interpreter detects a hot loop in the python program, it transfers control to the RPython bytecode interpreter. This bytecode interpreter executes for one iteration of the loop, and records a trace of what rpython bytecode instructions were executed in this iteration. Then, it uses the jit-compiler to directly generate machine code for this trace, and transfers execution to that.

If it goes according to plan, the base C interpreter will have speed comparable to CPython (or slightly slower), the tracing interpreter will be very slow (but for just one iteration) and the machine code for the traces will be very fast.


You might be wondering: why bother with so many different steps?

1) Why write a python interpreter in rpython and then run that on an rpython jit compiler instead of just writing a JIT compiler for Python?

The PyPy devs had already tried that with the Psyco JIT compiler.

2) Why generate C code for the first interpreter if you already have a JIT compiler that can generate machine code?

The JIT compiler can only compile linear traces, and can't compile whole functions.

3) Why not use C as an intermediate language for the JIT compiler as well?

While C is a workable intermediate language for method-at-a-time compilation, it is less suited for trace compilers.

PyPy is written in RPython, which can be compiled to C. PyPy itself emits machine instructions like a regular JIT interpreter.

There was a big HackerNews thread on Pixie awhile back with the author Tim Baldrige and why he eventually stopped the project. A neat idea.

Chicken Scheme does this. https://www.call-cc.org/

And Bigloo Scheme. http://www-sop.inria.fr/indes/fp/Bigloo/doc/bigloo-3.html

And Gambit Scheme. http://gambitscheme.org/

There seems to be a theme here...

I have always asked this question why C as IL and not something like C-- which is more portable and less system call specific. I think that LLVM IR already serves a good purpose at doing this and is really easy to hand program in.

I'm still learning how to build a language, and compiling to c seems like the easiest thing to do, but I'm curious if it's more difficult or not.

If my language is close to c (functions, scope, variable, types), can I take advantage of it so it's less work in my compiler to let the c compiler catch errors, or must I rewrite a full parser?

All I want to do is add pythonic indenting, range loops, maps, and geometric types with their operators (a little similar to shader languages).

As described, you’ll almost definitely need to write a parser. However, if your type system is very similar to C’s you may be able to just lower from the AST to C and let it worry about semantic analysis and code generation.

The biggest loss is that a lot of error and debug info will be relative to the C rather than your language. This means the user will be exposed to the internals.

Probably the biggest win, is that it’s almost trivial to integrate C, and potentially C++ libraries into your language. This has been hugely beneficial to Nim, for example. They have written a full front end though so they don’t have the error issue above.

Like most things it’s a trade off.

Not mentioned in that list of downsides is that you need to at least embrace the calling conventions, threading models, and memory models that C supports, at least for the flavors you target.

> The biggest loss is that a lot of error and debug info will be relative to the C rather than your language. This means the user will be exposed to the internals.

In other words, sometimes the end user has to debug a bunch of code they didn't write. I wouldn't underestimate how much a pain this can be.

> Probably the biggest win, is that it’s almost trivial to integrate C, and potentially C++ libraries into your language.

Again, subject to adhering to threading models, etc. If you're writing a language as a hobby, this may not be a big concern, but at scale, someone will be using memcpy or memory barriers or thread-local memory somewhere. It is certainly possible to support those use cases, but I'd be careful about overstating how trivial that is.

I would say that emiting LLVM IR is easier than C. I have done a few toy compiler projects with LLVM and it is very nice. I've never done a compiler that emits C, but I can't imagine it being any easier. At least you'd have to write more boilerplate code.

You also can't create JIT compiled REPL using C as easily.

The thing about llvm ir is you get a code builder "for free" from llvm, where you're just calling functions to build your ir. However, when emitting C you have to build that yourself and take care to always emit valid code. Not sure how much difference that actually makes though.

I'd actually argue the other way. If you emit invalid C code, you'll at least be told by the compiler (though the error might be pretty obscure). LLVM tends to assume that you are emitting valid IR and panics when you don't.

BTW this isn't a criticism of LLVM (far from it). It's expected to be used as a backend so any invalid IR is, pretty much by definition, a programmer error.

At the quick and dirty level, IMO C wins here as you can just splat the C you're building to a file and inspect it. With LLVM you need to get through the hurdle of building valid IR before you can serialise it.

In the longer term, I would argue that LLVM is better as you have way more control over what you're generating. Once your lowering works then the issue above is irrelevant.

> I'd actually argue the other way. If you emit invalid C code, you'll at least be told by the compiler (though the error might be pretty obscure). LLVM tends to assume that you are emitting valid IR and panics when you don't.

LLVM does have a verifier pass which should catch every invalid IR. However, optimization passes aren't necessarily robust against weird types (I don't expect <13 x i3> to go over very well), and the backends are liable to flip out if the IR isn't in the subset they expect (e.g., weird types).

There is another way to track down LLVM IR errors.

Instead of serializing it, directly write as text and process the file with the LLVM IR assembler.

Granted, it is still harder as just using C, but a step closer to LLVM and easier to debug.

I have used CIL [1] to generate C code before. Not exactly its main purpose, but it works nonetheless.

Essentially you generate a C-ast and then pretty print it. It has some limitations though.

[1] https://github.com/cil-project/cil

>However, when emitting C you have to build that yourself and take care to always emit valid code. Not sure how much difference that actually makes though.

I'm kind of surprised that there's not a library for emitting C code, though maybe my Google-fu is failing.

You can implemement reasonably C-like languages (eg. Objective C and various C-derived DSLs in many Lisp implementations) by a simple text transforming preprocessor and get reasonable error messages from the C backend (#file is your friend for achieving that).

But for anything more complex you are better off implementing complete parser with your own error handling and reporting.

Edit: one of the clearest signs of language implemented in this way is use of '@' character as part of syntax extensions as it is the only printable ascii character that has no syntactic meaning in C (the other such "unused character" is '$', but its meaning is implementation-defined, for example gcc has option that causes it to be accepted as part of identifiers, which is quite obviously the default on VMS)

Sounds somewhat like nim. https://nim-lang.org

Using LLVM is a little bit harder but gives you advantages like better debugging. Long term that pays off.

If you only play around, using C is fine.

I've never had any trouble debugging Nim, which compiles to C. The Nim compiler adds #line directives so you can just use gdb/lldb[1]. Can you elaborate on what debugging advantages LLVM brings?

1 - https://nim-lang.org/blog/2017/10/02/documenting-profiling-a...

Does Nim optimize anything on the way to C? That usually destroys useful information.

I can give you an example from C compilers. Consider a simple loop:

    int a[] = ...;
    for (int i=0; i<max; i++) {
Something fishy is printed, so I run it in my debugger, break at some point, and print "i". Gdb can usually do that. Now think about the optimizations a compiler probably did: While the index is incremented by 1, we actually step 4 bytes further due to the int array, so the compiler does i+=4 instead to avoid the multiply by 4 instruction. While i starts at 0, we can instead use the "a" pointer and avoid to add an offset at every loop iteration. The code might actually look like this:

    int *a = ...
    int end = a + max*4;
    for (char* x=a; x < end; x+=4) {
How can gdb print "i"? The trick is DWARF [0], basically the compiler inserts the information: If you want to print i here, then compute it by i=(x-a)/4.

tldr: #line directives are not enough

[0] https://en.wikipedia.org/wiki/DWARF

Vala seems to be doing ok targeting C. I'd hardly say it's just for toys.

Nim does it as well, but in case of Vala it's a bit different case as it emits glib based code with GObject and friends - basically whole OO system with containers, event loop and what not. The idea of Vala is that if you're using glib then using Vala will generate glib code as if you'd write it by hand, but much easier. The rationale is much stronger to emit C as you can mix it with existing code etc.

There certainly is a grey area. For example, there are esoteric architectures which LLVM does not support but a C compiler exists. Emitting C is probably easier than developing a few LLVM backends.

Still, my simple opinion is: If you are serious, use LLVM. It is not that much harder than outputting C and has clear advantages like debug information.

As I mention below, Vala debugging isn't the catastrophe you think: https://wiki.gnome.org/Projects/Vala/Tutorial#Debugging

I believe it's based on #pragmas that tell the C compiler (and/or GDB, presumably) which line of which file to blame.

Serious work has been done in Vala. It's not fair to dismiss the language as a quaint toy.

Preprocessor directives are not powerful enough [0], imho.

[0] https://news.ycombinator.com/item?id=16197856

That isn't an attribute of using C as an IR, it's an attribute of optimising compilers.

The solution is to disable optimisation for debug builds, not to avoid C.

From the gnome.org page:

> The -g command line option tells the Vala compiler to include Vala source code line information in the compiled binary

How does Vala handle debugging info and compiler diagnostics? Do the end users have to analyze the emitted C when something goes wrong?

Annoyingly Vala is extremely shy about the generated C, but I believe it uses #pragmas to map the generated C back to the Vala source.

I've only dabbled in Vala, but apparently its "-g" flag lets you debug with GDB. https://wiki.gnome.org/Projects/Vala/Tutorial#Debugging

> let the c compiler catch errors, or must I rewrite a full parser?

In the long run you always want to do it specifically for your language, because otherwise giving usable error messages and debugging information becomes impossible.

I heard that GCC supports more architectures than LLVM. Perhaps Rust having a C backend would make sense? On the other hand, I have no idea how tightly coupled to LLVM it is.

The language itself isn't coupled to LLVM outside of some underspecified details around things like unsafe and the memory model, which will eventually be better-defined.

The compiler has historically been fairly strongly tied to LLVM. This is changing for several reasons as the compiler is refactored, which should enable alternative backends like directly generating C or even plugging directly into GCC.

There is also an LLVM backend for generating C, at various levels of maintenance, that might be made to work.

A very long term project could be to eventually make the Rust compiler full Rust based.

However that would not make much sense from engineering point of view, too many years of optimization algorithms, tooling and supported architectures have been spent into gcc and LLVM backends just to thrown them away in name of purity.

There actually is work going on in that direction. The Cretonne code generator is written in Rust, and designed to support WebAssembly. It would also work as a fast rustc backend for debug builds on the few platforms it supports.

Interesting, thanks for point it out.

LLVM is used for optimizations in Rust. LLVM is more than just a nice way to target multiple backends, it's a very sophisticated set of building blocks for compilers.

Rust has tried to not couple its design to LLVM

For awhile LLVM could target C

There is some discussion of supporting Mozilla's experimental Cretonne compiler (initially designed to be a new Firefox wasm JIT) as an alternative backend for rustc. Cretonne could be handy for developers wanting faster compile times during development with the cost of less code optimization.


One I knew of was Vala. I’ve not used it, but the elementary-os devs seem to like it. https://wiki.gnome.org/Projects/Vala

Vala is interesting because it adds syntactic sugar for the GObject OOP library [1] which is a fundament in GTK+ GUI programming (but also can be used independently). All this was (is?) part of the idea that C++, Qt (with the moc) and finally KDE is bloated, broken from the roots. From my understanding, the main argument is the (in many cases unneccessary) complexity of C++ vs. the simplicity of C. In the Gnome desktop world, from my experience many devs moved on to Python based GUI programming. But I still think there are purists who prefer C/GLib due to its plain design.

[1] https://en.wikipedia.org/wiki/GObject

Usually in C vs C++ flamewars, even virtual calls are seen by C devs as bloated, which is ironic in a language that needs to cache the calls to strlen().

One thing in flamewars is arguing about performance. Another one (which is more related to the OP) is readability and simplicity. If you compare a simple GUI program written in Vala/GTK+ with one written in C++/Qt, without doubt the Vala version is easier to read for the uneducated. Of course, things may reverse if the problem is complex and C++'s OOP or the moc stands out compared to GLib ones.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact