Hacker News new | past | comments | ask | show | jobs | submit login
Generating C code that people want to use (protzenko.fr)
225 points by bibyte 21 days ago | hide | past | web | favorite | 51 comments

We sure expected that generating idiomatic C code was important, and this informed a lot of early design choices in our toolchain. We were surprised, however, by how closely Mozilla reviewed and manually inspected the generated C code.

Yes, yes, yes! Just because one is automatically generating/translating code, that doesn't mean it can't be pretty! When automatically translating code, the matching engine needs to be done with the full syntactic expressiveness of the source language, and what is matched and translated need to be idioms in each language! (As opposed to fine-grained syntactic elements. When the translation is done below the level of idioms, what results is non-idiomatic. It sounds pretty obvious when put like that.)

When compiling one language to another, or to assembly, or just straight to object code, the most important things are that a) you produce interfaces (APIs / ABIs) that are easy to use, b) you generate good code.

No one is going to demand that GHC generate readable assembly. Why should they demand that GHC generate readable C if it were generating C instead of assembly?

> Why should they demand that GHC generate readable C if it were generating C instead of assembly?

Debugging in programs that are mixed C and other code. If you're in the middle of debugging Firefox (or any composed system) and start stepping through unstructured gobbledegook, you'll end up cursing the people who did this to you.

This becomes less of an issue if/when the non-C language becomes so ubiquitous that there is proper debugging support for it (viz Python or C++ support in gdb), but as long as a bug may be triggered, propagated or otherwise be interacted from the non-C side, you want that code easily readable.

I remember a friends cracked out rant after a 48 hour all hands debugging binge to find a bug deep inside auto-generated JavaScript and HTML. They were using unicode text strings to tie the gobbledegook back to the source inputs and then trying to reason about what the code was supposed to do.

Comparison debugging mixed C source and assembly in a debugger is trivial.

Because the generated C is expected to be integrated into the project as source code: "Along the way, we learned what it takes to deliver quality C code that can be taken seriously and integrated into an actual source tree."

The code they generate is supposed to live on as source code, along with other code in the project. We should expect people to read and try to reason about code that exists in source control.

For code generators whose output you must or are expected to tinker with, I completely agree.

For compilers, however, I do not. For example, one codebase I help maintain is Heimdal, and it has its own ASN.1 compiler that generates C (and also something of an interpreted bytecode, as an option), and it's output is not even properly indented -- nobody minds because it works, and on the rare occasion I have to inspect that code, I use tools like VIM or clang to format it.

It isn't just tinkering that means the C must be readable. People need to be able to read it and understand what it does without learning F*.

It doesn't matter if the C code you are generating only gets seen by the C compiler. It does matter if humans need to see the generated C code, understand it and debug it.

This happened to me at work too. A large majority of the codebase is manually written C, but some of that is too tedious/error-prone to be written so they are generated. We invested a great deal of effort to generate extremely readable code, even with comments and all. The reason is that other people need to read this C code—both interface and implementation—as they work.

Speaking of GHC, the Haskell ecosystem actually has great tools to generate readable code. The various Wadler/Leijen libraries like ansi-wl-pprint make it a breeze to generate readable code. Indeed not many other languages have so many good, if a bit idiosyncratic, libraries to choose from just for pretty printing!

Debugging and profiling tools, which understand C. You don’t want a mess of auto generated function names showing up at the top or your profile. It’s better if the code generator produces something roughly human readable.

Also, crash reports produce stack traces. I’m positive that Microsoft and Mozilla are both heavy users of automated crash reports.

Abstractions are leaky. There are many tools for debugging assembly as well.

I've been playing around with nimlang...it compiles to C. It does create long identifiers, and consecutive ones too. Haven't had to, but running it in a C debugger would be madness.

Identifiers like:



You should check out Nim's gdb support. It will demangle these identifiers for you.

I would prefer generated C be readable because it’s quite likely I will need to step through it with a debugger.

At least in the web development world we have sourcemaps for that, html/css/js can all be inspected and debugged in their original form.

Is there anything like this for C?

Meanwhile I find it troublesome to rely on source maps even in web development. The JavaScript ecosystem doesn't produce nice enough abstractions that you can totally forget about the generated code and work in the original code. Just read the generated JavaScript actually run by browsers. Debugging sometimes really requires cutting through abstractions.

Same here. I did a bit of CoffeeScript development and source maps rarely worked properly. Arguably we just didn't have them set up properly, but it's a condemnation in its own right if our smarter-than-average engineers can't configure the technology properly.

I'm not sure if it's standard c/c++ but there are the #line directives (https://gcc.gnu.org/onlinedocs/cpp/Line-Control.html), bison and flex use them to good effect and I've used them when piping code into gcc. It'll give compiler errors in the right places and make gdb behave when stepping through code, not sure if it runs into limitations or not.

According to the oldest docs I could quickly find (https://gcc.gnu.org/onlinedocs/gcc-4.0.4/cpp/Line-Control.ht...) it goes back to at least gcc 2.95 in 1999, but I wouldn't be surprised if it predated javascript and css themselves.

> Is there anything like this for C?

I've been fixing up some ~15 year old code to compilable state and the flex/bison bugs were showing up with the flex/bison line numbers where the errors originated -- which somewhat helps but it turned out none of the errors were actually from flex/bison but because of how their API changed over the years.

I'd get some WTF compilers errors and track them down to running the output of flex through sed in a perl script to do something that it now does out of the box and didn't like being messed with, fun times...well, it actually is fun times since I'm just doing this on my own so I can play with the software.

C preprocessor supports the `#line` directive that can specify a custom source file/line combination. The compiler (or the runtime code with `__FILE__` and `__LINE__` macros) can then use it to report the location of the error.

I haven't used sourcemaps enough to know for sure that the following problem isn't accommodated for: It's sometimes extremely opaque that you finally hit something arbitrary like an int32/uint32 mismatch or situations where you are dealing with a second class language to normal getting an interface and running into some weird quirk of how the interpreter passes data into the FFI.

There is also the other potential issue that the final code is doing something strange because of some #define magic, which is also very hard to trace without being able to walk through code.

Probably the majority of generated C code is where the input language lives in an entirely different domain (e.g. parsers), so sourcemaps wouldn't help.

I don't do web dev but you can debug mixed source and assembly. You can switch back and forth between stepping though source or assemble and inspect registers and variables at the same time. It starts to break down with heavily optimized code.

Part of the reason C still exists is enormous amounts of work went into the tooling.

At that point just look at the disassembly. In many cases you'll have to anyways, so you might as well.

Also, the code generator should include tracing and debugging support. That seems much more important to me than generating idiomatic and readable C.

It probably depends how likely the consumer is to look at the code. In the article, they’re expecting code review of copy-pasted C code produced by their system, so the bar is higher than generated assembly that almost no one will ever look at.

TypeScript is an interesting example, the generated JS is pretty close to the original TS, largely just with types removed, so for someone with lots of JS experience it is easy to get confidence in the compiler as the output is pretty close to what it would have been in the first place.

Not exactly the same things, ts doesn’t generate anything really, coffescript or better reasonml compiled to js would be better comparison. Ts/flow design goal was specifically that if you replace type annotations with white spaces it is precisely normal js. There are no static or dynamic/runtime transforms.

Off-topic but somewhat related: BuckleScript transpiles to very readable and understandable Javascript.

Agreed, if you're going to be editing the output, then it must be readable.

There's a social/business issue here as well.

Suppose you're selling a code generator. The customer using that code generator is going to ask "what happens if you disappear? How do we maintain our code?" This drives them to demand that the code you generate being able to be continue to be maintained on its own, even if the code generator rots and dies. And that means it must be readable.

I've seen a case where a company has been stuck with generated code, and not only did they not have the code generator, they didn't even have the documentation for the code generator (nor anyone who ever used it). The company that sold the code generator had died many years earlier.

This same consideration doesn't apply to compilers, because you can buy compilers from many vendors, as well as get well-supported free ones.

> This same consideration doesn't apply to compilers, because you can buy compilers from many vendors, as well as get well-supported free ones.

This consideration absolutely applies to compilers for various applications (embedded, exotic platforms etc.) and also if you are relying on implementation-defined behaviour and extensions.

> I wrote the pretty-printer for our compiler, KReMLin, looking at the reference table of operator precedence in C, resulting in a minimal amount of parentheses being inserted; I happily thought I did the optimal thing, until it turned out that no one can remember the relative precedence of + and <<, or | and - – I have myself since then forgotten, and I’ve heard that it even differs across languages...

When I was a young(er) and naive(r) C programmer, I had printed out this precedence table for occasional reference while coding. I was rather proud that I knew of such subtle corner cases!

I've since learned and don't need to use that table anymore... because I just use parentheses to avoid ambiguity. Don't rely on subtle behavior in your code, folks. As the Python Mantra says: explicit is better than implicit.

Same here, I got to make my C code easier to follow and safer, by following a couple of rules, being explicit and not doing operator tricks of how many side effects one can cram into one line was one of them.

Readable C code generation is the future. Read my previous comments and you'll find I've been advocating this for many years. I mainly advocated Python/C two-language programming, but using F* instead of Python (or Scheme) is a secondary issue (though an important one in terms of static vs dynamic typing).

A future whose Eiffel was one of the first languages to take.

One of the reasons why Pre-Scheme was so lovely to use is because when it transpiled to C, the C was highly readable. You could follow the output of the Pre-Scheme compiler and know exactly what it was doing, provided you knew idiomatic C.

From a readability standpoint, the C output of Gambit and Chicken is a hot mess in comparison.

Icon's compiler used to generate somewhat readable C output. It was still the result of continuation passing style conversion (CPS) though, so it wasn't exactly easy to follow. Nowadays Icon only has an interpreter :(

If you want something like Icon or a Lisp or Scheme to generate readable C, you're going to have to use Simon Tatham's C co-routine scheme [0] (used by PuTTY) so that you can keep code looking sequential. You'd have to generalize the co-routine thing so it's not so much about co-routines but about closures (the two concepts are remarkably related). If you allocate call frames on the stack, then you get depth-first search and backtracking support, and closures of dynamic extent. If you allocate call frames on the heap then you get breadth-first search and backtracking, closures of indefinite extent, and co-routines.

(PuTTY has all of the SSHv1 and SSHv2 protocol up to the end of authentication coded as one enormous function each, and this reads surprisingly well in spite of being such enormous functions. That works for PuTTY because those protocols are extremely sequential up to the end of authentication. Think of Simon Tatham's co-routines a an await for C.)

[0] https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

I came to the same conclusions with my own compiler from perl to C. Lot of work went into generating perfectly readable and formatted C code, with all the whistles and ifdef's for debugging, tracing, Config and architecture optimizations. The good thing is that it's much easier in the compiler to do than in the resulting C code. We generate a lot of C code now, automatically, for perfect hash tables, optimized unicode tables, generating API's and exhaustive test cases and much more. It must be pretty and usable, gambit-c is a good example for throwaway code nobody wants to debug or read through.

Indeed. If you want to support full Scheme semantics, with continuations and all the crazy control structures, you really need to contort the C language into a form that's pretty hard to grok. That's the sacrifice Gambit and Chicken made, and it's probably a worthy one given how excellent and complete those compilers are.

Pre-Scheme is intended for a different use case -- when you want the level of fine-grained control that C gives you but don't want to leave Lisp behind. Its semantics are accordingly quite different from Scheme's and it doesn't support the full set of Scheme control structures, nor implicit garbage collection.

Chicken's output is necessarily a bit weird. It uses CPS and generates output in a specific way so it can avoid a stack overflow.

See: https://en.wikipedia.org/wiki/CHICKEN_(Scheme_implementation...


Yes, and Pre-Scheme deliberately avoids implementing all of Scheme in order to emit straightforward C.

Gambit and Chicken are fine programs -- close to best of breed in the Scheme->C transpiler space. Pre-Scheme was intended to map straightforwardly onto C semantics so that applications like the one for which it was designed -- implementing the Scheme48 VM -- could be coded and tested in a Lisp environment before being transpiled to understandable C code. They sacrificed full Scheme semantics for this.

I still remember the first C++ compiler we had that compiled down to C. The code was a total nightmare. I first had to debug the C code and then after finding the problem I had to somehow divine where in the C++ code it came from.

>>> Going to C is what allows people to use our code without having to buy into exotic, strange languages with lambdas.

At this stage is there any software engineer who considers a language with lambdas strange or this is a joke from the author?

Given their field, it is safe to assume this is a joke.

It's funny/depressing that the only bits of C which were written by hand, not converted from another language, had both memory corruption and undefined behavior.

>> Alas, this extra precision was not appreciated by reviewers, who very explicitly requested that variables named uu__123456 be eliminated whenever possible.

It's been a while but I used to use Simulink to generate C code and they had the same naming problems. It was hideous. They also solved the C99 types issue independently and generated their own target dependent header files that defined INT16_T in a target dependent way. I asked a couple guys from that company to please just implement a C99 target which is more portable - best if it actually produced code using the C99 types instead of renaming them.

So many issues with auto-generated code, but sometimes it's really useful.

I found it a little surprising that they initially thought using recursion would be acceptable for this sort of application. Esp since in the Mozilla codebase many threads run with quite small stacks.

Another benefit of generating good output is that someone can observe the input and the resulting constructed idioms and learn from the example

Mozilla adopted this and runs a Docker build command in their own CI, which errors out if someone tries to modify the generated code instead of fixing the F* source file

Ok, so all the talk about generating idiomatic C went for nothing in the end?

It still has to be readable.

Sound like mission impossible

(I try to avoid low-content comments here, but I'm LMAO at the fullscreen close-up of sir's forehead. Awesome article and project though!)

At least it's responsive :D

Putting the important content on top: TODO.

Derp: I revisited the page today and the graphic is a tidy part of the header now. It's still 2,838px × 2,789px though... I dunno if it was my browser or what; I was seeing the picture scaled to the width of the window which resulted in full-screen forehead. Scrolling up was like a Terry Gilliam cartoon.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact