Hacker News new | past | comments | ask | show | jobs | submit login
C2Rust Transpiler (github.com/immunant)
119 points by Aissen on Oct 23, 2022 | hide | past | favorite | 63 comments



C translates quite readily to D. I've been able to translate thousands of lines at a time in less than an hour, usually with some global search & replace and then making adjustments after running it through the D compiler. We relied on being able to do this in the D community for quite a while. There also have been three three translators built, with more or less effectiveness. It is nice to get the code into D, and then take advantage of D's safety features.

The fundamental problem with translation, followed by some hand tweaking, is that it only works if the C version is to be abandoned. If the C code is maintained by anyone else, as soon as they make changes, the translation gets out of date. Updating the translation turns out to be impractical because of the hand tweaking necessary.

Then there are some frustrating structural limits. The largest is that C doesn't have modules. The preprocessor puts everything into one file, and every C compilation is for one file. Declarations get duplicated across every translation unit. Somehow, these need to get teased apart into modules. This structural redo gets done by hand, and requires pretty good familiarity with the C code's design.

The preprocessor poses another major problem. The preprocessor language and the core C compiler have no knowledge of each other. They are completely separate languages, with their own syntax, keywords, semantics, etc. The preprocessor, aside from trivial use of it, simply does not translate into other languages. I also have yet to find a C programmer who could resist using the preprocessor as a metaprogramming language, which does a great job at obstructing all efforts at converting to another language.

All this stuff raises a lot of friction for D interacting with C code. Programmers don't like friction, they don't want to deal with C code they are unfamiliar with, they don't want to fold in maintenance changes in C code to the translation, etc. They want it to "just work".

The eventual solution I came up with is obvious, but I'd always dismissed it as impractical. Just fix the D compiler to be able to compile C code directly, and internally make the C declarations and constructs available to D code. This turned out to be fairly easy to do, and is ridiculously effective. It sometimes works even better than C++'s ability to #include C code (C++ doesn't support things like _Generic, old style C declarations, etc.). All you have to do is import .c code just like importing any D module, and the D compiler takes care of all the dirty work for you.

It isn't perfect, for example, C compilers have lots of extensions, and dealing with all of them is hopeless. But we just do the common ones, as it turns out most of them are rarely used.


Zig also takes this approach, and even exposes its C compiler (which if I recall correctly is basically Clang plus diverse sysroots and other customisation out of the box) as a separate `zig cc`.

I do a lot of work in Rust, and cross-compilation can be a pain when you have a lot of C dependencies. Fortunately https://github.com/messense/cargo-zigbuild exists. It sounds crazy, but using Zig's inbuilt C compiler to help build my Rust projects has been the smoothest option I've found.

I can't help but wonder if it would be worth it for Rust to follow D and Zig by shipping its own inbuilt C compiler, even if they still want to also support external C toolchains. It should be roughly the same effort as it was for Zig, given that they both use LLVM.


D can compile and link C programs with:

    dmd hello.c
C and D code can be mixed with:

    dmd mars.d pluto.c
C code can be imported by D code:

    import stdio;  // looks for stdio.d, stdio.h, stdio.c in that order

    void main() { printf("using C printf from D!"); }
It keys off of the file extension.

Amusingly, C code can also import D code:

    ----- D file ----
    int square(int x) { return x * x; }

    ---- C file ----
    __import square;

    int test() { return square(3); }
closing the circle, enabling D libraries to be written and accessed by C.


> Zig also takes this approach

D took the much more fun way, which is to implement a new C parser from scratch and tweak the D lexer and semantics to handle the differences of C. It's not too bad, about 5000 lines of D:

https://github.com/dlang/dmd/blob/master/compiler/src/dmd/cp...


> The preprocessor poses another major problem. The preprocessor language and the core C compiler have no knowledge of each other. They are completely separate languages, with their own syntax, keywords, semantics, etc.

"I wrote my program in C."

No, you wrote your program in a custom language that only you (at most) understand, and you gave the file a .c extension.


Their blogpost about translating Quake 3 was interesting: https://immunant.com/blog/2020/01/quake3/


FWIW, the V folks demonstrate translating DOOM source to V and compiling it:

https://youtu.be/6oXrz3oRoEg https://github.com/vlang/doom


How much memory does it leak at startup?


It looks better to actually use the language in which one is speaking about. Vlang went into beta (0.3), so does not leak memory.

It uses a GC, by default. The plan was always for Vlang to have optional memory management, which is: GC, autofree, and manual. And, since it is a language going through its developmental phases, it could be argued to give it some leeway.


A potential use case I see is for security auditing. Even if you cannot port an existing C codebase to Rust, you could run this tool to examine the unsafe hotspots. Any place where the translation has to rely upon unsafe is a region of the code more likely to contain any of the mistakes Rust is designed to prevent. Of course, this pre-supposes that 90% of the translation does not have to lean on unsafe annotation.


I suspect that this relies on unsafe pretty much everywhere. Even handling argc and argv in your C main function, in idiomatic ways, is unsafe.

There is no 80/20 rule for C unsafety, other than maybe an inverted one: 80% of the unsafety of a large C program might be spread into 80% or more of the code. :)


> There is no 80/20 rule for C unsafety,

Actually, there is; the problem is that the transpiler can't tell the difference between code that relies on unsafety for its semantics, versus code that would still work if appropiate annotations (potentially causing function to not be callable in intended contexts), run-time checks (possibly causing code to error out on what were intended to be valid inputs), etc, were added to make it safe.

80% of the code contains 20% of the cases where safety would require deviating from the intended semantics, not just the incidental ones. (A general-purpose transpiler can't (in general) tell the difference between intended semantics and incidental ones, so it has to conservatively assume all semantics are intended, and write everything as the most-general (ie most-unsafe) interpretation.)


Any C code that performs a calculation which would silently be wrong or crash if the values were not correct (even though they are) is inherently unsafe.


That's a weird way to put it. If your function assumes some constraints on the input it gets and you give it data that violates its constraints, it's going to fail in some way. Sure, C makes it worse by making it harder to verify the assumptions and constraints, but by your definition every function that operates on sorted arrays and doesn't verify the input is sorted is inherently unsafe, regardless of the language.


> but by your definition every function that operates on sorted arrays and doesn't verify the input is sorted is inherently unsafe

Are you claiming it’s not? verify all input should be the baseline to call something safe imho.


Sure, but that's not a meaningful definition of safety, because according to this definition writing a safe O(log n) binary search is impossible in every single mainstream language in existence.


> Even handling argc and argv in your C main function, in idiomatic ways, is unsafe.

Not knowing C very well, could you clarify what makes it unsafe? Thanks!


Paraphrasing a common meme, how much time do you have?

Just scratching the surface, we have:

- The language doesn't really have vector and strings as data types, they are pointers to memory sections without any kind of protection

- All functions on the standard library deemed as safe, added as mitgation to fix possible memory corruptions have gotchas on their use, there isn't a single one that is safe, specially because all of them expect the developer to never get the buffer size parameters wrong.

- Enumerations are not type safe, decay implicitly to integers when used in numeric context, and all numeric values can be converted into an enumeration, even if there isn't a mapping available

- Implicit numeric convertions everywhere, and since there is no overflow/underflow checking, every single numeric operation can wrap around, or be the source for clever compiler optimizations

- ISO C documents at least around 200 cases of UB, where the compiler can take the liberty to optimize the code as it pleases

- Type casts that convert complex data types into others can be a source of surprises when moving across compilers and platforms

- Speaking of which, even if you restrict yourself to ISO C, without any compiler specific extensions, there are behaviours that are implementation defined, which can vary across compilers and platforms.

- Variables defined as const, aren't really constant and one can subvert their value

- There is no null checking, so whatever happens depends on the platform.

This is just a short overview, open the man page for GCC or clang and go through the list of all warnings that you can enable to try to write safer code, specially all that are enabled via -Wall and -Wextra.

All the above flaws are also present in Objective-C, Objective-C++ and C++, due to their copy-paste compatibility with C (yes C++ isn't 100% compatible).


IIRC, in C++ at least, mutating an object that is originally const (whether it's a variable declared as such, or a heap object created with "new const ..."), is UB regardless of how you do it - pointer casts etc.


Mutating an object that is originally const is anyway UB regardless of the programming language.

Compilers have the freedom to place them into readonly memory segments, either in the case of static data, in the heap (mark the pages as such for safety), in embedded static const data can even be mapped into ROM chips, so who knows what happens if the compiler has decided to implement const data that way, or what the linker scripts are doing.


That's still ultimately up to the language - if it does say that constness is always legal to cast away, then compilers can't put const stuff into rodata.

There's also stuff that obviously can't end up in rodata, such as const locals that are initialized from arguments or I/O. Sometimes people assume that this means that casting away constness is legal in such cases.


The word unsafe has a specific meaning in Rust. It doesn't mean every C program that uses argc and argv is unsafe. In this specific case however I don't think it would actually require much unsafe. The only unsafe thing I'd introduce is a way of casting the *argv[] to a type that safely deals with null terminated strings. Maybe such a type is already in Rust's standard library and I wouldn't even need that.

edit: eh sorry I wasn't thinking straight, you of course need unsafe to cast the argv itself to a type that has a seperate argc as well. Assuming such a type is available, if it's not its implementation would also have unsafe all over the place.

Maybe to answer the underlying question. What makes it unsafe is that in C it is assumed the programmer knows to keep all indexes into argv under argc. In Rust such an assumption must be made explicit by specifying "unsafe". It is idiomatic Rust to have all instances of "unsafe" in libraries whose implementation is vetted by the community, so ideally there are little to no instances of "unsafe" in the application logic itself. Rust's compiler and type system have various tricks that reduce the amount of "unsafe" you would think you'd need for even quite complex problems.


Not every C program that processes argv gets it wrong. They are all unsafe. Unsafe != wrong.


"safe" and "unsafe" in rust are well defined, but in my opinion it's very confusing and I wish they used different terms.

In rust, unsafe means accessing memory that's already freed or unallocated or things like that. You can look up the definition for the full definition.

I think the comment you replied to mistakenly used the term "unsafe" (that's part of the reason I dislike the term; it can mean multiple things). In rust context though, it isn't unsafe to index an array that's out of bounds. I.e. if argc=10 and you call argv[99], that will crash your program but isn't considered "unsafe".


You'd have to access the stack frame above `main` and then treat some of the bytes within that frame as your env. This means forging pointers/bounds based on the inputs. `execve` basically sets that up for you but Rust doesn't know about that.

Then if you wanted to handle dynamically set environment variables you'll need to call into your libc implementation, which crosses an ffi boundary, which means rust doesn't know about what that code is doing and therefor it requires `unsafe`.

edit: Question for others - is main a separate stackframe? I actually don't recall.


For instance it means that an expression like argv[i], even though correct, could be wrong in a way that won't be diagnosed. Code us "unsafe" to the extent that is predictable behavior depends only on the programmer.


C arrays are just sugar for pointer arithmetic. [] just calculates the sum and deferences the result.

arr[n] == *(arr + n) == n[arr]

All these forms are valid C and gcc will happily compile them all without complaining.


You’re right, but they’re hoping to improve upon that.


Since 90% (a wild guesstimate) of C code is pointers, I suspect this is hopeless.

I've translated a lot of C code to D, and manually converting `*` to `ref` (D's safe pointers), and converting to slices, cleans up most of the C code nicely and you get buffer overflow checks for free.


This only produces unsafe code. Every translated function has the unsafe keyword. It's up to the programmer to clean it up afterwards.


I think a demo of the transpiler output for a short function would make a great addition to the readme.


You can try it out on the main website https://c2rust.com where they have a web version. Unfortunately it isn't working (HTTP 503 error)


It's back up now, sorry about that!


Classic Hackernews hug of death.


Nah I'm pretty sure it's just broken. I took a look at it a week ish ago and it was down too


Can confirm it is broken. With a little luck, it should be back up and running early next week.


Works for me


Did you press Translate?


No :)


Pressing "Translate" appears to do nothing.


How does this compare to Corrode? The trouble with these things is that the Rust that comes out is usually too awful to maintain. Corrode, too, said that someday they'd generate more reasonable Rust. But that never happened. Converting C into Rust with unsafe raw pointers is not all that useful.

What's needed is some way to provide key information C doesn't have. Mostly about array sizes. Some way to annotate

   int read(int fd, void* buf, size_t len)
to tell the system that buf has size len.

A file of translation hints with such info could guide the translator into producing decent Rust. Most of the things done with pointer arithmetic can be expressed with slices. (Things being done with pointer arithmetic which can't be expressed as slices should be viewed with deep suspicion.) But you need size info to do that.


The short form is that Corrode is effectively deprecated in favor of c2rust. Indeed, Corrode hasn't been updated since 2017, while c2rust still gets active development—last commit as of my writing this was 2 days ago.

It's worth noting that the developer of Corrode was consulted on the early design of c2rust, which means c2rust was able to benefit from hindsight on architectural decisions in Corrode. That ended up leading to a bit of a messy history between the two (c.f. https://jamey.thesharps.us/2018/06/30/c2rust-vs-corrode/ with HN discussion https://news.ycombinator.com/item?id=17436371 —although I believe that after that blog post the c2rust developers did end up acknowledging their inspiration and apologized for not doing so earlier.)


The goal of C2rust is not to produce good maintainable Rust.

It’s to produce buildable rust which exactly matches the original code, which you can then migrate to proper rust.

So your query is really in the “not even wrong” category.


which you can then migrate to proper rust.

Which means you have to manually work on that awful code that comes out. In the chart at [1], this step is represented by a magic wand.

(I wanted to give some examples, but https://c2rust.com/ seems to not be translating today.)

[1] https://c2rust.com/manual/


> What's needed is some way to provide key information C doesn't have. Mostly about array sizes.

You also want to know which way the data flows (i.e. is buf read from, written to, or both). And then you end up with something like this:

https://learn.microsoft.com/en-us/cpp/code-quality/understan...


I proposed an extension to C which adds slices:

https://www.digitalmars.com/articles/C-biggest-mistake.html


Me too.[1]

But it would have meant years of work on language politics.

It might be worth looking at this sort of thing again, because machine learning is far enough along that recognizing and converting the usual array idioms is feasible. If the output code with array bounds is run time checked, then errors in translation will result in detected array bounds errors.

[1] http://animats.com/papers/languages/safearraysforc43.pdf


Yes, Windows has been doing that with SAL annotations for years.


I would be interested to see performance numbers of the C version and the transpired rust version of some program.


It's like compiling C with clang.


Really cool concept! (Rust is my favorite language for tinkering; I haven't touched C since I was in school.)

What would really help is success stories: Who's used it? What have they used it for? What challenges did they encounter? Then again, maybe this is so new that there aren't a lot of success stories yet. :)


I used it with great success for transpiring libyaml from C to Rust. I even set up Miri to run the upstream library's entire transpiled test suite, and the fact it passes is validation of absence of UB in the original C code.

The transpiled library now serves as the YAML backend for the widely used serde_yaml crate. Having serde_yaml be pure-Rust code instead of linking C is advantageous for painless cross-compilation as well as making downstream projects runnable in Miri.

https://github.com/dtolnay/unsafe-libyaml


A few questions, if I may:

Is the intent that this continues to evolve by newer libyaml code getting transpiled, or that it's effectively a fork and might gradually become more idiomatic Rust which does the same job as the C but won't track any changes? Or is this basically "done" and only small changes (to both this code and the C libyaml) are anticipated anyway?

The c2rust documentation cautions that any platform independence isn't preserved, so if libyaml has types which are different on platform A versus platform B, the c2rust transpilation on platform A just gives you Rust types for platform A, losing that independence, was this an issue for libyaml ?


Very cool. Miri is seriously awesome.


From what I remember it was primarily funded by a DoD project of some sort. There probably isn't a lot of info about that. There have been a few conference talks or blogposts about it (I'm trying to remember) that talked about the process of using it.


It’s actually funded by DARPA so basically everything in this project is public available.


Sampling some directories and files in the test suite of this project, I see a problem: testing is done by translating C to rust and compiling it, and then testing the run-time behavior of the result. I don't see test cases which cover the behavior of the translator directly: like that a certain C language input maps to a certain Rust output.


I'd argue that these tests are more robust to changes in Rust and changes in C2Rust that change the output in trivial ways. I don't see how you could maintain the sort of test suite you're describing in a project like this. If you made a change that changed the Rust output, you'd invalidate huge parts of your test suite and generate lots of noisey failures. It'd make it fast too expensive to introduce all but the most critical changes.

We don't care what Rust gets generated; we care that the Rust which is generated has the correct behavior. Testing that is where the value is.


> If you made a change that changed the Rust output, you'd invalidate huge parts of your test suite and generate lots of noisey failures.

If that change was unintended, you'd be thanking yourself for unit tests.

The unit test suite doesn't have to be all that large. It's the behavioral test suite which has to be large in order to generate confidence.

> It'd make it fast too expensive to introduce all but the most critical changes.

You can easily have diffs between the expected output of those cases and the new output.

You can review those and merge them, which is time-consuming work, but of great value. You can spot bugs in the review, like whoa, this thing is now being translated in a bad way.

If the project is in a state of flux, the new expected outputs can be more or less blindly merged; still better than nothing, and there is a record that can be revisited. Ah, that test case is actually confirming the wrong thing, which was right previous to this commit when the output changed.


My experience with things like this has been, when you write tests like that you end up writing many tests (I've found you end up writing more unit tests than end to end tests - I know you asserted the opposite, but I'm just not sure why you think that, I'd be interested to discuss further), which isn't necessarily a problem, but the tests are brittle and so many of them break for any given change, and it's very difficult to tell which of the breakages are real and quite easy to overlook it & end up copying over the new output. If you have 100 breaking tests and only 1 of them is important, it's really easy not to find it. For a project this large and complex, you could easily break that many tests, were the test suite to be brittle.

So let's say we did introduce some unintended code in our output. Is this a problem? It's a problem if it changes the behavior of our program. If it doesn't change the behavior - it isn't a problem, unless proven otherwise (eg, it ends up producing something which is correct but confusing, and we get feedback from our customers).

If it changes the behavior of our program, there exists an end to end test which will catch it. It may or may not be part of our test suite, but that's the case with every test suite; you do your best to anticipate errors, and when you discover you missed something, you add it.

At my last position we had CYA tests which made naive assertions about the output of machine learning models. Every time we shipped a new model it'd break a bunch of tests. I'm not aware of any bugs that were caught this way but we did ship models with bugs several times. I tried to replace them, wherever I could, with a test that took some sort of measurement which was more stable. When those failed, they were worth investigating.

Whether a C program maintains it's behavior when compiled to Rust is the perfect thing to measure. It captures the definition of a bug in a way which is invariant across versions.

(Note: I do support writing unit tests generally, this is an exceptional case where there are lots of moving parts, not all of which we can control, and an absolutely massive surface to test; in cases like this, you need to create a defensible position, optimizing for fewest issues impacting customers and not eliminating all issues, because that isn't an attainable goal.)


This tool is there to automate the boring parts of converting codebases. The real work is verifying the parts that rely on undefined behavior.

High fidelity and reliability would be a concern if the transpiler is used to regularly sync between C and Rust versions of a same codebase. It's less of a problem for one-off efforts. Those will usually be heavily tested and inspected before the result is trusted.

Edit: it would of course be nice to see proper quality control applied, but prototypes have to proof quickly that they are worth the time spent refining them.


A source-to-source translator is a textbook example of something that is slam-dunk unit testable. There is almost no valid argument against doing it.

Note: maybe this project has it in there; I haven't exhaustively looked into every subdirectory.

> if the transpiler is used to regularly sync between C and Rust versions of a same codebase

How do you know it won't be used that way? Because the maintainers of every C codebase will stop what they are doing in C, and follow the Rust conversion as soon as they hear about it?

Even if you use this tool to permanently cut some code base over to Rust, it would be nice there to be some assurance about what it's doing beyond just "the converted code seems to do the same thing". A conversion could be done two or more times even if the C code isn't changing. Say you do the conversion. Then hack on the converted code. A new version of the converter comes out claiming to fix bugs. You might want to re-run it on the original code again, see if anything changed, and merge those changes to the current code stream that already contains modifications.

> The real work is verifying the parts that rely on undefined behavior.

That is neither here nor there. A construct that is confirmed undefined in C can be translated to the call to a Rust function that makes damons fly out of your nose. And there can be a couple of unit tests confirming this translation strategy.

Or else, something else can be done. E.g. let's say the C code relies on wraparound two's complement arithmetic. The translator can oblige and generate code which makes that work (making that translator more helpful than some modern C compilers).


Tests could verify that the transpiler correctly maps certain C constructs, but that almost doesn't matter. There is a fair chance that the translation won't work anyways if there's too much C sorcery and undefined behavior involved.

The idea of continuous two-way synchronisation between codebases is migraine-inducing to begin with. Even though Rust can probably transpiled to C with way less risk of losing fidelity. But I wouldn't be so sure that this always works out the more `unsafe` blocks the Rust version requires. C compilers are not designed to minimize UB after all, and the results can be very surprising.

Yes, I hope people are sane enough to eventually commit to a somewhat tidied-up Rust version and to only keep the C version around to conduct software archeology. Of course, this will produce a hard fork of the codebase, and the usual political reasons specific to such efforts apply.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: