Hacker News new | past | comments | ask | show | jobs | submit login
Rust: The New LLVM (willcrichton.net)
207 points by wcrichton on July 23, 2016 | hide | past | favorite | 121 comments

Counterpoints as someone considering writing a language that targets LLVM, and as someone who has used Rust a bunch:

- Your compile times will be a lot slower than they would otherwise be. This is a big one for me.

- If your type system is materially different than Rust's, you don't gain any benefits since you have to write your own type checker anyways.

- LLVM is designed to be easy to machine generate, including having a library API for generating code. Rust doesn't yet.

- You can't necessarily re-use Cargo if your language requires type info to link that wouldn't be in the Rust source files generated.

- If you have semantic differences in memory model you'll need to give up the advantage of Rust's safety by having your code generate unsafe pointers, for example if you want to have shared mutability in a single-threaded context, which Rust doesn't handle nicely.

- If you do re-use Rust's type checker the error messages you get out will be terrible and point into generated Rust code.

- Debug info and debugging get even harder.

- If you want Rust interoperability you have to replicate Rust's enormous type system in your language's FFI, which is hard if you also have your own type checker, see above.

Basically it is a good idea, if your language has semantics and type system very close to Rust.

A much better idea IMO is to target LLVM and then write an LLVM plugin that can link to Golang object files and generate Golang struct metadata. Golang's FFI and type system are much much simpler and more likely to be able to graft onto new languages. Also Go's library ecosystem is larger.

Thanks for the good counterpoints. Some thoughts:

- When you say "materially different," how different are we talking? I'll admit that if you want to implement Haskell (or anything with HKT) that it's a long shot, but if you want to implement Javascript, it shouldn't be that hard, yes?

- Can you give an example of when you need type info to link?

- RE: debug info and error messages, do you disagree with what's written in the post? I agree this is an issue, but it doesn't seem unsolvable.

Generally, I guess I don't see how most languages have semantics terribly different from Rust. You don't need unsafe pointers for shared mutability with a single thread because you can use a RefCell (see: my language linked in the post). In my experience, most type systems for general-purpose languages have roughly the same feature set as Rust.

I buy that implementing a language completely different from Rust in semantics/types makes Rust not a good target, but for most languages we use nowadays, isn't that not the case?

You need type info, if the language has its own type system which has more information than Rust.

Imagine you have machine code generated by compiling a C program and now want to figure out what struct some function takes as an argument.

As you go down towards machine code you lose information that you cannot easily recover.

Solving all of these issues is also only practical, if the Rust developers themselves cooperate. It seems to me the very purpose of creating a new language, is in freeing yourself of the limits existing ones have. Building a new language on technology that are restrictive doesn't really make sense from that perspective.

Some answers:

- The type system I'm thinking of includes effect inference from Koka (http://www.rise4fun.com/koka/tutorial), is simpler in other places, and has full type inference even of function signatures.

- An example of type info to link is effects. I could do things like serializing my type structures into Rust method names and then decoding them, but that's pretty hacky.

- Error reporting isn't just mapping, it's the messages too. I don't want my language getting error messages like "modification of an immutable borrow" or "cannot pass an Rc<RefCell<MyLangInt>> to a function that takes an Rc<RefCell<Vec<MyLangInt>>" when it doesn't look like that or have those constructs.

- RefCell has runtime overhead where every other compiled language doesn't.

- I think the type systems of most statically typed languages are substantially different from Rust that it would be an impedance mismatch.

I think this is only remotely better for dynamically typed languages, since that also solves a lot of the issues with errors since those are mostly created by your runtime functions.

How will you implement exceptions?

catch_unwind, probably? Its not supposed to be used for exceptions and if you try to use it for them there are some barriers in the way. But these are not fundamental barriers and can be solved through codegen, which is how the language is planning to work anyway.

> You can't necessarily re-use Cargo if your language requires type info to link that wouldn't be in the Rust source files generated.

As a supporting point, TypeScript requires definition files (.d.ts) for JavaScript libraries to properly type them. Unless the language has a 1:1 correspondence to Rust types, something similar will probably need to happen.

It is pretty easy to embed LLVM into an app and use it to generate IR code and then JIT and run it. The Kaleidoscope tutorial creates a JITting repl in just a few hundred lines of code.

Embedding the Rust compiler would be pretty difficult, especially if your compiler isn't written in Rust.

I think that using Rust as a compiler target is a completely goofy idea. But perhaps Rust's MIR intermediate representation could work for other source languages.

Just to add another one:

- almost nobody has Rust installed already, and it's a ~100MB download. With a new version every six weeks or so.

Very new to Rust ecosystem, but I wondered, regarding this last point about using an LLVM plugin to generate structs, etc... would this also be a better way to do clean FFI to functions that use C-style unions? I have been struggling to see how the Rust bindings for Vulkan can be improved.

I was nodding along, but I think you lost me with the Goland object file thing. What would be the purpose of linking against Go object files? (As opposed to, say, C?)

A nice big ecosystem with both a simple ABI and type system, but with a nice safe type system that doesn't need manual memory management. Interfacing to C libraries from more modern languages is generally a huge impedance mismatch and necessitates some kind of wrapper. Also C libraries vary wildly in how they are built, if they use macros, how difficult their headers are to parse, etc...

Linking to Go libraries from a certain type of language could be super easy and the libraries would be nice enough to use to not need a wrapper layer.

I talked about that because IMO a really important part of having a usable language is an ecosystem, and if you're going to start a new language today piggybacking off of Go's sounds like a good plan.

> This is part of the inspiration for LLVM, or Low Level Virtual Machine, which is a kind of "abstract assembly" that looks vaguely like machine code but is platform-independent.

LLVM is not platform-independent.

LLVM IR has baked-in constants, ABI details, and other platform-specific things. The "LLVM IR is a compiler IR" post has a good summary,


One result of that is that the following quote is not true:

> a compiler writer can target LLVM and then have his language work across platforms (e.g. on Linux, Mac, and Windows machines)

Actually a compiler must generate different LLVM IR for each platform.

It is true that LLVM does abstract some or even most things away - CPU architecture is (mostly) handled for you, for example, which is what the author mentions - but many important things aren't.

> as well as with other languages that use LLVM

You'll still need to make sure to use the right ABIs for those languages.

In fact, LLVM has somewhat poor abstraction for the C ABI, in that it exists in clang and not in LLVM, as a result, Rust has had to duplicate a bunch of that code. This point actually supports the author's position - targeting Rust would actually be a more portable target than LLVM!

Overall, it might make sense to compile to Rust. But probably to Rust MIR, the IR they are developing. MIR needs to mature and stabilize, but there are already signs of an ecosystem starting to form around it. It's interesting speculation, but it would be exciting to see Rust MIR be "the new LLVM."

Reading the article, I get the impression that the VM part of the LLVM name is causing some confusion.

For all those reading this that may have thought LLVM acts like a low-level version of something like the JVM (as the name suggests, I got caught out with this before too), it's not the case. LLVM is a compiler framework, the main output of which isn't bytecode for a VM but is rather machine code for the platform you're running on. There is no performance overhead from running LLVM-generated code because LLVM is not part of the final runtime.

That's very helpful. Having only read about LLVM without actually using it, your note clarified confusion I had about it.

You're welcome ternaryoperator.

LLVM is a virtual machine, though, even in the sense that the Java VM or the Lua VM are virtual machines! Virtual machines (a.k.a. abstract machines, though technically virtual machines are a subset of the abstract machines) are simply computers that are implemented in software. When you target the LLVM, you do generate bitcode in much the same way you'd generate bitcode for the JVM (though the bitcode itself is very different).

What makes the LLVM low-level, compared to the JVM or the LVM for example, is that it doesn't offer the same level of abstraction that they do. For example, the JVM abstracts not only the machine, but also the operating system, the program linker/loader, and even some devices, etc. On the other hand, the LLVM essentially abstracts the CPU's instruction set, and that's about it. This is why even when generating LLVM bitcode rather than native code, you still need to worry about things like ABI, naming conventions, system calls, etc. -- because those things are separate from the physical machine's ISA.

However, just like with physical machines, I'd wager that it'd be possible to implement those abstractions on top of the LLVM, though it's likely easier and more efficient to just have a retargetable code generator.

Note that the LLVM can also operate as a JIT VM much like Java's HotSpot VM or LuaJIT's VM, and in such a mode the differences are less obvious from the outside. Similarly, both Java and Lua can be compiled to native code (in fact, in order for JIT to work, this must be the case), but going a step further, they can be compiled AOT. In a sort of "middle-ground", they can be compiled to their respective bitcode, and then further compiled into native code -- precisely the way the LLVM tends to operate.

The idea goes way back. There are plenty of examples throughout history of compilers targeting virtual (abstract) machines, and it offers several advantages. Portability is the one everyone talks about, but there can also be advantages for code density and compiler complexity, among others. For example, the ANSI/ISO/IEC standard for the C programming language defines the semantics of C in terms of the operation of an abstract machine. A handy way to specify the operational semantics of a language! [EDIT: In fact, the original purpose of abstract machines was for this kind of formal specification of semantics; it was nearly a decade before people realized the other advantages of AMs when used as an abstraction in programs themselves.]

See also: Landin's SECD machine, BCPL (O-code), UCSD Pascal (P-code), (O)Caml (CAM - categorical abstract machine), the G-machine, and many more.

I think this is a bit too keen to hew to the textbook definition of "virtual machine". In modern industry practice a "virtual machine" tends to imply "we wrote an interpreter, and to run your program you execute this interpreter and feed it your program".

In the interest of being precise, I'd rather call LLVM an "abstract" machine, in the same vein as the C abstract machine defined by the C language specification (https://blogs.msdn.microsoft.com/larryosterman/2007/05/16/th...).

> I think this is a bit too keen to hew to the textbook definition of "virtual machine".

That's interesting to me because I haven't ever read any textbook on virtual machines. If you could recommend one, I'd be very interested (especially if it covers designing an ISA that's a good fit for your language, which seems like a dark art to me; I've seen plenty of examples but no design docs or notes whatsoever).

> In modern industry practice a "virtual machine" tends to imply "we wrote an interpreter, and to run your program you execute this interpreter and feed it your program".

My experience has been exactly the opposite: people have a preference for the word interpreter. When they've written an interpreter, they call it an interpreter. When they've written a virtual machine, it's still often called an interpreter. For example, you rarely hear talk of the Python VM, Lua VM, Scheme VM, Lisp VM, etc. the way you hear talk about the JVM. I normally hear "the Python interpreter" or "the Lua interpreter", etc. I sometimes even hear these sorts of VMs referred to as bitcode or bytecode interpreters.

> In the interest of being precise, I'd rather call LLVM an "abstract" machine, in the same vein as the C abstract machine defined by the C language specification

I'd normally be with you on that one, but I referred to it as a virtual machine here for two reasons: 1) I most often hear the terms "virtual machine" and "abstract machine" used interchangeably, and 2) the LLVM -- Low-Level Virtual Machine -- is called a virtual machine by its creators. All in all, the distinction between the two isn't a large one, and only rarely an important one, so I thought it best in this case to avoid causing any more confusion than I already felt I might :)

(As an aside, I think a big problem with the computation science field -- both in academia and in industry, though perhaps more pronounced in industry -- is the sheer lack of (and sometimes even disdain for!) precision in our jargon. Abstract vs. virtual machines is, of course, a topical example, but many others come to mind: parser, DSL, interface, object, message, native, relational, lightweight, functional, and many more. Of particular annoyance to me is the cloud of meanings surrounding the word "type" and its variations, along with a great deal of modifiers typically used alongside it ("strong", "weak", etc.). It's enough to drive the more formal/rigorous/pedantic of us batshit, myself included.)

LLVM generates bytecode in an intermediary step, not as the final product. The final product is still machine code. You do not run LLVM when you run your compiled program, the bytecode is not part of the final program, it is just used to help with compiler optimisations.

To further clarify this point, LLVM's bytecode format is called LLVM IR, where IR stands for Intermediate Representation. GCC has its own intermediate representation for the same reasons LLVM does.

I'm aware, but it really doesn't make any difference. The BCPL compiler, an early example of a retargetable compiler, itself featured an early example of an IR based on a VM [0]: OCODE [1]. In "BCPL: The Language and Its Compiler", Martin Richards, the creator of BCPL, describes OCODE:

> The intermediate form for BCPL is called OCODE, and follows this pattern. ... It can be regarded as the assembly language of a simple abstract machine for BCPL.

That sounds rather similar to LLVM to me! Richards then goes on to describe a stack machine and its instruction set, giving examples of how snippets of BCPL will look after the front end has finished generating the IR assembly language. He also discusses optimization concerns in the back-end as transformations on the instruction stream, and finally native code generation from the result.

Point is, whether the VM instructions get interpreted or further translated has no bearing on the fact that they represent instructions for some machine. That doesn't interfere with the VM code's ability to serve as an intermediate language. Do note, however, that not all IRs are defined in terms of an abstract machine (last I checked, admittedly quite some time ago, GCC's was not, for example).

Even in the case that a VM was intended to function via bytecode interpretation, it is still possible to treat its code as an IR. For example, Google has done this on Android for some time now: when an application is installed from the app store, the system AOT-compiles the Dalvik bytecodes into native code. Of course, some machines are better-suited for native-code translation than others, but translation is always possible regardless.

[0]: I continue to follow common practice here and use "VM" as a synonym for either virtual machine or abstract machine; the distinction between the two is unimportant here. Also, I'm lazy and "VM" is fewer keystrokes. Wait, did I just obviate those savings with this footnote? Eh, who cares :)

[1]: I'd mistakenly written "O-code" in my previous comment, but looking through Martin Richards's book now, I see that he in fact called it "OCODE".

> "I continue to follow common practice here and use "VM" as a synonym for either virtual machine or abstract machine; the distinction between the two is unimportant here."

The distinction between them is important. If you want definitions that separate the two, virtual machines rely on runtime emulation of said machine, abstract machines are machines that only exist as a specification. You may not like those definitions but they line up with common uses for these terms.

If you want to argue that intermediate representations that compilers use for optimisations can be virtual machines go ahead, but to me that waters down the definition of what a virtual machine is.

You're trying to draw a line between them, but there is no line. Virtual machines are abstract machines, but not necessarily the other way around. The most common distinction I've seen, when a distinction has been made at all, is that a virtual machine is either 1) a software simulation of a physical machine; or 2) a software simulation of an abstract machine (as is the case for the JVM, for example). More often, though, I've witnessed the terms be used interchangeably.

Case in point, in the example of OCODE, the BCPL machine does not exist only as a specification -- OCODE had both an interpreter and a multitude of native-code translators by the time the book was published -- yet Richards refers to it as an abstract machine. Frontends for the Amsterdam Compiler Kit (ACK) by Andrew Tanenbaum (once the native compiler toolchain for MINIX -- I'm not sure if that is still the case) have an IR analogous to OCODE called EM, but the papers describing it refer to it as code for a virtual machine, rather than as code for an abstract machine as Richards described it.

Your definition is not incompatible with the definition I give above. By such a definition, any abstract machine will immediately become a virtual machine as soon as someone implements the specification. In my mind, that makes the distinction even less important, as the only thing standing between the two is someone spending a few hours writing code.

I get a feeling that the two terms likely originate from two independent discoveries of the same concept, but I'm not that well-versed in their history.

> If you want to argue that intermediate representations that compilers use for optimisations can be virtual machines go ahead, but to me that waters down the definition of what a virtual machine is.

First, not every intermediate representation constitutes code for an abstract/virtual machine. There are high-level IRs where it would be a bit of a stretch to specify a machine which directly accepts that code. An example is code where the only translation has been to expand derived forms (for example, macro expansion). This code is technically in IR, but often remains at nearly the same level as the original code. Another example might be the CPS representation commonly used by Scheme compilers (among others), though I'm willing to bet an abstract machine could be described for such code. Typically, the IR intended to be fed into a code generator will be suitable for simulation by an abstract machine.

Second, the IR itself is not a virtual machine. The IR represents code for an abstract/virtual machine. This is necessarily the case because the IR must preserve the meaning of the program it represents, and the execution of the program involves the operation or simulation of some machine [0]. It is then the case that, so long as the IR is too low-level to qualify as a high-level programming language (and in some other cases regardless, see above), the IR represents code for an abstract machine, and vice versa. The LLVM IR qualifies.

[0]: This follows from the Church-Turing thesis coupled with the fact that the programs are intended to eventually be executed by machine. It then follows that any time a representation of the program is lowered into a lower-level representation, its description becomes more suitable for direct execution by some machine.

Totally fair points, it's an overgeneralization to say that LLVM gives you platform-independence for free (look at Rust's battle for Windows for some perspective!). Perhaps you could consider that another +1 for Rust is that it manages those platform differences well.

As for compiling to MIR, that's a solid point--I think the only reason you would compile to Rust instead of MIR is that it's easier as the compiler writer. As mentioned in the post, you can write code generators quickly just using quasiquotations.

  > (look at Rust's battle for Windows for some perspective!)
Or in other directions, there's some instances where Windows support is _better_ than other platforms, like stack probes.

MIR will likely never be "stabilized" in the sense of providing a reliable and unchanging public interface. AFAIK, this is what HIR is for (the IR that's one step above MIR), which is how syntax extensions are intended to be stabilized.

As a long-time Rust guy, this article makes little sense to me. It is surely possible for any language to target Rust, but unless the semantics of that language are an unusually close match to Rust's then you're going to have to pull off some gymnastics. LLVM IR is actually designed to be a decent compilation target for a wider variety of languages (though I wish there was more documentation on what is and is not UB), but even LLVM IR is going to start looking like a poor choice of a target the farther your language is from C++.

As part of a recent project I had to: - build an AST from a large and complicated DSL - transform the AST into something that could be compiled into a DSO.

I chose to write both parts in Rust. Fwiw the semantics of the language really didn't match Rust at all (though, I suspect you weren't thinking of DSLs when you wrote your paragraph).

In the end I had all the benefits of _safe_ Rust when compiling the transformed code into the DSO.

This DSO is a critical component and can not fail. I can't imagine transforming the AST into C or LLVM - I wouldn't be able to sleep at night. I'm only human; I know I'd fall into a number of traps that humans fall into when coding in C - especially when I am _generating_ C from an AST...

Can you give an example of when language semantics are so different from Rust that it would be difficult to do this kind of thing? When I wrote this article, I was thinking major languages: Javascript, Java, Matlab, etc. and to my mind they all are suitable for compiling to Rust. Granted, I don't know the Java specification or its internals, I'm not claiming right now that everything specified by the JVM can definitely get ported to Rust, but moreso the general ideas behind the languages.

Rust has a unique design that would be rather difficult to map other languages efficiently to, unless you simply used it as as complicated assembly language, in which case why not just use LLVM?

Most obviously you cannot map Javascript or Java methods or variables directly to Rust because these languages don't have anything like the borrow checker. You could come up with a way to work around that, but then why target Rust at all?

I think there's a deeper problem with your proposal though, which is - why? I know you try and answer that in the article but I found it unconvincing. The JVM was explicitly designed to solve the problem you're trying to solve, as you clearly know, so a good starting point would be to much more thoroughly analyse why you aren't satisfied with it.

You say you have to "shoulder the runtime overhead of a garbage collector" but then name lots of languages you want to compile to Rust, all of which assume and require the presence of a garbage collector. Then later you talk about needing a good GC written in Rust. Well ... why? If you're gonna have to provide a GC anyway for practically every modern language, then the runtime overhead of this will be paid no matter what.

Having studied the JVM quite extensively, I don't see much in it that's obviously wrong or could be done wildly better, but I see a lot that's done right. If you re-walk that path, you will probably end up reinventing the wheel.

> Rust has a unique design that would be rather difficult to map other languages efficiently to...

With regards to efficiency, a "unique design" is of little direct consequence; we are always mapping from one very different language to another, since assembly bears ultimately little resemblance to Java or Ruby or Haskell.

A language can be a bad compile target but that has more to do with runtime restrictions. For example, it's hard to build command line tools that start quickly if you target the JVM. Compacting GC can make interop with other languages hard (since objects get moved around). But Rust doesn't come with abstractions or an execution model that are any more heavy weight than C++ (which is no way a bad language for implementing interpreters or compilers, and has even served as a compile target on occasion); in fact it is a more conservative extension of C than C++ is.

Rust does impose a burden that is kind of like saying, your compiler/interpreter has to pass Valgrind+Coverity; but this is perhaps more of a help than a hindrance -- and it doesn't limit the kind of language you can write.

> For example, it's hard to build command line tools that start quickly if you target the JVM.

Only for those that don't know what they are doing.

There are plenty of JDKs that AOT compile to native code.

Including the freely-available GCJ, a part of GCC that's not often included by default anymore.

Isn't GCJ obsolete and dead?

It is obsolete as most of its devs moved on to other stuff.

But the gcc folks keep it alive, because many of the gcj unit tests stress parts of the toolchain that other languages don't use.

So I imagine that for quick command line scripts that can live with an aged Java 1.4 API it might still work.

Okay, so how do I do this? Because if I write a simple Scala program that prints "Hello World" it can take like three seconds to start.

I think that's pretty much untrue.

That program is 0.x seconds on Scala-JVM and 0.0x seconds on Scala-Native.

Well, Scala-Native should of course be much faster.

The last time I looked at this was a awhile back; maybe it was 0.3 seconds and I misremembered it.

That sounds more like a Scala issue than anything else. A Java Hello World app runs in about 50 msec.

Javascript has very different semantics from rust. It's not compiled for one thing, it's untyped, and it's executed by a dynamic interpreter that, in the case of v8, directly compiles hot code to native assembly.

Counterpoint: I'm writing a language to do just that [1]. A lot of the basis from this post comes from my experiment in writing Javascript that compiles to Rust!

[1] https://github.com/willcrichton/lia

Look forward to seeing the benchmarks. Good luck!

UB as in undefined behavior?


This guy wants to use Rust as a target language for a compiler. He likes that Rust doesn't have too much run-time machinery, such as a GC or a JVM. OK, fine. Compilers have been written to target C for similar reasons.

Then he wants a "battle hardened GC", a JVM, access to the internal syntax tree of the compiler, and enough dynamism to allow an interpretive environment. What? Those are all features that Rust deliberately doesn't have. Adding them would drastically change the language and add considerable complexity. They would not contribute to Rust's main goal - provide something in which to write reliable production software.

From the feature list this guy is asking for, what he really wants is Microsoft's ".NET" system. That has all the run-time stuff he wants, and multiple compilers compile to it.

The point, which he outlines in the post, is to be able to mix high level programming with lower level languages. This is already a common pattern in the industry where you write your application in python/ruby and speed up the tight loops with C/C++/Rust. The lingua franca of the interop is C. Which means that the boundary between the two is very stark and unsafe. You can't use advanced typesystems or seamlessly share datatypes. I'm working on something similar these days (C++ and Rust though, no high level language). You lose all template/generics info at the C bindings boundary. Bindgen programs let you get it back to some degree, but forcing monomorphization/etc are all not easy problems to solve without making the compilers talk to each other (also not easy to do).

Will's idea seems to be to use Rust as a target for a high level GCd language, which can then (mostly) seamlessly interop with "pure" Rust code, and the interop won't have an unsafe boundary.

You should check how COM and now WinRT (aka 2nd coming of COM+ Runtime) work.

You get an OO ABI, with support for generics and OS level interoperability and reference counting as the GC algorithm.

Yes, using COM from C is a pain, but it was never intended to be used as such, rather C++, VB, Delphi and any other Windows language able to speak COM.

Yeah, I'm aware :) I don't like it as much but that's because I've always ended in situations where I need to bind via a C API, which isn't COM's fault.

You might want to read an article I wrote on a new technique that allows high and low level languages to be mixed (relatively) seamlessly:


Ooh, this is cool and interesting :)

It seems to address a similar need as the one i mentioned above, among others.

The JVM, especially with Graal and Truffle, would also be a good fit: https://medium.com/@octskyward/graal-truffle-134d8f28fb69

See my comment [1] on the recent Graal/Truffle piece for why I'm moving away from those frameworks. It just seems too cumbersome to write a compiler in Java.

[1] https://news.ycombinator.com/item?id=12123657

And yet Rust, with its famously awkward borrow checker, isn't cumbersome at all? I find this set of priorities a little odd.

But if you want less verbosity, you can write your compiler in Kotlin instead. There's no requirement that Truffle languages be written in Java, they just do it that way because Java gives closest control over the bytecode which is generated. But Kotlin generates bytecode that's very similar to what Java creates, so you could use it and get a more modern syntax.

At any rate, I can guarantee you that adjusting an existing language to be less verbose or more to your liking is a lot less effort than building an entire ecosystem of compilers that try to use Rust as their lingua-franca, given that Rust was never designed for this task.

Firstly, I think that .NET is an excellent reference here. Its platform for language integration is a good model for the kind of thing I want to achieve. Honestly, I don't use it more just because I've never used C#/ASP/any CLI language. I can't give an honest comparison between my vision and what .NET offers, but I'd appreciate it if someone more versed could comment.

As for what I want from Rust: when I ask for a GC, this doesn't have to be in the compiler. It can be implemented as a library. Access to ASTs is coming in PRs (see the links in the post). And I don't need Rust to change to get a REPL, I just need to fix the darn JIT. I'm only asking Rust to change to support a few features which greatly improve the experience of using Rust as compile target.

The original idea of .NET was to migrate the whole Windows stack to it, hence why MSIL is much more flexible than JVM bytecode, as it had to be language neutral and even support C++.

Initially Microsoft was going to make it fully native and based on COM, it was called Ext-VOS project from Microsoft Research and the runtime COM+ Runtime.

Eventually they decided to create the CLR instead and things never fully worked out that way.

However with Windows 8 they have gone back to the original idea and WinRT was born. Which is basically the same COM+ Runtime ideas, but using .NET metadata instead of COM type libraries.

On Windows .NET follows a similar idea as mainframes do language interoperability by having the runtime being part of the OS.

The whole problem with GC and JIT implementations as library is that they will never be as good as the ones that work together with the compiler.

There are many types of optimization algorithms that require an helping hand of the compiler how the data structures and specific code patterns get generated.

Of course it depends what are the goals in terms of performance, throughput and latency.

See my comment above, Rust may (probably will) get intrinsics and traits that let you write a gc that works with the compiler. Not as awesome as a GC irrevocably designed into the compiler, but good enough.

I saw it, interesting reply.

But in the end what you get is a kind of compiler plugin and not a generally purpose library, right?

No, everything will be available via intrinsics, traits, and internal plugins. (as in, autoloaded #[derive()] syntax extensions as a part of rustc that the library has no part in defining) The library will look like regular rust code, and will be loaded the regular way (extern crate my_gc). There may be a special codegen flag that rust will ask you to explicitly enable (we haven't decided on this yet) which will be something cargo can do for you. But otherwise, just a library.

In fact, so far it looks like it might be possible to even have multiple GCs active at once. Not always a good idea, but a potentially useful thing to have. Whether or not the gcs should be allowed to mix is a hard question, but it is one that GC implementors can worry about -- the type system lets them forbid this behavior. Forbid is basically the default (not forbidding the behavior is harder to express and more explicit, but doable). Its the difference between implementing a concrete trait and implementing a generic trait with specialization.

The design is ultimately a magic trait that signals what should be considered a root, and an intrinsic that lets you walk these roots. There are some other traits and stuff making this palatable, but its all something that can be used in a library-esque way.

But that is what I mean, it is a pseudo-library because the code needs to integrate with those compiler hooks you just described.

Just like, for example Turbo Pascal, Delphi or .NET allow us to replace the memory allocator by initializing the runtime with an user provided library that obeys to a specific certain ABI.

Good luck with the work.

Are you aware that these guys are using Rust for language research in VM implementation?


They just published "Rust as a Language for High Performance GC Implementation".

What do you mean by compiler hooks here? I just mean intrinsics and magic traits. sizeof() is an intrinsic. It looks like a function, but isn't. Copy is a magic trait. It works like a trait but signals additional meaning to the compiler. So is Add. Implementing Add looks like implementing any other trait, except that it conveys extra meaning to the compiler. The GC support will involve basically the same thing. In fact, that intrinsic will probably be wrapped in a slightly higher level api, so you're not even directly calling an intrinsic.

Like I said, the GC will look like a normal library and be loaded the normal way. It doesn't hook into the compiler like a plugin (no library code is run at compile time), it just makes use of magic traits and functions which signal meaning to the compiler. Such traits and functions already exist in rust for non-gc purposes and get used all the time.

Rust also has that replace-allocator support you mentioned, BTW. If anything, allocator crates are more "special" that GC crates -- GC crates look like any other crate, but allocator crates have some special annotation.

I think I read a preprint of that paper, not sure (cant read the paper from here right now). It addresses the other half of the rust GC problem -- one half is designing a safe GC interface (which is solved by rust-gc or these hooks), the other is writing a good and useful GC algorithm.

Magic traits and intrisics are also a form of compiler hooks, without them the compiler will not have any idea what to do with the library.

In any case looking forward to it.

Right, but this is no different from implementing Add or Copy or FnMut or Placer, or calling sizeof or transmute or black_box, or using UnsafeCell. So its not "special" in any way. I don't disagree that its a compiler hook; that's what I called it in the first place :) I disagree that its makes the library in any way "special" or a pseudo library.

I can write a linear algebra library that gives you Vec3d and Matrix4d and the ability to multiply/add such types using regular multiplication operators. I can also write a Gc<T> library that lets you wrap some values in a GC heap allocation, which will have some extra stack maps being generated. It doesn't feel much different to me, both are equally worthy of being called libraries.

Anyway, expect a blog post soon. :)

I was not aware of this. Thanks for the link.

In fact, Rust is working on hooks for GC-as-a-library. I'll be writing a blog post on this very soon. The gist of it is that you'll some intrinsics for rooting, and some useful traits, on which you can build a GC or write safe bindings to a GCd language. In line with Rust philosophy, this won't cost anything unless you use it.

(rust-gc already exists, but it can be far more efficient with some compiler support. I would recommend waiting for that.)

The main integration points between compilers and GCs are pointer maps and write/read barriers. I don't see a good way to do a GC-as-a-library given the need for the GC to contribute barrier code, which has to be tightly optimised. That's one reason Java/.NET use JIT compilers: there's not yet any one-size-fits-all GC algorithm, and all the best ones need tight compiler integration, so allowing you to pick your GC on the command line and then that impacts how the code is compiled represents a way to square the circle.

Pointer maps (stack maps) are basically what rust will get in its gc support. Rust has a stricter view on mutability, so GCs would be immutable-by-default like Rc is -- mutation would be very explicit and thus barriers are easy to insert. Yes, some optimization opportunities are lost ... but I'm optimistic :)

Ultimately it will never be as good as what Go (or whatever) can do, but it might just be good enough for the application. For cases where you are interoperating with an existing GC or using gcs for a persistent datastructure or something, the GC isn't pervasive so perf doesn't factor in as much as in, say, Java, where everything lives on the heap and the GC has to optimize a lot of extraneous stuff. Non-pervasive usage of GCs mean that the programmer is using the GC only where needed, so consequently less optimization is needed :)

What the post proposes will require a pervasive gc, however. But that GC will be pervasive in the language living on top of rust -- and that compiler/transpiler can attempt to perform optimizations. Can't exactly JIT to rust, though, so it might end up slower. Not sure by how much.

Haskell functions (pure or with side effects) simply cannot be made into anything resembling standard Rust types due to the presence of laziness.

This post oversimplifies the problem of language interoperability. Interop is not limited because of lack of a common target. Every platform/OS combination has a well-defined ABI. Most languages have facilities to use functions that conform to this ABI and most can also export functions which use the ABI. The choice of LLVM has nothing to do with it -- LLVM simply uses the ABI. The fact that many compilers use LLVM is not because a hand-rolled solution couldn't implement this ABI, but rather because LLVM is easy to use.

I don't believe interop is that simple, though. If you want, for example, PHP to interact with Python, you can't just go to the ABI level. You need code to define the translation between objects of one type in PHP to another type in Python. ABI-level compatibility is good for really low level stuff, but for everything else it just doesn't work.

As for Haskell, I agree that it doesn't fit into this language vision. Both its lazy semantics and type system would be difficult to fit into the model I suggest.

Debug information, anyone? The author ignores the need for passing line numbers and storage symbols through some sort of debug dictionary.

The kindest thing I can say about this article is "simplistic".

Rust Core Team member here. I actually wrote up an RFC [1] that proposes adding JavaScript-like source map support to Rust, in order to improve our plugins-in-stable-code supports. Odds are we are going to reject it for now since I think we found a simpler way to achieve our goals, but we are certainly not opposed to adding support for it if there's demand.

[1]: https://github.com/rust-lang/rfcs/pull/1573

I only had time to give your link a quick skim, but I'm impressed that the Rust team is actually thinking about how to pass debug information through from "Rust-front" tools.

The #line directive in C is limited, at best. Debugging flex/lex/yacc/bison, for instance, is sketchy at best. But of course line numbers are only part of the problem. If someone really wants to create Spiffy New Language as a Rust-front, then SNL structures/objects/collections/whoozits will want to pass symbolic information to the debugger so that users can explore their data structures using meaningful names.

And even so, this assumes that with debug symbols on, you can turn optimization off. Debugging optimized code gets wacky very fast because associating line numbers to source is not always possible any more, and gives non-intuitive results due to code motion and inlining even when it is possible. (You haven't lived until you've seen gdb's source line pointer jump around like a weasel on crack due to code re-arrangement.)

At Rust's current level of maturity, I suspect there are bigger fish to fry than passing debug info from Rust-fronts, so rejecting ambitious debug symbol machinery is probably the right strategic decision for the time being.

> If you compile to Python, well, your program will probably run pretty slowly and you lose any static typing guarantees.

What? The static typing guarantees should be checked (statically!) by the compiler, so the choice of language you emit doesn't need to affect what kind of type system you can design your input language to accept.

This sounds like it is written by a person with little experience with compilers. The reason you compile to SSA is because it's easy to perform optimizations on. At some point you compile Rust to LLVM.

Yes, and Rust does the hard work of turning your code into SSA and activating all those optimizations. Why should we have to recreate all the effort of the Rust compiler devs?

And please refrain from ad hominem. It adds little to the conversation.

As others have pointed out, the difference is a sacrifice in compilation time to more quickly implement a langauge very semantically similar to rust. If it's not super similar to rust, you are simply jumping through more hoops than it is worth. Yes, LLVM IR is a pretty decent compilation target. And honestly, having costly abstractions provided to you would probably encourage you to generate poorer code. But I digress.

Here's a weird bit of C++ code:

    std::vector<int> foo = ...;
    for (iter = foo.begin(); iter != foo.end(); ++iter) {
        *iter += 1;
        foo[0] += *iter;
A naive translation to Rust would run afoul of the borrow checker. If we are to compile C++ to Rust, how are we to address this? Should we simply generate unsafe Rust?

I don't think the point is to translate existing languages to rust, though ten author seems to have tried it with success with js (and maybe others?).

The point is to translate new, generally higher level languages to rust. These won't fall afoul of the borrow checker if handled right. I suspect Rust will not have a major restricting effect on their design either. A lot of the borrow checker issues can be papered over with sufficient codegen. When I was working on rust-gc, I often remarked that if we wrote some codegen plugins that convert all types to `Gc<RefCell<Trait>>`, we're effectively Java ;) (Most of this papering over can be done in a way that avoids runtime cost, though not this specific simplistic example for Java)

Though there is this thing called Corrode which is trying to translate C into Rust and iirc they are trying to minimize the unsafe usage in the generated code. That may eventually be able to convert your code into an indexed iteration.

(Also, while the specific code you wrote is safe, if the mutation being performed were more complicated with a non-primitive, it could lead to segfaults)

What do you consider weird about this code? This is fairly straightforward C++, as far as syntax and language use. Do you mean the logic of what it is doing seems weird to you?

Yes, it's just accomplishing something weird. And it's straightforward C++ (as you say) that uses multiple mutable references to a single object. Of course Rust forbids this.

How would one transpile this code to Rust? Or is the idea that every language ought to have Rust-style semantics? If so, it's hard to reconcile that with "Rust is the new LLVM," since LLVM does not demand such strict semantics.

    let mut foo = vec!(1, 2, 3);
    for mut iter in 0..foo.len() {
        iter += 1;
        foo[0] += iter;
No unsafe required, I think.

Points for cleverness - you replaced the iterator with a call to len(). Array indices are sometimes thought of as back-door references, because they let you cheat in this manner.

But you didn't quite get away with it. The C++ code increments the value in the array through the iterator, while the above Rust increments the iterator (err, index) itself. So it doesn't produce the same result.

More generally, std::vector in C++ typically implements iterators with just raw pointers. This is quite unsafe, but doesn't impose borrow-checker restrictions on the user. How can such languages be compiled to Rust?

>But you didn't quite get away with it. The C++ code increments the value in the array through the iterator, while the above Rust increments the iterator (err, index) itself. So it doesn't produce the same result.

I have to admit that I don't understand you here. Can you elaborate on the difference between the two programs?

Sure. The C++:

    for (iter = foo.begin(); iter != foo.end(); ++iter) {
        *iter += 1;
Here `iter` is a pointer. We dereference it and increment the result. This adds 1 to every value in the `foo` vector.

The Rust:

   for mut iter in 0..foo.len() {
        iter += 1;
Here `iter` is an array index, and `iter += 1` increments that index. The contents of the `foo` vector are not modified.

>Here `iter` is a pointer. We dereference it and increment the result. This adds 1 to every value in the `foo` vector.

Huh? How is `iter` related to foo? When you go `iter += 1` aren't you just incrementing the variable behind iter?

Unless... OK, I'm not very familiar with C or C++, totally newb here. Are you doing pointer arithmetics here, when you increment iter?

But wait, if you are doing pointer arithmetic, I thought you are not supposed to deref the pointer. Otherwise you are doing arithmetics on the variable behind the pointer, not the pointer itself.

Also, with `foo[0] += *iter`, you are just adding iter to the first element of foo, again and again. You are not actually iterating on foo.

I'm confused...

Think of `foo.begin()` as returning a pointer to the first element of the array, or something which acts like it (similar to Rust's `Deref`). Think of `foo.end()` as returning a pointer to a "virtual" element after the last element of the array. Then `++iter` increments this pointer, so it points to the next element of `foo`.

Since `iter` points to an element of `foo`, `* iter` is the element itself. `* iter += 1` adds one to that element. `foo[0] += * iter` takes the current element, adds it to the first element of `foo`, and stores the result as the first element of `foo`.

The catch is that, in Rust terms, you have a mutable reference to an arbitrary element of `foo` (the reference being `iter`), and you are getting a mutable reference to another element of `foo` (the reference being `foo[0]`) at the same time. Even worse, in the first iteration, `iter` points to the first element, so `iter` and `foo[0]` are two mutable references to the same location, which exist at the same time! You can't do that in Rust with references (you can with pointers and `unsafe`, and in this case it happens that there is no data race, but...)

The correct safe Rust translation would be:

    let mut foo = vec!(1, 2, 3);
    for index in 0..foo.len() {
        foo[index] += 1;
        foo[0] += foo[index];
But, for more complicated cases, an equivalent translation won't be as obvious.

What's the advantage of doing pointer arithmetics as opposed to incrementing index variable, like I did?

The advantage of iterators is the abstraction. Collections like maps and sets don't even have indexes.

Such a bad idea, fanboyism is too strong here, Rust is one of the slowest languages to compile, and incremental compilation isn't even available yet. Your new language's compilation will take pretty much forever.

Is your main objection just compilation times? I agree that's a trouble point, although I don't believe it will never get fixed. Incremental compilation is just around the corner!

I assure you that this post is thought out past the point of just "fanboyism." Have some faith :-)

A PR for the first bit of incremental compilation was opened recently.

In fact, as a compilation target, one wants a language with a lax semantics, not a strict one. This is (one of the reasons) why so many languages compile to C, instead of - say - to C++.

Consider this fact for example: many languages make use of objects allocated on the heap in a way that prevents to know statically when they are to be disposed; then a GC takes care of that dinamically. How are they going to give Rust the necessary information to track lifetimes, when the information is not there to start with? Of course, one can just write a GC in Rust, but how is this an improvement with respect to a GC written in C?

Also - what about typed languages - such as Go - that do not have generics? These are not going to map well to the Rust type system. Or what about lazy languages? Or languages that save the stack on the heap to implement call/cc?

In short, whenever the semantics is different enough, you will not be able to map to idiomatic Rust constructs, and in fact if you don't want to pay the cost of interpretation, you will probably map to unsafe constructs most of the time.

Rust most probably will be getting hooks that make it possible to safely integrate efficient GCs. Writing a safe nonconservative GC for rust has already been done, and once these hooks are in place it should be possible to write a good GC. Though for a codegenned language on top of rust you may not need these hooks.

https://github.com/mystor/rust-callcc is callcc for Rust. Technically abuses a feature for a purpose it was explicitly not intended for, but meh :)

callcc and gc are usually not useful operations in rust. That doesn't mean that its not possible to implement them. The implementation might be annoying to use, but of you are codegenning to rust this doesn't matter.

Yes, some semantics won't copy over. I feel like most would though. Those that don't can often be emulated at the lower level with enough codegen.

It's not that writing a GC is impossible - of course it is doable. The issue is that objects written in this hypothetical language will always make use of the GC because they do not have lifetime information in the first place. So the question arises naturally: why bother compiling to Rust if you have to avoid the borrow checker anyway?

The reason I gave in https://news.ycombinator.com/item?id=12148269 applies here too. Clean interop. The ability to freely use a GCd language and smoothly transition to Rust when necessary.

And Rust-as-a-target doesn't necessarily mean you're doing it for the borrow checker. It could also be the typesystem (with the perf and safety of lower level code as a unique bonus). Yes, other, better, typesystems exist, but nobody's saying that Rust is the only language you can do this with :)

How is Rust type system helpful? If the codegen produces valid Rust, the Rust type system will not give errors. If it doesn't, I'd say it is a bug of the higher level language - and at best the user is going to see a type error of Rust, while working in X, which is not the best user experience.

Really, these kind of checks belong to the frontend.

No, I mean if you want your higher level language to have a typesystem similar to that of Rust. As I mentioned in the other comment, interop between high level and low level languages usually happens through the medium of C, in which all type information is lost. If your higher level language is compiling to Rust, you can easily add Rusty typesystem features with seamless interop.

I'm not saying Rust's compile errors should ever be shown to the user. I don't think that's what the OP is suggesting either. If rustc errors you should emit an internal compiler error, because your higher level compiler should be producing valid rust code.

Minor nit: GHC has a LLVM backend (in addition to an assembly backend and a C backend), but the assembly backend is the most commonly used.

(The LLVM backend is known to produce better-optimized code, but the compile times are slower.)

I'd rather see things compile to typed ASM or something. SPARK + Rust if we're talking intermediate language so we get SPARK's static benefits on some code with Rust's on other code.

Then you'd probably want LLVM IR (or an equivalent). It was literally designed to be a hardware-agnostic strongly-typed SSA assembly.

And/or you could do what Rust and Swift are doing (with MIR and SIL, respectively), and define your own ASM-like language between your parsed AST and LLVM IR. That way you can do high-level optimizations, static analysis, etc. before you drop too far down and lose much context.

LLVM folks told me it has issues that stem from fact it was created originally for C/C++ and x86. I'd think something more like TALC or, as you said, strongly-typed SSA that was language and HW neutral. Plenty of research on such stuff though.

How long till someone writes "Javascript: The New LLVM" or even better "Javascript: The New Rust"?

Great talk, Gary is a cool dude.

JavaScript is a target for many programming languages already. Maybe the title could be WebAssembly: The New LLVM.

llvm has alloca. llvm has computed goto. Saying safe-Rust can generate arbitrary "safe" LLVM is a bold statement

As an aside it may be interesting if this could be "Target MIR", but that's a future prospect

Rust doesn't generate arbitrary LLVM, but to my knowledge it does generate LLVM that is memory safe. That's the whole point of the borrow checker.

computed goto, aka indirectbr in the LLVM IR: http://llvm.org/docs/LangRef.html#indirectbr-instruction

> To my knowledge, Lia is the first major language that actually compiles directly to Rust.

You might be interested in checking out Dyon


Really creative ideas in here. Others have raised a lot of good questions, but it seems like an interesting area to explore, anyway.

Once static compilation gets off the ground, I wonder if Julia isn't an easier target for "high level LLVM". The whole language is built to be extensible, and people are already using it in comparable ways.

APL implementation:


As a code generator for the Go assembler:


Rust does offer some type features that could make language interop a lot better while not really imposing too much on the way other languages work. Traits in particular provide a way to model polymorphism that is compatible with subtypting but doesn't require it.

Can someone explain this:

"Second, LLVM is a simpler language to target than traditional assembly languages—it has nice features like infinite registers and recently support for stack frames."

Aren't stack frames a basic component of any run time? How was support for those only recently added? Maybe I am not fully understanding the context. I looked at this docs but it didn't clear it up for me.

"CPUs are basically assembly interpreters, and that's the most fundamental unit of execution that we can target"

I'm not trying to nitpick here but CPUs are interpreters of binary words. I think that sentence might be misleading to someone.

For all the corner case clarences out there, this is some forward thinking stuff that you should grok before you turn it down.

Web assembly is the new LLVM.

Doesn't web assembly put most responsibility for optimizations on the thing emitting it? That's a big difference from LLVM.

webassembly is just a binary IL spec

I don't know much about it. Is it like Java bytecode, where the runtime is expected to do substantial optimization? Or is it more like traditional assembly where the optimizations are expected to have already been done?

If you're still curious, I dug this up: https://news.ycombinator.com/item?id=9734842

It sounds like WebAssembly is in a weird position, where it can be (relatively) directly translated to machine code, or more heavily optimized to support dynamic languages.

It is just binary IL. Doesn't matter what it is expected to do. You can do anything to it.

I don't think that's at odds with the post. Rust could compile to WebAssembly through LLVM.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact