I'm incredibly psyched for MIR. It'll represent the first conceptual "leap" for the language since Rust 1.0, checking off (or at least unblocking) a ton of the improvements that we've had in mind for years now (the list in the OP is hardly comprehensive!).
It's also interesting that Swift, another LLVM-based project, has also transitioned to using their own language-specific intermediate representation (SIL). I think it's fascinating that, despite the availability of a robust and mature pluggable language backend, two professional languages independently decided that they needed another layer of abstraction between AST and LLVM IR. Not sure if this should be interpreted as a bug report against LLVM IR, or if it possibly represents a future trend in compiler design.
It's been pretty common for compilers to require 3 different intermediate representations (parse tree, MIR, and LIR). My copy of Advanced Compiler Design and Optimization [1] (over 10 years old at this point) lists the 3-IR architecture on page 8. It references Sun's SPARC compilers, DEC's Alpha compilers, Intel's x86 compilers, and SGI's MIPS compilers as implementations that use it.
What has changed are the boundaries. Traditionally, semantic analysis (including typechecking) operated on the parse tree, which resembled the human-written language extensively. Optimization operated on MIR, then it'd be lowered to an architecture-specific LIR via register allocation and instruction selection, and another round of optimization (instruction scheduling, etc.) would be applied. The purpose of all of this was to provide multiple language front-ends on top of a single compiler. In the 80s and 90s, compilers were often written by the hardware vendor, so they would tightly optimize the MIR and LIR for their own architecture, and use the parse-tree => MIR lowering to support multiple surface languages.
LLVM chose a LIR that's closer to what many compilers had been using as MIR, and then moved the optimization passes into the LIR, and hid the architecture-specific backends within the LLVM project itself, out of the eyes of language designers. It could do this because of open-source: with a shared body of compiler code owned by everyone and gradual consolidation of the hardware market, it became easier to contribute your backend to LLVM than to maintain your own compiler stack and fight for adoption. (GCC actually had a fairly similar architecture first, but the GCC IR was very difficult to comprehend if you weren't a GCC maintainer, which made it impractical as a compilation target for outside projects. They generated C code instead and let GCC compile it.) That in turn made it much easier to write a compiler and experiment with language design, since you only had to figure out how to translate your language to LLVM's IR rather than work out the details of scheduling and register allocation. That, in turn, allowed greater complexity in language features: Swift and Rust have language features that go beyond what cutting-edge research was <10 years ago, and do so in a production language that you can use now. And so it's not surprising that they're now re-introducing a MIR to manage the additional complexity introduced by the new language features.
If you count the intermediate IRs that LLVM has, then Rust actually now has many more than three IRs: AST, HIR, MIR, LLVM IR, SelectionDAG, MachineInstr. Even before MIR, there were 5 IRs.
A slight side note: this is one of the most difficult parts with using Hacker News on a mobile device. I fat finger upvotes all the time! I wouldn't mind the ability to change a downvote to an upvote and vice versa, even if it was time limited so I could correct any incorrectly voted items.
couldn't this be easily fixed by putting the downvote triangle on the other side of the time?
(upvote arrow) name xxx hours ago | parent | flag | (downvote arrow)
would make it less fat finger prone and still be pretty straightfoward to use (plus having downvote close to flag IMHO makes sense as well from a context perspective)
Could be easily fixed by one line of CSS to make the dang buttons larger, but apparently Hacker News can't be bothered to improve their shitty website.
Either that, or a way to hide the down vote on mobile. I hardly ever legitimately down vote so when i fat finger it on mobile it just makes me look like a jerk.
Except no one can see you did it, so it doesn't make you look like a jerk. A random downvote here and there shouldn't bother anyone. It's probably better off if we just realise this and never talk about it.
Depends on context. I'm at +149 right now on the top-level comment, so one upvote that turned into a downvote isn't really a big deal. It could be pretty annoying if you've worked hard on a long comment and then it goes from 0 to -1 instead of +1 because of a fat-fingered downvote.
I use it too. The only problem is that the latest update broke the commenting (it only allows you to type one line). But I filed a bug report and it was fixed, so we just have to wait for the update. This is why I love free software. No begging in app comment sections.
Type inference in the presence of OOP-style subtyping was largely considered to be impossible in 2005; there had been a lot of research results with System F:< and System Fw, along with a well-known undecidability result [1]. Attempts to introduce subtyping into research languages like Haskell and Ocaml gave us records [2] and objects [3], but neither of these were something that an ordinary mainstream programmer would want to use. Haskell's type system made it frustratingly easy to construct types that would force the typechecker into an infinite loop [4].
And now Swift not only allows these in a way that's familiar to existing mainstream programmers, it interoperates fully with Objective-C and offers additional features like polymorphic literals. And when polymorphic literals (previously allowed only by Haskell, and then not with user-defined types) makes typechecking take 12 hours [5], it's widely decried for being poorly implemented.
On the Rust end - the whole concept of the borrow checker was borrowed (sorry) from Cyclone [6][7], a research language of the early 2000s. I remember seeing it occasionally pop up on Lambda: the Ultimate [8] around 2004-2005, people seemed intrigued, but few people really took notice. Rust's big accomplishment is to generalize this to an "industrial strength" programming language, figuring out all the corner cases around mutability, traits, move-semantics, boxes, etc. that you need for this to work in real programming.
> On the Rust end - the whole concept of the borrow checker was borrowed from Cyclone
Not really. That's the lifetime system, which is not the same thing. "Existential Types for Imperative Languages", also by Dan Grossman, explains more of the motivation, but the borrow check is a hybrid of the two solutions he presents.
Borrowing itself is something with a long history, but the whole "unique loan paths" thing makes Rust's solution distinct from a lot of the stuff you find in the research literature.
> Type inference in the presence of OOP-style subtyping was largely considered to be impossible in 2005
Type inference in a OO language is still considered, if not impossible, at least very impractical. Swift, Scala and Rust avoid most algorithmic complexity and ambiguity by mostly supporting only type propagation instead of type inference, and thus require annotations on function parameters.
The big innovation there is in realizing that's enough. When I was into language design from 2004-2007, language designers usually assumed people wanted full global type inference. It turns out that local type propagation is an effective 80/20 solution for most practical programming tasks, which opened up a lot of hybrid type systems that would've been far too complex before.
Yes. A blessing in disguise. In hindsight, global type inference seems too unstable. Too much action at a distance; it's hard to evolve a codebase when local changes cascade. Type annotations keep things grounded.
Unfortunately they also had to narrow down the scope of type inference recently. IIRC type annotations are now mandatory for class-level variables (member variables). This hindered the "Ruby-like" feeling, but they had to give up global type inference because it costed too much memory in some cases. I got an impression that global type inference is difficult (or, costly) in general.
> Haskell's type system made it frustratingly easy to construct types that would force the typechecker into an infinite loop [4].
It's more akin to C++ templates, where the type checker only has a bounded number of reductions it will do before it gives up. Aside from writing something that really is infinite and a limit of trillions of reductions - which wouldn't really be a useful program - it ends up as a type error as you'd expect. This also only happens with -XUndecidableInstances, which frankly isn't really one of the most common extensions, I'd say.
That said, in practice, most Haskell programmers seem to largely consider subtyping a misfeature anyway, and approximations of it elsewhere in the ecosystem use things like RankNTypes to give a 'feel' for subtyping where you can do things like "use any instance of a more general base type, in place of that type, right here". The infamous lens library does this to great effect.
> previously allowed only by Haskell, and then not with user-defined types
I don't know what you mean here - Haskell has, until recently, never allowed overloaded literals for certain syntactic forms, like lists. For the other cases, though, like numerics, you have always been able to have polymorphic literals, which can be instantiated to user-defined type. The most general type of `1` by itself with no defaulting is `1 :: Num a => a`, and there's certainly no problem with instantiating `a` to whatever user-defined type you have, providing it can sensibly be constructed from a base-10 decimal value (meaning, you have a function `toInteger :: Integer -> a`)
These days, lists are also syntactically overloadable by a similar convention, through value polymorphism/type classes.
I like Rust, but there is nothing wrong with Go (I think their decision to not have generics does have some merit). They just took trade-offs they needed. Nothing wrong with that.
I did like Oberon a lot, and appreciate Oberon-2 influence on Go, but I think there is a lost opportunity in a company like Google sponsoring such a design, when it could have done so much more helping advancing the state of art for mainstream developers.
In that regard, Apple and Microsoft do so much more to help developers improve their skill sets, given the programming languages and tools that they have brought into the market.
However, I do advocate Go for the use cases where developers still make use of C for user space applications, without any real need for C's low level features.
Also do believe that Go can actually be used for systems programming, even if not all scenarios from systems programming like for example, 8 and 16 bit embedded processors, can be covered.
The language is similar enough to Oberon and Limbo for it to be possible, just needs someone to write a bare metal runtime.
There is a long tradition of type-directed compilation. Haskell compilation system refers to it as the simplifier or "middle end".
The parsing and type-checking produces a IR known as 'Core'. The middle-end optimizes it as a series of source-to-source transformations on Core code.
The back end of the compiler transforms Core code into an internal representation of a C dialect called C--, via yet another intermediate language STG (short for "Spineless Tagless G-machine").
Finally, the C-- code output can either be printed as C code for compilation with GCC, converted directly into native binary, or converted to LLVM virtual machine code for compilation with LLVM. Any output is linked against the Haskell runtime.
The innovation would come in creating the optimizations in the middle-end that could not be possible in the LLVM backend. Proof is in the pudding. Introducing another layer without any meaningful results is how projects bloat (NIH syndrome).
There is a LLVM backend that implements most if not all of the GHC "middle-end" optimizations[0] and should provide you with details/context you are seeking.
>The innovation would come in creating the optimizations in the middle-end that could not be possible in the LLVM backend
>There is a LLVM backend that implements most if not all of the GHC "middle-end" optimizations
I'm not sure I follow, are you saying that the optimisations are possible or not in LLVM? I don't think the LLVM optimisation passes can make guarantees such as is needed for (say) the rewrite rules that rely heavily on purity.
LLVM doesn't ever even bother with rewrite rules. That's done substantially earlier in the compiler, at the Core phase (rewrite rules are 'just' equational reasoning optimizations that the compiler is aware of, so it works roughly at the level of the program you write, vs any more complicated intermediate form.)
There is still some inlining/smarts going on at the LLVM level, obviously, but it's nowhere near as aggressive as it is at the Core level in GHC. And obviously rewrite rules are user-definable - so they kick in much earlier. (As a side note, we actually spit out pretty optimized LLVM code from the C-- backend, in such a way that only some of LLVM's heaviest optimizations make a difference).
There are some specific optimizations that are simply not possible for LLVM in the general case and are better handled before conversion (for example, C-- at one point essentially gets CPS transformed in a way which LLVM cannot 'see through', and by that point you've lost the opportunity to do a number of simpler optimizations like basic constant propagation, while it's trivial to do earlier). But that's mostly the same in any compiler, and not unique to GHC...
Please read the link I provided and search the page for "Problems with the backend"
...There is one major issue remaining with the LLVM backend that I currently know of, its inability to implement a optimisation used by GHC called 'TABLES_NEXT_TO_CODE' (TNTC)...
The prologue support directly subsumes previous behavior (the `prefix` form) and is more flexible, and with it, GHC doesn't have to rely on shuffling assembly output in order to achieve TNTC.
I can't wait to see improved borrow checking! I'm not sure it'll fix this specific case, but I often find myself needing to introduce a scope just to hint the borrow checker into doing the right thing. For example:
fn handle_message(&mut self, find_node_query: messages::FindNodeQuery) {
let origin = node::UdpNode::deserialize(find_node_query.get_origin());
let target = Address::from_str(find_node_query.get_target());
{
// Borrow self.routing_table first time
let nodes: Vec<&Box<node::Node>> = self.routing_table.nearest_to(&target);
let response = Message::Response(
self.transaction_ids.generate(),
Response::FindNode(&self.self_node, nodes));
origin.send(response.serialize());
} // Done borrowing self.routing_table here
// Borrow self.routing_table again
self.routing_table.insert(origin);
}
I think it's too early to say. The devs have assured me that MIR itself will land within the year, but beyond that point I think that non-lexical borrows might be out-prioritized (though just barely so) by incremental recompilation.
I'm not sure I can say for sure when they'll be ready, as MIR isn't 100% done, and they're built on top, so it's tough. There's also the difference between "lands in nightly" and "lands in stable", which could be some time for a feature so large.
Every time I see rust news, I feel the urge to learn this language. But, thankfully, someone posts some random snippet. :-) Java is a beautiful language compared to this.
My only complaint is the mix between snake_case and PascalCase. I tried removing it(in favor of snake_case, but you could probably get a similar result if going full PascalCase), and it looks much better(no guarantee it has the same semantics, I've never worked with rust):
fn handle_message(&mut self, find_node_query: messages::find_node_query) {
let origin = node::udp_node::deserialize(find_node_query.get_origin());
let target = address::from_str(find_node_query.get_target());
{
// Borrow self.routing_table first time
let nodes: vec<&box<node::node>> = self.routing_table.nearest_to(&target);
let response = message::response(
self.transaction_ids.generate(),
response::find_node(&self.self_node, nodes)
);
origin.send(response.serialize());
} // Done borrowing self.routing_table here
// Borrow self.routing_table again
self.routing_table.insert(origin);
}
Rust used to work that way, but it was changed primarily to visually distinguish between enumerated variant names and variable bindings in patterns. e.g. match foo { none => ... } and match foo { non => ... } having completely different interpretations was confusing.
Exactly. There are few fundamental principles which have been neglected in order to provide way too many features (a kitchen sinc syndrome).
For example, "do not create unecessary entities" maxim, which, it seems, is central for good language design (ML, Scheme, Haskell, Smalltalk, Erlang, Golang are small languages, the first three are even standardized). As long as we violate this genetal rule we end up in a mess like CL, C++, and of course Java as a peak of absurdity).
The meme that "implementation languages must be feature rich" does not hold beyond certain limits, and even Rust designers have realized that there must be certain restrictions. In fact, C - the oldest and most common implementation language is still alive and well (despite all the mess other people made of it since it left Bell Labs).
The best illustration I know is refusal of the Golang designers to complicate (to ruin the balance of simplicity and elegance with expressiveness, which is the most difficult to attain - every poet or artist will tell you that - to create an unbalanced pile of features is way easier) the language by adding generics in the language and runtime. For them interfaces (duck-typing) is good-enough).
"Explicit is better than implicit" is also a naively wrong maxim. Verbosity is bad. Period. Every good writer I know is non-verbose. Meaning behind words is what makes a good literature, not long words or elaborate passages half page long (ironically, common men think exactly opposite is true). Convention over configuration (evolved defaults instead of explicitly spelling every single rule over and over again) is the same notion from a different perspective.
Here comes Scheme - the greatest mostly-functional small language with balanced defaults, cohetent semantics (almost like in math), and unmatched expressiveness with almost no syntax at all (expressive power comes from coherent defaults that everything is an expression, everything (including lambdas) is a first-class value referenced by symbol, lexical scooping, very few standard special forms and the ability to create new ones.) as an good example. Smalltalk and Erlang are another ones for the emphasis on message passing and, in case of Erlang - pure functionality and isolation - the way vastly complex biological systems has been implemented by evolution.
So, those who have seen highly refined languages like Haskell or Scheme or Smalltalk would never be fooled by "features". The meme that "each project must first agree on which subset of C++ it would restrict itself to" is another good illustration. Rust, like Java, is already suffering from this kitchen sink syndrome, so the comment above is legitimate.
BTW, it seems that many of features should go from the language to sanitizers and checkers, like it is with clang tooling.
> "Explicit is better than implicit" is also a naively wrong maxim. Verbosity is bad. Period. Every good writer I know is non-verbose. Meaning behind words is what makes a good literature, not long words or elaborate passages half page long
Good writing often allows (or even encourages) a reader to concoct their own inference of exactly what the author means. Witness discussions that last to this day of many ancient classics. Such ambiguity may be fine for writing, but it is obviously disastrous for large engineering projects.
That's fine - not all code is a large engineering project - but most of the people I know who work on very large code bases will take explicitness every time. And those are the kind of projects that Rust is aimed at tackling.
Explicitly passing &self is lame. Haskell's implicit parameters are clever.)
Actually, the language fells Rubysh - a bit of syntax from here and there. Swift has much more refined and concise syntax, because they paid way more attention to details.
Patter matching on type constructors could have been taken from Haskell - it won't hurt.
Block syntax with where clause (why?) is really bad.
The syntax inconsistency and awkwardness aren't major issues, but it is telling.
You say "explicitly passing self is lame" and then praise Golang for "feature restraint" when it does the exact same thing.
I don't know what "pattern matching on type constructors" means. You use the constructor syntax to pattern match in Rust.
There is no such thing as "block syntax with where clause". Where clauses go on signatures, not blocks. This is also the exact same syntax as Swift, which you praise for "more refined syntax".
Pattern matching is technically almost unchanged since Standard ML, but in Haskell the syntax is refined by perfectionists (in best possible meaning of this word).
Swift, obviously, trying very hard to look familiar for [Objective-]C (but not Java) programmers, and to follow the principle of less astonishment, produced much more concise syntactic choices, and the devil is in the details.
BTW, if one wants to see another example of carefully crafted language, there is PLOT
Explicit passing of &self (or &mut self, or self) might be 'lame', but it also gives more clarity as to what the code is doing. I can't say it's ever once given me pause.
Rust is emphatically not a kitchen sink language. If you believe otherwise, please explain what features you would remove without compromising safety or the notion of zero-cost abstractions.
The engineering decisions are not based on beliefs. They are usually trade-offs. Safety in Rust has a very specific meaning clearly outlined in the official book. There are users who prefer it for certain tasks. The answer to your last question is also in the book.
Why not state it here in a few simple sentences, like for a 5 year old? Clarity is usually an evidence for deep understanding. I hope I will be able to get it.
Go is less safe due to the "Off to the Races" problem, but there's a legitimate debate as to whether that safety loss matters in practice.
Common Lisp and Golang use GC, which exacts a performance cost in some scenarios. Go also has highly dynamic semantics (e.g. with defer) which have unavoidable performance overhead, especially in an AOT compilation setting.
" Not sure if this should be interpreted as a bug report against LLVM IR, or if it possibly represents a future trend in compiler design."
Compilers have pretty much been the way (with the exception of GCC) since about 1975, if not earlier.
:)
LLVM is deliberately low level.
SWIFT uses a higher level IR because they want to do static analysis and language specific optimizations on it.
This makes sense, LLVM's IR is not going to be good at that.
You could transmit metadata, but it's probably not a good architectural solution compared to progressive lowering of IR.
This is a common approach in compilers. It makes things more manageable. Each of the languages is simpler but has more constraints until you reach to the constraints that machine provides.
I believe that's a bit different, as Webkit entirely replaced LLVM with a new backend, whereas here Rust is still ultimately emitting LLVM IR, just with a new well-defined step in-between.
Not necessarily a future trend. When I took a class in compiler construction last semester, the professor considered it industry standard to use multiple (AFAIR he said "up to five") IRs in the same compiler/linker toolchain.
If we're looking at the full toolchain, then LLVM itself does already have either two or three internal IRs, and god only knows how modern linkers work in practice.
I don't think it is a bug against LLVM: there are basically two way of doing this: either you extend LLVM with more intrinsic or concept that carry your high-level semantics to the LLVM level (this is what people from Azul are doing [0]) or build an other intermediate representation like Swift and Rust. For the swift case, you can see the talk at last year LLVM Dev Meeting [1]. There are advantages and drawback to each approaches.
[0] : "LLVM for a managed language: what we've learned"
[1] : "Swift's High-Level IR: A Case Study of Complementing LLVM IR with Language-Specific Optimization"
Both have slides and video here: http://llvm.org/devmtg/2015-10/
It's not a bug, and the answer is provided in the link:
"Faster execution time. You may have noticed that in the new compiler pipeline, optimization appears twice. That’s no accident: previously, the compiler relied solely on LLVM to perform optimizations, but with MIR, we can do some Rust-specific optimizations before ever hitting LLVM – or, for that matter, before monomorphizing code. Rust’s rich type system should provide fertile ground for going beyond LLVM’s optimizations."
There's still an implicit question there, concerning why one would choose to develop a new IR rather than enhancing LLVM (the latter of which the Rust devs (and especially the Swift devs) have had years of experience doing). Though I'm not trying to suggest that Rust made the wrong decision or is suffering from NIH, because MIR is great, and there's no reason to suspect that shoving absolutely everything down into LLVM is necessarily a tractable solution. I'm just curious which Rust-specific and Swift-specific optimizations these IRs enable above and beyond LLVM IR's capabilities, and what that says about each source language in turn (especially if we regard LLVM IR as a proxy for C, being LLVM's original target).
It's easiest to do optimizations on an IR that has the appropriate level of granularity. If you're trying to optimize address calculations, it's easier to work with something like LLVM's getelementptr, which represents an address calculation as one IR node, than a C-level IR (where the address calculation would be implicit in an array or structure field access), or an ASM-level IR (where the address calculation might be broken up over several different instructions).
LLVM is pretty low-level. It's a bit lower-level than C and has a very simple type system. It has no idea about unique pointers versus references, has no concept of borrowing, etc. I bet you could encode the necessary information for performing say borrow-checking into LLVM IR, but working with that would not necessarily be pleasant.
A good example of this is ARC optimization in Swift, which gets rid of unnecessary reference counting operations. That optimization operates on SIL, which has special instructions for reference counting. You could represent those instructions as magic intrinsics in LLVM IR, but you have no way of imposing any structural restrictions on how those intrinsics are used. Moreover, any pass which operated on LLVM IR to optimize uses of those intrinsics, would have to make deep assumptions about the semantics of those operations. Those assumptions, being Swift-specific, would mean the pass could not be used by other languages, which removes a lot of the benefit of making it an LLVM pass in the first place.
For starters, LLVM has a very simple type system. very simple. So pretty much any optimization that relies on type knowledge, unless it's a simple optimization (IE can be expressed as an attribute on lowered types, ...) needs are a higher level IR.
Things are also legal in LLVM IR that are not legal in any of the languages it supports, for example, getelementptr can be used with negative indices, etc.
Some language rules and optimizations are easy to express in LLVM IR, but there are plenty that are not.
You could add metadata, but you'd probably have to extend how metadata works to support a bunch of things, and even then, the community would want metadata designs that multiple language frontends can use, not just yours.
> Things are also legal in LLVM IR that are not legal in any of the languages it supports, for example, getelementptr can be used with negative indices, etc.
Not sure what you're talking about here; what's wrong with negative indices in C?
int a[100];
int *p = &a[50];
a[0] = 666;
printf("%d\n", p[-50]);
Remember that x[y] in C just means *((x)+(y)). You can offset pointers with positive and negative offsets so long as the resulting pointer remains within the bounds of an object.
(EDIT: formatting so HN doesn't misinterpret code point U+2A.)
If you implement a stack using an array and you track the top of stack by a pointer just past the last element, then you can index negatively into that pointer to access items on the stack:
// Create an empty stack.
int stack[100];
int* top = &stack;
// Push a couple of values.
*top++ = 123;
*top++ = 456;
// Peek the values.
printf("%d\n", top[-1]); // 456.
printf("%d\n", top[-2]); // 123.
You could make the pointer arithmetic more explicit like this:
For something like that, I would use the explicit pointer arithmetic. The array syntax implies (to me) "I am doing the normal thing with this normal array".
I'm aware that index notation in C is just sugar for pointer arithmetic, but I'm under the impression that the interpretation of "within the bounds of an object" is less liberal than that demonstrated in your code example. In particular it's not clear to me that a pointer derived from an offset into an existing allocation does not constitute a new object, with its own bounds. Maybe I'm just spoiled by languages with proper slices...
Your impression unfortunately turns out to not be the case. You can move pointers forward and backward within an array using pointer +/- integer arithmetic or (equivalently) the prefix and postfix ++/-- operators, and it's well-defined so long as you don't back up over the beginning of the array or go more than one element past the end.
See section 6.5.6, paragraph 8, of the ISO C standard, if you want the legalese. But this is pretty basic C that dates back to the beginning of the language.
This working is entirely necessary for C++-style handle-iterators to work. Rust's slice iterators are implemented in the exact same fashion precisely because C/C++ does this a lot and LLVM is trained to "get" it.
There is nothing surprising in this fact. I always argued that even a very simple language needs not one, not two, but dozens of consequent IRs. Why? See the recently opensourced ChezScheme.
The LLVM bitcode problems are mostly caused by using the LLVM IR as a distribution format. As long as the rust MIR doesn't leak into the outside world, I don't see how it might turn into an issue
Most of those problems are non-issues (e.g. backdoor injection--Apple always has the ability to do this) and the remaining issues have nothing to do with Rust.
I suspect that the MIR idea could (should?) become a pretty common concept for a large number of LLVM-based languages. the LLVM IR is very low-level, and in most modern languages there's a big gap between the programmer-friendly features they have and "the bare essentials that still capture the same ideas".
I also suspect that the act of designing a MIR for a language is a fun one, should be pretty damn enlightening. It really forces you to focus on what's "core" about the language.
In my own, limited, experience with language design and compilers, I've found that doing something like a MIR greatly simplifies other parts. Code written in MIR (in my case, it didn't even have a concrete syntax, just an abstract syntax tree) is full of duplication and is ridiculously explicit, but it's super easy to process in all kinds of ways.
Just plainly guessing, but based on this blog post, if I'd write a Rust interpreter, I'd probably run it on MIR and not on plain Rust or LLVM IR.
You are correct that a Rust interpreter would work great on MIR, which is why Scott Olson (@tsion) has been working on one: https://github.com/tsion/miri (check out the slides and the report).
What I want to see in a language would be a bunch of syntactic sugar which corresponds to a MIR like Rust's by a well-specified reduction that is likely to be consistent from build to build.
The reason that this could be really cool is that you could hypothetically hash these MIRs and store these hashes in a public database of open-source code. (You would have to include hashes of whatever functions they call/depend on, and you would have to solve foreign-function imports.)
Then you would have this cool language where modules are simply saying "here's a mapping of names to hashes in the public database," with automatic recognition of "hey, these two functions you're importing from these two modules are actually the same function, I don't need to disambiguate them." If your dependency graph isn't a tree and one of the branches can't cope with a dependency-upgrade, that's fine too -- you upgrade the dependency on the one branch only. The point is, all of your dependency-semantics are now reduced to a really straightforward reasoning about unique names.
> The reason that this could be really cool is that you could hypothetically hash these MIRs and store these hashes in a public database of open-source code.
Right, the point would be for the hash to strip out file info, line numbers, whitespace, comments, any stylistic choices which turn out to be equivalent "under-the-hood" in some straightforward way, and probably even the bound variable names (with de Bruijn indexes).
I'm curious how this new architecture preserves safety invariants. The post says MIR contains operations not available in Rust because they are unsafe. Does this mean every MIR transformation needs to be safety-preserving? Or is safety-checking the last stage that runs directly before LLVM IR generation?
But the MIR definition of "safety" must be more encompassing than LLVM IR, right? LLVM IR has no concept of a borrow. Where in the MIR pipeline does borrow checking happen?
I believe that when it's implemented it will happen first before any optimization, in order to provide the best diagnostics possible and to avoid doing wasted optimization if there are errors. But Niko would really be able to answer this more definitively.
Yes, in theory, you could write a new backend that uses MIR. Miri, linked at the end of the post[1], is an example: it interprets MIR, rather than using LLVM to generate machine code.
Wow! More than the actual concept (which is cool), I found the level of technical writing is of high calibre .. highly educating. Pleasure to read such articles.
In theory it could allow for sibling call optimization by rewriting those functions into loops. It would not allow full unrestricted TCO--that would require both ABI work and thorny high-level design issues around destructors that may well not be solvable at all.
However, in practice LLVM already has full support for sibling call optimization internally, so implementing such an optimization at the MIR level wouldn't really buy us anything over what we already have.
Then maybe rustc ought to have diagnostics that warn if a recursive function won't be optimized fully, and then I can either avoid recursion or rewrite to fit into the optimizer, but having to appease a changing optimizer isn't what I'd call maintainable code, so a function attribute might make more sense.
There's a reserved and unimplemented keyword "become", which is like "return" except it requires that it can optimize the tail call (and is a compile-time error if it can't). See this comment: https://github.com/rust-lang/rfcs/issues/271#issuecomment-18...
How about non-sibling TCO provided their are no drop obligations? I'll gladly write the boilerplate of manually calling `drop` to get the ball rolling here.
Not super familiar with Rust, but will this have any impact on debugging?
I imagine LLVM is generating debug symbols from what it receives (which would be fairly representative of the original program).
Now with this extra step, won't LLVM generate debugging symbols for the MIR, which doesn't really map well to the original code (ie: LLVM won't know all those gotos were originally a loop)?
Debug information is propagated down from the HIR through the MIR and then down to LLVM IR, just as LLVM propagates debug info down through LLVM IR to SelectionDAG and then to MachineInstrs.
I'm surprised to see no mention in the article or on HN about how this might affect writing crypto code in Rust. Maybe it's a little tangential, but I'm dying for constant time operations.
Avoiding branches, etc. in Rust doesn't mean LLVM won't add some as an optimization, which is frustrating to say the least. It would be awesome to be able to define a block - similar to "unsafe" - that tells the compiler to disable optimizations that could introduce non-constant time operations. When I started reading the article, I though maybe this new development would open the door to something like that, but it doesn't appear to be the case.
There's some work to do constant time ops in rust, but it's very experimental and untrustworthy. :/
I have no experience with LLVM internals, but couldn't you split up the code and disable some of LLVM's optimizations on blocks defined by the programmer?
Or do you have to pass the entire program at once to LLVM? Maybe the ability to disable optimizations only on certain functions is possible, idk, but I think it certainly would be nice.
Awesome. This reminds me of the recent changes to the OCaml compiler (flambda). If it's anything similar, can't wait for the performance improvements from this (which can hopefully continue bringing Rust up to C speed).
Because (a) I resisted it due to (mistakenly) thinking it was not necessary; (b) the language was changing a lot and having a MIR to change would have made that harder. If we had had a MIR from Rust 0.1, it probably would have had to be completely overhauled by this point.
Hopefully the IntelliJ Plugin of Rust will iterate faster and support more completions. Actually it would be awesome to edit Rust the same way than Scala/Java.
I don't think this new work would be useful for that. It's an internal compiler IR for programs in the process of being optimised and lowered, not a general format for representing Rust programs for user applications.
Specifically, based on the thread covering WebAssembly as an output target [0], rewinding the graph into higher control structures like loops may be non-trivial.
It's interesting to compare and contrast this against bytecode from other platforms like CIL or JVM bytecode. They're both essentially a kind of compiler IR and they allow for incremental compilation, can be interpreted, dynamically manipulated, structed as basic blocks, etc. But MIR is represented using a textual form that still has scoping and which still uses Rust syntax. I can see the appeal. But I wonder how much it complicates working with MIR. You'd need a more complex parser, I guess.
It's kind of a moot point, because MIR never leaves its in-memory representation in normal compilation. (However, that may well change with future improvements to incremental compilation, though the on-disk format will likely never be formally defined and stabilized--it's too much work for little gain.)
Surface syntax of the compiler IR isn't particularly important anyway.
Well, actually, we store MIR in crate metadata. However, it's not a special serialization format, just the rustc_serialize infrastructure, i.e. you could also serialize MIR to JSON if you really wanted to.
Inside rustc, MIR is a data structure. The outputs you see here are a human-readable export of that, the compiler does not emit text and then re-parse.
That's certainly true today, but other ecosystems have benefited tremendously from tools that are able to work directly with the IR (well, bytecode). Usually to pull off features that can't be expressed directly in the language without huge pain, like continuations, or to do things like profiling.
I understand the desire to keep MIR private and avoid needing to stabilise a syntax and feature set for it, but there are benefits to doing so as well.
Yes, but at the time that we actually do stabilize such a thing, we could choose a representation that looks completely different than this one, one that would be easier to parse.
Speaking of comparisons to bytecode, there's currently a student writing a MIR interpreter as part of their research: https://github.com/tsion/miri . There are some that foresee this being added to the Rust compiler itself, to enable more extensive compile-time constant evaluation.
IIUC, MIR has a textual representation, but the compiler works on a control flow graph in memory. I don't think MIR is written to disk and then re-parsed during compilation.
My guess is that in the future, source code -> MIR could be cached (on disk) so that whole-program optimization/linking could happen without compiling all of the source to IR.
I wonder; can someone explain to me how this compares to Scalas' TASTY? It seems to be on about the same layer, but I understand too little about compilers to be sure.
One name is in all caps, the other is not, and it's easy to add another word to your website searches to get the one you want (e.g. MIR Rust, Mir Ubuntu).
If you are interested in more details, there was a talk two years about the compiler design: https://www.youtube.com/watch?v=osdeT-tWjzk (many type names and implementation details have changed since then, but conceptually it is still very applicable).
It's also interesting that Swift, another LLVM-based project, has also transitioned to using their own language-specific intermediate representation (SIL). I think it's fascinating that, despite the availability of a robust and mature pluggable language backend, two professional languages independently decided that they needed another layer of abstraction between AST and LLVM IR. Not sure if this should be interpreted as a bug report against LLVM IR, or if it possibly represents a future trend in compiler design.
EDIT: I should also mention that MIR has been available on play.rust-lang.org for a while, if anyone would like to play with it in their browser: https://play.rust-lang.org/?gist=fee8ccf28bae2c89107d&versio...