This is a very cool exercise. I see on the roadmap in the README that the approach seems to be "horizontal" in writing the compiler's parts (complete parts are written starting from the parser and it will end with writing the native code generator). While this can work if you really know where you're going from the beginning, I would strongly recommend to use a more "vertical" approach where you stack language features over language features, each completely implemented from parsing to native code generation. So you may start (in the extreme case) with a language that can only understand integers (and that's it, not even basic arithmetic operations) but which at least is an actual language implementation, and can be tested for regression each time you add new features (which can be built on top of the already implemented, known-to-work, features).
Whenever someone is applying the "vertical" approach to build a compiler, it looks really awesome to me. Helps to see the big picture right away -- it's much easier to pick up finer details if I understand a topic in general first. Also, seeing a project working -- even if the majority of functionality is missing -- and growing piece by piece works much better to keep me motivated.
So, when I was starting to work on my own toy compiler, I was totally going to go with the vertical approach... Well, I quickly realized it didn't work for me for writing my first ever compiler.
You see, when you go vertical you have to learn a little about every layer of a compiler with every language feature you add: a bit of lexing, a bit of parsing, a bit of semantic analysis, a bit of assembly generation. Basically, you have to constantly switch context between the layers. And I personally found it much easier to learn about managing stack, calling conventions and translating expressions into assembly in one go without switching context away from the code generation layer.
In contrast, layers of compilers tend to be rather loosely coupled, so you don't have to think about parsing at all while working on semantic analysis.
But if I ever set out to write a second compiler, I'll try to go vertical again. Now, that I have more or less clear idea of how to build each layer, it feels I can build them adding slice by slice. What helps is I can keep a picture of the end result in mind when adding each piece.
This is a valid point. I'm teaching a compilation class to undergrad computer science student, so we actually first learn each "horizontal" step separately and work on them during lab session, and then for their project I strongly recommend that they take the "vertical" approach, because it is systematically what gives the best results.
In a sense, even if my student's project is their first compiler, they already saw and did exercise on each compiling step individually, a bit like you said.
> In contrast, layers of compilers tend to be rather loosely coupled, so you don't have to think about parsing at all while working on semantic analysis.
I agree, this is a illustrative example of data coupling: Very loose, easily observable and there are even nice visualization techniques for certain steps. Each can be understood and tested in isolation.
Also how vertical is the "vertical" approach really? In a first iteration you ought to lay the foundation of each component. Later you just add additional branches/structures so to speak.
This is fine, but I think there is a minimal sub-language that you want to implement on the first iteration, that is sufficiently small to not to get lost in the more hairy details, but large enough so it presses you to find a 'workable flexibility' for each of the components.
> Also how vertical is the "vertical" approach really?
It can be quite vertical, really. Like starting with a compiler for a language where the only correct programs consists of a single integer. This is what I do with my students to get them started on their compiler project. Of course you may have to rethink and rewrite or adapt some code depending of the order in which you're adding features and how they interact, but that will be true anyway if your language evolve :).
Thanks - author of the repo here. This really started off by me reading Pierce's Types and Programming Languages textbook and getting interested in implementing typecheckers and whatnot... and kind of spiraled into a full blown compiler. So it's ultimately a learning project for me and not meant to be a production-ready compiler (we already have MLton).
After this experience, I do agree that going with the "vertical" approach you described would be significantly easier - I actually considered doing this for just the raw lambda calculus so that I could get CPS transform and native codegen working on a smaller scale.
Unfortunately, I haven't had time to work on the project in a while (I'm finishing my PhD in chemical biology), but I hope to come back to it soon.
I realize this is more about Rust and writing compilers. But if anyone is interested in finding out more about Standard ML, give a shot to the Programming Languages course on Coursera[1]. Part A is devoted to Standard ML (SML) . To say the instructor is great would be a huge understatement. The code we wrote for assignments used only immutable values, recursion instead of loops, partial application. It didn’t make me a hardcore functional programming proponent, but seriously broadened my horizons. On top of that, we used pattern matching extensively, which in combination with discriminated unions impressed me the most: using these two to model a lot of problems felt really natural.
I’m in the early stages of working a compiler in Rust and haven’t gotten to the IR infrastructure yet. With Rust’s borrow checker, how can I make the IR graph safe, with its control flow preds and succs and data flow inputs that make it so there isn’t a clear owner for any node. How does rustc do this?
The essential idea is that you can use one of these "GhostCell" structures to assign a lifetime to a related collection of objects (such as the nodes in a graph), which allows the entire graph to be treated as if it had a single owner and a single lifetime rather than a separate owner and lifetime per node.
The paper is relatively recent (2021), and GhostCell is not, so far as I am aware, used in rustc at this point.
Author of the posted repo here! As another commenter posted, I used typed arenas quite a bit, it's not the most ergonomic way to do stuff, but it's quite performant and gets around the lifetime issues.
I struggled with this as well, and ended up making the lexer itself have a lifetime which is that of the full String read from a file, so that every token (and AST node) has the lifetime of the parsed file. I keep their indexes as well so that I can show error messages like rustc itself does (with code snippets and all) which would be hard to do if I didn't have the full source code text available.
This sounds a bit like cheating. What if you want to parse a really big file and delete the tokens as soon as you don't need them?
If Rust claims to solve memory management, then we should not fool ourselves by allocating everything in a huge chunk which is then freed when the program ends.
That's why this was not my first choice... but it was just far too hard to implement it otherwise... and then again, once you really think about it, you must keep ALL tokens/AST nodes in memory to be able to analyse and type-check/lint any decent language... so keeping a big Sring in memory while all nodes just point to slices of it is not nearly as wasteful as it might appear at a first glance.
If this is cheating, many real compilers “cheat.” Heck, some of them famously don’t even free it when the program ends, leaving that up to the operating system!
So basically it sidesteps the type system issues in a similar way that an inconvenient car trip instead of taking a flight sidesteps the TSA issues? ;)
Adjacency matrices would make ownership clear and solve the single-writer/multiple-reader issue because nodes wouldn't directly reference each other, but it keeps the exact same relationship between nodes, so any memory leaks are still possible (e.g. not cleaning up dead code because it is cyclic and appears to have references). If I'm going to defeat the type system, it seems like raw pointers would be easier.
Interesting and surprising to me how easy to read the uncommented compiler code is even to the new Rustecan (compared to my recollection of reading a compiler in C written by another person when I already knew C much better and had written my own).
F# doesn't have functors but SML and OCaml do. F# has Computation Expressions which (AFAIK) SML does not have; OCaml has PPX extensions. F# has access to .NET ecosstem, JS ecosystem and multi-core. OCaml has easy binding to C but multi-core hasn't quite landed yet. There are a few competing JS compilers for OCaml.
F# is the most mainstream of the 3. If you can live without the fancier type-system features (trust me, you probably can!) it's the easiest ride.
F# has quite a bit of awkwardness related to dotnet compatibility.
In practice you have to use a lot of C# class based code, which brings an inherent mismatch to the functional model and makes code look more like a multi-paradigm language like Scala.
The tooling and ecosystem is great (though not as good as for C#, obviously...), the language has some very nice features and design decisions, but you have to live with a somewhat messy reality.
An interesting opinion on OOP in F# from Eirik Tsarpalis, who curiously started out their programming with pure FP in Haskell [1].
And a talk from Don Syme, the author of F#, on good F# code as they see it [2]. It's a great talk overall and they specifically address OOP's good and bad bits starting from about 39:00.
When I (an utter beginner in it) code in F#, I miss ReSharper a lot. I wish there were an IDE extension with features similar that of ReSharper but for F# rather than C#.
Give Rider[1] a try? Its F# support is not as fully-featured as C#, but its, in my opinion, the best available F# IDE. Worth noting, many great features of Rider aren't C#-specific, e.g. the incremental search in files works for any language. Keep in mind, it's not free though.
Sometimes the absence of features is itself a feature. SML is a much smaller language than OCaml and F#. Offhand, I can't recall anything it has that OCaml doesn't, although it does some things differently (e.g. structurally typed immutable records).
I'm not sure there is much reason to use SML itself today, given that it has stagnated since the late 90s, but it is an excellent language for teaching programming because you don't have to sweep anything under the rug. But if you want to use it, then it does actually have well-maintained pretty good compilers. MLton probably generates faster code than OCaml and F# in many cases. The main problem with SML is that there are not a lot of SML programmers around, and hence very few libraries, particularly for handling modern protocols and formats.
My first programming language when I started to study CS at university was SML using MoscowML (in 2001).
It seemed like both I and a lot of other students had a lot of issues with it. It seems like the university agreed as they later moved away from it.
Basic SML might not be hard to learn but you quickly have to go to concepts like recursion that is not a very easy concept to grasp for beginners. SML seems more suited for the theoretical mathematical (for lack of better words) parts of computer science rather than the more practical parts.
The difficult thing isn't the concept of a function being defined in terms of itself, but translating between iterative and recursive versions of the same algorithm. Recursive algorithms can feel "backwards" when you're used to iteration. Real-world procedures are much more likely to be described iteratively than recursively.