Hacker News new | past | comments | ask | show | jobs | submit login
A Compiler for Standard ML in Rust (github.com/somewhatml)
116 points by eatonphil on Sept 9, 2021 | hide | past | favorite | 49 comments

This is a very cool exercise. I see on the roadmap in the README that the approach seems to be "horizontal" in writing the compiler's parts (complete parts are written starting from the parser and it will end with writing the native code generator). While this can work if you really know where you're going from the beginning, I would strongly recommend to use a more "vertical" approach where you stack language features over language features, each completely implemented from parsing to native code generation. So you may start (in the extreme case) with a language that can only understand integers (and that's it, not even basic arithmetic operations) but which at least is an actual language implementation, and can be tested for regression each time you add new features (which can be built on top of the already implemented, known-to-work, features).

Whenever someone is applying the "vertical" approach to build a compiler, it looks really awesome to me. Helps to see the big picture right away -- it's much easier to pick up finer details if I understand a topic in general first. Also, seeing a project working -- even if the majority of functionality is missing -- and growing piece by piece works much better to keep me motivated.

So, when I was starting to work on my own toy compiler, I was totally going to go with the vertical approach... Well, I quickly realized it didn't work for me for writing my first ever compiler.

You see, when you go vertical you have to learn a little about every layer of a compiler with every language feature you add: a bit of lexing, a bit of parsing, a bit of semantic analysis, a bit of assembly generation. Basically, you have to constantly switch context between the layers. And I personally found it much easier to learn about managing stack, calling conventions and translating expressions into assembly in one go without switching context away from the code generation layer.

In contrast, layers of compilers tend to be rather loosely coupled, so you don't have to think about parsing at all while working on semantic analysis.

But if I ever set out to write a second compiler, I'll try to go vertical again. Now, that I have more or less clear idea of how to build each layer, it feels I can build them adding slice by slice. What helps is I can keep a picture of the end result in mind when adding each piece.

This is a valid point. I'm teaching a compilation class to undergrad computer science student, so we actually first learn each "horizontal" step separately and work on them during lab session, and then for their project I strongly recommend that they take the "vertical" approach, because it is systematically what gives the best results.

In a sense, even if my student's project is their first compiler, they already saw and did exercise on each compiling step individually, a bit like you said.

> In contrast, layers of compilers tend to be rather loosely coupled, so you don't have to think about parsing at all while working on semantic analysis.

I agree, this is a illustrative example of data coupling: Very loose, easily observable and there are even nice visualization techniques for certain steps. Each can be understood and tested in isolation.

Also how vertical is the "vertical" approach really? In a first iteration you ought to lay the foundation of each component. Later you just add additional branches/structures so to speak.

This is fine, but I think there is a minimal sub-language that you want to implement on the first iteration, that is sufficiently small to not to get lost in the more hairy details, but large enough so it presses you to find a 'workable flexibility' for each of the components.

> Also how vertical is the "vertical" approach really?

It can be quite vertical, really. Like starting with a compiler for a language where the only correct programs consists of a single integer. This is what I do with my students to get them started on their compiler project. Of course you may have to rethink and rewrite or adapt some code depending of the order in which you're adding features and how they interact, but that will be true anyway if your language evolve :).

Thanks - author of the repo here. This really started off by me reading Pierce's Types and Programming Languages textbook and getting interested in implementing typecheckers and whatnot... and kind of spiraled into a full blown compiler. So it's ultimately a learning project for me and not meant to be a production-ready compiler (we already have MLton).

After this experience, I do agree that going with the "vertical" approach you described would be significantly easier - I actually considered doing this for just the raw lambda calculus so that I could get CPS transform and native codegen working on a smaller scale.

Unfortunately, I haven't had time to work on the project in a while (I'm finishing my PhD in chemical biology), but I hope to come back to it soon.

Great job!

I realize this is more about Rust and writing compilers. But if anyone is interested in finding out more about Standard ML, give a shot to the Programming Languages course on Coursera[1]. Part A is devoted to Standard ML (SML) . To say the instructor is great would be a huge understatement. The code we wrote for assignments used only immutable values, recursion instead of loops, partial application. It didn’t make me a hardcore functional programming proponent, but seriously broadened my horizons. On top of that, we used pattern matching extensively, which in combination with discriminated unions impressed me the most: using these two to model a lot of problems felt really natural.

1: https://www.coursera.org/learn/programming-languages

And you also learn Scheme! I haven't taken course, but I've used profs online resources from it for self-teaching and they are great.

We used MoscowML and later SML/NJ at the University and I have fond memories of SML. I find SML very beautiful, more so than OCaml and Haskell.

I agree, SML just seems very "clean" and well-designed. I think it hits a sweet spot between simplicity, clarity and power.

SML is pretty much the "Scheme" of the ML family, which is what drew me to it as well.

really nice language for sure

I’m in the early stages of working a compiler in Rust and haven’t gotten to the IR infrastructure yet. With Rust’s borrow checker, how can I make the IR graph safe, with its control flow preds and succs and data flow inputs that make it so there isn’t a clear owner for any node. How does rustc do this?

You may find this paper interesting:

GhostCell: Separating Permissions from Data in Rust <https://plv.mpi-sws.org/rustbelt/ghostcell/paper.pdf>

The essential idea is that you can use one of these "GhostCell" structures to assign a lifetime to a related collection of objects (such as the nodes in a graph), which allows the entire graph to be treated as if it had a single owner and a single lifetime rather than a separate owner and lifetime per node.

The paper is relatively recent (2021), and GhostCell is not, so far as I am aware, used in rustc at this point.

Author of the posted repo here! As another commenter posted, I used typed arenas quite a bit, it's not the most ergonomic way to do stuff, but it's quite performant and gets around the lifetime issues.

It looks like this project is using typed arenas a fair bit - https://github.com/SomewhatML/sml-compiler/blob/master/crate...

I don't know the details of how this is being used in the project, but it might be worth investigating.

I struggled with this as well, and ended up making the lexer itself have a lifetime which is that of the full String read from a file, so that every token (and AST node) has the lifetime of the parsed file. I keep their indexes as well so that I can show error messages like rustc itself does (with code snippets and all) which would be hard to do if I didn't have the full source code text available.

Hope that helps.

This sounds a bit like cheating. What if you want to parse a really big file and delete the tokens as soon as you don't need them?

If Rust claims to solve memory management, then we should not fool ourselves by allocating everything in a huge chunk which is then freed when the program ends.

That's why this was not my first choice... but it was just far too hard to implement it otherwise... and then again, once you really think about it, you must keep ALL tokens/AST nodes in memory to be able to analyse and type-check/lint any decent language... so keeping a big Sring in memory while all nodes just point to slices of it is not nearly as wasteful as it might appear at a first glance.

If this is cheating, many real compilers “cheat.” Heck, some of them famously don’t even free it when the program ends, leaving that up to the operating system!

The solution doesn't scale. The moment you need to deal with very large inputs, you will have to rewrite your entire application from scratch.

I prolonged my own path many times in my life, by making things harder than they need to be. (edit) Aspirations acting as a anti-shortcut somehow.

This sound nice and simple. I moving to arenas but is a extra complexity. Can you show how is done? (ie:code!)


This loses the whitespace and consequently positions... but should be enough to get you started...

I use a UTF-8 crate to split words which keeps all whitespace, so I can keep positions correctly, in case you want to do that:


You could look at Cranelift, which I believe uses integer indices, and several datastructures for control-flow vs dataflow.

There is no SSA IR in rustc itself (MIR, regrettably, only lowers control-flow, but uses variables instead of representing dataflow).

> which I believe uses integer indices

So basically it sidesteps the type system issues in a similar way that an inconvenient car trip instead of taking a flight sidesteps the TSA issues? ;)

Adjacency matrices would make ownership clear and solve the single-writer/multiple-reader issue because nodes wouldn't directly reference each other, but it keeps the exact same relationship between nodes, so any memory leaks are still possible (e.g. not cleaning up dead code because it is cyclic and appears to have references). If I'm going to defeat the type system, it seems like raw pointers would be easier.

Interesting and surprising to me how easy to read the uncommented compiler code is even to the new Rustecan (compared to my recollection of reading a compiler in C written by another person when I already knew C much better and had written my own).

Is StandardML a subset of OCaml and F# or does it have something they don't?

F# doesn't have functors but SML and OCaml do. F# has Computation Expressions which (AFAIK) SML does not have; OCaml has PPX extensions. F# has access to .NET ecosstem, JS ecosystem and multi-core. OCaml has easy binding to C but multi-core hasn't quite landed yet. There are a few competing JS compilers for OCaml.

F# is the most mainstream of the 3. If you can live without the fancier type-system features (trust me, you probably can!) it's the easiest ride.

F# has quite a bit of awkwardness related to dotnet compatibility.

In practice you have to use a lot of C# class based code, which brings an inherent mismatch to the functional model and makes code look more like a multi-paradigm language like Scala.

The tooling and ecosystem is great (though not as good as for C#, obviously...), the language has some very nice features and design decisions, but you have to live with a somewhat messy reality.

In practice sticking to C# is probably wiser.

I think a little bit of OOP-related messiness is easily worth it to be part of a mainstream ecosystem.

An interesting opinion on OOP in F# from Eirik Tsarpalis, who curiously started out their programming with pure FP in Haskell [1].

And a talk from Don Syme, the author of F#, on good F# code as they see it [2]. It's a great talk overall and they specifically address OOP's good and bad bits starting from about 39:00.

1: https://eiriktsarpalis.wordpress.com/2017/03/20/why-oo-matte...

2: https://www.youtube.com/watch?v=1AZA1zoP-II

> The tooling and ecosystem is great

When I (an utter beginner in it) code in F#, I miss ReSharper a lot. I wish there were an IDE extension with features similar that of ReSharper but for F# rather than C#.

Give Rider[1] a try? Its F# support is not as fully-featured as C#, but its, in my opinion, the best available F# IDE. Worth noting, many great features of Rider aren't C#-specific, e.g. the incremental search in files works for any language. Keep in mind, it's not free though.

1: https://www.jetbrains.com/rider/

I can recommend getting an all products pack for personal use, if you can afford it.

Many programmers will tell you that you can live with no type system at all, and they would be correct.

Personally, as an Ocaml coder, I make extensive use of Ocaml's fancier type-system features and would be sad to lose them.

> If you can live without the fancier type-system features

One could argue that not only you can, but you should (for the sake of having more readable programs).

Anther way to put it - F# has a .NET dependency :)

Sometimes the absence of features is itself a feature. SML is a much smaller language than OCaml and F#. Offhand, I can't recall anything it has that OCaml doesn't, although it does some things differently (e.g. structurally typed immutable records).

I'm not sure there is much reason to use SML itself today, given that it has stagnated since the late 90s, but it is an excellent language for teaching programming because you don't have to sweep anything under the rug. But if you want to use it, then it does actually have well-maintained pretty good compilers. MLton probably generates faster code than OCaml and F# in many cases. The main problem with SML is that there are not a lot of SML programmers around, and hence very few libraries, particularly for handling modern protocols and formats.

My first programming language when I started to study CS at university was SML using MoscowML (in 2001).

It seemed like both I and a lot of other students had a lot of issues with it. It seems like the university agreed as they later moved away from it.

Basic SML might not be hard to learn but you quickly have to go to concepts like recursion that is not a very easy concept to grasp for beginners. SML seems more suited for the theoretical mathematical (for lack of better words) parts of computer science rather than the more practical parts.

What is impractical about recursion?

Nothing, maybe I was unclear. My point about recursion was that it was hard for students just starting university to understand.

Why? How can recursion be hard to understand? I don't understand.

The difficult thing isn't the concept of a function being defined in terms of itself, but translating between iterative and recursive versions of the same algorithm. Recursive algorithms can feel "backwards" when you're used to iteration. Real-world procedures are much more likely to be described iteratively than recursively.

No it's not a subset. They are all separate languages. But F# is more influenced and similar to OCaml than Standard ML.

One thing Standard ML has that the others don't is a full formal specification, in terms of typing rules and operational semantics.

Good on you, man! I'm finding it hard to learn the language. This one has to be a great learning experience, if not widely adopted.

Would this offer performance improvements or other advantages that it's compiled in Rust?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact