Hacker News new | past | comments | ask | show | jobs | submit login
Why ML/OCaml are good for writing compilers (1998) (yale.edu)
140 points by jasim on Dec 11, 2015 | hide | past | favorite | 79 comments

Every time I see ML on HN I hope it's about ML the language and I'm sad when it turns out to be Machine Learning instead, haha.

It's a shame ML-family of languages isn't very popular. 1ML for instance could be a fantastic modern language but I don't see that happening.

My dream language would be 1ML with a co-effect system as described by Tomas Petricek (pretty big in the F# community, but unfortunately his research isn't being incorporated into F#). But yeah, unless some company goes into full sponsorship mode, ML will always be on the niche side.

Jane Street is a major user/sponsor of Ocaml.

I would call them a major user, but sponsor not so much. Their investments in the ocaml ecosystem have mostly been self serving...driven by what they need, not what the ocaml community as a whole needs. Contrast this with Sun/Java, Microsoft/DotNET, Google/Go, etc., where development is happening on what developers everywhere are looking for.

Actually, we've funded lots of things that we don't directly use. For example, we funded the development of OPAM, which we don't use internally at all. Moreover, the funding of OCaml Labs was aimed broadly at improving the OCaml ecosystem, rather than just addressing our narrow internal needs. Our work and funding on compiler improvements like Flambda is aimed at things we want, but they're also of broad utility to the users of the language. Merlin is another example of something we've supported that is useful to us internally, but also useful more broadly.

This is of course self serving in the sense that we think the OCaml ecosystem is important to our future, and so we want to help it flourish. But it's a relatively enlightened form of self interest...

That's fair, I wasn't aware that OPAM was funded by JS. It is definitely one of the better package managers that I've worked with too.

I'd like to ask you a question, as it isn't every day that yminsky responds to me. If you knew from the beginning that you would end up creating an entire ecosystem (including libraries, package managers, etc) mostly from scratch, would you have chosen SML instead? I know when you joined JS that OCaml was probably the pragmatic choice. But so many ML enthusiasts (myself included) prefer the syntax and semantics of SML and end up using OCaml/F#/Scala because the SML ecosystem is so non-existent.

I love 1ML, but so far I've failed to see the importance of Tomas' co-effect system. Can you point out why you think it's important? In particular, in what ways is it better than e.g. implicits in Scala?

Have you ever used F#'s type providers? The idea of an external environment (for example, a Web API or a specific SQL database and schema) lifted directly into the type system is pretty powerful. I'm familiar with Slick's typed SQL statements (currently using them in a project) which use a hefty dose of implicits and macros, and it kinda approaches the power of the SQL type provider, but its still incredibly clunky and trying to debug anything is a huge PITA.

A co-effect system is like a type provider on steroids: any aspect of the environment in which a system executes can be lifted onto the type system. This could be anything from GPU or specialty sensor availability to data security requirements.

It would allow for things like code sharing between scala & scalajs without needing to set up complicated SBT code sharing projects. You would have a scalajs main, and a scala main, and any piece of code whose type is permissible in both scalajs and scala is automatically shareable. It would also allow for cross-executable optimization. There is often a lot of code that could be optimized away from current executables if the compiler only knew more about side effects. You could eliminate dead code that goes across a wire between client and server because knowing exactly what the client/server interactions are, the compiler can know what is actually dead code.

Maybe it is possible to do some of the above with some implicit wizardry that I'm just not familiar with, and maybe it could be done with compiler and build system tricks without modifying the base language, but honestly even if it could I would still prefer co-effects. It took me one read of a blog post to really grok co-effects, whereas I've been using Scala for well over two years now and still have trouble understanding how implicits are working in the code I'm using.

For more info, you could read his blog post or academic papers linked in the blog post: http://tomasp.net/blog/2014/why-coeffects-matter/

Why can't F#'s type providers be implemented more generically as macros? Define a piece of code to run on an AST and generate types or whatever? Rust's macros, AFAIK, are powerful enough to do the same thing.

F# type providers sorta seemed like more of an answer to C#'s codegen utils like sqlmetal and xsd.exe. They're cool, but I just don't understand the limitation if we're already gonna run code at compile time.

If the macro can open a connection to a database, query the schema, and use the results to construct/inject types into the source, all at compile time, then yes. That is basically what Slick's type-checked SQL API is doing in Scala.

It sounds like something that might be a practical disaster if you don't have incremental compilation though.

Thanks, I'll read those. Also, I'm absolutely certain that F#'s type providers do provide an additional feature to the language, one that would be quite hard to replicate using macros or other existing constructs.

I'm still not too sure about coeffects, though. I'll read the blog post you linked in more detail, but from a quick glance, it seems that what they provide could more-or-less be replicated by Scala's implicits.

For example, a function requiring the context of a database or GPS module is just a function that has those two as dependencies. This has long been solved using dependency injection, which can be made implicit (but still type-checked) using Scala's implicits.

The security of sensitive information (e.g. a password) is a different manner. I'll have to read the paper, which seems rather technical at first glance (and I couldn't find any additional information about handling of security issues), but I imagine a lot could be done simply by wrapping the password into a monad/object and restricting access to it through the operations the monad allows.

Same here. ML-family languages are my favorite. Haskell seems to be the most popular though I may be biased considering I consider myself a Haskeller.

Between OCaml and Haskell, I would definitely pick Haskell. Even if only because I much prefer Haskell's typeclass system to OCaml's modules.

That said, I use Clojure and would choose Clojure over either Haskell or OCaml, because of the live code-reload abilities Clojure comes with, which makes it really great for rapid development. And because Clojure emphasizes immutability so much, it's trivial to keep only the state around that you want to keep, and refresh the rest. I don't think I would be nearly as fast without that.

My window manager is Xmonad, and restarting it after a configuration change does feel like hot code swapping: since the configuration file is actually Haskell code, it gets recompiled.

The Yi editor has similar capabilities.

I think hot code swapping has less to do with the language, and more to do with whatever surrounds it.

That makes sense in the context of a window manager, because the processes live outside it, meaning it has very little to no state of its own. But when developing a web app, you have to login to it every time you reload the code, because being signed in is in-memory state.

> meaning it has very little to no state of its own

That's false.

Windows are affected to workspaces. They are sorted in a particular order (set by the user as they move the windows). Each workspace uses a particular layout. One particular window on each workspace last had focus. There's a "current workspace" pointer. That's a bit more than "very little to no state". All of that is preserved upon reload. And I bet my hat all that state was in-memory.

Then there is Yi, a text editor. There's no avoiding lots of in-memory state in there.

Live code reload in Haskell. It's real.

> ... being signed in is in-memory state.

Can be in-memory state. Everything beyond toy-sized I've ever worked on has externalised this sort of state to a database of some kind, if only so you can run a second copy for the sake of failover. I don't believe your example is relevant.

Or used a signed assertion that is verified with a secret.

I think there's some room for a next Gen language somewhere between the two.

It's puzzling to me that it isn't more popular honestly. My main exposure is through F#, but I would be interested in a new dialect or revision to ML, especially if it gained reasonable community support (the size of say, Scala or Clojure).

What is 1ML, I've never heard of it and it's hard to google.

1ML is basically SML with first-class modules. It's very recent research (paper appeared in ICFP 15) so I'm not surprised there isn't much coming up.

EDIT: I'd like to add that the video from ICFP is up: https://www.youtube.com/watch?v=42Wn-mXWcms

Is there a compiler for it yet?

only a prototype interpreter [1]. this stuff is fresh out of the research lab

[1] http://www.mpi-sws.org/~rossberg/1ml/

Can't make it build, unfortunately:

    ocamlopt -c lib.ml
    File "lib.ml", line 24, characters 49-58:
    Error: Unbound value List.mapi
TBF, this is probably a good thing; it's not like I have any time...

List.mapi was added in OCaml 4.00.0 [1], so it's likely that you are using an older version of the compiler.

[1] http://caml.inria.fr/pub/docs/manual-ocaml/libref/List.html

After typing 1ml into Google the paper is the first result and the link header sums it up nicely "1ML - core and modules united" :P

Ah, see ddg give me a lot of people asking yahoo questions about converting things to milliliters. Completely forgot to check Google.


I think is doing better than expected. Is the one I'm using now, and is nice.

Personally, I really like the ideas that OCaml brings, but I feel like it comes with too many compromises in consistency of the language. So I have to ask myself: if I'm going to make compromises, which ones would I make? At the end of the day it usually comes down to choosing between a new language, and one that I already know pretty well, and the benefit of using what I already know is pretty strong.

In my experience, #4 has held up extremely well over time and is the most compelling reason for choosing an ML over another language. Algebraic data types are an absolute joy to work with when dealing with abstract syntax trees. The other points are obviously great to have, but strong language support for ADTs is key, in my opinion.

I'm not certain I agree with #3. It seems to defeat the purpose of a strong type system. Either way, it can be very nice to express meta-level constructs in a matching object-level type. For example, if the language you are writing a compiler for has an int32 datatype, but you use an int64 in the language you're writing the compiler in, you'll need to simulate overflow. It would just be easier and safer to use an int32 in both places.

These days, I'd recommend Menhir over ocamlyacc unless you have a very specific use case that the former breaks on but the latter works on.

> In my experience, #4 has held up extremely well over time and is the most compelling reason for choosing an ML over another language.

I have a hard time choosing between ADTs and the module system, honestly. If I really had to chose, I'd probably pick ADTs, but it'd be very close. It's really disappointing every time a new language comes out and it has a weak module system. Especially F#, which was primarily influenced by OCaml.

A thousand times this; I've been working on compiler-y things in C++ for the past couple years, and I am firmly convinced that the visitor pattern is a truly inadequate way to attempt to compensate for the lack of a proper destructuring pattern-matching construct in a language.

I love writing compilers in OCaml: I T.A. a compiler class and my own implementation of the project (a compiler for a not-quite subset of Go) is written in OCaml. Compared to my thesis project which is written in Java, OCaml is a breath of fresh air and I feel that the single, most important feature of OCaml is its type system, especially having sum types available. Once you learn to use sum types to design your solutions, it's very hard to go back to other languages that don't support them and where you need to figure out the least painful way to emulate them.

Hey - minor long shot but have you got a link to the course? Or is it on edx/coursera?

Not the original comment author, but I thoroughly enjoyed taking Penn's undergraduate compilers course, which used OCaml to compile a basic OO language with single inheritance and dynamic dispatch. http://www.seas.upenn.edu/~cis341

Hi, no the course isn't on a MOOC platform. The website from last year is available here: http://cs.mcgill.ca/~cs520/2015/

Just in case, coursera has a course where a bit more than a third is devoted to sml (sml/nj): proglang by Dan Grossman (UoWashington)

MLs are way much better than the average blubs for writing compilers, but yet they still have some issues. The worst is the amount of a boilerplate code needed to implement even the smallest pass over an AST. Say, you want to visit and rewrite all the identifier nodes, but yet you must implement recursive functions for all the node types. In Haskell a bit of the Scap Your Boilerplate magic relieves this problem a bit, but still a long way to go to reach a level of convenience of Nanopass.

Next issue is related: slightly amended AST types are hard to define, they cannot be derived fromtypesxisting one by a simple rewrite.

Also, MLs do not allow an easy way to handle metadata transparently.

What I really want to have is ML type system and Nanopass density combined in one language (working on it, not done yet).

There are many ways to avoid the boilerplate you mention. The deriving ppx extension is one of them. I personaly used my own poor's-man-deriving called ocamltarzan which I used in pfff to factorize boilerplate by using auto-generated visitors. See https://github.com/facebook/pfff/blob/master/lang_php/parsin...

Yes, it looks much better than an ad hoc ML. As I said, tricks like Scrap Your Boilerplate are very useful, although they're still not as dense as Nanopass.

If a straightforward, depth-first visitor is relatively easy to infer, more complex strategies (e.g., the one you'd need to resolve the lexical scoping) are still a bit messy. And if your visitor is building a new AST out of an old one, things are getting much more tricky.

You can usually generate the visitor function, if your ML has a decent macro language (as does OCaml).

I'll give you the one about amended ASTs though. I once had to write an HTML / CSS processor. It started with the HTML DOM [tree] and performed successive processing stages on it, each time adding a little bit of extra data into the tree and extra node types. I had to tediously write an html type, an html1 type, an html2 type and so on. Never did find out if there's a better way to do this.

I don't know that anyone has really solved that problem:


Honestly, maybe the solution is just to look at the problem a different way, and use an attribute grammar instead, like Ted Kaminski mentioned.

I think I found a relatively sane solution to the expression problem, which employs quite a weird type system. The trick is to use a permissive (nearly "dynamic") first typing pass, using a "weak set" unification (a special kind of a Prolog variable which unifies with any other weak set, merging the values together), and then doing another typing pass to find out of the union types inferred this way are isomorphic to the expected declarations.

This way, tagged unions can be expanded, cut down, rewritten using simple rules - all the stuff you can do in Nanopass, but strictly typed with a Hindley-Milner type inference.

Asking as a complete layperson, are there any other strongly typed languages that doesn't have this issue when parsing an AST and adding metadata during subsequent passes?

Well, in ML you can easily create an AST type that allows you to add metadata as you go, by using option types. The problem is that what you really want is to have multiple representations of the AST, each only supporting relevant data.

You could potentially solve the problem using macros, to generate different versions of the AST, but it doesn't solve the problem of having multiple ASTs.

I suppose you could parameterize your type with whatever metadata should be on it, which means you only have to create one tree. But then pattern matching on it becomes a little bit messier, and your type is a bit more complicated.

So it's not that the problem doesn't have any solutions at all, it's just that none of them are totally satisfactory.

The reason you don't have this problem in untyped languages is that you aren't trying to type them :). The first solution I mentioned, using option types, is effectively how you would handle it in a dynamic language.

Exactly, the best compilation technique involves dozens of small passes, ideally with a new AST after every pass. And rewriting the whole thing so many times is very error-prone and clumsy in general. If you add a new variant to the first AST you may end up amending dozens of the lower ones.

My ideal approach is the one of Nanopass (or MBase, which was developed independently), where you can derive an AST from an existing one using simple rules.

For example, my usual trick for a typing pass is to enumerate all the expression nodes, giving them unique ids (that will become free variables in the type equations). The new AST is usually generated with a couple of lines of code:

  ast typed: simple (simpleexpr -> osimpleexpr) recform {
    simpleexpr = t(ident:ttag, osimpleexpr:e);}
And the rest of the AST can be very complex and elaborate, but we only wanted to rewrite `simpleexpr` nodes here and nothing else.

What about Lisps then? They don't have the type system but you probably win on verbosity. I thought that Racket and other Lisps have some static typing/ADT support too, but I'm not familiar with them.

I first heard the term "nanopass" from this paper [1], which is done in Scheme:


[1] Sarkar, Dipanwita, Oscar Waddell, and R. Kent Dybvig. "A Nanopass Framework for Compiler Education."

> What about Lisps then?

That's what I'm using at the moment, but I'm really missing the strict typing. It's often hard to debug even the simple passes if they're constructing invalid trees. A type system would have picked it up easily.

For this reason I'm working on a hybrid DSL which combines all the power of a Nanopass-like visitors language and a Hindley-Milner type inference with a little twist to overcome the expression problem. Plus all the bells and whistles like an enforced totality, combined with locally allowed imperative things (updating hash maps, accumulating data destructively in lists, etc.).

This article covers setting up Common Lisp with GADTs and exhaustive pattern-matching checks. https://chriskohlhepp.wordpress.com/metacircular-adventures-...

> We note that we get a non-exhaustive match warning at compile time, not run time. This is the behaviour we would expect for example from OCaml. Indeed, in our little game we might change the simple use of keyword symbols to algebraic data types and thus add compiler hosted verification of our pattern matches. To bring our exploration of the meta-circular abstraction to a full circle: Abstract data-types are an ideal medium in which to represent Abstract Syntax Trees or ASTs. Indeed the whole of Lisp syntax is itself fundamentally an Abstract Syntax Tree. So let’s see if we can represent trees using the cl-algebraic-datatype package. That would amount to an Abstract Syntax Tree represented inside an Abstract Syntax Tree. We will close with that that little teaser.

Is the DSL embedded in a Lisp or is it an entirely new language? Isn't there a lot of existing work on type systems and ADTs in Lisps? e.g.



I ask because I have run into the typing problem writing language tools in Python, and then I switched to OCaml. I am not that experienced with it, but I see the problem you're talking about. I'm wondering if I should go to a Lisp. Racket looks interesting and this seems like something the community cares about.

Yes, it is a eDSL on top of Lisp (but I also have a proper ML which runs on top of Lisp, so it does not matter much).

The issue with the existing type systems is that they do not solve the expression problem, for this reason I had to build a new one from scratch.

I'd suggest taking a look at Nanopass (there is a Racket port). I doubt it will play well with any external type systems, but at least some of the ideas in it are very valuable.

I proposed integrating FLINT ML compiler and VLISP in some form for heightened correctness argument, ML safety, and LISP power. Kind of several copies of the software that corresponded with the strengths of each catching various problems and assuring correctness of tool itself.

Never occurred to me to embed ML in LISP. Looks so obvious in hindsight. You might be onto something there. Integrating yours with mine, if I get back to using those, might embed CakeML or Ocaml subset in VLISP with it emulated in Racket or CL knockoff of it for fast, day-to-day development.

Btw, what's the expression problem? I've either forgotten or not got that deep into PL design yet.

Originally ML was implemented in Lisp. It was the language for a theorem prover: HOL


Older ML (not something like SML) in Lisp is still available in some old version of that prover.

Interesting. So, basically taking it full-circle back where it came from. That is, if one still uses it with HOL4 for verification.

The term was coined by Philip Wadler: https://en.wikipedia.org/wiki/Expression_problem

The older versions of Bigloo used to contain an ML compiler built on top of Scheme (they dropped it later for some reason), so the concept is not new.

Appreciate the reference. Will look into it in the future. The Wikipedia summary made sense but the recompile part doesnt bother me as much as others. My preferrence is to make compiles fast as possible with good incremental compilation and caching. Ive even pushed for trying to FPGA-accelerate compiler passes if possible.

My solution allows incremental compilation. It is just a type system hack, and it does not even involve backtracking (my second favourite compilation technique), just a two pass typing instead of a single pass.

As for the compilation speed, the Nanpass-like approach is very parallelisable and there is a lot of space for nice optimisations (like pass fusion).

Curious, do you publish your tools that implement your solution?

Only partially (see the discussion elsewhere in this thread). It's a work in progress. But I hope to be able to roll out a working prototype soon.

The main component of this type system (weak sets unification) is already published.

Cool, cool.

IIRC, you are into both tech like ML and hardware. I know of LISP machines & even a G-machine (Haskell) emulator on a Forth chip. However, given verification and non-Turing push, I was looking for a CPU to run algorithms written in Standard ML. Found one and thought you'd enjoy reading it too:


Thanks! An interesting one. It's unusual to see an SKI-based implementation of an eager language.

One of the main stoppers of the wider OCaml adoption, I think, is a poor Windows support. For example, missing Unicode support [1] or lack of the opam [2] on this platform.

[1] https://github.com/ocaml/ocaml/pull/153

[2] https://github.com/ocaml/opam/issues/2191

> ML has, perhaps, the best gc around; it's so fast that for many real apps it's as fast as the C++ malloc/free, and maybe even a little faster in some cases. You don't have to wring your hands about using ML's gc, as you do with Java's, which is slow.

This is from 1998. I highly doubt this is true today. HotSpot now has an incredibly good generational, concurrent garbage collector.

Other tracing collectors have certainly caught up since 1998, but I'd keep my eye on two things that have also happened since 1998:

1. Work by Damien Doligez which made further improvements to OCaml's current tracing collector (in 4.03) which further reduced pause times.

2. The multicore OCaml tracing collector work that's going on right now at Cambridge. If I understand correctly, this work involved not only partitioning memory into multiple thread local heaps, and a shared heap, but it also improved the collector in general.

I haven't tested it but I'm curious if the combination of this work will again put it ahead of other alternatives.

I went through the videos of a coursera class for Programming Languages with Dan Grossman. https://www.coursera.org/course/proglang

He starts teaching programming with ML and then moves to Racket and ends with Ruby.

I had tried to teach myself Haskell several times but it always fell flat. I ended up loving ML and Racket (Especially Racket) the 1ML does look very interesting. Racket is pretty amazing for me. I learned a ton and was able to really improve my code in Python and R.

SML is one of a very few languages that have a formal description of the meaning of the language.


This is also true for C [0,1] and other mainstream languages [2] now.

[0] http://robbertkrebbers.nl/thesis.html [1] https://github.com/kframework/c-semantics [2] http://www.kframework.org/index.php/Projects

I used SML/NJ for a course I took last Fall. It was my first experience with ML other than looking over some sample code on the web. I'm not sure why, but my thought processes just work well with ML. Beautiful language. Very expressive while light on verbosity.

I heard an interview with Benjamin Pierce in which he extolled the virtues of the OCaml compiler. While he said that many other languages are interesting to him, OCaml is the go-to for getting stuff done.

Given people's comments and the linked post, my project for next Summer is going to be writing a compiler in OCaml for some simple language I'll define and implement.

For interested people, there is a very nice OCaml MOOC for beginners currently: https://www.france-universite-numerique-mooc.fr/courses/pari... !

A (classic?) book on writing compilers with ML:


Despite being almost 20 years old, the only thing about this that sounds anachronistic is extolling the virtues of exceptions failing at runtime.

While #8 is useful, I've been writing a static analysis tool for JS in OCaml and it's been extremely hard to refactor when type inference is happening on function parameters. Mostly because I can't easily look to see whether I've refactored a method to use my new types.

But on the other hand, I think that's because (I at least get the impression) there's a different strategy for refactoring functional programs effectively. I haven't quite figured out what that is, though.

Actually, in Haskell the usual coding style is to explicitly declare the type of all top-level things, which kind of contradicts his first example of 8. The idea is that every time you change a type, all uses of it that use the previous type are then highlighted as compile-time errors, which is definitely useful for refactoring. I'm not sure about the example that he gives, but if used this way type inference is actually quite useful for refactoring. Plus, all intermediate non-top-level types are adjusted automatically whenever possible, which saves work — I think this is really what he's saying.

In the meantime, several newly designed languages do in fact have most or all of these features along with other pragmatic advantages.

Rust and Scala being perhaps the most popular and maybe best ones.

These sound like good reasons for using it for compiler front ends. Is it equally helpful compiler back ends? I mean optimization and register allocation.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact