
Why ML/OCaml are good for writing compilers - groovy2shoes
http://flint.cs.yale.edu/cs421/case-for-ml.html
======
swah
Because you can do stuff like:

    
    
      type tree = Empty
                | Leaf of int
                | Node of tree * tree

------
larsberg
I would expand on point #4 about type constructors. Datatypes are so easy to
define that it's easy to create intermediate representations (IRs) that
enforce static properties of your language. In Manticore, we have four IRs we
move through during the compilation process, each of which is easier for
certain types of optimizations or follows a set of constraints:

[http://manticore-
wiki.cs.uchicago.edu/index.php/Image:Mantic...](http://manticore-
wiki.cs.uchicago.edu/index.php/Image:ManticoreCompilerStages.png)

For example, in the AST representation, you can still have anything on the
right hand side of an "=". Once you get to BOM, it has been normalized, so
every subexpression has a unique variable attached to it. So, "val x = 1+2 _y"
is now "val x'=2_ y" and "val x=1+x'". This transformation makes identifying
common subexpressions trivial.

In a compiler written in C, because creating a new set of IR types involves
either copying header files or horrible template magic that terribly ties all
portions of the compiler together (I've seen people try!), most people end up
just keeping one IR and doing passes over it. Some variables are valid at some
stages; some are not.

A good example is the very nice V8 javascript compiler. I was playing around
with it a few months ago, and just understanding what invariants were valid
after which phase was challenging. And that doesn't even cover dealing with an
endless series of conflicting changes because every time somebody changed
anything it involved changes to core structures.

------
joe_the_user
_ML was conceived as a solution to the main problem faced by mathematicians
using automated theorem provers: that because of the type-free, dangerous,
cavalier nature of the usual language for that domain (Lisp), you could never
be sure that your program was going to work._

I'm a c++ person but I'm curious how folks feel about this statement from the
article.

~~~
anon_d
I've written several (toy) compilers in both Lisp and Standard ML. I
completely agree with the statement. Mostly, algebraic data types are a _huge_
win for representing expression trees.

~~~
calibraxis
He claims, "ML programs can't crash the system; if it compiles, it will run,
and you won't get a segmentation fault."

Does that mean there's static safeguards against out-of-memory errors? (This
is not snarky; I don't know ML or how it handles this situation.)

~~~
tomjen3
Out of memory errors generally don't happen on modern systems. What happens is
that you end up using so much swap that the program becomes so slow that the
user cries himself to sleep or something.

~~~
mhd
> Out of memory errors generally don't happen on modern systems.

Well, unless we're talking about smartphones or virtualized environments (e.g.
VPS). Both not very relevant in a compiler context, of course.

~~~
calibraxis
Yesterday, I got a Java error about running out of heap space, when I was
testing a Clojure program on an extreme case (to test its limits). (Which of
course runs in a VM, but I'm not sure you meant the JVM.)

~~~
groovy2shoes
The error doesn't mean that your system is out of memory. The JVM allocates a
glob of memory at startup to allocate objects from, and is configured with a
maximum heap size. If you run out of heap space on the JVM, you can relaunch
the java process with a higher max heap size using the -Xmx option.

------
mhd
Speaking of ML, is there anything from that family besides OCaml actively
developed? Just a few years ago you had several projects (SML/NJ, MLTon,
AliceML), nowadays it all seems pretty stagnant.

~~~
larsberg
We are currently working on Manticore (<http://manticore.cs.uchicago.edu> ).
It's not at a state where you would want to use it for anything in production,
but we have best-in-class parallel functional language performance. We
recently picked up another NSF grant, so you'll see at least another four
years of work on it.

SML/NJ is still supported but there is little active development going on. We
had a small infrastructure grant a few years back and cleaned up a bunch of
cobwebs that had grown in the runtime, particularly around Windows support,
and still support its active use in a few classes. But, it is fairly mature
and stable. The only pieces of work I am aware of are some cleanup work
related to the integration of the FLINT code generation backend with the
frontend and some work we've been doing identifying places where the
Definition was vague or the language needs to be extended.

MLton is also actively supported, and at this point it is stable and provides
extremely high performance for sequential programs (competitive with hand-
coded C, though all bets are off if you start cheating and using compiler
intrinsics). The primary maintainer has a large list of potential additional
projects, but I steal a signifiant portion of his time picking his brain, as
he is a co-PI on Manticore :-)

I don't know anything about AliceML.

------
winsbe01
nice read! I wonder how much of it is still relevant today (i.e. gc speed in
Java, relative superiority of its library, etc.). OCaml is one of those
languages that interests me, but I don't know if I'd ever have a reason to
learn it to do something specific, when the languages I've been using are
already pretty flexible.

Also (unfortunately, IMHO), this talks a lot about speed, which was a big
problem in the Pentium 200 days with minimal RAM, cache, etc. Nowadays, any
old poorly written, garbage-collected program runs speedy as hell. Seems like
lost is the golden days of optimization.

~~~
colanderman
Point 4 (algebraic data types) is really what's important; especially tagged
unions. ML's type constructors map very very well to abstract syntax trees.
Java has no equivalent (enums are a distant cousin).

Of course if you are a believer in higher-order abstract syntax (the idea that
one should use constructs such as lambda abstractions as part of your syntax
tree), Scheme is a better fit, as ML's type system doesn't allow for such
wildly typed syntax trees, nor does it allow data-as-code. (I think HOAS is
hogwash, but then I'm a firm believer in well-typed code and against data-as-
code.)

Of course, if your AST is more a graph than a tree, you're better with a logic
language such as Mercury or Prolog. I've had _very_ good experiences writing
compilers and interpreters in Mercury.

~~~
groovy2shoes
I'm curious as to what a language with a less tree-like, more graph-like AST
might look. Could you provide any examples of such languages?

~~~
xyzzyz
For instance, this appears when you do common subexpression elimination -- if
you refer to the same code in two or more places (e.g. call the same function,
compute the same math expression), and it's known not to have (or depend on)
any side effects, you can compute its value only once and refer to this value
in these places. I recall it's covered even in Dragon Book, which is kind of
old.

~~~
stiff
That sounds rather confused, an Abstract Syntax Tree by definition can't be a
graph. What you are referring to is a some kind of an intermediate form used
by a compiler, the details may very greatly, but for example directed acyclic
graphs are often used for common subexpression elimination, but they are most
often generated from some other intermediate form (straight-line code) and
they have more in common with the code that has to be generated then with the
original source code. Of course one can imagine annotating the AST and
performing this optimization directly on it, but then it is not an AST
anymore.

~~~
xyzzyz
I never used "AST" term in my comment. I assumed that intermediate DAG is what
the parent poster really meant.

------
mas644
-Good points about OCaml/Standard ML. One big omission though (at least I did not see it) was the lack of emphasis on the concept of "pattern matching". I'm not sure about SML, but with OCaml, everything is implemented using pattern matching. Pattern matching is an important concept in all functional languages to begin with. What makes ML datatypes so powerful is the ability to pattern match on typed expressions. Unlike dynamically-typed functional languages like Lisp, your expression and all of its subexpressions have a type. More importantly a type that can be statically resolved. When you write code it's guaranteed to be safe meaning that the code did not corrupt memory or perform operations that do not make sense (e.g. try to take the square root of the string "banana"). Unlike other clunkier statically typed languages, type inference automatically resolves the types of most values -- so the code ends up looking compact and clean like a dynamic language but at the same time type safe. Type safety is a big deal...ML makes it impossible to violate the type system. No recasting allowed, no way to corrupt the process memory, no way to cause runtime type errors :)

-Where it fails though still...and this is a big deal in this day and age, kernel level thread support. User level threads can solve a lot of problems, but when you're dealing with heavy number crunching and the algorithms of the future (computer vision, AI, highly parallel search, etc)...you need to use all the cores! Intel is talking about having thousands of cores on a single die by the end of the decade. We have to run 1000 OCaml processes and do process level messaging passing?? A lot of people are now looking to Microsoft's F# (virtually the same as OCaml sans OOP) as it targets .NET and supports true parallelism and a thread safe garbage collector.

-One thing about the original post which mentioned the lack of OOP. Perhaps not in SML, but OOP is well supported in OCaml (hence the O). People haven't given it a chance. It's really nifty...the biggest complaint I have is that type information from the compiler is hard to read due to the notation used for object types. Also, all of the standard libraries and most of the community only use the functional subset. There's good reason too, functional programming is very flexible.

-There are a lot of libraries out there for OCaml created by the community. However they vary in level of documentation. For the most part though, I've found 3rd party OCaml libraries to be of high quality due to the elegance of OCaml. Also, it's pretty easy to take a C or C++ library and write an OCaml binding for it. This is true of most languages, but it's annoying when the thing is already written for C++/Java/Python/etc

-Oh one really really cool feature of OCaml that nobody ever talks about -- you can actually read the code for the standard library! Have you ever looked at files like "iostream" for C++ or "stdio.h" in C. There are macros and templates and all sort of ugly craziness that nobody can read. I was able to open the standard library in OCaml and actually read it. I could see how they implemented standard modules like List, Thread, and Array. What's interesting is that most of the code would be considered inefficient by imperative programmers due to heavy use of recursion. However simple tail call optimizations by the compiler save the day!

