Hacker News new | past | comments | ask | show | jobs | submit login
A Crash Course on ML Modules (2015) (jozefg.bitbucket.io)
146 points by Tomte on Feb 25, 2018 | hide | past | favorite | 53 comments

This is a good rundown of the basics of ML modules, the syntactic aspects, if you will. But it doesn't really give you the important insight:

  Abstraction. is. powerful.
if you consider the IntMap module provided in the article, you notice that the type `t` is not defined. It is kept "abstract". One defining property of ML modules is that you can indeed keep the definition of types hidden and nothing can break abstraction. The rest of the program can only manipulate the type `t` through the functions provided by the module.

This allows, in practice, to encode pretty much any property you would like. In the case of maps, you can use BST and never have to worry about users breaking the invariants: the user can't even see it's a BST! You can provide any form of validation and be sure that the data you manipulate will stay validated (as long as the module is correct, of course).

Functors allows you to rely on this by giving you a way to abstract away a whole module. This gives you excellent guarantees: if two modules behaves exactly the same, swapping them is semantic-preserving. Said in another way: if I can prove that sets-as-lists and sets-as-bst are functionally equivalents, then I can swap one for the other, and the rest of the program will behave the same!

Refactoring becomes a breeze. Functors (and modules) makes all the Javaesque reflections and dependency injections completely redundant: modules are (in OCaml) first class objects that you can manipulate and apply as you wish.

It also gives you separate compilation! A lot of people say that the OCaml compiler is blazing fast: this is precisely because OCaml's module system ensure that each module can be typechecked and compiled incrementally.

It pains me that modules get so little appreciation. Many ML programmers get a sort of selective blindness: just like fish in the sea, they don't realize the power of what they're swimming in. Proper modules with actual abstraction are, in my opinion, the single most essential feature for "large scale" programming.

I agree this is great stuff, but let's not get carried away about types. They don't guarantee compatibility.

Two functions with the same type signature can compute different things, and code can advertently rely on this difference. For example, changing a hash function can cause code to break that relied on iteration order of a hashmap. (And with in a large enough codebase and a commonly used, rarely changed hash function, this is probably inevitable.)

Also, even if the values returned by the function are the same, you can still observe differences between different implementations by comparing performance, and code can inadvertently depend on performance differences. This can cause breakage or security holes.

Still, if you can avoid type errors, you're probably doing well.

I didn't say anything about types. :) I said functionally equivalent.

What it means depends on the definition of "functionally equivalent" of course. The first version, which is types, is the most immediate and obvious, but as you pointed out, not that useful.

Abstractions, however also works at the semantics level: if A behaves like B trough the interface S, then you can confidently say that for any F (of type S -> S'), F(A) behaves like F(B). So you can just test that A and B are not distinguishable (property testing is very good at this) and feel confident about your refactoring!

The definition of behaves like depends on your language. Usually, this doesn't account for performances, as you point out, but only for the observable semantics. This is known as parametricity[1]. If you really want to get into the deep end, [2] demonstrates all that for a (very rich) ML module system.

[1]: https://en.wikipedia.org/wiki/Parametricity [2]: http://www.cs.cmu.edu/~crary/papers/2017/mapp.pdf

Good point! But leaving types aside, the same argument applies. "Observable semantics" seems to mean "observable according to our semantic model" which is an agreement to pretend that machine behavior outside the model isn't observable and avoid relying on it.

This agreement is normally a good thing since it's what allows for the same program to run on different machines, or to compile the same program with different optimizations, or swap in a newer version of a library with a better hash function and claim it doesn't break backward compatibility.

Nevertheless, it can be a blind spot as we saw with Meltdown and Spectre, so I wanted to emphasize that this is a useful myth but the world isn't obliged to go along with it. It's often important to observe program behavior that isn't specified by the language.

Ah, yes, This is where safety and security differs! Abstraction is about safety: preventing errors and giving additional guarantees, not preventing attacks. :)

Nothing I said hold when people are being malicious: Even in OCaml, you have escape hooks that allows you to break past abstraction boundaries and do whatever you want.

There's work to address that. Look into proof-carrying code and "fully-abstract compilation." There's also been proofs done for hardware and assembly against specs in ACL2 and Abstract, State Machines.

Agreed, but this is where our own jargon betrays us: in ordinary usage by non-programmers, "safety" includes safety from attacks. Perhaps especially from attacks?

Ordinary usage is often murky, but the usual distinction between security and safety is not whether it is about an attack, but about the direction of the threat.

Safety means the environment is not harmed by the system. Security means the system is not harmed by the environment.

That's an interesting definition. Traditionally, we had several ways of looking at it:

1. Safety contends with accidents like components breaking. Security contends with malice where input is crafted and failures are set up intelligently to do damage unlikely to happen on accident. There's quite a bit of overlap, though.

2. If talking leaks, safety rarely requires confidentiality of data. It's usually integrity or availability that it shares with security. So, a security violation with covert/side channels might not be a safety violation.

Just two off top of my head to illustrate the difference.

I think of it the same way you'd think of it in a workshop: a "safe" drill or saw is one that can be used in a safe way, not one that's safe from deliberate attacks.

While ML has lots of cool features, that kind of abstraction capabilities and compilation speed is also possible in other languages with modules.

Even if their modules aren't exactly ML-like, they offer similar features.

For example CLU also explored hidden types and generic interface definitions.

Modula-3 does offer opaque types, generic module interface definitions with multiple implementations and class inheritance. However it does lack closures/lambdas, forcing the use of function pointers, which aren't so developer friendly,

Similarly the way traits work in Scala are quite close to ML modules.

So there are ways of achieving ML module capabilities, even if not 1:1 as they are done in Standard ML.

It seems to me that you can get the same thing in C++ with classes <> modules ; templated classes <> functors. What am I missing?

There is no abstraction in C++, you can always poke around the memory layout of what you are given, and change your behavior depending on that.

For example, let us say you have a module satisfying this signature:

  module M : sig
    type sorted
    val import : int array -> sorted
  end = struct
    type sorted = int array
    let import = Array.sort
In OCaml, if you have a value of type `sorted`, you know it's indeed sorted. In C++, as soon any code external to the module had an handle on it, you don't know! It could have modified the array behind your back, since it can look directly at the definition, or worse, poke in the memory layout.

This is not accurate. C++ has statically enforced private variables that cannot be modified by outside functions without fairly painful workarounds and castings. But, if you want to say that the ability to do those castings would disqualify any compile-time guarantees, well, OCaml has an equivalent escape hatch. Module Obj contains a bunch of functions to manually manipulate any object fields and cast arbitrarily using the "magic" function.

Yes, you can access raw memory and flip bits to modify a private member of an object (and it migth even be defined, not too sure about that thou). But I don't find that an valid argument for the statement that you can't do abstractions in c++. That's just not something people do.

The c++ version of your example would be, if I understood your code correctly, to take an std::vector as an constructor parameter, copy it to an private field and sort it.

I don't know all that much about C++'s object system, so I can't give you a concrete example on how it breaks down.

However, you say that it is "not something people do" ... well maybe not in C++ (I highly doubt that), but it's very common in many languages. In C, it's common to look inside structs directly and change things. Javascript libraries do it all the time: They inspect their arguments, look at the types and change their behavior depending on it. It's a common programming practice to poke deep into the data-structures and do things. In Java, they made it an art with reflection and monstrosity such as Spring.

Abstraction is a bit like immutability: Sure, you can try to fake it in languages that don't have it, but then you are just praying that everyone plays by your rules. :)

My comment "not something people do" was about directly accessing memory to circumvent the private data abstraction in c++ and I stand by that.

I admit that I kinda pushed you into it, but you are moving the goal post. People indeed use public fields in languages like C and Javascript.

In C it's often done for the sake of performance. Hiding data behind a pointer has a cost.

In Javascript I would say it's lazyness above all. Front end programs often aren't that big nor pinacles of code quality.

But it is possible to define abstract data types in both languages. ML makes it a bit easier and some times even more performant, but it doesn't "own" the idea.

Abstractions are quite like immutability. You can enforce both in many languages, some just give you better tools for it.

The goal post, right from the start, is that by having proper abstraction (the kind that doesn't break when you blow lightly on it), you get many benefits.

You say that, even when the language doesn't enforce it, people don't break it ..... except when they do. It doesn't really matter why, it simply makes every thing else more brittle as a consequence and limits how you can reason about your code.

You seem to trust that programmers will play by the rules, even if the compiler doesn't enforce them. We will simply have to agree to disagree. :)

> You seem to trust that programmers will play by the rules, even if the compiler doesn't enforce them.

I've been trying to say the exact oppisite. C, C++, Javascript, all those languages provide ways to define abstract datatypes that cannot be circumvented (by "normal" code. Even Haskell has unsafePerformeIO). My latest argument was that people decide not to use those abstractions not because they are unavailable, but because it is more ergonomical or performant not to. The same happens even in ML, not all data is abstracted as an abstract data type.

So, in C++

  class M
    // This is private
    std::vector<int> _sorted;
    // This is your "import"
    M(const std::vector<int>& v) : _sorted(v) {
      std::sort(_sorted.begin(), _sorted.end());
    // Add other ADT operations that preserve the invariant

A more indirect way that more closely mimics the ML thing:

  struct M
    // Everything private; but as M is "friend" it can access all members
    class Sorted {
      friend M;
      std::vector<int> value;

    static Sorted import(std::vector<int> v) {
      std::sort(v.begin(), v.end());
      return Sorted{v};

  auto s = M::import(std::vector<int>{4, 3, 2, 1});
  ++s.value[0]; // ERROR: value is private
Yes, you can use casts to access the private members. As others have pointed out, unsafe operations exist in Ocaml/Haskell/... too.

"In OCaml, if you have a value of type `sorted`, you know it's indeed sorted."

Could you expand on this? I'm pretty sure the OCaml compiler isn't sophisticated enough to make guarantees like this - i.e. I could pass in a function that satisfies the interface constraints, but doesn't actually do any sorting.

Since the function import in the signature is the only way to create a value of the abstract type sorted, and the implementation of import is given as Array.sort, all values of type sorted have indeed been sorted.

An interface and safety. In C++ the template's arguments are directly copied into it's body. A module can only be accessed via its interface. This in turn allows for a typechecker to guarantee the absence of certain errors. It also enables separate compilation that you do not have for C++ template's. The downside is of course that a C++ compiler can apply more optimizations.

As a side note: the argument, I can do X with Y so why use Z is somewhat misleading, when Y and Z are both Turing complete ;).

Type checking, type abstraction. If you're willing to throw those out, you can "get the same thing" with structs and void pointers.

Or with Modula-3 opaque types, for example.

ML should get more love. If PolyML or MLton had received even a fraction of the support behind golang, we'd have a language with all the upsides of go (simple, concurrent, typed, etc) and none of the horrible downsides (not functional, almost zero abstraction, empty interface, no generics, pointers, etc).

The bucklescript and ReasonML ecosystem is quite interesting right now, seems to attract at least a bit of attention.

ReasonML syntax just isn't as good as the ML style syntax. If I understand correctly, they started with SML then switched to Ocaml because there were more libs. I suppose it was a pragmatic decision, but I believe SML Successor would have been a much better long-term target not to mention that going with PolyML would have given them a concurrent back-end (PolyML has had years of refinement while Ocaml's implementation is still not finished yet).

Can you explain what the problem with that syntax is, is there something you couldn't express in ReasonML? It is meant to bridge the gap for existing js developers, if I understand it correctly.

Regarding the ecosystem, I think the main goal is compilation towards js and js interop, so other backends don't really matter that much, from my understanding. Maybe Ocaml just fit the bill best as a base language equivalent, regarding the features they wanted to map to and from js.

I love JS (which ReasonML tries to emulate), but the traditional ML syntax (elm, haskell, SML, etc) is simply better for me. The extra parens and curly braces everywhere don't really add anything useful to the language while complicating the syntax (well, they help solve some edge cases in Ocaml, but those cases also don't exist in most other MLs). Another result of using Ocaml is that operators aren't overloaded (not bad if you have only a couple types, but quite problematic as the number of primitive types grows). A personal annoyance is the use of JS promises (non-monadic with auto-flatmap).

Most of this applies to reason as well. http://adam.chlipala.net/mlcomp/ (I'd note that the goal of Successor ML is to add the most useful Ocaml features into SML).

As to why they chose Ocaml. More libraries makes it easier to write tooling. Ocaml has builtin tools to make writing the language much easier. Most importantly, other Facebook teams were already doing a ton of stuff in Ocaml. There are probably a bunch of other reasons, but the ReasonML guys know way more about that than I do.

Ok, yeah, I guess that's personal preference then. I have only superficial experience with traditional MLs and am currently experimenting with ReasonML, and it just feels more intuitive and productive right from the start, and context switching is easier.

With regard to operator overloading, that's the number one issue I had anytime I looked for example at haskell, it's just incredibly hard to figure out what some <=$=> operator or whatever means, and also very hard to google, so for newcomers it's really bad language UI.

Operator overloading has pros and cons. In truth, I'm inclined to agree with you that operator overload should be limited in userland, but I'm a bit incredulous at the choice with primitive data types. There are 12 or so common primitive numeric types baked into hardware. If Ocaml ever expanded to include these all, the effect without overloading would be horrendous.

Ok yes, agreed, the separate operators for different numeric types are not exactly very user-friendly either. In the de-facto use case of Reason they shouldn't be such a big deal though (building webapps).

>Ocaml is that operators aren't overloaded

Which is good since SML ad-hoc polymorphism is broken. Right way to do an ad-hoc is to use classes, type classes or modular implicits.

SML overloading of builtins isn't specifically broken -- it just makes implicit assumptions about data types. That said, I'm with you about typeclasses and Ocaml's modular implicits.

If full-blown operator overloading is going to be allowed, modular implicits are definitely the way to go. If only builtin operators can be overloaded (probably a better solution in maintainable software), then a few specific typeclasses (like the current equality one) could be a more pragmatic solution.

In this particular case, I'd hedge my bets and use modular implicits, but limit them to a small list of builtin operators.

You have it partially in Rust, Swift and F#, it is just Google can't really do language design that well.

I know he said "All the code here should be translatable to OCaml if that’s more your taste" but I have read multiple examples about the power of ML modules and the code always seems to be in SML and not OCaml. Why is that?

Most likely because there is plenty of actual production code using modules written in Ocaml and modules are such a key part of modern idiomatic Ocaml no one seems to think that writing blog posts and exemples outside of teaching material is interesting. Ocaml users are generally more interested into writing software than promoting the language.

You will find plenty of exemples in the Ocaml manual (the whole chapter 2 is on the module system [1]) and beginner focused books however (chapter 4, 9 and 10 of Real World Ocaml [2]).

Functors (parametrized modules) are routinely used in production Ocaml code. Ocamlgraph [3] is a famous exemple (it's used by Frama-C and pff) because they wrote a paper on using functors to build generic library in 2007 [4] but they are literally everywhere : Janestreet Core, Facebook pff, Facebook infer, Frama-C.

[1] http://caml.inria.fr/pub/docs/manual-ocaml/moduleexamples.ht... [2] https://dev.realworldocaml.org/toc.html [3] http://ocamlgraph.lri.fr/index.en.html [4] http://www.lri.fr/~filliatr/ftp/publis/ocamlgraph.ps

I think a lot of the people who are enthusiastic about the ML module system are either faculty, students, or alumni from Carnegie Mellon University (including the author of this blogpost), and the same set of CMU people have also done a lot of work on defining and teaching Standard ML.

Perhaps if it were taught to CompSci students elsewhere in the US, they'd be more advocacy?

In Europe, they are are taught at quite a few unis.

We had classes about C++, Pascal, x86 and MIPS Assembly, Smalltalk, Caml Light (OCaml had just been released as Objective Caml), Lisp, Prolog, Java (on my last year there).

Those that ventured into the optional compiler design classes, like I did, even had more languages to look at, like Algol, PL/I, Oberon and a few esoteric ones.

So it always feels strange to me when universities don't take this approach to teach what is out there, or where we come from.

Because Ocaml is easily the ugliest of all the ML languages. I will never understand why Ocaml became the popular one. SML is much more elegant in my opinion (and even adding all the good Ocaml stuff to SML still leaves a much more elegant language).

The further reason is that SML is very popular in Academia (perhaps that is its downfall). The same holds true for things like lisp where Common Lisp is undoubtedly the most used, but scheme is usually an example language because it is more elegant (and with fewer warts).

Can you define what you mean by "the ugliest"?

SML is used in academics because the language is very well-defined(both semantically and typing rules) and all the fundamentals of typed functional programming, so it is preferred for teaching. Ocaml is a moving goalpost, and the language has no formal specification, afaik.

The fact that the implementation is the spec is a large part of the issue. Someone feels like adding a feature, so they just tack it on somewhere until the language has a million syntactic features littered everywhere. The experimentation aspect is nice, but to me, the final product is an unwieldy Frankenstein (though still better than many more popular languages).


SML is nice enough as is, but the progress with Successor ML to gradually add the best of Ocaml on top is a very appealing option. As a personal gripe, both languages need better Unicode support.

Speaking from European universities point of view, when I was there everyone was using Caml Light, so when its new version came out, Objective Caml, it was natural just to upgrade there.

Then you have countries like France, INRIA's home, where schools are proud to teach their students about local tech.

INRIA also tries to connect with the industry, instead of being stuck in Academic stuff.

OCaml has preferred OO over modules by convention.

No, OCaml has an object system, but it is an uncommonly used complement to the module system.

That’s exactly what I mean. They find no need for modules when you have classical inheritance.

Using the object system in OCaml is super rare in my experience and using modules to structure code is way more common. They are also two somewhat orthogonal features of the type system: objects don't contain types, they are just values.

Yikes, the OCaml I worked on must be pretty unusual then.

Quite on the contrary: The OO part of OCaml is hardly used at all in the wild!

Prerequisite is existing knowledge of ML syntax I guess. Cause I couldn't fully follow.

And can someone tell me how its different to Java Interfaces or Classes?

One big difference is that java interfaces can't contain member types. They describe the methods of one particular type. Imagine you could define an interface for an entire java package.

Another big one is that module types are structural: you don't have to declare all the module types implemented by a module when you define the module.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact