
Thoughts on Rust vs. OCaml - pbiggar
https://blog.darklang.com/first-thoughts-on-rust-vs-ocaml/
======
Jweb_Guru
Yeah, trying to translate code from OCaml the number one most annoying thing
was limitations around pattern matching. I think in general you run into this
sort of thing a lot if you try to either treat Rust as a functional
programming language like OCaml, or are writing the sort of thing that
inherently benefits from garbage collection. However, there is actually a
trick that can (if you don't need cleanup) both improve your performance and
give you back pattern matching: allocate your nodes in something like
[https://docs.rs/typed-arena/2.0.1/typed_arena/](https://docs.rs/typed-
arena/2.0.1/typed_arena/) instead of as Rcs. Doing this (especially if you
tend to stick to immutable stuff, but even if you don't if you're careful) can
make using Rust for manipulating ASTs and stuff a lot more pleasant (though it
does have other tradeoffs, obviously, and you usually end up needing both
kinds of representation at different points--which I guess speaks to your
point about there being way too many ways to do things and too many rules
about how to do them).

~~~
pbiggar
Thanks! How does the Arena help you with pattern matching? I'd be very
interested.

~~~
Jweb_Guru
The below is much too long, so yeah, just listen to the sibling comment.
That's the essence of it, the below are mostly technical details.

\---

When you allocate nodes in an arena, instead of getting back something like a
Box<T>, or Rc<T>, you get a plain Rust reference--&'a mut T or &'a T.

Obviously, the first benefit is that these work well with pattern matching,
field projection, and other useful language features, They are also "zero
cost" in a way that Box, Rc, and Arc are not--everything in any particular
arena gets freed at the same time, so there's no need for precise ownership
tracking. In fact, &T in particular is Copy, aka freely duplicable, so you can
pretty much treat these &T references exactly like garbage collected pointers
in a functional language!

Arenas have some other benefits, too. They have great locality of reference,
are tightly packed (almost as tightly packed as a vector), and in many cases
let you allocate slices in addition to single references (so even if you think
you need a Vec of nodes, you can still often stick with pure arenas; and
slices are, again, Copy, meaning they work well with pattern matching and
other parts of the language).

In the context of writing a compiler, arenas have another extremely useful
property: they make it easy to efficiently create recursive data structures
(particularly immutable ones, but even ones with cycles if you are willing to
use interior mutability like `Cell<T>`). These recursive structures can use
ordinary Rust references in the exact same way you would managed pointers in a
GC'd language. The only restrictions are that whatever nodes you point to
should be allocated in the same arena (or older), and a more complex type-
related rule for safety that in practice doesn't hurt much.

The exact details are a little complicated, and it involves some subtle areas
of the compiler, so feel free to skip this part (the tl;dr is that these rules
usually apply). Essentially Rust can reason as follows:

* For safety, all references to values within a TypedArena must be accessible for at most as long as the TypedArena itself.

* Before the arena is dropped, its nodes are always accessible, meaning there are no restrictions on how they can be used.

So the problem case only occurs when not only do the node types stored in the
arena include references to other nodes within the arena (more conservatively,
with the same lifetime as the arena), they _also_ have destructors that
actually follow that field. It turns out that this almost never occurs in
practice, so in theory virtually all Rust types should be usable with cycles;
but the language can't reason directly about field usage in that way at the
moment. Instead, Rust provides three mechanisms to make all this work out:

* Many types (all Rust primitives, many zero-sized types, and other "plain old data") simply lack drop glue altogether. Notably, regular Rust references (&mut T and &T) fall into this category, but the "smart" constructors that are annoying to use with pattern matching (Box<T>, Arc<T>, Rc<T>) do not. This is by far the best kind of type to use with arenas because it maximizes their strengths--as there is no per-element drop code to run, Rust can do no per-element processing whatsoever on destruction, allowing effective use of arenas to equal or even surpass the speed of the young generation of a generational GC in the right circumstances.

There are some arenas that even go further, exploiting the fact that with no
drop glue you don't need to know what types are allocated to define an arena
that lets you allocate heterogeneous types into the same structure--the only
restriction is that the types must be `Copy` (a proxy for not having a
destructor, in this case, as any type which has drop glue is prevented by Rust
from also implementing `Copy`).

* Many other types in Rust have drop glue, but their own destructors are automatically generated by Rust. The Rust autogenerateD destructor code is very straightforward and just recursively calls drop on all the subfields / variant data, so as long as all of these are known not to use any fields with the lifetime of the arena, we can assume the parent type doesn't, either.

* Finally, there are the cases of Rust types that _do_ have a destructor, _do_ have a field with the lifetime of the arena, but _don 't_ try to follow that field within the destructor. While in reality this probably encompasses virtually all remaining types, Rust can't currently deal with lifetimes that are in scope, but not alive (though this is a useful concept in some other scenarios too). And in practice, the most likely scenario for encountering this involves using a type like Vec<T>, Box<T>, Rc<T>, Arc<T>, etc. inside an arena, where _T_ satisfies one of the other two rules, but the outer container doesn't (since it has a destructor).

As Rust can't verify these scenarios itself, instead it provides an unsafe
escape hatch, #[may_dangle], which can be applied to a lifetime that isn't
dereferenced in the destructor, or to a generic type parameter that is only
used in order to run its destructor manually somehow. Using this attribute in
a `Drop` implementation makes the implementation require `unsafe impl` and can
only be done on nightly for new types, but it's implemented for all the types
I just mentioned, and more besides. This means that Rust immediately knows
that Box<T> is safe to use in such a cyclic arena as long as T is, and same
with Arc<T>, Rc<T>, Vec<T>, etc.

So ultimately what this amounts to is, that if you put recursive types in an
arena, nine times out of ten it will "just work" and basically feel like
you've turned Rust's ownership rules completely off, as long as the arena can
manage to stay in scope.

Obviously, there's no free lunch, and you can see that this approach gets a
lot less attractive if you're talking about storing things with unpredictable
or highly irregular lifetimes (in an object lifetime sense), or if you want to
package things up as owned data, or be able to allocate them in a function
that doesn't accept a context (that library ecosystem is indeed great as you
mentioned, but if you start using weird enough patterns it stops working so
well). But my experience, having done this sort of thing for a long time now,
is that the tradeoffs for arenas are _usually_ worth it for at least some
parts of a compiler, both for aesthetic and performance reasons. Their
relative lack of popularity is, I think, more a product of poor marketing than
anything else (as you can see, the arena story _does_ fit into the borrows,
ownership, and lifetime paradigm, but not in a straightforward way).

~~~
pbiggar
What a great writeup, thank you! I think I followed all that . Two things leap
out at me:

\- I can see arenas for a compiler, but I'm not so sure for an interpreter.
While most of the memory allocated in Dark will be done within a single HTTP
call (and so an arena might actually be appropriate), there are contexts where
a GC cycle or two to cleanup might be appropriate. I'm guessing that arena's
are an all-or-nothing approach?

\- I'm using `im`, which I think also gets in the way of my nice pattern
matching, right? Any solutions to that?

~~~
Jweb_Guru
I also realized I should specifically address your "all or nothing" question,
though: arenas absolutely aren't all or nothing :) In particular, a very
common (well, by writing-a-compiler-in-Rust standards) pattern is to have two
representations for your elements: one that uses Rc, that you use for longer-
lived stuff, and another based on references, that you use for arena-scoped
stuff. An extra term is added to the younger version that marks the use of a
wrapper around an Rc type (recall that you can downcast Rc to reference, so
you do not necessarily have to lose Copy to be able to do this, depending on
your implementation). You can then implement a variant of generational
collection by (prior to dropping your arena, i.e. the young generation),
scanning any live objects using the young term type, and converting whatever
you find to Rc nodes. The Rc nodes then persist as the older generation.

This kind of hybrid strategy is much more effective than using Rc for
everything, and while it's still not as efficient as a proper garbage
collector it still has roughly the kinds of properties you want from one. My
experience (like that of many others) is that almost any workload you want to
use this with will satisfy the generational hypothesis, especially in Rust
where you only use GC for things that really need it, so hopefully you can see
why it's sometimes worth going to the extra trouble to maintain several
versions of the same type.

(Obviously, in OCaml, you can piggyback off the existing garbage collector and
this is a nonissue).

------
jbandela1
> You put everything in a Box::new (regular heap memory) or RC::new (reference
> counted memory) or Arc::new (reference counted memory suitable to be used
> concurrently in different threads), and then when they go out of scope
> they'll be cleaned up.

Please don’t. This is probably one of the biggest mistakes I see in people
coming from languages with garbage collection to languages like Rust or C++.
They attempt to use smart pointers as a replacement for garbage collection.

The idiomatic use of these languages is that unless you have a good reason not
to, just put the variable on the stack. This will make your program shorter,
easier to reason about, and potentially faster due to memory locality.

The rough guideline I have heard for C++ which likely applies to Rust is that
90% of your variables should be on the stack. Of the remainder 90% should be
unique ownership such as unique_ptr or Box. Shared ownership such as
shared_ptr or Arc/Rc should ideally be no more than 1%.

~~~
pbiggar
> Please don’t. This is probably one of the biggest mistakes I see in people
> coming from languages with garbage collection to languages like Rust or C++.
> They attempt to use smart pointers as a replacement for garbage collection.

Thanks for the feedback. I'm new here, so the advice is appreciated.

I've been using the rule of thumb "does the lifetime of this variable exceed
the lifetime of this function" to decide whether to put it in an Rc. Given I'm
implementing an interpreter for a GCed language, it seems like I'm pretty much
going to have to put everything into an Rc, no? (Another commented mentioned
an Arena, but either way it's not the stack).

~~~
Jweb_Guru
FWIW, very often the answer in those "returning from a function" situations
(in Rust anyway) is to allocate further up the stack, or clone, rather than
use a smart pointer. In fact, you can nearly always do this, as long as you
control enough of the involved code. Use of Rc is intended not only for data
that outlives the function in which it's currently defined (because in
general, you can define stuff wherever you want), but for data whose lifetime
is _unpredictable_ \--there either isn't a safe point in your program where
you can definitively drop it (this usually more applies to Arc, if it's
holding data pointed at by two independent threads), or there _is_ a known
safe point but it is too imprecise to be useful to you.

But the latter is pretty subjective; for instance, if you know it's safe to
drop something when the program ends, but it might be safe to drop much
earlier, in some contexts it still might be worth dropping at the end rather
than dealing with the significant overhead of something like Rc or Arc. The
reason I recommended arenas is that I find in compilers, often you have cases
where you just need to define a scope at some point in your program way higher
than when you create the value, and that's what they excel at--letting you
initialize stuff with whatever parent scope you want, not arbitrarily
restricting you to the function boundary.

------
thelazydogsback
Seems like a strange pair of languages to compare/contrast given that they are
for such different use-cases. The cognitive overhead of programming Rust need
only be applied if you're programming systems or code that need fine control
over memory. OCaml is a general purpose language with GC and much higher
levels of abstraction that can be quite performant - or F# (for OCaml-like) or
other mainstream ecosystems like .netcore or jvm that a much wider choice of
libs than Rust. (And these also usually have a path to native or js and/or
webasm.)

I really don't get the whole "rewrite it in Rust" thing at all, especially
when 95% of those applications would be perfectly performant in a language
that lets you focus on your domain, not on making the borrow-checker happy.

IMHO if your code is scoped to the lifetime of a call or transaction of some
sort or has very simple allocation needs, then fine -- but if you want to have
complex inter-relationships between objects and interesting data-structures in
a long-running process then you're in for a world of hurt.

~~~
asplake
But this is what I don't get. Why is OCaml (or some other language of ML
lineage) not a more accessible platform for web development? Or perhaps it is,
and I'm not seeing it? Prove me wrong here – I'd be delighted!

I'm not sure that Rust is there yet either mind you, but give it time, perhaps
it will...

~~~
cemerick
> Why is OCaml (or some other language of ML lineage) not a more accessible
> platform for web development

"Web development" is a grueling land war that requires massive investment in
foundational libraries and fairly sophisticated tooling just to do a modern
"hello world" app. Any language/platform that isn't explicitly dedicated to
"web development" will always be lagging many generations behind those that do
(i.e. rails and _waves hands_ javascript).

I agree with the sentiment that Rust and OCaml aren't really peers. More
generally, the urge to transition to Rust in general application development
is as puzzling to me as it is to your parent here: it has unique benefits for
systems programming, but that absolutely carries costs for a ton of non-
systems contexts where you could just as easily use OCaml, or (modern) Java,
or ...

------
magicmouse
The Eve team had been using some other language, then Jamie Brandon fell in
love with RUST,and it led to a split in the company. Rust is such a low level
language; its primitive types are things like integers from 8 to 128 bits. but
doesn't have any drawing, event mgt, or layout tools in the language. I can't
imagine trying to build a graphical interactive product in Rust. It's similar
to Modula-2 in terms of level, with one added trick, the memory borrow
checking. I think of Rust as a one trick pony, that helps handle multi-
threading memory management. Something that a browser maker would think ideal.
I could never use it. Far too clumsy for building graphical interactive
software, which is my wheelhouse.

~~~
ssokolow
It sounds like you're making it a requirement that a language should be
tightly coupled to a specific application framework (or at least a specific
event loop and fat standard library).

Is that correct?

(If so, it's a perspective very alien to me because it seems like something
that would limit the language in the same way that PHP is limited by its
historical tight coupling to the task of request-lifecycle'd HTML templating.)

------
dkersten
> Since I've really hit a dead end with OCaml (multiple times even), I'm still
> hoping I'll get Rust Stockholm Syndrome. There doesn't really seem to be
> anywhere else to go.

I think you have to come to terms with the fact that _the perfect programming
language_ does not exist and you will hit frustrations and dead ends with
every language.

------
fishmaster
I've looked into both some time ago, and while both are great languages the
tools around ocaml are atrocious.

~~~
pbiggar
Yeah, it's really unfortunate. Dune and Esy are moving things forward, but you
sorta have to know the ecosystem quite well to figure that out.

~~~
dm3
Do you have a reference to the up-to-date guide that you'd recommend to get
started in the OCaml ecosystem? The one that would cover editor integration
(VIM/Emacs), build system, standard and essential libraries to use, etc.

~~~
pbiggar
I'm not aware of any, sorry. You can look at the dark repo to see how we do it
though: [https://github.com/darklang/dark](https://github.com/darklang/dark)

------
wpwoodjr
I think this would be more idiomatic for your enum, and also shows how to
match on a Box:

[https://play.rust-
lang.org/?version=stable&mode=debug&editio...](https://play.rust-
lang.org/?version=stable&mode=debug&edition=2018&gist=5365c8f9d02d52b9f4ea523fcacecee6)

    
    
        #[derive(Debug)]
        pub enum Expr_ {
          Let(String, Expr, Expr),
          FnCall(String, Vec<Expr>),
          Lambda(Vec<String>, Expr),
          Variable(String),
          IntLiteral(i32),
        }
        use self::Expr_::*;
        
        pub type Expr = Box<Expr_>;
        
        fn main() {
            let x = Box::new(Variable("test".to_string()));
            match *x {
                Variable(n) => dbg!(n),
                _ => unimplemented!()
            };
        }

------
slifin
I think this just reminds me if I went back to ReasonML from ClojureScript I
wouldn't be having a good time

~~~
dm3
Would you care to expand on what would be the reasons?

------
dangoor
I wonder if it's related to the size of the codebase you have right now, but
compiler speed is one thing I've noticed from other comparisons. OCaml is said
to have a much faster compiler than Rust.

------
jacknews
What was wrong with clojure?

I'm learning both clojure and rust, for different cases (I have C, ruby,
javascript).

~~~
avasthe
Performance, startup time, JVM dependence, dynamic typing.

And the cult around zero information statements by Rich Hickey.

