Hacker News new | past | comments | ask | show | jobs | submit login
Thoughts on Rust vs. OCaml (darklang.com)
84 points by pbiggar 11 months ago | hide | past | favorite | 45 comments



> You put everything in a Box::new (regular heap memory) or RC::new (reference counted memory) or Arc::new (reference counted memory suitable to be used concurrently in different threads), and then when they go out of scope they'll be cleaned up.

Please don’t. This is probably one of the biggest mistakes I see in people coming from languages with garbage collection to languages like Rust or C++. They attempt to use smart pointers as a replacement for garbage collection.

The idiomatic use of these languages is that unless you have a good reason not to, just put the variable on the stack. This will make your program shorter, easier to reason about, and potentially faster due to memory locality.

The rough guideline I have heard for C++ which likely applies to Rust is that 90% of your variables should be on the stack. Of the remainder 90% should be unique ownership such as unique_ptr or Box. Shared ownership such as shared_ptr or Arc/Rc should ideally be no more than 1%.


> Please don’t. This is probably one of the biggest mistakes I see in people coming from languages with garbage collection to languages like Rust or C++. They attempt to use smart pointers as a replacement for garbage collection.

Thanks for the feedback. I'm new here, so the advice is appreciated.

I've been using the rule of thumb "does the lifetime of this variable exceed the lifetime of this function" to decide whether to put it in an Rc. Given I'm implementing an interpreter for a GCed language, it seems like I'm pretty much going to have to put everything into an Rc, no? (Another commented mentioned an Arena, but either way it's not the stack).


FWIW, very often the answer in those "returning from a function" situations (in Rust anyway) is to allocate further up the stack, or clone, rather than use a smart pointer. In fact, you can nearly always do this, as long as you control enough of the involved code. Use of Rc is intended not only for data that outlives the function in which it's currently defined (because in general, you can define stuff wherever you want), but for data whose lifetime is unpredictable--there either isn't a safe point in your program where you can definitively drop it (this usually more applies to Arc, if it's holding data pointed at by two independent threads), or there is a known safe point but it is too imprecise to be useful to you.

But the latter is pretty subjective; for instance, if you know it's safe to drop something when the program ends, but it might be safe to drop much earlier, in some contexts it still might be worth dropping at the end rather than dealing with the significant overhead of something like Rc or Arc. The reason I recommended arenas is that I find in compilers, often you have cases where you just need to define a scope at some point in your program way higher than when you create the value, and that's what they excel at--letting you initialize stuff with whatever parent scope you want, not arbitrarily restricting you to the function boundary.


You can return stack values assuming you're willing to transfer ownership to the calling function. For example, this is totally valid (and does not heap allocate):

    struct Point {
        x: i32,
        y: i32,
    }

    fn add(p1: &Point, p2: &Point) -> Point {
        Point {
            x: p1.x + p2.x,
            y: p1.y + p2.y,
        }
    }
You'll also find that a lot of common data structures already use the heap under the hood (e.g. Vec and HashMap), so there's no need to Box them.

I'm not quite sure what you mean by "the lifetime of a function"; lifetimes are associated with references, but not with functions themselves. In particular, it can be totally valid to return a reference to something that needs to last beyond the current function call; lifetimes are used to enforce that you'll get a compiler error if the memory your reference points to won't survive long enough. For example, it's totally valid to have a function return a reference to a field on a struct as long as the struct was also passed in by reference:

    fn x_val(p: &Point) -> &i32 {
        &p.x
    }

Since `p` is passed in by reference, the compiler knows that the reference to `x` will still be valid after the function returns, so it allows that reference to be returned. The compiler will even catch it if later you drop the Point while the reference to the `x` value is still alive:

    fn main() {
        let p = Point { x: 1, y: -1 };
        let x: &i32 = x_val(&p);
        std::mem::drop(p);

        // Without this line, this will compile, since the compiler can tell that `x` isn't 
        // used after `p` is dropped. However, since `x` is used here afterwards, the 
        // compiler will generate an error stating that `p` does not live long enough.
        println!("{}", x);
    }


You should only put an object in an Rc when its lifetime may potentially be determined by multiple, independent references. That's what it's for. For the simpler case of single ownership, Box<> is fine,


The OP wanted to represent AST nodes - the stack won't do for that: the size of the object isn't known at compile time.


I think that the parent commenter was referring to overuse of Box and reference counting. However the compiler does know the size as it is an enum (a sum type) which means it is effectively a discriminated union that the compiler forces you to check the type of.


If the enum is recursive (which will often be the case, since expressions can be nested in most languages), then you'll still need to use `Box` (or some other indirection) for some of the child nodes.


That's a good detail to note, thanks. I wanted to note that the size of the enum itself is indeed known at compile time otherwise the compiler would complain (such as if it is improperly recursive). The point is mute though as a tree like this would usually be built on the heap anyway (whether that be an arena or normal allocation)


That is the only way when using libraries like Gtk-rs, unless you completely change the programming paradigm by using relm instead.


Yeah, trying to translate code from OCaml the number one most annoying thing was limitations around pattern matching. I think in general you run into this sort of thing a lot if you try to either treat Rust as a functional programming language like OCaml, or are writing the sort of thing that inherently benefits from garbage collection. However, there is actually a trick that can (if you don't need cleanup) both improve your performance and give you back pattern matching: allocate your nodes in something like https://docs.rs/typed-arena/2.0.1/typed_arena/ instead of as Rcs. Doing this (especially if you tend to stick to immutable stuff, but even if you don't if you're careful) can make using Rust for manipulating ASTs and stuff a lot more pleasant (though it does have other tradeoffs, obviously, and you usually end up needing both kinds of representation at different points--which I guess speaks to your point about there being way too many ways to do things and too many rules about how to do them).


Thanks! How does the Arena help you with pattern matching? I'd be very interested.


To put it very simply, it sort of provides a way to make everything 'static, except for it's actually arena's lifetime, 'a. So you no longer care who outlives who because all the objects will die together when the arena dies. And you're dealing with simple references, &, and not some wrappers.


The below is much too long, so yeah, just listen to the sibling comment. That's the essence of it, the below are mostly technical details.

---

When you allocate nodes in an arena, instead of getting back something like a Box<T>, or Rc<T>, you get a plain Rust reference--&'a mut T or &'a T.

Obviously, the first benefit is that these work well with pattern matching, field projection, and other useful language features, They are also "zero cost" in a way that Box, Rc, and Arc are not--everything in any particular arena gets freed at the same time, so there's no need for precise ownership tracking. In fact, &T in particular is Copy, aka freely duplicable, so you can pretty much treat these &T references exactly like garbage collected pointers in a functional language!

Arenas have some other benefits, too. They have great locality of reference, are tightly packed (almost as tightly packed as a vector), and in many cases let you allocate slices in addition to single references (so even if you think you need a Vec of nodes, you can still often stick with pure arenas; and slices are, again, Copy, meaning they work well with pattern matching and other parts of the language).

In the context of writing a compiler, arenas have another extremely useful property: they make it easy to efficiently create recursive data structures (particularly immutable ones, but even ones with cycles if you are willing to use interior mutability like `Cell<T>`). These recursive structures can use ordinary Rust references in the exact same way you would managed pointers in a GC'd language. The only restrictions are that whatever nodes you point to should be allocated in the same arena (or older), and a more complex type-related rule for safety that in practice doesn't hurt much.

The exact details are a little complicated, and it involves some subtle areas of the compiler, so feel free to skip this part (the tl;dr is that these rules usually apply). Essentially Rust can reason as follows:

* For safety, all references to values within a TypedArena must be accessible for at most as long as the TypedArena itself.

* Before the arena is dropped, its nodes are always accessible, meaning there are no restrictions on how they can be used.

So the problem case only occurs when not only do the node types stored in the arena include references to other nodes within the arena (more conservatively, with the same lifetime as the arena), they also have destructors that actually follow that field. It turns out that this almost never occurs in practice, so in theory virtually all Rust types should be usable with cycles; but the language can't reason directly about field usage in that way at the moment. Instead, Rust provides three mechanisms to make all this work out:

* Many types (all Rust primitives, many zero-sized types, and other "plain old data") simply lack drop glue altogether. Notably, regular Rust references (&mut T and &T) fall into this category, but the "smart" constructors that are annoying to use with pattern matching (Box<T>, Arc<T>, Rc<T>) do not. This is by far the best kind of type to use with arenas because it maximizes their strengths--as there is no per-element drop code to run, Rust can do no per-element processing whatsoever on destruction, allowing effective use of arenas to equal or even surpass the speed of the young generation of a generational GC in the right circumstances.

There are some arenas that even go further, exploiting the fact that with no drop glue you don't need to know what types are allocated to define an arena that lets you allocate heterogeneous types into the same structure--the only restriction is that the types must be `Copy` (a proxy for not having a destructor, in this case, as any type which has drop glue is prevented by Rust from also implementing `Copy`).

* Many other types in Rust have drop glue, but their own destructors are automatically generated by Rust. The Rust autogenerateD destructor code is very straightforward and just recursively calls drop on all the subfields / variant data, so as long as all of these are known not to use any fields with the lifetime of the arena, we can assume the parent type doesn't, either.

* Finally, there are the cases of Rust types that do have a destructor, do have a field with the lifetime of the arena, but don't try to follow that field within the destructor. While in reality this probably encompasses virtually all remaining types, Rust can't currently deal with lifetimes that are in scope, but not alive (though this is a useful concept in some other scenarios too). And in practice, the most likely scenario for encountering this involves using a type like Vec<T>, Box<T>, Rc<T>, Arc<T>, etc. inside an arena, where T satisfies one of the other two rules, but the outer container doesn't (since it has a destructor).

As Rust can't verify these scenarios itself, instead it provides an unsafe escape hatch, #[may_dangle], which can be applied to a lifetime that isn't dereferenced in the destructor, or to a generic type parameter that is only used in order to run its destructor manually somehow. Using this attribute in a `Drop` implementation makes the implementation require `unsafe impl` and can only be done on nightly for new types, but it's implemented for all the types I just mentioned, and more besides. This means that Rust immediately knows that Box<T> is safe to use in such a cyclic arena as long as T is, and same with Arc<T>, Rc<T>, Vec<T>, etc.

So ultimately what this amounts to is, that if you put recursive types in an arena, nine times out of ten it will "just work" and basically feel like you've turned Rust's ownership rules completely off, as long as the arena can manage to stay in scope.

Obviously, there's no free lunch, and you can see that this approach gets a lot less attractive if you're talking about storing things with unpredictable or highly irregular lifetimes (in an object lifetime sense), or if you want to package things up as owned data, or be able to allocate them in a function that doesn't accept a context (that library ecosystem is indeed great as you mentioned, but if you start using weird enough patterns it stops working so well). But my experience, having done this sort of thing for a long time now, is that the tradeoffs for arenas are usually worth it for at least some parts of a compiler, both for aesthetic and performance reasons. Their relative lack of popularity is, I think, more a product of poor marketing than anything else (as you can see, the arena story does fit into the borrows, ownership, and lifetime paradigm, but not in a straightforward way).


What a great writeup, thank you! I think I followed all that . Two things leap out at me:

- I can see arenas for a compiler, but I'm not so sure for an interpreter. While most of the memory allocated in Dark will be done within a single HTTP call (and so an arena might actually be appropriate), there are contexts where a GC cycle or two to cleanup might be appropriate. I'm guessing that arena's are an all-or-nothing approach?

- I'm using `im`, which I think also gets in the way of my nice pattern matching, right? Any solutions to that?


I also realized I should specifically address your "all or nothing" question, though: arenas absolutely aren't all or nothing :) In particular, a very common (well, by writing-a-compiler-in-Rust standards) pattern is to have two representations for your elements: one that uses Rc, that you use for longer-lived stuff, and another based on references, that you use for arena-scoped stuff. An extra term is added to the younger version that marks the use of a wrapper around an Rc type (recall that you can downcast Rc to reference, so you do not necessarily have to lose Copy to be able to do this, depending on your implementation). You can then implement a variant of generational collection by (prior to dropping your arena, i.e. the young generation), scanning any live objects using the young term type, and converting whatever you find to Rc nodes. The Rc nodes then persist as the older generation.

This kind of hybrid strategy is much more effective than using Rc for everything, and while it's still not as efficient as a proper garbage collector it still has roughly the kinds of properties you want from one. My experience (like that of many others) is that almost any workload you want to use this with will satisfy the generational hypothesis, especially in Rust where you only use GC for things that really need it, so hopefully you can see why it's sometimes worth going to the extra trouble to maintain several versions of the same type.

(Obviously, in OCaml, you can piggyback off the existing garbage collector and this is a nonissue).


- Well, it depends on how you intend your interpreter to be used, but if it's for long running programs, a garbage collector of some kind is essential. Rc<T> is a primitive garbage collector that happens to be provided by Rust, but it's not very efficient, so if you're concerned with performance you'd probably want to build your own (or try your luck with a community GC--I haven't tried them in a while, so maybe they've improved since I last looked, but there didn't seem to be a lot of promising options for automating this).

- I don't think so, unfortunately. Your best bet would probably be a macro, which I'm sure you've already considered. You might want to talk to the author though, as I'm sure she has thought about this a lot more than I have.


Seems like a strange pair of languages to compare/contrast given that they are for such different use-cases. The cognitive overhead of programming Rust need only be applied if you're programming systems or code that need fine control over memory. OCaml is a general purpose language with GC and much higher levels of abstraction that can be quite performant - or F# (for OCaml-like) or other mainstream ecosystems like .netcore or jvm that a much wider choice of libs than Rust. (And these also usually have a path to native or js and/or webasm.)

I really don't get the whole "rewrite it in Rust" thing at all, especially when 95% of those applications would be perfectly performant in a language that lets you focus on your domain, not on making the borrow-checker happy.

IMHO if your code is scoped to the lifetime of a call or transaction of some sort or has very simple allocation needs, then fine -- but if you want to have complex inter-relationships between objects and interesting data-structures in a long-running process then you're in for a world of hurt.


Both are systems programming languages (OCaml describes itself as such).

The rewrite in Rust thing, for darklang at least, comes from the 50th time we've had to do something bad and ugly to fit into what is supported by OCaml.


Was there something about F# on .netcore you didn't like? Would make a re-write much easier, I'd think.


I've thought about it a few times, I'm so far from that ecosystem that I know nothing about it at all


> the 50th time we've had to do something bad and ugly to fit into what is supported by OCaml

Can you tell us a bit more about those cases? I've been writing front-end stuff using BuckleScript (and the OCaml syntax, which I for once really like!) for 6 months, and I haven't (yet?) stumbled upon those kind of pain points, I'm definitely curious!


Not the OP, but my biggest issues have been around tooling.


> I really don't get the whole "rewrite it in Rust" thing at all, especially when 95% of those applications would be perfectly performant in a language that lets you focus on your domain, not on making the borrow-checker happy.

Generally, people who actually know what they're talking about see RIIR as being less about performance and more about getting stronger compile-time correctness guarantees than are available from languages with nullable types (holes in the type system), exceptions (hidden control flow paths), no sum types (limited ability to "make invalid states unrepresentable"), etc.

In that respect, Rust serves more as something to get the benefit of academic-originated functional languages with the FFI integratability and large community and ecosystem of more mainstream languages.

(I use it for command-line utilities and for writing compiled Python extensions. I'm still on Python for Django and PyQt.)


This attitude is why I avoid Rust


What attitude?

The "Generally, people who actually know what they're talking about"? I could have phrased it better, but "Rewrite it in Rust" IS a bit of a groaner in the Rust community because of the history of having to do damage control for counter-productively rabid fans who have no idea that Rust isn't magic pixie dust and that rewrites actually take effort.

(I've been lurking in /r/rust/ since before Rust 1.0 came out in 2015, so I've seen a lot of that.)

The "In that respect, Rust serves"? I'm not the only one who sees Rust as taking the best stuff from functional programming and stopping before the return on investment starts to tank. For example:

https://dev.to/yujiri8/why-i-don-t-believe-in-pure-functiona...


Rewrite it in rust is not mature reaction from rust community. It is similar to arch linux btw boys you see in Linux communities.

And GP is right, you rust folk talk as if sum types are silver bullet. And you people also talk as if rust invented sum types (though you people don't imply it).

You folks talk as if cargo is the best package management solution that exist.

You folks talk as if web of hundreds of dependencies written by enthusiastic youngsters is not a problem for a language priding itself on safety properties (the attempts like crev don't get the attention they deserve).

You folks talk like the memory management overhead of rust is not a problem for creating average CRUD webshit.

You folks talk as if rust automatically guides you towards correct programming patterns, when that is out of necessity to facilitate static memory management. The business logic is not at fault here, neither is rust.

Some folks in the rust community frequently bring virtue signalling politics into programming language related medium.

Rust is a well designed systems language and deserves appreciation for bringing Ada / Cyclone / ATS' good parts as a usable language. But the community is ruining its image.

Plaudits to all involved.


But this is what I don't get. Why is OCaml (or some other language of ML lineage) not a more accessible platform for web development? Or perhaps it is, and I'm not seeing it? Prove me wrong here – I'd be delighted!

I'm not sure that Rust is there yet either mind you, but give it time, perhaps it will...


> Why is OCaml (or some other language of ML lineage) not a more accessible platform for web development

"Web development" is a grueling land war that requires massive investment in foundational libraries and fairly sophisticated tooling just to do a modern "hello world" app. Any language/platform that isn't explicitly dedicated to "web development" will always be lagging many generations behind those that do (i.e. rails and waves hands javascript).

I agree with the sentiment that Rust and OCaml aren't really peers. More generally, the urge to transition to Rust in general application development is as puzzling to me as it is to your parent here: it has unique benefits for systems programming, but that absolutely carries costs for a ton of non-systems contexts where you could just as easily use OCaml, or (modern) Java, or ...


I've used F# for many web projects - server side and transpiled to client-side. For example, see: https://safe-stack.github.io/

Also: https://www.purescript.org/


They didn't have corporate backing.

They didn't have aggressive marketing or zealous community.

They didn't have massive influx from ruby rails or javascript webshit crowds.


The Eve team had been using some other language, then Jamie Brandon fell in love with RUST,and it led to a split in the company. Rust is such a low level language; its primitive types are things like integers from 8 to 128 bits. but doesn't have any drawing, event mgt, or layout tools in the language. I can't imagine trying to build a graphical interactive product in Rust. It's similar to Modula-2 in terms of level, with one added trick, the memory borrow checking. I think of Rust as a one trick pony, that helps handle multi-threading memory management. Something that a browser maker would think ideal. I could never use it. Far too clumsy for building graphical interactive software, which is my wheelhouse.


It sounds like you're making it a requirement that a language should be tightly coupled to a specific application framework (or at least a specific event loop and fat standard library).

Is that correct?

(If so, it's a perspective very alien to me because it seems like something that would limit the language in the same way that PHP is limited by its historical tight coupling to the task of request-lifecycle'd HTML templating.)


> Since I've really hit a dead end with OCaml (multiple times even), I'm still hoping I'll get Rust Stockholm Syndrome. There doesn't really seem to be anywhere else to go.

I think you have to come to terms with the fact that the perfect programming language does not exist and you will hit frustrations and dead ends with every language.


I've looked into both some time ago, and while both are great languages the tools around ocaml are atrocious.


As a counter-anecdote, I came to OCaml skeptically in late 2018, having previously done significant work in Haskell, Clojure[Script], and Java. I got up and running very quickly, and IMO compared to those other three languages/platforms, OCaml tooling is only worse/more difficult than Java's.


Yeah, it's really unfortunate. Dune and Esy are moving things forward, but you sorta have to know the ecosystem quite well to figure that out.


Do you have a reference to the up-to-date guide that you'd recommend to get started in the OCaml ecosystem? The one that would cover editor integration (VIM/Emacs), build system, standard and essential libraries to use, etc.


I'm not aware of any, sorry. You can look at the dark repo to see how we do it though: https://github.com/darklang/dark


I think this just reminds me if I went back to ReasonML from ClojureScript I wouldn't be having a good time


Would you care to expand on what would be the reasons?


I think this would be more idiomatic for your enum, and also shows how to match on a Box:

https://play.rust-lang.org/?version=stable&mode=debug&editio...

    #[derive(Debug)]
    pub enum Expr_ {
      Let(String, Expr, Expr),
      FnCall(String, Vec<Expr>),
      Lambda(Vec<String>, Expr),
      Variable(String),
      IntLiteral(i32),
    }
    use self::Expr_::*;
    
    pub type Expr = Box<Expr_>;
    
    fn main() {
        let x = Box::new(Variable("test".to_string()));
        match *x {
            Variable(n) => dbg!(n),
            _ => unimplemented!()
        };
    }


I wonder if it's related to the size of the codebase you have right now, but compiler speed is one thing I've noticed from other comparisons. OCaml is said to have a much faster compiler than Rust.


What was wrong with clojure?

I'm learning both clojure and rust, for different cases (I have C, ruby, javascript).


Performance, startup time, JVM dependence, dynamic typing.

And the cult around zero information statements by Rich Hickey.




Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: