Hacker News new | comments | ask | show | jobs | submit login
Immutable.rs: immutable data structures for Rust (immutable.rs)
291 points by tosh 10 months ago | hide | past | web | favorite | 63 comments

> Most of these data structures support in-place copy-on-write mutation, which means that if you're the sole user of a data structure, you can update it in place with a huge performance benefit (about an order of magnitude faster than immutable operations, almost as fast as std::collections's mutable data structures). > > Thanks to Arc's reference counting, we are able to determine whether a node in a data structure is being shared with other data structures, or whether it's safe to mutate it in place. When it's shared, we'll automatically make a copy of the node before modifying it, thus preserving the usual guarantees you get from using an immutable data structure.

That's a really interesting approach. I could see that being really useful in a lot of cases, and it lets you get good single-threaded performance on code that would also continue to work in a multi-threaded case.

FWIW Clojure already has that concept as "transients" but it requires explicit invocation (creation of the transient then conversion back to persistent). I think Swift also commonly uses CoW but does so for regular collections not persistent ones.

What is the advantage of persistent collections over value-type collections?

Structural sharing. It means you don't have to copy the whole data structure but just the path to the root.

I don't know - that sounds more like an implementation detail available to both. Obviously if you modify the first element of a value-type list, it's not going to copy the rest.

Persistent data structures are really about the interface: functions that would be traditionally mutating instead return new versions. Is there an advantage of that interface over mutable value-type collections? One advantage is that it works better with e.g. folds. But overall I think it's more awkward.

The point is to share intermediate state of your data structure without headache.

For instance, if you write an interpreter, representing variable bindings with a functional map allows to reason about scope super easily; when you change scope, you just create a new map with the new variable bindings, and when the scope ends you just get back the map at the scope beginning which is still valid.

With a stateful hashtable, you need to undo the changes you did in you scope; the value of the table when opening the scope that you want to get back to is gone forever. Now it ties you program to a certain execution order. One way to see that is that if you want to collect environment to each scope in some data structure for some reason, you need to make deep copies everywhere.

> I don't know - that sounds more like an implementation detail available to both.

It impacts the performance profile of the collection, and thus isn't merely an implementation detail.

It seems like it would be nice to just use the immutable interface and have it be mutable in the case that there's only one owner, but be able to fall back to immutability when you need it.

It could make refactoring and debugging easier. For instance, let's say you're part way through some operation and it fails, and the error recovery logic is buggy and leaves the data structure in an inconsistent state. You could debug the problem, but if you're in a hurry and you just need something that works right now, you can store a reference to the data structure at the beginning of the operation and revert to that if anything goes wrong. You'd pay a performance cost since all the direct in-place edits would turn into copies and allocations, but it would at least work.

Similarly, you could develop a program initially in an all-immutable multi-owner style, and then selectively convert codepaths to single-owner style to get better performance in the parts where performance matters.

This, of course requires tracking of owners. If you're on the JVM, that's not so cheap. If you're already tracking pointers or can do compiler-aided RAII like python, or c++ then it's relatively simple unless you also need recursive data structures which "own" themselves.

Lack of defensive copying (widespread e.g. in Java) may alone be worth it, performance-wise.

Also, shared data are much easier to reason about if it's immutable (not necessarily even in a concurrent execution context).

FWIW this is exactly what Swift does with all of its heap-allocated value types (e.g. Array, Dictionary, Set, String, etc).

And what the OPAL functional language [1] did since the mid 80s. Though the concept there is expanded from simple reference counting by additional static analysis during compile time (making use of the functional semantics of the language) to avoid reference counting when for example the compiler can find/prove sequential increment/decrement counts, then the reference counting code is eliminated.

[1] https://github.com/TU-Berlin/opal

Arc::make_mut is really nice and is something we use a bunch in Stylo to safely mutate possibly-shared styles. Across threads.

But isn't it an anti-pattern to wrap something in an `Arc` when some of your use cases for it are purely single-threaded, and as such you're wasting a certain (potentially very significant) amount of performance on upholding memory visibility and such even when it's not needed? I have only dabbled in Rust, so I may be misunderstanding the implications of Arc and how it is implemented, and how it behaves in the absence of multiple threads. To me it feels a bit like some of the relatively early Java days when even the Standard Library was using `synchronized` (which is of course way worse than `Arc`) in lots of places like StringBuffer (used for String concatenation), just to be safe in case anyone ever wanted to use that code in a multi-threaded context. Only later folks began to realize that maybe that's not such a great idea and that you need to be able to not pay any synchronization costs when you know you're not going to be sharing a data structure across threads.

These collections are of somewhat limited use outside of concurrent scenarios, and I don't think you can (currently?) be generic over allocation in Rust. Since these structures need some sort of automatic memory management for structural sharing, you need to use a GC of some sort, having both atomic and non-atomic Rc would require duplicating the entire thing, once with Rc nodes and one with Arc node, for more or less no value.

> I may be misunderstanding the implications of Arc and how it is implemented, and how it behaves in the absence of multiple threads

All Arc means is that refcounting operations use atomic increment/decrement instead of regular ones.

So that means that the synchronization cost would only be paid when making a new copy of the Arc and releasing each of those. Ok, that extends the set of use cases where it's acceptable to use Arc then. Thanks for clarifying.

EDIT: wording

> So that means that the synchronization cost would only be paid when making a new copy of the Arc and releasing each of those.

Yes. An interesting note here is that due to Rust's ubiquitous use of references there will be little extraneous refcounting traffic unlike systems like e.g. Python where `a = b` might trigger a refcount increase. I think it was Armin Ronacher (mitsuhiko) who made that observation a few weeks back, possibly on twitter?

> But isn't it an anti-pattern to wrap something in an `Arc` when some of your use cases for it are purely single-threaded, and as such you're wasting a certain (potentially very significant) amount of performance on upholding memory visibility and such even when it's not needed?

Yes. You have Rc and Rc::make_mut for the single threaded case. Pick what you need. I interpreted the GP comment as talking about the general pattern of make_mut, not Arc::make_mut itself.

Stylo's multithreaded, we pick Arc. Singlethreaded systems can get by with Rc, but there are often legit reasons for picking Arc, and it's not that bad a hit.

> That's a really interesting approach.

... isn't this just basic copy-on-write ? that's something you learn to implement in first year of comp-sci. Old C++ types used to work like this - and some still do, for instance Qt's collection types are all COW.

No, this is a bit different.

This is copy-on-write-unless-we're-alone-then-just-mutate-without-copy. That is, it's possible to write without copying if it has been determined that there are no other readers that would be referring to the old value (thus no copy is needed).

That's exactly what the Qt collections do:

> Implicit sharing automatically detaches the object from a shared block if the object is about to change and the reference count is greater than one. (This is often called copy-on-write or value semantics.) [source: http://doc.qt.io/qt-5/implicit-sharing.html]

The C++ language is sometimes a little cruel, though. It's easy to accidentally call the mutable version of a function when the const version would have been sufficient. That can be a performance problem if it causes unnecessary copies.

It's what Delphi has been doing with strings since 1995. It really is just copy on write, which normally has an assumption that the thing is shared before you copy. It's easily implemented with a refcount. It makes heap allocated types behave a lot like value types.

> copy-on-write-unless-we're-alone-then-just-mutate-without-copy

well, yes, that's generally called copy-on-write. A COW class which does a copy every time you write to it would be pretty useless, woudln't it ?

e.g. if you did, in old std::strings:

    std::string s = "foo";
    s[1] = "x"; // no copy here
    auto s2 = s; // no copy here;
    s[1] = "y"; // s is copied in a new buffer
    s[1] = "z"; // no copy here
    s2 += "bar"; // no copy here

That is a natural extension of copy on write if you are keeping reference counts. I implemented something similar for a C++ library a couple of years ago.

Rust actually provides a COW type in the stdlib for augmenting any of your types with such behavior: https://doc.rust-lang.org/std/borrow/enum.Cow.html

That feels a bit like software transactional memory. I was wondering if someone would try that in Rust.

What kind of uses cases do people have for this? Since rust already pretty tightly controls mutability, my normal reasons for using immutable data structures don't apply there.

Immutability can be nice when writing server logic. Suppose all you data structures are immutable and the only mutable state you have is a pointer to the root of your immutable data structures. To create a new state, you build a new state by allocating what you have to and using pointers into the old state for those things that haven't changed, and then you update the pointer to point at the new state.

Depending on the data structures used, updating state could require a lot of allocations. However, there are a couple really compelling benefits. Error handling is very easy because you can bail out at any point before swapping pointers and the server state remains unaffected. Also, readers can continue to read the old state without being blocked by writers -- basically, all reader threads can just go right in and read the state with no locking at all.

I've used this approach before in Haskell and it seemed to work pretty well. I used the lens library to make data structure updates easier (syntactically, they look a lot like the sort of data structure updates you'd make in an imperative language, except that they do a copy-on-write behind the scenes), and also paired it with the acid-state library to make the Haskell data structures persistent across application restarts.

That was one of the interesting side-effects of digging into Rust that I wasn't expecting. I find that I get the same level of "no mutation surprises" I commonly see in Elm/functional code but with the benefit of breaking out mutation when I need performance in a way that I don't normally shoot myself in the foot with.

Having written C++ for quite a while it's really a refreshing feeling. Heck, I wish I had similar constrains available to me in Java for parts of an architecture that I know is going to be gnarly.

I don't see anyone mentioning here is that properly implemented persistent data structures where the increase in space is low and predictable can be used to efficiently store and query the history of data structure's state.

One example is querying which polygon from known set contains a point. If all the points are known it can be done using regular sweep line algorithm. Using a persistent ordered tree instead of normal one allows storing it's state over the sweeping process, that removes requirement of knowing all the points at the start. After performing sweeping using persistent data structure query any point can be answered in O(log(n)) time. Similar technique can be applied to convert other offline algorithms that rely on sorting all the data to partially online (queries can be answered one by one without knowing all of them).

>Why Immutable Data Structures

>... Rust, being what it is, does a good job of discouraging this kind of behaviour, and keeping it strictly controlled when it's necessary, but the standard library doesn't provide collection data structures which are optimised for immutable operations. This means, for instance, that if you want to add an item to a Vec without modifying it in place, you first need to clone the whole thing before making your change.

>Data structures exist which are designed to be able to make these copies much cheaper, usually by sharing structure between them, which, because this structure is also immutable, is both cheap and safe. ...

Seems pretty straightforward. You'll still have some benefits to immutable-data coding-patterns because it requires only read access on a value, not write, all the way up to the moment you want to swap values (if ever). Making that faster is a worthwhile goal.

Ok, but I repeat: what is the use case?

Yes, this allows me to do things like add an item to a Vec without modifying it in place, but why would I want to do that? What's the situation where that's helpful?

In other languages I do that, specifically to avoid mutability hell, but that doesn't apply to rust.

EDIT: Thanks to the responders for the great examples!

> Yes, this allows me to do things like add an item to a Vec without modifying it in place, but why would I want to do that? What's the situation where that's helpful?

The context is obviously shared concurrency (hence the library using Arc period), so you can cheaply get a shared immutable structure and manipulate it on your side (without sharing it back), and if necessary share/swap it in at some point, they provide cheap read/copy/update cycles.

With regular collections Rust would preclude misuses but that would still require either deep copies upfront or fairly large exclusive critical sections.

One example: storing a large text buffer in a persistent data structure (e.g. a rope) can allow you to maintain a space-efficient history of "snapshots" simply by keeping pointers to older versions of the buffer.

No ability to mutate -> no mutable ownership complications. Ever. And that's probably the biggest pain-point for people using Rust.

Not necessary, but can simplify things, by encouraging(/forcing) simpler use of data by default.

But you get another pain-point in return - having to replace the whole state at root position, which in turn leads to redux.

That's a pretty normal way to handle immutable-data systems, yes.

It might be useful for RCU data structures in multithreaded code

Absolutely. This is one very good use case.

Say you have a configuration to read and share with all threads. Each thread can make local changes that will not be visible to the others. The API looks like it's mutating the shared object, but it's actually mutating a copy.

maintain history/rewind state to any previous stable time, like git/datomic.

Seems misguided, though. Specifically, this has the implicit claim that the shared structures will be faster than the alternative. The whole "is both cheap and safe." However, Rust uses other mechanisms to provide the safety, such that it may be "cheaper," but only compared to copying the full data structure. It will not be cheaper than just modifying in place.

The question, then, is how often is that strategy truly idiomatic and a net win in Rust?

We had a talk at RustConf previously about the Xi editor; it uses an immutable data structure to handle text. It's a video but the author lays out the details here: https://www.youtube.com/watch?v=SKtQgFBRUvQ

They make an interesting argument about how Rust makes it easier and nicer to write these kinds of structures.

It's still an area of exploration in Rust generally though; I wouldn't say immutable data structures are idiomatic, but that's also just because we haven't had a ton of easy ones to use, so nobody uses them. So we'll see!

Thanks for the shout-out. You can also read the docs[0] about the rope data structure (the release is a bit old, if you're really curious git master is better).

I admit this is a fairly specialized use case, for most things xi uses mutability in a controlled way.

[0] https://docs.rs/xi-rope/0.2.0/xi_rope/tree/index.html

I saw downthread a link to someone talking about how you can already treat the standard collections as values. Has a good argument for how it can be beneficial to keep the same interface regardless. Has a good concession that there are definitely use cases where persistent collections make sense. I'd imagine they are not as common as one could expect. (I grant they are also likely more common than I expect. :) )

With an immutable data structure the only synchronization cost is that of atomic reference count increment/decrement, but it feels like you're mutating shared objects, so the API is very easy to use.

For an example API I'm familiar with, see https://github.com/stedolan/jq/wiki/C-API:-jv

If this is at all like jq's libjq's jv API [0], then this it will be great. The C-coded jv API gives you immutable JSON-ish data structures, with the illusion (and sometimes truth) that they are actually mutable.

For example, say you have

    jv foo(jv a)
      return jv_array_append(a, jv_string("foo"));
Here `jv_array_append()` looks like it must mutate `a`. What actually happens under the covers is that if `a` has just one reference, then indeed, `a` will be mutated, but if it has more than one reference then a new array will be constructed as a copy of `a`, `a`'s reference count will be decremented, the copy will be mutated, and lastly the copy will be returned.

This might make it a bit clearer:

    jv a = jv_array();
    a = jv_array_append(a, jv_true());
    a = jv_array_append(a, jv_false());
    a = jv_array_append(a, jv_null());
Each assignment to `a` here can be a brand new array, though since in this code snippet `a` will always have just one reference, in fact `a` will always be mutated.

Now, look at this:

    jv a = JV_ARRAY(jv_true(), jv_false(), jv_null());
    a = jv_array_append(a, jv_number(10));
here at the third line `a` might have two references, depending on what `set_thingy()` might have done with its reference. Perhaps `set_thingy()` saved its reference in a global or thread-local variable, or some object reachable from a global or thread-local variable, in which case the last call to `jv_array_append()` in this example will produce a brand new array.

See the jq jv wiki[1].

Now, this is all C. But in Rust all of this would be even more powerful.

[0] https://github.com/stedolan/jq/blob/master/src/jv.h

[1] https://github.com/stedolan/jq/wiki/C-API:-jv

Immutable strucutres allow you to sharing or keeping multiple versions. For instance moddeling gamestate as an immutable tree so the ai can cheaply try different options.

Also sometimes it's still simpler. Exception handling on windows uses a thread local immutable linked list of handlers.

So honestly they probably do. This is useful whenever you want to share and optionally modify some data (especially if it's large).

Eg. in a multi-threaded video game, for each frame, you probably have some big block of state, which some threads will mutate. Instead of making a copy of the data for every thread, you could pass around a big immutable data structure and only make copies when writing. Each thread can own their own data, so there won't be lock contention, and when all the threads are done, a quick merge operation can recombine the modified pieces back into a single block of state for the next frame.

I you are into immutable and c++ you should really have a look at Immer by Arximboldi (who is a user here on HN). Really amazing stuff: https://github.com/arximboldi/immer

The other immer[0] of note is a JS library that gives you immutability with a mutable API i.e. you mutate your code as normal, but all changes are proxied to a copy, which is then returned as a result of your mutations.

[0]: https://github.com/mweststrate/immer

Sounds a lot like jq's jv API: https://github.com/stedolan/jq/wiki/C-API:-jv

Thanks for mentioning Immer! I am studying Immutable-RS right now and it seems interesting. The code is much more straight-forward than Immer's thanks to Rust expresivity when it comes to sum types, etc. Sadly, on the other hand, some of the tricks that Immer implemented, in particular the ones that flattens leaf nodes, can not be implemented in safe Rust, and I my guess is that Immer is at the moment more efficient both in run time and space usage. I would love to see some benchmarks comparing both though!

Yeah I've been using this to make a interpreted Clojure-style lisp with immutable data structures. Great library.

http://smallcultfollowing.com/babysteps/blog/2018/02/01/in-r... is an interesting look at how Rust's ownership system lets you treat the standard collections as values.

Good read, if you are the author I am not sure if feedly is lying that you have rss but despite saying you do I can not subscribe.

The author doesn’t post to HN that I’m aware of, I’ll ping him. Thanks!

BTW, while it's true the feed URL listed in the <head> tag of that site is incorrect, you can in fact subscribe using http://smallcultfollowing.com/babysteps/atom.xml

The ergonomics of using immutable structures with Rust are really good. You wouldn't know you have an immutable structure in front of you (other than that maybe there are more Arcs than you would expect).

I'm trying to figure out what the basis is for the implementation of the sorted map. For example, in related project https://github.com/orium/rpds they explicitly note that they are basing their sorted map on Okasaki red-black tree.

ordmap.rs notes that it's based on B-trees. Doesn't that help?

This could really use some examples.

I hope this is as easy to use as, say, the jv functions in libjq [0].

[0] https://github.com/stedolan/jq/blob/master/src/jv.h

This looks more like a view library where you have an iterable and views that let you access them in convenient ways and copy as little as possible than a data structure library where data is created on the fly within the data structure as needed.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact