Hacker News new | past | comments | ask | show | jobs | submit login
Immer: immutable and persistent data structures for C++ (github.com/arximboldi)
144 points by ingve on Nov 27, 2016 | hide | past | favorite | 45 comments

Author here, open for questions :)

How much does immutability/persistence cost in terms of performance and memory? Are there benchmarks against std::vector? E.g, are we talking about 10x or 100x slow down?

Performance is one of the goals, so I am benchmarking everything. I run benchmarks on Travis, locally on my machine, and sometimes on a Raspberry Pi 3. You can check some of the reports here: https://public.sinusoid.es/misc/immer/reports/ (this is a bit hard to interpret without further comment).

So, the slowdown depends on the operation.

At the moment, comparing `immer::vector` to `std::vector`, example, for reading, using internal iteration (e.f. `for_each`) you get 1.5X, using iterators something like 2.5X, using [] something like 6X iteratively, something like 1.1 in random order.

Updating a single `int` element is like 100X. (This factor decreases with the size of the updated element, though!) But then, for example, `push_back` is only 1.1X std::list, and it does not rely on amortization, making it better for latency. Considering that you get access time similar to std::vector and persistence, it is a good deal. Also, with `flex_vector`, you get logarithmic slicing and concatenation, making these operations faster than for std::vector (when using big-enough vectors). Also, I am working now on _batch updates_ via transients, so if you update 10 elements in a row, you don't pay 10 times 100X (basically, you only pay the 100X once, which is very much ok for most use-cases, think like a 1microsecond tax per mouse click ;)

There are lots of trade offs one can make (in general, one can make updating a bit faster by making access a bit slower). Also, it depends a lot on the memory management strategy, which can also be customized.

Are the benchmarks supposed to look like this, with tooltips all the way at the right?


I'm on Win7 x64 Chrome.

That's how they look for me too. I agree it makes it hard to see which bar is what, but it makes it easy to scan all the results numbers in order.

It all depends on the scenario. Consider this situation: a sending object has an internal collection of items and wants to send it to a receiver. With a mutable collection the sender sends a shallow copy of the collection, because the receiver can't be trusted not to modify the original (which is part of the internal state of the sender).

With an immutable collection the sender can send the original, no copy required. The immutable collection can be shared (lock free) by any number of users.

In this case you have both reduced time and memory use.

So while immutable collections are obviously "slower" for general (mutable) scenarios, the win comes when you need less locking, less copying, and so on.

It is interesting how in half of computing we are lamenting how badly trust can cause issues, and then in another half we are saying you can trust parts of the code to not modify your objects that you are passing.

That is, there is nothing that guarantees foreign code doesn't modify your "immutable" data. You can start building in checks into the system to make sure they were not modified, but you will eventually get to the point where you are basically locking on your data. Or just sending a copy of just enough of the data for the other end to work. (Which, if you are at all distributed, can make some sense anyway.)

I get it, in that if you are sticking to the contract, than immutable leads to this. So, in a very real sense, it helps an individual (or small team) stick to the convention of "initialize and then use" data structures.

None of this is to say that immutable data can not be really useful. It can be. Awesomely so. I just get a little worried at the benefits being touted as absolute.

To me, the much more interesting aspect of immutable structures are the ones that let you run something in an append only way. Or in a method where you can effectively recreate a local history of an object from the data. (Neither of these things are "new", btw. In a simplistic sense, this is refinding that assoc lists can be useful.)

Yeah, thanks for your comment. Indeed, the idea is that by paying a tax on updates, you get a much bigger improvement on the overall system performance, and also, better scalability.

I only know these things from clojure on the jvm. How do you implement these without Garbage Collection, since freeing the wrong memory in one version could spell disaster for another version of the structure due to structural sharing.

The typical way of doing garbage collection in C++ is via reference counting. This is the strategy used in the library by default.

However, this can be customized in the library. There is a brief description here: https://sinusoid.es/immer/memory.html

For example, one may choose between thread safe or thread unsafe reference counting (the later is much faster!). One may also plug in a conservative garbage collector, like `libgc` (https://github.com/ivmai/bdwgc). While reference counting is often considered to be bad for immutable data structures, I found out that this is not the case. They have very interesting interactions with move semantics and _transients_ (this is a feature that is still on the design phase though). I should write more about this somewhere, maybe in a blog or even as a paper.

GPL seems like a poor choice of license given the way the library would normally be used. What about switching to MIT?

Or even LGPL, what copy-left libraries should be using.

Don't get me wrong, the GPL isn't the "worst thing ever" or a "virus", but I feel it's best suited to full applications rather than libraries.

We need more copyleft not less. I'm glad it is GPL and also plan to use it should I need immutable data structures.

Certainly copyleft has a place, but changing the license to LGPLv3 would allow this very useful library to be used much more broadly while continuing to require that improvements to the shared code be made public. If glibc were GPLv3 rather than LGPLv3 (same goes for Boost, etc.) almost no one would use it. IMO this library will either be re-written under a less restrictive license by some other author (wasting time and effort of the community) or migrate to a more open license (LGPL, ASLv2, MPLv2, MIT, etc.).

For something as fundamental as a container library, GPL will basically mean it's not used widely. There's a reason many language ecosystems generally use BSD/MIT licences for almost all libraries (haskell being just one example I know this to be true for). Applications may be another story.

I had no time to read the source yet. I was wondering, did you use any of the work from Chris Okasaki (Purely Functional Data Structures) or it is irrelevant to your project?

I did read the book few years ago. I found his ideas on how amortization vs persistence and exploiting laziness for data structure very interesting.

However, many of the data structures in that book do not translate great outside of the world of Haskell and ML. This library is mostly based on the work of Phil Bagwell, Rich Hickie, Tiark Rompf and others. Phil worked over the last decade on building immutable and/or concurrent data structures that are "pragmatic", ie. cache efficient, and supporting mutable facades. I recommend reading his work on Hash-Array Mapped Tries and Radix Balanced Trees.

Thanks for explaining. I am going to read up on his work.

What do you really mean when you say that the complexity is "effectively" O(1)? Amortized O(1)?

Thanks for the question, I should add a clarification. It is a term I borrowed from some Scala papers... The idea is that it is actually logarithmic, but the base of the logarithm is very big, so in practice, it is just like a constant (kinda log32 with the default options). With these settings, the biggest vector of ints that fits in 4GB of memory has a depth of 6. And you can store 1M ints in just 4 levels. 1K in 2. This constant is much lower than what you are paying anyways for cache thrashing when working with big vectors. So while from a theorical point of view it is `O(log(n))` considering real programs running on real computers, it is more useful to think of this as a constant factor. Probably I should update the docs with some clarifications or some more honest terminology.

Oh, very interesting, thank you!

This is awesome! Definitely looking forward to its final release.

Are you planning to use some C++17 features, or strictly stick with C++14?

There is a tension between me wanting to use the fanciest shiniest features and me wanting it to be usable by the widest audience (even people using Microsoft compilers). I know of people that are still migrating their commercial code bases to C++14 that are interested in using them. It is probably in my best interest to support them, so it will take a while until I start using newer standards.

Anyways, the _implementation_ of the data structures themselves does not really benefit that much from C++17 (well, there might be some interesting stuff that I could do with hazard pointers...). However, people using the new concurrency stuff from the standard might really want to look into using immutable data structures to pass data between tasks.


I plan to change it to a more liberal license for the first stable release (I should add a note to the readme). I still have to choose which one. The GPLv3 was chosen as an interim conservative option.

One way to help me choose one of the most liberal licenses is to get your company to join the sponsorship program ;)

Anyways, I and many others use plenty of GPLv3 software every day, let's agree to disagree on the "useless" part...

Similar library with MIT, if that is your cup of tea: https://github.com/rsms/immutable-cpp

For most people? I highly doubt that. But it might make it unuseable for you.

The vast majority of C++ programmers in the world aren't writing open source software.

You are free to use GPL libraries in proprietary code as long as you don't distribute it.

Any sources? I'm not sure it is that obvious.

According to https://adtmag.com/Blogs/WatersWorks/2014/01/Worldwide-Devel..., in 2014 there were 18 million coders world wide (so few!) of which 11 million were professional and 7.5 million hobbyists. Hobbyists does not have to mean Open Source, but professional does not have to mean proprietary either. At the same time, half of all open source software was written by professionals (http://www.techrepublic.com/article/for-50-percent-of-develo...). I am not aware that there are any complete stats on how many open source developers exist. And one should not forget that also a professional C++ programmer can use a library as this for a FOSS project.

We are getting OT. It is okay to ask with respects to the licence, but it is not okay to act like a GPL project is worthless.

Where exactly in the landscape of postmodernism do you locate your data structures? Are you happy with the outcome and what was the original inspiration to write that outstanding piece of code?

They may be located somewhere in the forest of continental post-structuralism somewhere between Derrida and Deleuze, even though this is a very structuralist assertion in a way, a contradiction that we shall embrace anyways. I wanted to deconstruct the data post-structures from Clojure and Scala following. The outcome is that I wrote code that I can barely understand, something that while burying my engineering career and speeking poorly of my intellectual abilities, might open me doors of the philosophical heavens.

Thank you for your honest and enlightening response. I'm glad that we still have programmers who show the courage to take risks and go beyond what they deem possible and find true knowledge that cannot be understood by the commoners mind. You are an inspiration to me.

Please consider changing the license to LGPL or something even more liberal. GPLv3 is too restrictive even for free software, and may result in problems when linked to GPLv2 only, MPL and similar software, such as NS-3 and many others.

I assume the author chose this licence because it reflected his wish for the user's freedoms to run, study, share and modify the software to be guaranteed. He probably doesn't want a permissive licence.

For what's worth, I urge you to keep GPLv3 :)

> Postmodern immutable and persistent data structures for C++

What is a postmodern data structure?

They probably mean "trying to get on the hype train". I have nothing against immutable data structures, but that wording is just silly.

It was meant as a half-joke picking on how everything is now "modern" on C++.

The half-serious part is that there has been a trend in the last few years in the "modern" C++ community against Copy-On-Write, heap allocation, shared pointers, etc. Immutability kind of brings new life to these concepts...

I thought it was clearly a joke poking fun at all the other modern libraries.

IIRC postmodernism is related to post-structuralism, which would be quite ironic considering this is about data _structures_.

Ah, but remember that post-structuralism is a _critique_ of structuralism....

(and yeah postmodernism and post-structuralism are related, but like everything in this area, It's Complicated)

I'm glad the author joined us for a Q&A sesh.

If anyone is looking for something like this for Java, there are a few of them, including my own FSet: https://github.com/slburson/fset-java

What are some of the others you know of?

Another java library with persistent/immutable datastructures is javaslang.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact