
Efficient Immutable Collections [pdf] - tjalfi
https://michael.steindorfer.name/publications/phd-thesis-efficient-immutable-collections.pdf
======
norswap
This is really cool work by Michael, a collection of configurable data
structures.

Underlying most of them is CHAMP - a compressed hash array map trie.
Essentially it's a trie over the hash of the objects inserted in the map. It's
compressed using a clever technique that involves bitmaps.

A made a toy implementation of it to get a sense of how it works. There are
some accompanying notes that you might find useful:
[https://github.com/norswap/triemap](https://github.com/norswap/triemap)

~~~
wyager
Could you expand a bit on the differences between CHAMPs and HAMTs? Most of
the info I see about CHAMPs makes them seem very similar, but with a slightly
different node structure.

~~~
drawnwren
From the bottom of the Capsule github page [1]: " HAMTs already feature
efficient lookup, insert, and delete operations, however due to their tree-
based nature their memory footprints and the runtime performance of iteration
and equality checking lag behind array-based counterparts.

We introduce CHAMP (Compressed Hash-Array Mapped Prefix-tree), an evolutionary
improvement over HAMTs. The new design increases the overall performance of
immutable sets and maps. Furthermore, its resulting general purpose design
increases cache locality and features a canonical representation."

1 -
[https://github.com/usethesource/capsule](https://github.com/usethesource/capsule)

~~~
wyager
That doesn't answer any of my questions, and they don't have any citation in
that block of text. Real details could be in any of the 8 papers linked, or
none at all.

~~~
drawnwren
They increase cache locality and runtime performance of iteration and equality
checking. If that answer isn't sufficient, try watching the ~30min Clojure
West video on the github page. The speaker seems to be a bit new to public
speaking but his talk on the subject was easy to follow and informative for
me.

~~~
loopingoptimism
The talk at Clojure West wasn't given by myself, but I found it worthwhile
linking. Rather it was given by someone who independently picked up my
research results and replicated them in the context of ClojureScript
([https://github.com/bendyworks/lean-map](https://github.com/bendyworks/lean-
map)). The authors independently confirmed the performance improvements of
CHAMP over HAMT that I was observing.

You can also have a look at the JVMLS'16 talk
([https://www.youtube.com/watch?v=pUXeNAeyY34](https://www.youtube.com/watch?v=pUXeNAeyY34))
for a high-level overview of the work that is covered in my thesis.

~~~
fnordsensei
From what I've read, it seems that the performance benefits disappear when
compound data structures/objects are used as keys. Is this true?

Let's say that I were to use a vector of coordinates as the key to something
in a map. Would CHAMP still vastly outperform HAMT?

~~~
loopingoptimism
Why should that be the case? Can you point to sources where you read that? In
my experience, CHAMP clearly has advantages over HAMT in this scenario.

To answer your questions about using vectors of coordinates of keys: it
depends on the design implementation of the vector's hash code, regardless if
you use HAMT or CHAMP.

Using collections as keys in other collections is in general a performance
sensitive subject. The available HAMT implementations in Clojure and Scala
fail to deliver here. The case study in Chapter 3.7 nests hash-sets into hash-
sets (i.e., Set.Immutable<Set.Immutable<K>> sets). The CHAMP implementation
yields minimal speedups of ~10x over Clojure and Scala due to the way it
calculates and incrementally updates the collection's hash code.

------
hellofunk
To anyone wondering why stuff like this matters, it's because the benefits of
functional programming reach new heights when coupled with efficient immutable
structures. In C++, for example, you can do functional programming in the most
basic sense of the word, and it's actually pretty fun. But it can be very
expensive because you are not working on efficient data structures that
support the mangling and idioms that make FP really shine. There have been
interesting efforts to bring structures like this to C++ but nothing mature or
known widespread. When people talk about FP, it's about a lot more than what
you can do with functions and expressions; it's about making sure that these
functional manipulations remain very fast without lots of copying, and that's
what is so fascinating about data structure research: how it supports new ways
to write programs.

------
amenghra
Chris Okasaki's work (cited several times in this thesis) is worth reading if
you want to learn more about these kinds of data structures. His thesis is
here:
[https://www.cs.cmu.edu/~rwh/theses/okasaki.pdf](https://www.cs.cmu.edu/~rwh/theses/okasaki.pdf)
and parts of it should be easy to grok by every software engineer. Chris also
wrote a book on the same topic.

~~~
candu
Yup. Purely Functional Data Structures (thesis as linked, book:
[https://www.amazon.com/Purely-Functional-Structures-Chris-
Ok...](https://www.amazon.com/Purely-Functional-Structures-Chris-
Okasaki/dp/0521663504)) is awesome. I especially found some of the tree-based
recursive algorithms eye-opening (once you spend the time to wrap your head
around them).

IMHO required reading if you're doing any heavy FP work.

~~~
myth_drannon
And there is UChicago course that is following some parts of this book and
implements it in Elm language

[https://www.classes.cs.uchicago.edu/archive/2017/spring/2230...](https://www.classes.cs.uchicago.edu/archive/2017/spring/22300-1/)

------
devrandomguy
Clojure devs: 3.5 is the figure you are looking for, and it does look very
good, at first glance.

~~~
emsimot
Did you mean "3.6 Benchmarks: CHAMP versus Clojure’s and Scala’s HAMTs"?

"Speedups Compared to Clojure’s Maps: In every runtime measurement CHAMP is
better than Clojure. CHAMP improves by a median 72 % for Lookup, 24 % for
Insert, and 32 % for Delete. At iteration and equality checking, CHAMP
significantly outperforms Clojure. Iteration (Key) improves by a median 83 %,
and Iteration (Entry) by 73 %. Further, CHAMP improves on Equality (Distinct)
by a median 96 %, and scores several magnitudes better at Equality (Derived).
Speedups Compared to Clojure’s Sets: The speedups of CHAMP for sets are
similar to maps across the board, with exception of insertion and deletion
where it scores even better."

Interesting indeed!

~~~
amelius
Nice results, but in the old days, algorithms improved on other algorithms in
the big-O sense.

~~~
anarazel
Right, which is why nobody ever used qsort. Or, wait. It's been widely used
for a long time...

~~~
amelius
Well, I didn't say "worst case".

If qsort only provided a constant ratio time improvement over bubble sort in
the average case, then it wouldn't have been so popular.

~~~
kcorbitt
Actually, big-O complexity by definition is determined based on the worst
case. [https://stackoverflow.com/questions/3230122/big-oh-vs-big-
th...](https://stackoverflow.com/questions/3230122/big-oh-vs-big-theta)

------
lgierth
Very interesting! We've been using HAMTs in IPFS [1] to make huge directories
more efficient (think NPM or Wikipedia), and the memory profile has been a
pain, so this looks like a welcome improvement.

[1]
[https://github.com/ipfs/specs/issues/32](https://github.com/ipfs/specs/issues/32)

