Optimizing Hash-Array Mapped Tries for Fast Immutable JVM Collections (2015) [pdf]

jcdavis · on Aug 6, 2017

The author presented a talk about this at the JVM Language Summit last year: https://www.youtube.com/watch?v=pUXeNAeyY34&list=PLX8CzqL3Ar...

Github repo for those interested: https://github.com/usethesource/capsule/

agumonkey · on Aug 6, 2017

Interesting talk in a way but god the amount of wasted neurons from java-like OOP.. all this to xi..xn,yi...yn instead of xi,yi...,xn,yn ..

VHRanger · on Aug 6, 2017

What is the real world use case for a HAMT data structure? Tree based associative arrays (like c++'s std::map) tend to have poor real world performance

chongli · on Aug 6, 2017

They are used to implement Clojure's immutable map and set structures. They are absolutely critical to the philosophy of the language.

gshayban · on Aug 6, 2017

std::map and red-black trees are associative & sorted (by key); HAMTs are just plain associative. HAMTs have generally excellent performance: the typical implementations use 32-way branching, which allows for the top levels to be likely in cache.

The best thing about HAMTs is passing immutable ones around, which have free update costs and multi-thread safe. Clojure uses these to represent information that flows through a program. (By free update costs I mean: they are free enough to never rise to the programmer's attention. Obviously ops have costs.)

masklinn · on Aug 6, 2017

More efficient (in both CPU and memory) immutable/persistent collections, which have the advantage that they're intrinsically thread-safe. They can also be used to implement lock-free concurrent maps (Ctrie)

> Tree based associative arrays (like c++'s std::map) tend to have poor real world performance

That's something HAMT significantly improve upon. You're not going to get linear performances out of them but basic operations are generally O(ln N) where N is usually 32.

dmichulke · on Aug 6, 2017

I think it's for an array of size N:

- O(1) - standard array (regardless of N)

- O(log_b(N)) - HAMT (b is 32 in clojure)

Since b is big, log_b(N) is quite small even for big numbers, e.g. log_32(2^20) = 4

mrkgnao · on Aug 6, 2017

Um, isn't that "constant-time in practice"? (Something like log-star.) Am I missing some matter of convention here?

wfunction · on Aug 6, 2017

Is there any research on the tradeoff between this kind of thread safety and the penalties incurred on the memory management side (specifically the fact that now you have far more heap allocations with immutable data and they have to synchronize too)?

joe-user · on Aug 7, 2017

I don't know of any formal research, but in Clojure, where HAMTs are a fundamental part of the language, the philosophy is more oriented towards paying for things (such as thread safety) with memory and CPU first, benchmarking to find bottlenecks second, and if there are any bottlenecks, optimizing accordingly. It was written with this in mind, which is why it requires world-class VMs like the JVM, CLR, and V8/JSC/SpiderMonkey/etc. that can deal with the GC and (in many cases) make runtime optimizations.

Also, I'm fairly certain synchronizing isn't an issue because there's nothing to synchronize on since the data structures are immutable. Am I understanding you correctly?

wfunction · on Aug 7, 2017

> Also, I'm fairly certain synchronizing isn't an issue because there's nothing to synchronize on since the data structures are immutable. Am I understanding you correctly?

No -- I'm referring to synchronization on the heap itself (think new()/delete()). Multiple threads allocating memory need to synchronize in order to avoid trampling on each other. You can't just get rid of synchronization entirely.

jcdavis · on Aug 6, 2017

Immutable maps that have efficient update semantics

FreakLegion · on Aug 6, 2017

Depending on the use case, of course. Sometimes you need a structure that's both associative and ordered, for example, in which case std::map is handy.

masklinn · on Aug 6, 2017

HAMT are not intrinsically ordered or sorted.

FreakLegion · on Aug 6, 2017

The comment obviously refers to std::map, which it mentions explicitly and which guarantees ordering.

masklinn · on Aug 7, 2017

But the entire discussion is about HAMT not std::map, the parent comment only talks about std::map as an example of performance issues in tree-based maps in asking how that relates to HAMT, it doesn't actually care a whit about std::map.

zokier · on Aug 6, 2017

So would this be a good structure to store the recently discussed password hash dump, basically a fixed set of 300 million SHA1-sums where the only interesting operation is checking if the sum is or is not contained in the set?