Hacker News new | past | comments | ask | show | jobs | submit login
Optimizing Hash-Array Mapped Tries for Fast Immutable JVM Collections (2015) [pdf] (steindorfer.name)
98 points by tosh on Aug 5, 2017 | hide | past | favorite | 25 comments



The author presented a talk about this at the JVM Language Summit last year: https://www.youtube.com/watch?v=pUXeNAeyY34&list=PLX8CzqL3Ar...

Github repo for those interested: https://github.com/usethesource/capsule/


Interesting talk in a way but god the amount of wasted neurons from java-like OOP.. all this to xi..xn,yi...yn instead of xi,yi...,xn,yn ..


What is the real world use case for a HAMT data structure? Tree based associative arrays (like c++'s std::map) tend to have poor real world performance


They are used to implement Clojure's immutable map and set structures. They are absolutely critical to the philosophy of the language.


std::map and red-black trees are associative & sorted (by key); HAMTs are just plain associative. HAMTs have generally excellent performance: the typical implementations use 32-way branching, which allows for the top levels to be likely in cache.

The best thing about HAMTs is passing immutable ones around, which have free update costs and multi-thread safe. Clojure uses these to represent information that flows through a program. (By free update costs I mean: they are free enough to never rise to the programmer's attention. Obviously ops have costs.)


More efficient (in both CPU and memory) immutable/persistent collections, which have the advantage that they're intrinsically thread-safe. They can also be used to implement lock-free concurrent maps (Ctrie)

> Tree based associative arrays (like c++'s std::map) tend to have poor real world performance

That's something HAMT significantly improve upon. You're not going to get linear performances out of them but basic operations are generally O(ln N) where N is usually 32.


I think it's for an array of size N:

- O(1) - standard array (regardless of N)

- O(log_b(N)) - HAMT (b is 32 in clojure)

Since b is big, log_b(N) is quite small even for big numbers, e.g. log_32(2^20) = 4


Um, isn't that "constant-time in practice"? (Something like log-star.) Am I missing some matter of convention here?


Is there any research on the tradeoff between this kind of thread safety and the penalties incurred on the memory management side (specifically the fact that now you have far more heap allocations with immutable data and they have to synchronize too)?


I don't know of any formal research, but in Clojure, where HAMTs are a fundamental part of the language, the philosophy is more oriented towards paying for things (such as thread safety) with memory and CPU first, benchmarking to find bottlenecks second, and if there are any bottlenecks, optimizing accordingly. It was written with this in mind, which is why it requires world-class VMs like the JVM, CLR, and V8/JSC/SpiderMonkey/etc. that can deal with the GC and (in many cases) make runtime optimizations.

Also, I'm fairly certain synchronizing isn't an issue because there's nothing to synchronize on since the data structures are immutable. Am I understanding you correctly?


> Also, I'm fairly certain synchronizing isn't an issue because there's nothing to synchronize on since the data structures are immutable. Am I understanding you correctly?

No -- I'm referring to synchronization on the heap itself (think new()/delete()). Multiple threads allocating memory need to synchronize in order to avoid trampling on each other. You can't just get rid of synchronization entirely.


Immutable maps that have efficient update semantics


Depending on the use case, of course. Sometimes you need a structure that's both associative and ordered, for example, in which case std::map is handy.


HAMT are not intrinsically ordered or sorted.


The comment obviously refers to std::map, which it mentions explicitly and which guarantees ordering.


But the entire discussion is about HAMT not std::map, the parent comment only talks about std::map as an example of performance issues in tree-based maps in asking how that relates to HAMT, it doesn't actually care a whit about std::map.


So would this be a good structure to store the recently discussed password hash dump, basically a fixed set of 300 million SHA1-sums where the only interesting operation is checking if the sum is or is not contained in the set?


It would probably be better to use a trie directly, maybe a radix trie or a b-trie.


Does this add something that the Clojure implementation doesn't?


I wish the paper carried a GitHub link to the repo of reference implementation.




Indeed, wasn't easy to track down:

https://github.com/msteindorfer/oopsla15-artifact

Still trying to find out if there's a video of the OOPSLA15 presentation.


Some commentary on a blog, that I encountered along the way: https://blog.acolyer.org/2015/11/27/hamt/


Looks like the "production" version lives at:

https://github.com/usethesource/capsule




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: