Grokking Hash Array Mapped Tries (HAMTs)

chowells · on Aug 24, 2023

This is one of those data structures with a misleading name. Throw out the hashing part. It's not part of the data structure, it's just one use case. When you do that, you discover that this structure is something between a trie and a B+ tree. It's a persistent data structure for indexing fixed-size bit patterns to arbitrary values, with node sizes that aim for a compromise between update (copy) speed and lookup speed (number of pointers chased before reaching the value).

It's cool, but don't tell yourself it's only good for hashing.

Genbox · on Aug 24, 2023

> don't tell yourself it's only good for hashing.

The hash function[1] takes an arbitrary input and generate (in this case) a 32bit integer. This is nessecary to index values that are not integers (such as byte arrays or strings). It also doubles as a way to deterministically randomize values across the keyspace for better distribution of keys.

[1] https://github.com/mkirchner/hamt/blob/main/src/hamt.c#L377

reitzensteinm · on Aug 24, 2023

While it's true that the hashing isn't intrinsic to the mapping of key (bits) to value, hashing is what ensures that the tree stays balanced.

You can absolutely use whatever bits you want, but patterns in the keys will open yourself up to potential loss of performance as the tree will get unnecessarily deep.

joshlemer · on Aug 24, 2023

Nitpick: there's nothing in the data structure that limits it to only fixed-size bit patterns, you could easily have implementations which accept arbitrarily long bit patterns

twic · on Aug 24, 2023

This isn't a very good explanation. The wikipedia article isn't great either. I like this description:

https://github.com/mkirchner/hamt#persistent-hash-array-mapp...

The name does tell you quite a bit about what these are:

* Hash - rather than directly using the keys to navigate the structure, the keys are hashed, and the hashes are used for navigation. This turns potentially long, poorly-distributed keys into short, well-distributed keys. However, that does mean you have to compute a hash on every access, and have to deal with hash collisions. The mkirchner implementation above calls collisions "hash exhaustion", and deals with them using some generational hashing scheme. I think i'd fall back to collision lists until that was conclusively proven to be too slow.

* Trie - the tree is navigated by indexing nodes using chunks of the (hash of the) key, rather than comparing the keys in the node

* Array mapped - sparse nodes are compressed, using a bitmap to indicate which logical slots are occupied, and then only storing those. The bitmaps live in the parent node, rather than the node itself, i think? Presumably helps with fetching.

A HAMT contains a lot of small nodes. If every entry is a bitmap plus a pointer, then it's two words, and if we use five-bit chunks, then each node can be up to 32 entries, but i would imagine the majority are small, so a typical node might be 64 bytes. I worry that doing a malloc for each one would end up with a lot of overhead. Are HAMTs often implemented with some more custom memory management? Can you allocate a big block and then carve it up?

Could you do a slightly relaxed HAMT where nodes are not always fully compact, but sized to the smallest suitable power of two entries? That might let you use some sort of buddy allocation scheme. It would also let you insert and delete without having to reallocate the node. Although i suppose you can already do that by mapping a few empty slots.

vaibhavsagar · on Aug 25, 2023

I wrote about these too: https://vaibhavsagar.com/blog/2018/07/29/hamts-from-scratch/! Unfortunately it's in Haskell which probably makes it less approachable than it would be otherwise.