
Hash Array Mapped Tries (HAMT) to the Rescue - adlrocha
https://adlrocha.substack.com/p/adlrocha-hash-array-mapped-tries
======
omginternets
Apart from the Steindorfer and Vinju paper [0], does anybody have any
resources for implementing the CHAMP variant of HAMT tries?

Better yet, has anybody here implemented one before, and might I pick your
brain?

[0]
[https://michael.steindorfer.name/publications/oopsla15.pdf](https://michael.steindorfer.name/publications/oopsla15.pdf)

~~~
joshlemer
I have done more than enough work on the CHAMP encoded hashmap in Scala 2.13
to call myself a co-author

~~~
omginternets
As I mentioned in another comment, I posted my parent comment
opportunistically, so I have yet to distill my reading into specific, well-
formed questions.

Is there any way I can message you when I get around to doing so? I'd really
appreciate a super-quick whiteboard session, if you have the time.

~~~
joshlemer
Sure, my email is joshlemer [at] gmail [dot] com

~~~
omginternets
Thanks! I've sent you an email.

------
huhnmonster
Can someone explain the reasoning for using a HAMT here? I have played around
with hashtables quite a bit, but altough HAMT's are cool and a lot more memory
efficient, they are likely not faster and will at some point suffer from more
severe memory fragmentation than hashtables.

Is the concern speed or memory? Because speed would kind of seem strange,
since after changing to the new architecture, they should see a strong
decrease in messages being exchanged anyways, right?

Or are go's maps inefficient? Genuinely curious

~~~
fsloth
HAMT is excellent for a persistent dictionary. Not persistent as in write-to-
disk, but as in immutable-yet-small.

~~~
benibela
Btw, can it be used to build a persistent dictionary that keeps the key in
insertion-order?

~~~
masklinn
Sure. Map the HAMT to an index into a persistent vector or something e.g.
Scala's VectorMap and TreeSeqMap. You could also reuse the old "linked list
map" idea, but go through the keys every time e.g. whenever you add a value to
the map, add a triple of `(Some(last_key), value, None)` and update the last
keys value such that the third item becomes the key of the new item.

~~~
benibela
>Sure. Map the HAMT to an index into a persistent vector or something e.g.
Scala's VectorMap and TreeSeqMap

I do not have a persistent vector, only a HAMT.

Perhaps the HAMT becomes a standard trie if one sets hashcode(i) = i, and then
it can be enumerated in ascending key order?

But the double-lookup in two data structures is unnecessary slow. The first
HAMT could store (Index, Value) pairs and only use the "vector" for
enumeration.

But I was hoping for a single data structure that can do lookups and ordered
enumerations.

>. You could also reuse the old "linked list map" idea, but go through the
keys every time e.g. whenever you add a value to the map, add a triple of
`(Some(last_key), value, None)` and update the last keys value such that the
third item becomes the key of the new item.

I have been using a linked list as sole storage so far. The lookup was O(n),
which was fast enough with my data, but I wanted to improve it to be more
scalable.

But enumerating the list is in reverse, is it not? Going from the newest key
to the oldest key. I want it to go from the oldest to the newest key.

In any case it should be better to store pointers to HAMT nodes rather than
keys.

~~~
masklinn
> I do not have a persistent vector, only a HAMT.

Why not get a persistent vector?

> But the double-lookup in two data structures is unnecessary slow.

Not necessarily. You'd have 2xlog(n) and because you wouldn't be storing the
data in the HAMT it could be denser / wider and thus better fit in cache. At
least that's what happened with Raymond Hettinger's (non-persistent) naturally
ordered dict: rather than a sparse array of buckets, the hashmap is a sparse
array of indices and a dense array of buckets leading to much better memory
density (even more so with an adaptive sparse array where the stored indices
can be u8, u16 or usize depending on the number of items: most maps will only
need a sparse array of u8 leading to even better memory behaviour for the most
problematic part).

> But enumerating the list is in reverse, is it not? Going from the newest key
> to the oldest key. I want it to go from the oldest to the newest key.

Not if you have a doubly linked list, but that increases your costs in storage
and update complexity

> In any case it should be better to store pointers to HAMT nodes rather than
> keys.

If you're implementing the entire datastructure then yes, I was rather
assuming you were on the outside with an existing HAMT.

~~~
benibela
>Why not get a persistent vector?

I have not read about them. Might take a while to implement.

I use Pascal. But it is hard to find Pascal libraries for anything. Either
they do not exist at all or have not been maintained for over a decade

So, if I want to use a data structure, I usually need to implement it myself

>Not necessarily. You'd have 2xlog(n) and because you wouldn't be storing the
data in the HAMT it could be denser / wider and thus better fit in cache.

If I reuse the HAMT as persistent vector, it is not so good. There is no
grouping, each node is allocated separately and could be anywhere in memory.
One lookup could be five cache misses.

>At least that's what happened with Raymond Hettinger's (non-persistent)
naturally ordered dict: rather than a sparse array of buckets, the hashmap is
a sparse array of indices and a dense array of buckets leading to much better
memory density (even more so with an adaptive sparse array where the stored
indices can be u8, u16 or usize depending on the number of items: most maps
will only need a sparse array of u8 leading to even better memory behaviour
for the most problematic part).

I use such a hashmap, too, for the insertion order. (there are Pascal
libraries with hashmaps, but even then I needed some weeks to customize it).
But all indices there are 32-bit. Perhaps I will add the adaptiveness to it.
16-bit at least. 8-bit seems hardly worth it, at most it can save 256 bytes

>> But enumerating the list is in reverse, is it not? Going from the newest
key to the oldest key. I want it to go from the oldest to the newest key.

>Not if you have a doubly linked list, but that increases your costs in
storage and update complexity

But a doubly linked list would not be persistent.

Arbitrary many new nodes could point to one old node, but the old node can
only point to one of them

------
sfink
Maybe I'm just missing it, but I don't see in this description any explanation
of how to handle hash collisions - ie, when two keys map to the same hash
value. Do you iterate through an infinite series of secondary hash functions
when you run out of distinguishing bits? Or do you bottom out at a table that
you linearly scan?

Overall, it seems like the algorithm is "maintain a trie on the hash values,
using an auxiliary bitmap and popcnt to determine the index within any trie
node." Does that miss anything?

