
In-memory hash tree implementation - hit9
https://godoc.org/github.com/hit9/htree
======
songgao
> HTree is better for local locks if you want a safe container.

> Goroutine Safety.

Could somebody elaborate on these two points? Why is it a "safe container" and
how is it more "goroutine safe" than a map?

~~~
astockwell
[Edited for clarity per downstream thread]

Both refer to native maps not being safe for concurrent use (e.g. Safe for use
with Go's coroutines -- Goroutine-safe)[1]

[1]
[https://golang.org/doc/faq#atomic_maps](https://golang.org/doc/faq#atomic_maps)

~~~
songgao
If I understand correctly, you still need guarding locks for this if reading
and writing at same time, or have multiple writers. Isn't this the same as for
the native map?

~~~
astockwell
It's not clear from the code; the author probably should clarify what
"Goroutine safety" means - for reads or for read/write? Your correct about the
approach for native maps.

EDIT: The package description punctuation is hard to grasp, but upon a second
read, "Goroutine Safety." might be intended as a headline for the next line,
which starts with "No." Read that way, your understanding sounds correct.

------
ot
Why would this be faster or more space-efficient than a good hash table?

~~~
cmrx64
Depends on the implementation. Some algorithms[1] have a load factor no more
than 0.80 before insert/lookup performance starts to degrade significantly.
Others, like the algorithm used in Rust[2], can achieve load factors in excess
of 0.98 without breaking a sweat.

[1] [http://netjs.blogspot.com.au/2015/05/how-hashmap-
internally-...](http://netjs.blogspot.com.au/2015/05/how-hashmap-internally-
works-in-java.html)

[2] - [http://codecapsule.com/2013/11/17/robin-hood-hashing-
backwar...](http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-
shift-deletion/)

~~~
ot
That's why I said "good hash table". Even a simple open-addressing hash table
with quadratic probing can easily go to load factor 0.8 with reasonable
performance. With cuckoo-hashing you can get much higher and still guarantee
worst-case constant-time lookups and amortized constant-time insertions.

------
lisper

        Take 10 consecutive prime numbers:
    
        2, 3, 5, 7, 11, 13, 17, 19, 23, 29
        And they can distinguish all uint32 numbers:
    
        2*3*5*7*11*13*17*19*23*29 > ^uint32(0)
    

Huh? How do I represent 31 (or any prime greater than 29)?

~~~
Zarel
You don't represent primes, the primes are used to split the keys up and
decide where in the tree it goes.

e.g. to represent 31, you take 31%2=1, so you use child 1 of the root node. If
the root node already has child 1, you take 31%3=1 and use child 1 of that
node. And so on.

~~~
lisper
Oh, I see. It's like a B-tree with a larger node size at each level, yes? But
then what's the advantage of basing the node size on primes? Wouldn't you get
the same advantages with simpler code by using successive powers of two at
each level instead of successive primes?

~~~
jacobolus
This is relying on the
[https://en.wikipedia.org/wiki/Chinese_remainder_theorem](https://en.wikipedia.org/wiki/Chinese_remainder_theorem)

The idea is that you will find a place in the tree for your given number at a
relatively shallow depth, because every number in the ring Z/p1×p2×..pn×Z has
a unique list of remainders in the separate rings Z/pnZ. As long as the
product of the primes is larger than the size of your key space, then you
won’t get any collisions.

I’m not sure using consecutive primes starting from 2 is actually the best
choice though. You could use any primes you like (or for that matter any
collection of factors which are all coprime), so long as their product is more
than the largest possible key. The choice of factors should probably be
tweaked based on benchmarks in real-world use. For instance, the factors could
be 25=5², 29, 31, 32=2⁵, 33=3×11, 37.

[Excuse the lack of subscripts in the notation here; no universally installed
fonts contain a subscript n, as far as I know. Likewise I’d prefer to use ℤ
rather than Z, but I’m afraid it might show up as little boxes for some
readers.]

------
_ak
Assuming that the underlying implementations of the Item interface will use
some form of hash algorithm to determine the return value of the Key method,
how does this deal with hash collisions of two distinct items?

~~~
judofyr
It doesn't. It's not a hash, it's the actual key.

~~~
hit9
Yep.

------
jkot
HTreeMap from mapdb works similar way.

~~~
hit9
This tree is implemented as the record position informations indexing
container for a disk-based storage engine on the bitcask model..

~~~
hit9
And the memory are expensive in that case..

------
hit9
And the github project is
[https://github.com/hit9/htree](https://github.com/hit9/htree)

------
jws
I wonder: Why primes? Why not just crack off 4 bits at each level?

~~~
doomrobo
I think if you do it this way you can extend the tree without recalculating
all the entries. I can just add the next prime to the list of primes.

~~~
ball_of_lint
In that case you could just use a binary tree taking from lsb to msb, and then
expand down a level only on collisions.

The reason for primes is that it is more likely to distinguish between numbers
earlier in the tree so that it doesn't get as deep as fast. Where a binary
tree couldn't tell 0 and 256 apart until the 8th level, this could tell them
apart at the second level: 0 mod 3 = 0, 256 mod 3 = 1

------
amelius
Is it possible to implement this as an immutable data structure? Or is the
duplication necessary for each insertion making it too inefficient to be
practical?

~~~
jasonkostempski
Would that be a HAMT?
[https://en.m.wikipedia.org/wiki/Hash_array_mapped_trie](https://en.m.wikipedia.org/wiki/Hash_array_mapped_trie)

~~~
brudgers
Bagwell's paper:
[http://infoscience.epfl.ch/record/64398/files/idealhashtrees...](http://infoscience.epfl.ch/record/64398/files/idealhashtrees.pdf)

------
freefrag
What's the advantage of using wider trees at every level? Asymptotically
wouldn't you get the same behaviour with a binary tree?

~~~
hit9
Advantage: constant level time complexity with better space utilization.

The binary tree with the binary-search searching strategy has a time
complexity O(logN), which is higher than htree's.

This htree is mainly for memory bounded cases.

------
hsnewman
I love golang!

------
sdegutis
ELI5 why this is cool please?

~~~
dkopi
(I'm assuming you're a 5 year old who's at least been through a course on
algorithms and data-structures):

It's vaguely stated in the abstract: "Hash-Tree is a key-value multi-tree with
fast indexing performance and high space utilization."

The use cases of hash trees are somewhat similar to hash tables (A map where
you can quickly look up a key and get it's value), but you'd use a hash tree
when you're willing to sacrifice a bit of performance for better memory usage.

"Although hashtable is very fast with O(1) time complexity, but there is
always about ~25% table entries are unused, because the hash-table load factor
is .75. And this htree is suitable for memory-bounded cases."

~~~
jbapple
> "Although hashtable is very fast with O(1) time complexity, but there is
> always about ~25% table entries are unused, because the hash-table load
> factor is .75. And this htree is suitable for memory-bounded cases."

Do you know htree achieves this space efficiency? I ask because it looks to me
like every node has an array of children, like a standard tree. Doesn't the
cost of storing one additional pointer for each key add up to at least 25%
space wasted?

~~~
hit9
In golang, array is auto-expanding so the children array allocates more space
than its length, which depends on the golang internal implementation.

> "Child nodes are stored in an array orderly, and checked by binary-search
> but not indexed by remainders, array indexing will result redundancy
> entries, this is for less memory usage, with a bit performance loss. "

I just cannot find some way (such as linked list?) to use the exact space with
at-least binary-search time complexity.

~~~
jbapple
What is the actual space overhead of this structure, and how did you measure
it?

