
Beating hash tables with trees? The ART-ful radix trie - HenryR
http://the-paper-trail.org/art-index/
======
jamii
My experience has been that the vast majority of papers on data-structures are
at best misleading, and at worst deliberately biased.

For example:

> The hash table used by the authors of ART in their study was a chained hash
> table, but this kind of hash tables can be suboptimal in terms of space and
> performance due to their potentially high use of pointers.

> Our experiments strongly indicate that neither ART nor Judy are competitive
> to the aforementioned hashing schemes in terms of performance, and, in the
> case of ART, sometimes not even in terms of space.

[https://www.victoralvarez.net/papers/A%20Comparison%20of%20A...](https://www.victoralvarez.net/papers/A%20Comparison%20of%20Adaptive%20Radix%20Trees%20and%20Hash%20Tables%20-%20ICDE%202015.pdf)

~~~
leiroigh
It is unsurprising that a structure designed for range queries loses out
against hashes for random point look-ups.

Thanks for posting that addition and very nice link. It is completely unclear
why authors in this field even attempt to compete with hash-tables for random
point queries: Trees (whether comparison or radix) make only sense when using
either range queries or when faced with a query distribution that prefers
locality.

Something I'd like to see is benchmarks of batch-query performance: If we can
queue a couple of queries, then trees should gain a lot from processing them
in-order (then, the last query has already warmed the cache for the next; this
probably can pay for approximately sorting the batch).

~~~
bradleyjg
> or when faced with a query distribution that prefers locality

Isn't this a fairly common pattern? How would we go about quantifying the
preferring locality-ness of a query distribution?

~~~
whatshisface
> _How would we go about quantifying the preferring locality-ness of a query
> distribution?_

For each level of the tree, treat the incoming stream of queries as a markov
process where each state is a query that involves a certain node. So, if I
have 7 nodes on level 2, I can build up a table of transition probabilities
between vertices like "query involved node 3 on level 2" and "query involved
node 7 on level 2." When the transitions between these vertices and themselves
have high probability, the queries prefer locality. You can see which scale
the locality is preferred on by doing this at each level of the tree.

------
acidx
Nice article and analysis! I'm actually considering scrapping the trie used in
my project to something based off of this one, with some modifications:

For instance, find_node(c, Node48) could avoid the branch if a non-existing
index points to an additional entry in child_ptrs that's always NULL. Lookup
would be comparable to the Node256 version.

Another thing that could be done, is to scrap the Node48 entirely, and
implement two new structs to replace it: Node32 and Node64, and use
respectively AVX2 and AVX512. These can be based off of the Node16 version. It
remains to be seen if these will yield better performance than the branchless
Node48 above, especially if power management kicks in when mixing AVX512 with
older SIMD generations.

The trie in Lwan ([https://lwan.ws](https://lwan.ws)) does an interesting
trick to reduce the amount of memory used in the equivalent of a Node256:
instead of 256 pointers to a node, it has only 8 pointers. Characters are
hashed (MOD 8). The leaf node contains a linked list of key/value pair, and an
actual string comparison is performed at the end. (Lwan cheats here by
avoiding a string comparison if the linked list contains only 1 element.)
Works pretty well, as it's part of the URL routing mechanism.

One other experiment I've been making with tries, is to use the idea of key
compression and use it in a different way: slice it every 4 or 8 bytes,
consider those bytes as an arbitrary integer, and add every chunk of it to a
hashmap<int, some_struct>, building a chain for the next lookup in
some_struct. The prototype I wrote works pretty well.

~~~
jules
Another standard trick to reduce memory is to store a bitfield telling you
which pointers are not null, and then store an array of only the pointers that
are non null. For example for a Node64 you store a 64 bits bitfield plus only
the pointers that are not null, so if that node has only 3 children you store
64 bits plus 3 pointers instead of 64 pointers. You can index that structure
by doing a shift + popcount: if you want to find the pointer at index n you
first check the n-th bit in the bitfield to see if the pointer is null, and if
it isn't you count the number of set bits in bitfield[0..n] to find the index
into the pointer array.

------
faragon
A good point for both RB trees and linear-addressing hash tables is that they
can be implemented with vectors( [1], [2]), allowing the case of initial
reservation for N elements, so with a tricky implementation you could even
have the data structure with one or zero allocations (e. g. allocate the tree
or the hash table in the stack). For tries you could use many memory pools for
the different node sizes, apply path compression, and even a LUT accelerator
for reaching the Nth level, but hardly could be implemented using a vector.

[1]
[https://github.com/faragon/libsrt/blob/master/src/saux/stree...](https://github.com/faragon/libsrt/blob/master/src/saux/stree.c)

[2] I'm implementing a key-value hash table that will be added to the same
library as [1] with "srt_hmap" type, in one continuous allocation. Being able
to use hash tables allocated both in the heap and in the stack (e.g. you could
use a int32-int32 hash table allocated in the stack for computing the color
frequency of a bitmap image). Being the HT performance 4 to 5x the performance
of the RB-trees, including cost of rehashing - rehash implementation using
techniques for avoiding moving all the data- (rehashing only available for the
heap case).

~~~
jstimpfle
You can implement most things with dynamically allocated vectors. Just use
indices instead of pointers to link the elements. This can also bring
advantage in space efficiency if you're able to do with 4 byte indices instead
of 8 byte pointers.

~~~
saagarjha
You can consider the entire virtual memory space to be a big vector, where a
pointer is just an index into it!

~~~
jstimpfle
However that's a pretty poor vector. It's not homogeneous (you put things of
different shapes inside). That also implies you need sophisticated memory
management, leading to further overheads. You cannot meaningfully iterate it.

~~~
saagarjha
I was mostly joking…

~~~
jstimpfle
Yes and I'm serious :-)

------
namibj
Just a friendly reminder that B-trees are often faster on modern
microprocessors than RB-trees. See kbtree.h for a simple, yet fast example. I
didn't test it, but I'd assume B-tries would be rather efficient.

~~~
jules
Red-Black trees are isomorphic to B-trees of node size 4. If you take a Red-
Black tree and put each black node together with all its red children into a
single node then you get a B-tree of node size 4. An insertion/deletion
algorithm for Red-Black trees gives a corresponding insertion/deletion
algorithm for B-trees of size 4 and vice versa. Putting those nodes together
in a single node probably improves performance already because you have fewer
pointers, and you can improve performance further by increasing the size of
the B-tree nodes to, say, 32 instead of 4.

~~~
saagarjha
They are isomorphic, but B-Trees are more cache friendly. B-Trees store more
in each node, while red-black trees require pointer chasing for each element.

~~~
jules
That's what I said.

------
marknadal
Radix trees are one of the most under utilized data structures.

They are great and have fantastic performance!

I implemented a custom on disk storage engine with a Radix format, and am
getting on a low end MacBook Air 2015 about 3K/acked writes to disk/second! It
is now the default at
[https://github.com/amark/gun](https://github.com/amark/gun) the code is
pretty short too.

------
repsilat
> _This is superior to binary-search: no branches (except for the test when
> bitfield is 0), and all the comparisons are done in parallel_

Branchless binary search isn't hard to implement if you know (or can bound)
the number of elements statically. You just use the comparison result
arithmetically instead of branching on it, and you unroll the loop.

Obviously a binary search can't do comparisons in parallel, though.

~~~
BeeOnRope
> Obviously a binary search can't do comparisons in parallel, though.

Sure you can, although at reduced efficiency the "deeper" you go.

For example, while searching a range of size N, you know your next probe point
will be at N/2\. You also know the probe point _after_ that will be at N/4 or
3N/4, and so on. You know up-front all the possible probe points. Of course,
only one out of the two probe points N/4 or 3N/4 or will actually be useful,
depending on the N/2 result - but that doesn't stop you from comparing all
three in parallel.

You can get a reasonable speedup this way: the extra comparisons happen in
parallel and for a moderate depth the unnecessary probes are more than
compensated by doing comparisons in parallel.

------
laxk
If somebody need a good/fast/well optimized Go implementation of the ART,
check this out: [https://github.com/plar/go-adaptive-radix-
tree](https://github.com/plar/go-adaptive-radix-tree) (disclosure: I'm the
author)

------
saagarjha
I had to implement a trie for an Aho-Corasick implementation a while back, and
I just used a std::unordered_set<unichar, std::unique_ptr<trie_node>> to store
the children (this was Objective-C++, so I was using UTF-16 characters taken
from an NSString). Worked well enough for the effort I put into it.

~~~
rcfox
So you ended up using a tree to hold your tree nodes. I guess that was good
enough for your purpose, but the article is discussing an optimized
implementation.

~~~
saagarjha
std::unordered_map is general a hash table, is it not?

~~~
rcfox
Sorry, I switched set and unordered set in my mind. Still, a generic hash
table isn't going to match a tailored data structure.

~~~
saagarjha
I mean, it might. I only had to construct one trie, and I used it hundreds of
millions of times, so O(1) child (hash) lookup ended up being faster than the
O(log(n)) binary search lookup, at best, from the data structure described in
the article. And since the data structure in the article wastes some space as
well, I may have even used a similar amount of storage.

