
To Trie or not to Trie – a comparison of efficient data structures - fogus
http://bhavin.directi.com/to-trie-or-not-to-trie-a-comparison-of-efficient-data-structures/
======
Robin_Message
This is a nice write-up of the available sensible trie implementations -
thanks Bhavin!

One thing I'd add is that if you aren't exploiting the ordered nature of the
keys within a trie - i.e. if your keys have no underlying substructure or you
aren't doing linear scans through the key space - then you don't want a trie
at all and you'd be better off with a hash table.

A hash table will be faster and use less memory if your keys are unstructured
(imagine for a minute choosing random integers for your keys - then mod N is a
good hash function for those keys, so it is obvious a hash table will work
well for them, whereas a trie will waste space trying to find common prefixes
between random keys.)

If you are accessing completely randomly, then the cache utilisation will be
similar between a good trie and a hash table. If you are accessing some things
more than others, then you should use a move-to-front hash table to be
competitive.

~~~
geocar
> so it is obvious a hash table will work well for them, whereas a trie will
> waste space trying to find common prefixes between random keys.)

It is obvious, but it is wrong. I observed Judy beating out many hash tables
in specifically this situation, and comparing favorably to google's
sparsehash. Given its much smaller code size and fewer dependencies this has
demonstrated itself a huge win at my organization.

Link to my benchmark:
[http://reddit.com/r/programming/comments/bhxwh/hash_table_be...](http://reddit.com/r/programming/comments/bhxwh/hash_table_benchmarks_google_sparsedense_hash_map/c0mw7hm)

~~~
Robin_Message
Benchmarking beats random ideas, no doubt about that! Thanks for pointing out
my mistake with integers. However, I would say Judy doesn't look so hot for
string keys - slower at most things and similar memory requirements.

I'd be interested to see 10 million _64-bit_ integers in Judy and DenseHash
though. Since ten million is 0.2% of 2^32, in some sense lots of the potential
key space is actually in use, so I'd expect a lot of sharing to be possible
and a sparse trie to perform well.

Is there a 64-bit integer key Judy array you could test on?

~~~
geocar
> Is there a 64-bit integer key Judy array you could test on?

Simply using the 64-bit integer as a 8-byte key is what I use.

Treating the 32-bit integer as 4-byte keys performs similarly to using the
integer-API directly.

> I'd be interested to see 10 million 64-bit integers in Judy and DenseHash
> though

It's easy enough to check :)

------
aidenn0
For me the power of a trie over a hash table has always been the ease of
implementation. Hash tables have a lot of corner cases that tries don't have.
I can write an entire trie in the time it would take me to just write the part
of a hash-table that deals with collisions. Nevermind resizing or rehashing!

This is why Judy arrays seemed a bit odd to me. I feel that if I put as much
effort into making a super-fast hash table as the author did into making a
super-fast trie, I feel like it would perform as well or better. That's just a
gut-instinct though and I don't really have any data to back it up.

I suppose an exception could when the median key length is very short, tries
tend to be better since a secure hash function could take longer than a very
small number of memory lookups(especially since some of them will be cached).

------
tyler
Two useful implementations that this article misses are Array-compacted tries
(aka dual-array tries) and Array-mapped tries. ACTs are nice due to the small
memory footprint and good cache-locality, but have the downside of being
difficult and time-consuming to construct. AMTs have a bit-array for
determining existence of children and a sorted array of child-pointers in each
node and are surprisingly fast at both insertion and search. Both have been
described elsewhere, but I like the explanations by Phil Bagwell in his paper,
"Fast And Space Efficient Trie Searches" (1998).

------
sunir
Perhaps a dumb question, but has anyone considered using a trie to manage
search query auto-suggest, such as Google AutoSuggest?

It seems like it could be more efficient as you could send to the browser the
node ID of the current buffer and then when the next key was pressed, use the
current ID to efficiently look forward just one level.

~~~
whakojacko
Not a dumb question at all.

Im working on a project to do this right now in fact, since it seeems all the
open-source autocomplete implementations I can find just come down to doing a
"SELECT bla where bla like bla", which is pretty slow. The one major issue is
that tries are prefix-based, so it has to match starting from the beginning of
the query. My sort of solution around that is to break up the terms your
matching against into groups of phrases, and allow prefix matching on any of
those phrases.

As a question to HN-I was planning on using Radix Trees, but now Ive read this
article it seems like there might be some other good options. Are any of those
structures noticeable better for storing lots of strings?

~~~
sunir
I thought that most of the time since users are typing one character at a time
that would be efficient. Even pasting at the end of the string is easy to
manage as it might only be O(q) where q is the length of the query string
(depending on the structure of the underlying trie).

The worst cases are editing midstring, and worst, pasting to the start of the
query string. Still, those are O(q) rather than O(N) where N are the total
number of strings in the database. In my case, N will have millions of records
where q will rarely exceed 250 characters.

~~~
whakojacko
the problem is not efficiency, its relevancy. Say you have a result "hacker
news". If its a pure prefix trie, this wont match the query "news", even
though you probably want to return this as a suggestion. My hack around this
would be to add both "hacker news" and "news" in the trie-Its still not
perfect, since it wouldn't match "acker" or "ws" etc.

------
arethuza
That's a nice review - I'm toying with the idea of trying to write a simple
hierarchical OLAP database (like Essbase or Hyperion HFM) and trie's seem like
a potentially interesting starting point.

