

Trees, Hash Tables and Tries - skorks
http://benlynn.blogspot.com/2007/04/trees-hash-tables-and-tries.html

======
AlisdairO
In the real world, it's unlikely that a trie of any real size will be
comparably performant to a hash table with a sane hash function: trie
algorithms involve moving from node to node, each of which will require an
unpredictable retrieval from non-sequential memory. Memory latency is pretty
high these days (~200 cycles), so hash tables will almost always be quicker.

This is not to say that radix tries aren't extremely useful - they're compact
and fast - but they're unlikely to match a hash table for raw performance.

~~~
elblanco
I've done a bit of work with tries and hash functions and found that in-memory
tries are almost universally faster than in-memory hashes, no matter how good
the hash function is.

The difference though is on average around 20-30% faster if both
implementations are well optimized. But they come with lots of downsides that
need to managed well, tries explode into memory very quickly requiring all
kinds of memory optimizations to the naive tries to keep the tree depth short
(usually with a small hash on the leaves). I remember building an in-memory
trie library once in Perl that outperformed a native C hash library by 70%.
The development team's jaws dropped when my code was running, they were
insisting it wasn't actually doing anything at all and I was faking the
output. "There is no way Perl code can outperform our C." It was an
interesting learning experience all around. Suffix trees likewise are insane
for full text searching.

Hashes on the other hand have a few benefits, they are more compact in memory,
on average have fast lookups (though if you try and constrain the hash table
growth algorithm too much the can spend a lot of time on both collisions and
on table growth operations) and have many many more high quality
implementations in most standard libraries, including many very good and very
fast on-disk hashes that only consume another few percentage points (single
digits usually) off of speed.

The problem with tries is not that they are fast, there really isn't anything
faster (in memory) that I'm aware of for lookup and retrieval problems. But
tries are like sports cars, really fast, but lots of downsides. Hashes are
like an upper end practical family car.

If you try and put the trie on disk, you end up with all sorts of problems
that you don't really see in modern on-disk hash implementations. Coming off
of disk, hashes tend to be the better option (in fact many of the on-disk trie
interfaces I've seen actually just use hashes for the on-disk portion and
optimize for tree depth in memory to keep things small).

So it depends, if the set of things you want to use for lookups is relatively
small and you can cram it all into memory in a raw trie, it'll be a bit
faster. If the set is big, and you want to use it off of disk, go with a hash.

 __Also, if the values you wish to store don't map well to a trie, like
floating point numbers, you pretty much have to go with a hash. Fortunately,
the hash function in that case is really easy ;) __

~~~
AlisdairO
That's really interesting - but it's hard for me to understand. What makes
them faster for lookup than a hash table with a really good, fast hash
function (say, murmurhash)? Does it depend on factors like string length?
Basically, what's the environment?

My assumption is that it's faster because you might end up performing fewer
string comparisons with the trie - if you have strings with common
prefixes/sizes, this could cost you time on a hash with collisions. Although
it seems like unless you have a lot of strings with common prefixes/sizes,
this still ought to be pretty fast. And if you do have a lot of similar
strings, you could store the hash value as well to speed this up.

My work on cache-sensitive data structures is in the integer space, so perhaps
my ignorance when it comes to working with strings is showing :-).

~~~
elblanco
Bingo, string comparisons will eat you alive. Even if you have a very small
percentage of collisions in your hash (even close to 0%), you _still_ have to
check if you have a collision to determine if you need to iterate to the next
slot in the hash no matter what. If you are hashing strings, you have to
compare the entire string, each character comparison is guaranteed to be at
least one integer sub operation for 1 byte wide character sets and 2 for
unicode.

But basically, tries compute almost as fast as just reading the string, while
hashes have to read the string a few times AND compute the function AND deal
with collisions (AND resize the hash which is stupid expensive).

So to either insert or lookup in a hash you have to:

1) Compute the hash from the string, which naively involves reading the entire
string into memory and computing the hash code. So for each character you have
to move at least one or two bytes into memory, do something with it (likely
some kind of integer op), store the result, then move the next byte or two in,
do an operation with that and so on. Let's not forget that I also have to
check for end of string, which uses at least one sub operation for each
character as well for C style strings.

Otherwise you keep a strlen counter in a register somewhere and do sub
operations against that, either way.

1b)If you are smart you simply restrict the length you compute off of to some
smaller length that will still do a pretty good job of assuring a reasonable
level of uniqueness in the hash, but you can't ever guarantee it (uniqueness
is the principle problem of hashing functions).

2) But still, let's assume I'm dealing with a nice small 16 character string
(n=16). I have to compute many more operations than just the string length, at
least 1 mov, 1 sub (to check for EOS), and then a few operations for your hash
and probably some copies back into memory, so let's be generous and say 8 ops
for each character. So if n=16 at _least_ 128 ops to simply compute the hash.

3) Then we have to do some pointer arithmetic or some table lookups or
whatever to get to the memory location and store the string. So in all we're
talking about 500-1000 ops to insert a 16 character string without even
checking for collisions.

4) Factor that in, assume no collision, we might be on the order of 1200
executed ops in the end. If we have a collision, a couple thousand or more.

So for n=16, ops=~1200 minimum.

5) Even if there isn't a collision, I have to do a linear string compare to
the hash->string value in the hash to assure me that there wasn't a collision.
So we have to start moving integers into registers and doing a bunch of subs
again (and checking for EOS on both strings now). This could easily introduce
another 50-100 ops to ensure no collision.

Lookups on the hash are also of a similar scale. Heaven help you if you have
to expand the hash and move all that crap from one place to another.

For a trie, there is no such thing as collisions. In the naive version, you
simply allocate an array in memory the length of the character set + 1 in
memory, which takes a couple of ops since you know the character set + 1
length ahead of time, use the character (I'm assuming 1 byte character
lengths, but the principle is the same for 2 byte) as an operand in some
simple pointer arithmetic. And store a memory location pointer in each array
index to the next array (of character set + 1) until you finish off the
string. If you are inserting, point to one more array and set that extra + 1
byte to something meaningful. If you are doing a lookup, you check that
location to see if the byte is set to something meaningful (or you can just
set that on the last character instead of one more down the tree, whatever).

The point is, for a 16 character string you count say two ops for memory
allocation, an op to find the slot in the array, and another op for allocation
or value setting for each character.

So in all, an insert or lookup happens almost as fast as a direct string
compare, maybe on the order of n=16, ops=<100\. Lookups can be even faster
since you have _have_ to short circuit the lookup (read: stop reading the
string) as soon as you can't find just one character).

I'm wagging a bit, I've never built either structure in asm, but if you
analyze the number of operations each takes, the trie is insanely faster. Even
with memory access lags.

You can do some obvious speedups, like loading several characters into a
register at a time for integer arithmetic, or loading the hash locations value
chains or something...

One of the confusing points on this is that the big-O estimates I've seen
never bother to account for the string reads and comparisons. With a hash you
basically have to do them a minimum of two times, with a trie, only once (and
sometimes not even that much). Usually you see hashes as O(1) best and O(n)
worst, but really they are O(k) and O(k*n) where k is the string length while
tries are really O(k).

All that being said, as soon as you try to work with either structure on disk,
hashes easily overtake tries in terms of speed since hard drive random access
times are much worse and tries require lots and lots and lots of random access
(for each character).

<http://www.statemaster.com/encyclopedia/Trie>

~~~
AlisdairO
Fascinating - thanks for the detailed response, and that does make a lot of
sense.

It seems that if string comparisons are the major cost, you could speed up
collision handling on the hash table quite a bit by storing additional
metadata. Storing the size of the string, for example, would allow you to skip
a lot of comparisons - you can do a simple size comparison prior to actually
comparing characters. This would also allow you to do slightly more funky
string comparisons, such as comparing characters from the end first, or even
starting at the middle, to skip common prefixes.

Alternatively, instead of storing the size you could store the computed hash
value for each string, which would allow you to compare hash values rather
than full strings. The likelihood of a collision on a 32 bit hash value (say)
is pretty low, so you'd mostly be avoiding extraneous string comparisons. This
would also speed up growing/shrinking the table quite a bit. Have you tried
these sorts of changes, and what (if any) difference do they make?

I can see that these optimisations merely reduce the cost of collisions rather
than the baseline non-collision cost, and it definitely makes sense that tries
require fewer operations. What I wonder is whether, as you scale up the trie
to very large datasets, the superiority of the trie tails off due to the
memory latency? Or are tries basically always superior in memory, due to their
depth never getting all that large?

(To be clear, I don't mean to doubt your work on this subject, since you're
clearly rather more knowledgeable than I am - I'm just interested :-) )

~~~
elblanco
No, not at all.

I think this is an incredible interesting, and often overlooked subject. I've
even seen hashes that resolve collisions by sticking tries in the slots
instead of linked lists for chaining. That way the "I'm checking for a
collision" operation part of the hash becomes "give me the memory location
100% of the time I want". It essentially eliminates hash collisions in the
classic sense, and the tries in the hash slots usually don't get very big/you
can delay resizing the hash for quite a while (just use a different method
like when the first trie hold more than n strings).

Some of the optimizations you mention make a lot of sense and I was generally
responding with the more naive variations of the two concepts. One thing to
keep in mind is that tries grow in memory amazingly fast. So fast that the
rest of your system will have to start swapping out just to keep the trie in
memory. A list of a quarter million tokens, average length of 6 will probably
eat about 300-400MB before optimizations. If you restrict the tree depth to 4
or so, and stick hashes on the end, you can keep it around ~100MB. But a
quarter million tokens is not really all that much, real life applications
will usually involve millions of token. And a naive trie will rapidly outgrow
the memory on 32-bit systems. It's the classic tradeoff, memory for speed.
However, hashes are not really all that bad speed-wise. So the tradeoff isn't
quite as good as one might hope. If you have to go to a trie from a hash to
eek out an extra few percentage points in performance, you're likely having
bigger problems.

In practice I've actually only used a trie a couple of times over a hash.
Mostly because in the applications I'm using, I never have to grow the hash
which is the most expensive hash operation. But also because there are already
so many very good hash libraries in most languages particularly persistent
hashes which can be _very_ memory efficient. The practical different in speed
between the two is really not that noticeable for most applications unless you
are doing billions of lookups a day (in which case you probably are looking
for a different setup anyway involving some kind of database).

~~~
AlisdairO
That is (again) really interesting/informative. Thanks - this is one of those
conversations that really restores my faith in communicating over the web :-).

~~~
elblanco
Actually, your idea of just sticking in the strlen on the hash nodes as a
heuristic to prevent unnecessary collision checks is pretty good.

I too am glad to engage in this topic!

------
liuliu
I will argue that a properly implemented tries are more compact but not
necessarily faster than hash table. The fact is, there are some badly
implemented tries which even worse than hash table in term of memory
consumption. Also, I'd like to point out the actually trie implementation may
not have a fix k child node. I most often use k = 256 or 512 for the first few
layers and 4~8 for the last few one in order to get "hash table" like
performance for the first few layers and "tree" like performance for a deeper
lookup.

~~~
mhansen
_I will argue that a properly implemented tries are more compact but not
necessarily faster than hash table._

Where do you argue this?

 _The fact is, there are some badly implemented tries which even worse than
hash table in term of memory consumption._

Any data structure can be badly implemented. What use is comparing a badly
implemented data structure to a well-implemented one?

~~~
liuliu
It seems in the article the author suggests that in certain case tries is
faster than hash table.

Yes, you are right about every data structure can be badly implemented. But
some data structures are more easily to get wrong. For example, everyone have
no problem with selection sort, it is just hard to get it wrong. But for
binary search, some decent programmer may write (a + b) / 2. Same case for
tries, people more commonly (intuitively) write things like node_t* child[26];

------
timtadh
I have also been getting into tries recently (last 6 months) and if you are
interested in learning about them I recommend Sedgewick's Algorithms in C
(parts 1-4) for a good introduction to the subject. Available on Safari if you
have access to that.

------
senderista
In the same way, any naively "constant space" algorithm that indexes into an
array with a fixed number of pointers or indices is technically O(log n) in
space, since you need log n bits to index n objects. So a space factor of log
n can generally be treated as constant in practice.

