
String Interning – Trie vs. Hash Table Deathmatch - ScottWRobinson
http://loup-vaillant.fr/projects/string-interning/benchmark
======
AaronFriel
Tries ain't easy. It's no surprise that a naive trie is bested by a naive
hashtable, especially where the number of entries is dwarfed by the capacity
of the hashtable. The gold standard for a trie is probably the Judy array[1],
which I've heard has some patents on it, but I can't confirm.

Judy arrays obtain their performance by using a myriad range of possible leaf
and branch nodes. The goal is to fit as much data that could be used for
decision making and branching in the first cache line that would be read. And
in the case of very sparse tries this pays off handsomely.

That said, hash tables are really hard to beat. They are susceptible to denial
of service attacks though.

[1]
[http://judy.sourceforge.net/application/shop_interm.pdf](http://judy.sourceforge.net/application/shop_interm.pdf)

[2] [http://loup-vaillant.fr/projects/string-
interning/intern_tri...](http://loup-vaillant.fr/projects/string-
interning/intern_trie.c)

~~~
beagle3
Also, critbit trees:
[http://cr.yp.to/critbit.html](http://cr.yp.to/critbit.html) \- they are a
binary patricia trie. They might beat both Judy and Hash tables in specific
input distributions: e.g., if all strings have a long common prefix such as:

    
    
        AbstractFactoryFactoryFactoryCreator_x
    

where x is different in each string; hash and judy would need to scan through
the entire string, but critbit trees would start by comparing the 'x'.
Somewhat similar to how a Boyer-Moore search doesn't need to scan the whole
haystack while looking for needles.

~~~
aidenn0
I hacked in support for critbits from this repository:
[https://github.com/jgehring/critbit89](https://github.com/jgehring/critbit89)

I modified it to use something like the pool allocator that TFA's hash table
implementation used (testing later with and without showed only small
performance differences).

I couldn't replicate the dataset from TFA, since it wasn't provided, so I
downloaded the java corpus and just took the first 3M symbols in it. Relative
performance of the 3 benchmarks from the article were similar.

The result was that it was much more space-efficient than the trie
implementation from TFA (only slightly larger than the hash table), but about
1.8x slower than the trie implementation.

A quick -pg run showed 98% of the CPU time in cb_tree_insert which isn't
useful for determining why this was so slow, since it's the monolithic
function that does all of the insert work other than allocation.

~~~
beagle3
Thanks, I wanted to do that but didn't have the time.

Probably the usual tradeoff - critbit does log2(fan_out) branches per location
(it branches per _bit_ ) whereas a regular trie branches per location (it
branches per _byte_ ). That it is only x1.8 times slower is actually good.

I guess it's mostly the less predictable branches of the critbit (and the fact
that there's more of them).

------
bsdetector
> 85,000 unique identifiers (1.4Mb).

Basically all the data fit in the CPU cache, so what was being measured was
mostly the number of steps needed for each algorithm.

For most real uses, the cache is a large part of the performance. If you are
using the data in order, the trie may be faster because the next entry will
likely already be cached. If you are using data in random order or doing other
work in between lookups, the hashtable may be faster because it only has to
fetch 1 or 2 lines into the cache instead of several.

~~~
panic
_Basically all the data fit in the CPU cache_

True, but it's the relatively-slow L3 cache (the L2 cache is at most 1MB for
the processor being used for testing).

I think the biggest problem with this trie implementation is the amount of
pointer-chasing due to the low branching factor. The _step_ function does four
(!) loads for every character, each depending on the result of the previous
load. Real tries have much larger branching factors (and compress each node to
avoid wasting tons of memory).

~~~
flgr
> Basically all the data fit in the CPU cache

If Mb is used with its usual meaning here (megabit), that's only 175 kilobyte.

~~~
kryptiskt
Probably not as the average length of the identifiers would be 2 bytes in that
case.

------
porges
> It seems the terrible performance of the STL can be explained by
> std::string: this thing hits the general purpose allocator every time a new
> string is constructed. In this benchmark, that means every time we insert a
> string, possibly more. Not good for such an inner loop. There are ways to
> speed things up, but that would complicate the code, and defeat the purpose
> of leaning on the standard library.

It's actually reasonably easy to avoid the unnecessary copying.

Something like this would do (use a string as the buffer, pass it by
reference, then use `try_emplace`). Also, it should probably be using the same
hash function as your C code:

    
    
        #include <cstdint>
        #include <fstream>
        #include <string>
        #include <unordered_map>
    
        class Intern_pool
        {
            struct fvn_hash
            {
                // FVN-1a hash -- http://isthe.com/chongo/tech/comp/fnv/
                std::size_t operator()(const std::string& s) const
                {
                    std::size_t hash = 2166136261; // offset basis (32 bits)
                    for (auto c : s)
                    {
                        hash ^= c;       // xor
                        hash *= 16777619;   // prime (32 bits)
                    }
                    
                    return hash;
                }
            };
    
            std::unordered_map<std::string, std::uint32_t, fvn_hash> map;
            std::uint32_t next = 0;
    
        public:
            std::uint32_t add(const std::string& s)
            {
                auto r = map.try_emplace(s, next);
                if (r.second)
                {
                    ++next;
                }
    
                return r.first->second;
            }
        };
    
        int main(int argc, const char* argv[])
        {
            for (int i = 1; i < argc; ++i)
            {
                std::ifstream file(argv[i]);
    
                Intern_pool intern_pool;
    
                std::string line;
                while (std::getline(file, line))
                {
                    intern_pool.add(line);
                }
            }
    
            return 0;
        }

------
afsina
I thought one advantage of Tries over Hash tables is fast prefix searching.

~~~
magicmu
I had actually never even heard of Tries until today. Fast prefix searching
makes a lot of sense; are there any other clear use cases for Tries?

~~~
birdsbolt
Large dictionaries (set of words) can compactly be represented with tries.

For words with similar prefixes and suffixes directed acyclic word graph is a
much better option (reuses prefixes and suffixes, not just the prefix as in
trie), it's a little bit slower to build but fast to traverse if done right.

Any problem where there's a lot of suffix/prefix reusage benefits from a
proper trie implementation (or suffix array/tree as alternative) - ex. lempel-
ziv compression.

~~~
magicmu
Ahh that makes a ton of sense, thanks!

------
daemin
Is there any reason why you can't use a sorted vector (possibly of pointers
into pooled memory actually storing the strings)?

Would this be necessarily worse performance than a Trie or Hash ?

~~~
gdwatson
The sorted vector has O(n) insertion cost. A hashtable has O(1) insertion,
possibly amortized; a trie, if I understand correctly, has an insertion cost
proportionate to the length of the string, assuming a fixed alphabet.

So as the vector gets larger the amount of time it takes to insert an item
will grow at the same rate as the vector does, but the cost to insert the item
into a hashtable or trie will remain about the same.

------
mjcohen
How did you validate your program?

