
Algorithms and Data Structures: Trie (Prefix tree) - juliascript
https://medium.com/algorithms/trie-prefix-tree-algorithm-ee7ab3fe3413#.ydm9bvqbh
======
bluefox
If I understand the task correctly, it is to map a word into another word that
belongs to the set of words that have the same letter frequency distribution.

So a possible solution is to define a representation for a letter frequency
distribution, and use an associative data type for the mapping.

Since the problem does not concern itself with prefixes, there is no need for
a prefix tree - in Python, a simple dict would do.

A few words about style: when you write code, you have to come up with names
for variables, functions, and other things. In this case, it seems you just
went ahead and wrote down the first thing that came to mind, which is
sometimes verbose because you're still in problem exploration mode and things
are not clear. It's OK to do that, but please consider re-naming later on, so
that instead of a first-thing-I-thought-of name, you'll have a name that gives
the reader a sense of clarity. The process of doing that usually leads to
better understanding of the problem and perhaps better code. With experience,
you'll find that the need to re-name is reduced.

As a rule of thumb, the length of the name should correlate with the its
scope. Names with indefinite scope tend to be long, while names with limited
scopes tend to be shorter.

Finally, I encourage you to write more posts, and hope my critique is helpful.

~~~
jedimastert
>So a possible solution is to define a representation for a letter frequency
distribution, and use an associative data type for the mapping.

out of curiosity, could one such representation just be all of the letters in
the word, sorted? I wonder how the time-complexities would compare. I assume a
trie would still be faster.

~~~
bluefox
Yes. Another simple representation if you assume a maximum number of
occurrences per letter:

    
    
      (defvar *alphabet* "abcdefghijklmnopqrstuvwxyz")
      
      (defconstant capacity-in-bits 3)
      
      (defun key (word)
        (loop with key = 0
              with max-occurrences = (1- (ash 1 capacity-in-bits))
              for letter across *alphabet*
              for i upfrom 0
              for occurrences = (count letter word)
              do (if (> occurrences max-occurrences)
                     (error "too many occurrences")
                     (setf (ldb (byte capacity-in-bits (* i capacity-in-bits)) key)
                           occurrences))
              finally (return key)))

------
mbrumlow
" that means 235,887 operations for each string that I want to verify as a
real word."

/usr/share/dict/words is sorted. A binary search would be much more suitable
than iterating over each word, at least for a first run.

And depending on the constraints of the actual problem you are trying to solve
a binary search is likely going to be the best well rounded case.

(This mostly depends if memory and disk space constraints are in place).

But I would imagine most spell checkers store their dictionary much
differently than that of dict/words

(off to take a look at the aspell source)

~~~
rocqua
Ok, so I did my bachelor thesis on Tries (suffix tries to be exact). So imma
run wild

If you get to preproccess, the trie is best. Building the trie takes time and
memory that is O(size of dict = k). After that, lookups are O(length of words
= n). Meanwhile, binary search is O(n log k) (that factor n their is kinda
contentious as generally you will find a mismatch early in the word).

So if the dictonary is really large, or you're doing a shitton of lookups, the
trie is better than binary search. Sadly, in practice tries don't work very
well with caches. However, there is a very interesting intermediary form based
on something called LCP.

LCP stands for longest common prefix. The idea is to prevent rechecking the
first part of the word by keeping track how well we matched at the left and
right boundaries of our search interval.

There is another magic trick we can use to get this technique up to the level
of a trie. This depends on the Range Minimal Querry. A range minimal querry
asks, given a range (i, j) and an array A, what is the index of the minimum of
A between A[i] and A[j]. It turns out that, with linear preprocessing, we can
answer this in _constant time_. Now, we build an array that, for any 2
adjacent words stores their LCP. Any time we have an interval in our search,
we don't go to the middle, but to the RMQ-index of that interval. This way we
split our search space more efficiently. In fact each interval in this search
method corresponds to a branching node in the trie.

~~~
mbrumlow
I agree that binary search might not be the best (depending on constraints),
my initial complaint was there were faster ways to solve problem stated than
searching every word every time.

I personally like the binary tree method for systems that don't or can't load
a dictionary in memory. I particularly like it because you can implement a
search that requires little memory and does not require you to read the entire
file. But I imagine that there are better ways to lay out the data that
potentially could get you the best of both worlds.

I can see how a trie would be useful for doing things more than just checking
if the word is spelled correctly viz finding list of words that start with a
prefix. The question asked was mostly about looking up a word. I posted
something on this below before I found your reply. In my post below I am
concerned with only checking if the word exist. Are my concerns with regards
to the trie valid? Would you agree that hash table would faster than a trie
(any kind) for checking if the word exist? Or have I overlooked something?

~~~
rocqua
Comparing a hash-table to a trie is difficult in practice.

Theoretically, they are equally fast. Both take O(length of word) time. You
could give an advantage to the trie because it allows for early failure
detection, whilst the hash table requires processing the full word.

In practice though, for short words (e.g. < block size of the hash) the hash
might effectively be constant time. Finally, cache behavior is going to really
matter tries require a pointer dereference for each character (you can
mitigate this someone by storing non-branching paths along the trie better).
Such dereferences tend to really wreck a cache.

So in the end, it's better cache behaviour for the hash table vs early
stopping for the trie. Cache misses are so expensive, I'd guess the hash table
wins in most situations, but that is really a guess.

~~~
mbrumlow
I see your point about early detection. While a trie will have precise early
failure detection a hash will have 2 ways to fail early.

One would be the hash is simply not found. But at that point you had to at
least look at the entire input word, compute the hash and check for the hash.
I have not done any actual analysis on this but just for thought let us look
at the hypothetical case that words would fail near halfway through the
testing process. In a hash we would have needed to at least read the full word
while in a trie comparison of at least half the word with half the letters in
the trie path would have needed to be processed. So conceptually (at least in
my head) both the hash and trie would need to have done _something_ with about
the same number of letters. The question there is is computing and checking
the hash more efficient than traversing the trie.

The second way the hash would fail early is when we have a hash match and
comparing the resulting words the hash entry held. Now if the collision rate
was low then we might only have to check one word, if it is high then many
words. So I would hope the hash was tuned for a low collision rate. But when
comparing two words there is also early detection (this is the point I was
getting at). The entire word does not need to be compared, it would likely be
compared word at a time (word as in 4 or 8 bytes at a time).

If I have time this weekend I think I want to implement both of these to
benchmark.

All that being said, the more I think about it I think a trie would be the
_best_ for a spell checker in general. Simply checking if this is a word or
not is nice, but what is nicer is the ability to suggest alternatives to the
misspelled word and a trie would be mucher nicer to work with with building
that list.

~~~
rocqua
I missed the point (as others have said) that a Trie will take less memory as
it stores common prefixes only once.

As for early mismatch detection, the statistics of that elude me at the
moment. Certainly, early detection decreases as you get more words. However,
as you get more words, you probably get longer words (lest you run out of
space) so that makes early detection more valuable.

------
Fragguccino
Not everyday I get to share this! I wrote some code in Go that uses a trie to
quickly find anagrams. My trick is that I load the trie with a dictionary, but
before I insert each word, I sort it alphabetically (ex: sort -> orst). Before
I lookup a word I sort it alphabetically, so when I locate it in the trie, I
also get all the other words that share the same characters. Because of the
nature of the data structure, I can also quickly find all the smaller anagrams
which will be handy when I get around to turning this into a full fledged
scrabble playing AI:
[https://github.com/RyanEdwardHall/anagrambler](https://github.com/RyanEdwardHall/anagrambler)

------
no_protocol
It is refreshing to see a data structure "explained" without a drawing of some
boxes and arrows. +1 there for the creativity -- I'd be even more pleased to
see a data structure _explained_ with words without relying on images at all.

Also:

    
    
      ...to check if a word exists in the text file, it takes
      at most, as many operations as the length of the word
      itself. Much better than the 235,887 operations it
      was going to take before.
    

But your source for the words (/usr/share/dict/words) is probably already
sorted! I don't think that's a fair upper bound to quote.

~~~
rocqua
Still, trie lookup time is independent from your dictionary size. Sadly, in
practice they are hurt by the many pointer dereferences of traversing the
tree.

If you have a sorted list of strings, you can speed up the search by tracking
how many starting characters on the left and right boundary of your interval
already match the given word. In practice, this turns out to be almost as good
as tries as most steps eliminate quite a few characters to check.

If you like theory, you can add on Range Minimal Query to get the same
theoretical performance as tries. In practice though, this isn't necessary.

------
dugmartin
John Resig wrote up a nice analysis of tries using Javascript five years ago.
It is well worth the read: [http://ejohn.org/blog/javascript-trie-performance-
analysis/](http://ejohn.org/blog/javascript-trie-performance-analysis/)

------
rukuu001
I've implemented a trie exactly once in my 15 years of professional
programming, and I'm so happy I knew about tries when the time finally came :)

Edit: swapped _used_ with _implemented_

------
blt
The coolest use of Tries I've ever seen is the Apriori algorithm. It exploits
the trie properties and the structure of the query sequence in a really clever
way. It's beautiful.

------
aidenn0
One might note that properly implemented hash tables also have O(k) lookup
times. And in fact, much of the time well-tuned hash tables outperform well-
tuned tries.

Tries still have some advantages though:

1) The naive implementation of a trie of using an alphabet-sized vector for
the child nodes is much closer to the ideal performance than the naive
implementation of a hash table. If I had to implement a dictionary on short
notice, I would use a trie.

2) Tries preserve ordering, while still allowing O(1) [in the number of items
in the dictionary] insert, removal and lookup. If you need to get the key that
is lexicographically next or previous, to another key, or you need to find the
two keys that bound a particular value, it's a great structure to use.
Comparison based trees can also do this but not with constant time insert and
remove.

For a really space-efficient trie-like structure, see crit-bit trees[1] which
are basically a space-efficient radix tree with on a binary alphabet (so each
internal node has a fanout of exactly 2).

1: [https://cr.yp.to/critbit.html](https://cr.yp.to/critbit.html)

------
dukoid
"I wrote the simplest version of a trie, using nested dictionaries." (⊙_◎)

~~~
nuggien
What's wrong with this? What would be the alternative? A preallocated array of
26 (or more if you account for special chars) for each node?

~~~
Jach
I recently wrote a basic trie in Nim for fun, I started with a preallocated
array and then rewrote parts to hashtables, it's simpler since you don't need
to care about the dictionary size and it can have space savings depending on
how the table grows. It wouldn't surprise me if there was a fancy bitmapping
approach though, or if you could be more efficient if you had n-gram
statistics. (And as mentioned elsewhere for the purposes of storing/searching
a dictionary you could just have a flat hashtable indexed by the words, or a
bloom filter if you only care about certainly not existing...)

------
kristianp
I'm not familiar with Python, but would a C extension be the best way to
implement a data structure like this, that requires high performance? Or is it
possible to write efficient enough code in Python?

~~~
erubin
When algorithms matter, you do get a lot of performance out of a better
algorithm. But I actually recently wrote an admittedly not production ready
trie extension in C and it was about five lines of SWIG to use it from python.

------
alexchamberlain
I wonder if the op compared the performance of this vs "word in set(words)";
tries are very interesting data structures mind.

~~~
foobarian
The problem with anagrams is you'd have to test all permutations of a word,
which is n! hash lookups. Alternately you could store all prefixes in the set
and prune the permutations that way. Slightly more interesting than word in
set(words).

------
elcct
I wish medium was more developer friendly. Articles about programming on that
platform are unreadable for me.

