
Autocomplete using Tries - mdibaiee
http://dibaiee.ir/autocomplete-predict-trie/
======
nostrademons
So tries are a really cool data structure and they're the natural fit for an
autocomplete, but a while back I had to implement an autocomplete widget (over
stock tickers) and it turns out that for many relatively small, dense data-
sets, binary search over a sorted array with a linear scan to find suffixes is
a much (10x+) faster approach.

The problem is that a trie is a very cache-loose data structure. Every node is
connected to every other by pointers, and they may be spread all over main
memory. Finding the first node usually involves chasing several pointers;
enumerating each successive one involves chasing several more. Meanwhile, if
you just had a sorted array, all subsequent results would be on the same page
at least, and many would be on the same cache line.

YMMV, based on your data's size and shape. Sorted arrays usually work best
when the full data set is small and dense (eg. stock tickers), while tries may
work better for sparse data (eg. sentences typed in by users). Always
benchmark before committing to an implementation.

~~~
rch
Your conclusion is likely correct given your data, but did you happen to
evaluate HAT-tries as well?

[http://crpit.com/confpapers/CRPITV62Askitis.pdf](http://crpit.com/confpapers/CRPITV62Askitis.pdf)

~~~
nostrademons
Nope I didn't - I did this work in 2006, and the HAT trie didn't come out
until 2007. I did try a number of different variations on how to do the
internal storage of nodes though - the fastest was what the paper described as
a burst-trie - as well as other binary search tree structures. The speed
difference was well over 5x though (IIRC, it was something like 17x from best
string trie vs. binary search), so if the paper's conclusions hold it probably
wouldn't have mattered.

~~~
srean
Your comment provoked a quick search on pointer-less encoding of tries. It
turns out one such implementation was patented in 2006

[http://www.google.com/patents/US20060020638](http://www.google.com/patents/US20060020638)

------
ot
When you have more than a handful of strings, you will want to return only a
small number (k) of completions, preferably sorted by some pre-assigned score.
Depending on how you order the trie, you can only get the first k completions
by lexicographic ordering, which could be very far from the "best"
completions.

A common approach to solve this is to store in each internal node the maximum
score among all its descendants, and sort the children of each node by maximum
score. This way by doing a best-first search with a heap you can efficiently
retrieve the top-k completions by score.

If you then have millions of strings, it becomes tricky to do this in a space-
efficient and cache-efficient way. I wrote a paper a few years ago with
different space-time tradeoffs:
[http://www.di.unipi.it/~ottavian/files/topk_completion_www13...](http://www.di.unipi.it/~ottavian/files/topk_completion_www13.pdf)

------
gokhan
You may also want to check this spelling correction algo. I'm using it along
with trie. Original is in C# but ports are available.

[http://blog.faroo.com/2012/06/07/improved-edit-distance-
base...](http://blog.faroo.com/2012/06/07/improved-edit-distance-based-
spelling-correction/)

------
AceJohnny2
Speaking of autocomplete, how would you design a data structure and
declaration syntax to feed to the autocomplete/parsing system of a constrained
CLI?

By "constrained CLI", I mean the kind that controls a router or other piece of
equipment which has a well-defined (um, somewhat) syntax, as opposed to the
somewhat more free-form aspect of the shell.

I ask, because one of my jobs had me occasionally working in the CLI
declarations for a type of router. The cli commands could be in the form
"interface foo enable", where "interface" could be shortened its shortest non-
ambiguous form ("int" IIRC), and "foo" was a string corresponding to one of
the interfaces on the system. Pressing TAB after "int" would automatically
suggest the list of interfaces that exist on the system. This command had an
opposite in the form "no interface foo enable" (yeah, prefixing "no" instead
of postfixing "disable", though to be fair that may have been possible as
well). A declaration of such a syntax was done with something along the line
of:

    
    
      node(id=interface, type=keyword, keyword=interface, no_prefix=yes, next=int_name);
      node(id=int_name, type=string, next=enable);
      node(id=enable, type=keyword, keyword=enable);
    

(with several extra bells and whistles to account for the int_name string
referencing the existing interfaces on the system and to indicate that the
command was only complete with "enable" at the end).

There were hundreds of declaration files containing lines like that, it was
parsed by a humongous Perl script that output a 7MB source file containing the
data structures fed to libinput. It was easily the slowest part of our
compilation process, and Emacs would choke on that 7MB sourcefile.

It was terrible, but I never found examples of how to do it better. Anyone
have any experience?

------
ocharles
It seems you'd get more mileage out of a trie if you also compared it with a
sort of zipper approach. The idea being that if I type "a", I should just
focus the trie around "A". If I next type "l", then I focus the "l" around my
new root, so I find all strings starting "Al". It doesn't change the
complexity class of completion, but maybe the small optimisation is worth it.

------
msoad
I was watching this lecture last night about tries and suffix arrays. Pretty
cool stuff!

[http://youtu.be/NinWEPPrkDQ](http://youtu.be/NinWEPPrkDQ)

------
mrsmee89
This looks really cool.

Do you have any speed comparisons vs traditional autocomplete methods?

~~~
bpicolo
Tries are pretty much the tradition autocomplete method.

