
The Sorry State of Trie Implementations in Python - kevinwuhoo
http://kevinformatics.com/blog/2013/sorry-state-of-trie-implementations-in-python/
======
jeffreychang
Original author of the Biopython implementation here. Out of the maybe 100,000
lines of code that I have out on the internet somewhere, I never would have
thought that this little module would be what makes the front page of Hacker
News! Unfortunately, it's for a bug, but hey, I guess you can't win em all.

I've submitted the fix back to Biopython. Thanks for reporting the problem,
and please let me know if there are more issues.

~~~
kevinwuhoo
I'm was quite surprised (and happy) at the speed this fix came! I was mucking
around with C for a few hours today and was elated to see your email! I
thought that the original author would be long gone when I saw that the it was
created in 2002.

Everything works as advertised now and I've amended the entry. Using your trie
implementation cut my memory usage down from ~22gb to ~10gb. I just mapped
hundreds of thousands of peptides. Thanks!

------
textminer
Not useful for Kevin here, but for those with fat, frozen Python dictionaries
that're useful to some vital process (e.g., feature lookup), the Python
wrapper for the MARISA trie is wonderful. Tend to save 80-90% of memory
whenever I capture a dict with it, and it works with memory mapping.

Quite efficient even for something silly, like casting a long_int -> short_int
mapping by first converting the keys to strings. Hurrah, prefix trees!

------
njbooher
It's even sorrier if the trie you want to search doesn't fit into memory.
There are compressed suffix trees, but at the time all I could find was a C++
implementation[1] and I had to write my own bindings[2] using Cython. After
figuring all that out, it turned out tries weren't very well suited the
problem I wanted to solve. Oops.

[1] <http://www.cs.helsinki.fi/group/suds/cst>

[2]
[https://github.com/njbooher/boglab_OTSF/blob/master/sstree/s...](https://github.com/njbooher/boglab_OTSF/blob/master/sstree/sstree.pyx)

------
shabble
A while back I wrote a perl wrapper[1] for the CProps[2] Character-Trie, and
found the library interface quite nice to deal with, even for a complete noob
at perl-xs.

They have a bit-trie listed as well (I only ever needed strings) which might
be a reasonable fit for more complicated serialised objects. I have no clue
how easy the python FFI/native library stuff is though. I started out with
Inline::C[3], which is both amazing and horrifying in near-equal measure.

[1] Sadly neglected, and never properly packaged for CPAN, but still around at
<https://github.com/shabble/cprops-perl/>

[2] <http://cprops.sourceforge.net/>

[3] <http://search.cpan.org/~sisyphus/Inline-0.51/C/C.pod>

------
andrewcooke
the first comment there is reasonable - suffix arrays seem to be preferable
anyway (see <http://en.wikipedia.org/wiki/Suffix_array>). but i am not a
biologist so perhaps i am missing something? (what?)

(i am not questioning the wider point about it being nice to have a good
suffix tree library; just curious why it's important here).

[thanks for the replies]

~~~
kevinwuhoo
You're right, generally suffix arrays are more efficient than suffix tries,
but biologically there are two reasons why suffix tries are preferred.

1\. There are a lot of repeated sequences in biology and many of them are
slightly different than each other. For example, we need to insert
"difference" and "different" in the structure. The suffix trie would have
shared nodes while the suffix array needs an entry for each suffix.

2\. Mutations are how evolution takes action. Biologically, in sequences,
mutations are just misspellings. If you're looking for a sequence with
arbitrary misspellings at position i, a tree structure is easier and faster to
traverse.

But in reality, most indexing, especially in DNA, uses BWT [1] for indexing.
However, there are more complex implementations of suffix arrays that may be
suitable [2].

[1]:
[http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transfo...](http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform#BWT_in_bioinformatics)

[2]: [http://www.mi.fu-
berlin.de/wiki/pub/ABI/AdvancedAlgorithms11...](http://www.mi.fu-
berlin.de/wiki/pub/ABI/AdvancedAlgorithms11_Indices/script-09-EnhancedSuffixArrays.pdf)

~~~
attractivechaos
Actually just notice that you use "suffix trie". Note that there is trie
(which is also called prefix tree) and suffix tree. There is no suffix trie.
Suffix array usually uses less memory, but I am not sure in theory it is
faster than suffix tree. We can use FM-index and enhanced suffix array (ESA)
to simulate trie traversal. You can use ESA to achieve almost everything that
you can do with a suffix tree.

Trie and suffix tree are natural structures for multiple strings. The typical
suffix array is for one string. It is possible to encode multiple strings with
suffix array, but that part of theory is not widely used.

Also, ESA was introduced in ~2002. It has been well known in CS. Because ESA
is in some way a super set of FM-index, you can of course implement those
popular mappers with ESA. The key reason that ESA is not widely seen is
because it uses much more memory than FM-index. In 2008 when these mappers
were greatly needed, no so many machines have 16GB or maybe more needed by
ESA.

~~~
justin66
> Actually just notice that you use "suffix trie". Note that there is trie
> (which is also called prefix tree) and suffix tree. There is no suffix trie.

At the risk of being snotty: someone should tell Robert Sedgwick. He uses the
term in his newest book, "An Introduction to the Analysis of Algorithms,
Second Edition," which just came out. Sorry if I haven't been fair to your
post due to a quick reading but Kevin isn't making the term up.

------
styluss
Does anyone know a good implementation in Ruby?

~~~
alxndr
<http://stackoverflow.com/a/9042547/303896>

<https://github.com/dustin/ruby-trie>

