
The Wonders of the Suffix Tree Through the Lens of Ukkonen’s Algorithm - pekalicious
https://humanreadablemag.com/issues/0/articles/the-wonders-of-the-suffix-tree-through-the-lens-of-ukkonen’s-algorithm/
======
lbj
So often lately, Ive come here and left disappointed. Reading about Wasabi and
hipster sweaters. But this post is exactly the kind of content we should be
pushing. Everyone who reads this will become a better hacker for it.

------
mumblemumble
This site is currently running a Kickstarter to launch a new programming
magazine. If you'd like to see more content like this out there, consider
taking a look:

[https://www.kickstarter.com/projects/pekalicious/human-
reada...](https://www.kickstarter.com/projects/pekalicious/human-readable-
magazine-for-the-inquisitive-developer)

------
weitzj
Also interesting to get into are suffix arrays which you could construct from
a tree. Then there is a notion of generalized suffix arrays, where you mix
multiple suffix arrays into a big suffix array. Then you can give fast answers
like: what is the longest common prefix between the several tokens, whereas
the tokens of your individual suffix arrays could be letters from an alphabet
(so you construct a suffix array from words) or maybe words (you construct a
suffix array from a sentence)

~~~
robinhouston
And then you could get into compressed indexes such as the FM-index, which
represent a text in a form that is both (typically) smaller than the original,
and _also_ supports fast substring queries.

Suffix trees are cool, but they’re just the entry point to a world of
wonderful data structures and algorithms.

~~~
del_operator
The Burrows-Wheeler transform was an interesting thing to see! When I used to
teach at an immersive 3 month program for web dev we had a few weeks of
playing with such arrays, trees, tries, dynamic programming, and text from
Wikipedia.

------
ridiculous_fish
In practice one should typically prefer a suffix array to a suffix tree. A
suffix array is conceptually so easy!

1\. Given a set of strings, put all the suffixes into a set. For example
"derp" -> "derp", "erp", "rp", "p".

2\. Sort it

It's cheap! The number of suffixes equals the string length, and suffixes can
be mere references into the string input. So in practice you use maybe ~2x
memory compared to the input - and in big cache friendly chunks (not tries).

Now you're flying. If you want to perform a substring search:

1\. Binary search on the first character. You get a range of suffixes which
begin with that character.

2\. Within that range, perform a binary search on the second character. Now
you have a range of suffixes which begin with the first two characters.

3\. etc.

Powerful and easy.

By the way this technique of going character by character can and should be
extended to the sort itself; look up "three-way radix quicksort" for what you
never learned in school.

~~~
SamReidHughes
More like 4x or 8x the memory (of an ASCII string) if you just use pointers or
size_t’s. Abstractly, you need at minimum (n log n) memory to represent a
permutation, which is what you’re doing.

------
nukke
I took string algorithms course with Esko Ukkonen. It was one of the most
interesting and fun algoritm courses ever and very useful.

Ukkonen algorithm for suffix trees and Kärkkainen-Sanders for suffix array are
very beautiful algorithms.

------
saagarjha
> Introduced to the world in 1973 by Peter Weiner, suffix trees are a
> relatively new concept in computer science, with this newness likely fueling
> its obscurity.

That isn't really all that new; it's just about when red-black trees and
B-trees were invented as well and I wouldn't call those "obscure".

~~~
kadoban
You're right, it is not new. It is slightly obscure though. Definitely comes
up if you study algorithms enough, but I don't think intro courses usually
even mention them, and most working programmers won't have any idea they
exist.

------
th-th-throwaway
Are there some hidden constants/considerations that prevent suffix trees from
being used in the real world?

Every time I expect it to show up (e.g., for pattern matching in databases), I
almost always see something else used instead (e.g., trigrams)

~~~
mehrdadn
A couple of considerations that come to mind for a database: (a) efficiency,
and (b) dynamicity. The tree structure isn't particularly efficient (it has a
lot of space overhead, and I think it's also not as fast), so you'd prefer
something else if possible. So people often use suffix arrays + LCA
preprocessing instead. However, those are hard to make dynamic (your database
is constantly changing), so I'd expect that still makes it tough to use them
in databases. Though I suspect this isn't the whole story.

------
StefanKarpinski
A suffix array is of similar utility to a suffix tree, more compact to store,
and pretty simple to compute: all you need to do is start with an array of
each index into the data from start to end, then sort the indices by
memcmp'ing the suffix string starting at each index. There are more optimized
algorithms, but they're within 2-3x for moderately sized data (megabytes). You
need the fancier sorting algorithms if you're constructing a prefix array of
something huge like a genome.

~~~
kadoban
Just as a side note, that algorithm for suffix array construction has really
bad worst cases (consider a string of all "a"s).

It won't matter for experimenting or most likely for using it on smallish
realistic texts, but anything where you're taking user data for instance you'd
want to step up to a slightly more clever algorithm (there's an O(n lg n)
worst case one that's still very simple, called rank doubling I believe).

------
carapace
(Ironically, for a site called "Human Readable Magazine" I found that really
hard to read.)

------
joe_the_user
(Edit: I been curious/confused by this for years)

Can someone explain how this is fundamentally different from a b-tree of the
substrings of a string? You'd store each substring by start index and each
leaf is where the substring becomes unique.

~~~
mumblemumble
The only thing suffix trees have in common with B-tries that I can see is that
they're both non-binary trees.

A B-tree stores whole keys in the nodes, not fragments of values. And a B-tree
tries to self-balance, so there's a lot of leeway for what keys get pushed up
to the root node. B-trees split and merge nodes based on the number of items
in them, trying to achieve a certain fill factor.

A suffix tree, on the other hand, is storing fragments of the values, and I
believe (but have not bothered to prove) that there is one and only one
minimal and valid way to organize the tree for a given set of data. Suffix
trees add children when the data mandate that they do in order to remain
correct, not just in order to make room.

~~~
joe_the_user
I should phrase it differently; what does a suffix tree get you that a
standard sorting tree doesn't get you. A standard sorting tree would get you
all the neighbors in lexical order. What more do you need?

~~~
mumblemumble
The B-tree just lets you quickly find out if a given string is in a set. The
suffix tree lets you do things like finding the longest common substring very
quickly, or looking for repeating motifs.

~~~
joe_the_user
A B-tree (or any sorting/index tree) allows one to find nearby elements in
where sorting order one choose. If one takes a B-tree of strings sorted in
lexical order, one find the neighbors of a given string X. If you start with a
string X=YZ, where Y is smallest substring in our base string-to-search and Z
is the part that doesn't occur, then the (left and right) neighbors of X will
be other strings containing Y, if such strings exist. Search of this sort may
take O(log(l)) time admitted where a suffix tree might take O(1) time (for all
I know) but still, a lot of specialized search trees are not that different
from B-tree+special-sort-order, as far as I can tell. For example, B-tree +
Z-ordering gives you a spatial sorting tree.

------
alexchamberlain
This might be a stupid question: why is the fact it’s a _suffix_ tree
important? (as opposed to building a similar structure with prefixes)

~~~
j-pb
A suffix tree contains a compact trie like graph representation of all the
suffixes of a long string. So if you want to know if it contains a substring,
you can start at the root and just search like in a prefix tree. Since for the
queried word to be an infix it must be a prefix of a suffix, you will find the
word in O(|query|) time.

So yeah its a prefix tree, but a prefix tree of all the suffixes.

And now to answer your question. If you naively put every suffix into a
regular prefix tree, you would get the same lookup performance. BUT, since the
suffixes will share a lot of suffixes, which the prefix tree can't compress
you have a lot more memory consumption.

The suffix tree avoids this by essentially performing compression on the
suffix as well, so that you simply have pointers to shared suffixes of
suffixes, which point to the root.

In the end it's like a diamond shape I guess, trie in the front, compression
in the back.

~~~
raincom
Why can't one construct a prefix tree of all prefixes, then start process from
the right to left (assume that it is an offline algorithm)?

~~~
j-pb
Because you can walk search trees from their root but not from their leafs.

------
herostratus101
I remember learning about this in a data structures class, and it is a
remarkable data structure. But the author of this article needs to pick up a
copy of Strunk and White. This reads like it was written by a high school
junior. Awful, awful writing.

~~~
pzh
It seems like the style of the article is deliberately long-winded and poetic
to match the type of content that the “human readable magazine” is going for
(similar to New Yorker articles). It’s likely seeking to cater to a specific
niche of reader preferences.

~~~
herostratus101
You've got to be kidding me. The sentences are excruciating to read. New
Yorker prose is in a different universe. The writing is spare and lucid. You
can't evaluate the quality of prose with "wc -w".

