I don't understand why the trie would take quadratic memory. I would expect it t...

argusdusty · on Aug 30, 2013

Because a trie works great for prefix searches, but we want substring searches, so we'd have to modify the trie to be more like a suffix tree, in it's simplest form inserting each possible suffix into the trie and doing a prefix search over that. For each word, adding up the length of each suffix gives quadratic memory.

thedufer · on Aug 30, 2013

That's linear. It's quadratic on each word, but that doesn't matter (this is why I mentioned high-constant). It's linear on the size of the data set, which is the important part. I'd be happy to code it up if you'd like. Even a binary tree on all suffixes (which would be somewhat simpler to write) would have the linear memory/logarithmic search time that you're looking for.

Edit: I can't reply to you yet, so here goes:

Calling it O(n.m^2) is, at best, an abuse of big-O notation. Since m is a constant (1), that is large-constant O(n). Which is what I've been saying.

Yes, it increases the size of the data set. That would be the "large-constant" part.

(1) If this is not the case, I would love to hear why.

argusdusty · on Aug 30, 2013

I did say "requiring quadratic memory from every word." I'm not sure you can really call that linear - it's O(n m^2), where n is the number of words, and m is the average word length. It's not the O(n m) that Ferret achieves.

For even a medium sized dictionary of a few dozen MB, you'll find yourself quickly running out of memory. The 4MB dictionary running on the demo would jump to 328MB. Both a Trie and a binary search tree (both of which I've coded and tested) take significantly more memory, and the binary search tree is somewhat slower (try doing a tree traversal between two points as fast as the same traversal over an array).

EDIT: I'll just reply to your edit here for simplicity and because I really don't want to start another thread as I have to get to sleep. I view m as about as constant as n - adding or subtracting words generally changes them both. Try thinking of m as c/n, where c is the total number of characters in your dictionary. O(n m^2) -> O(c^2 / n), whereas ferret uses O(c). The 1,000,000 most frequent English words might have an m of 6-7, but the 100,000 longest English words might have an m of 12-15, taking more memory in a trie, but less memory in Ferret.

But the point is really irrelevant, tbh. We both know how m and n work, and how much memory the trie and Ferret cost for different dictionaries, which is all that should really matter.