Hacker News new | past | comments | ask | show | jobs | submit login

I don't understand why the trie would take quadratic memory. I would expect it to be high-constant linear. Can you explain?



Because a trie works great for prefix searches, but we want substring searches, so we'd have to modify the trie to be more like a suffix tree, in it's simplest form inserting each possible suffix into the trie and doing a prefix search over that. For each word, adding up the length of each suffix gives quadratic memory.


That's linear. It's quadratic on each word, but that doesn't matter (this is why I mentioned high-constant). It's linear on the size of the data set, which is the important part. I'd be happy to code it up if you'd like. Even a binary tree on all suffixes (which would be somewhat simpler to write) would have the linear memory/logarithmic search time that you're looking for.

Edit: I can't reply to you yet, so here goes:

Calling it O(n.m^2) is, at best, an abuse of big-O notation. Since m is a constant (1), that is large-constant O(n). Which is what I've been saying.

Yes, it increases the size of the data set. That would be the "large-constant" part.

(1) If this is not the case, I would love to hear why.


I did say "requiring quadratic memory from every word." I'm not sure you can really call that linear - it's O(n m^2), where n is the number of words, and m is the average word length. It's not the O(n m) that Ferret achieves.

For even a medium sized dictionary of a few dozen MB, you'll find yourself quickly running out of memory. The 4MB dictionary running on the demo would jump to 328MB. Both a Trie and a binary search tree (both of which I've coded and tested) take significantly more memory, and the binary search tree is somewhat slower (try doing a tree traversal between two points as fast as the same traversal over an array).

EDIT: I'll just reply to your edit here for simplicity and because I really don't want to start another thread as I have to get to sleep. I view m as about as constant as n - adding or subtracting words generally changes them both. Try thinking of m as c/n, where c is the total number of characters in your dictionary. O(n m^2) -> O(c^2 / n), whereas ferret uses O(c). The 1,000,000 most frequent English words might have an m of 6-7, but the 100,000 longest English words might have an m of 12-15, taking more memory in a trie, but less memory in Ferret.

But the point is really irrelevant, tbh. We both know how m and n work, and how much memory the trie and Ferret cost for different dictionaries, which is all that should really matter.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: