
Optimizing trie-based spelling correction algorithms at Constructor.io - AstroChimpHam
http://blog.constructor.io/2017/07/12/optimizing-trie-based-spelling-correction-algorithms-at-constructor-io/
======
deepsun
I'm curious, how do you guys estimate "fitness" of the search results you
found, i.e. whether it was what user searched for or not.

Just looking whether user later clicked on a result, or continued to search?

~~~
gmossessian
That’s definitely one signal, but it can be refined by tracking whether the
search led to click-throughs, conversions, a different search, a bounce,
etc... The conversion is the most important event, but there are a lot of
important features aside from that which we'll use to improve results over
time. If you're interested, our behavioral api
([https://constructor.io/docs/#behavioral-
data](https://constructor.io/docs/#behavioral-data)) goes through some of this
as well. We've done a lot of work to figure out which signals are the most
important, and are continuing to develop this feature all the time as we
acquire more data.

------
lovesdogsnsnow
Does this mean all prefixes have to be regenerated for every addition of a new
item? Or can you invalidate the relevant prefixes?

~~~
gmossessian
This was a problem with the very first working implementation! However, with
some intermediate data structures we were able to take advantage of the fact
that, for an index with millions of items, adding or deleting a few thousand
probably does not change the set of prefixes or their relative rankings very
much. So although the initial precomputation of the prefix mappings from
scratch takes ~10 minutes for our largest customers, we were able to get
subsequent index updates down to mere seconds.

------
danmccorm
Nice post. How do you customize the Damerau-Levenshtein algorithm? Did you
write a new from scratch?

~~~
gmossessian
Hi, and thanks! Well, there's no need to re-invent the wheel, and there
already exist fast implementations of DL distance. However, Damerau-
Levenshtein distance is just one piece of our string comparison evaluation
system that is built in-house and is constantly under development and
improvement.

For example, another important aspect of the comparison metrics is our in-
house phonetics library that we've built to be sensitive to vowel context,
syllabification, diphthongs, stemming and lemmatization, and other language
phenomena, and we are fleshing it out to handle other languages including some
Eastern European and CJK.

------
ak-409
Cool) sounds as interesting approach. It is good to compare it in real life.

