
Fuzzy string matching using cosine similarity - pdq
http://blog.nishtahir.com/2015/09/19/fuzzy-string-matching-using-cosine-similarity/
======
thomasahle
Many people comment that there are nicer string-distance measures.

However the real advantage of cosine distance is that you can perform
dimensionality reduction (see e.g. page 38 of [1]). This allows you to work
with very large documents efficiently and fuzzy. It also allows you to create
efficient data structures for finding similar strings and much more.

[1]:
[http://web.stanford.edu/class/cs246/slides/04-lsh_theory.pdf](http://web.stanford.edu/class/cs246/slides/04-lsh_theory.pdf)

------
felixguendling
That is a simple but very effective method. I am using it for my projects with
this tiny library I've written earlier this year (C++). It uses Cosine
Similarity + Trigrams and is quite fast.

[https://github.com/felixguendling/guess](https://github.com/felixguendling/guess)

~~~
andrewflnr
NLP neophyte here: where do trigrams fit in? Do you base your vectors on
trigrams instead of characters?

------
afandian
This is an experiment that everyone who learned about cosine distance and
vector space information retrieval probably tried, just to see what would
happen.

"Fuzzy string matching" isn't really the right term. Of course string distance
will give a better fuzzy string match. But this finds words with similar
structure rather than words that are similar in the traditional sense. Perhaps
there's no obvious application, but it gives results that are very interesting
in and of themselves.

It might serve to partition search space for another more expensive algorithm.

~~~
bede
> It might serve to partition search space for another more expensive
> algorithm

This is what we're doing. We're working on approaches for performing
nucleotide sequence alignment using approximate time series representations of
vectorised DNA sequences. There are some really elegant lower bounding
similarity search methods for time series which generate no false negative
alignments, allowing use of more expensive alignment algorithms later on in
the search to prune out the false positives.

We've tested a few transformations including Haar wavelets, DFT and PAA and
implemented an indexing structure in C++.

Preprint:
[https://www.academia.edu/12575290/Alignment_by_numbers_seque...](https://www.academia.edu/12575290/Alignment_by_numbers_sequence_assembly_using_compressed_numerical_representations)
(slightly outdated... Email me for more info)

------
sound_of_basker
I am not thinking this through but since cos is monotonous between 0 and pi/2,
you can avoid the square root by squaring the numerator and still have a good
similarity indicator. This is going to eke out some performance with a short
tweak.

~~~
contravariant
It might not be a good idea to just square the numerator. Doing so would give
an advantage to words with a small norm, i.e. a small number of letters.
Unless you actually want that to happen of course.

You could alternatively square the thing in it's entirety, this causes
problems if a.b is negative but if you're only using positive coefficients (as
in this case) then that can't happen, but even if it could then it's pretty
easy to check if it is positive or not.

~~~
sound_of_basker
That is what I meant. I wanted to point out that sqrt is more expensive
compared to squared and you could just square the whole thing without
affecting the results too much.

cos(theta) and cos^2(theta) are pretty close and monotonous between 0 and
pi/2:

[http://www.wolframalpha.com/input/?i=cos%28theta%29+vs+cos%2...](http://www.wolframalpha.com/input/?i=cos%28theta%29+vs+cos%28theta%29*cos%28theta%29)

------
squigs25
Dimensionality reduciton is correct - the cosine similarity on a
character/word level really just provides a metric for measuring the
"anagramness" of two words. But "Mary" and "Army" would have a perfect
similarity.

------
nl
This is nice, but a better measure of string similarly is an edit count based
measure. See Levenshtein distance for example:
[https://en.m.wikipedia.org/wiki/Levenshtein_distance](https://en.m.wikipedia.org/wiki/Levenshtein_distance)

~~~
xyzzyz
On the other hand, computing Levenshtein distance is O(n^2), while the cosine
approach is O(n).

~~~
danieldk
This very topic was discussed yesterday. For a given word, you can construct
an automaton that matches that word within a certain edit distance in O(n)
where n is the length of the word. Since the result is a DFSA, you can also
see if a string matches in O(n). (See the work of Schulz & Mihov.)

Moreover, rather than going over each word in a dictionary to find out whether
that word is within a certain distance, you can compute the intersection
language of the Levenshtein automaton and the dictionary automaton (which is
generally faster, intersection is O(mn) in the number of states of the two
automata).

For an implementation of this and more, see:

[https://github.com/danieldk/dictomaton](https://github.com/danieldk/dictomaton)

------
jazzido
Postgres users might find Teodor Sigaev's smlar [1] extension useful for this
kind of thing. It implements cosine and other similarity metrics.

[1]
[http://sigaev.ru/git/gitweb.cgi?p=smlar.git;a=summary](http://sigaev.ru/git/gitweb.cgi?p=smlar.git;a=summary)

------
timClicks
Would this be able to find near matches within substrings, e.g. a search for
"house" in "this is my hiuse typo" would bring back a near match result?

~~~
afandian
No. The character distribution of those two strings is massively different.
You'd have to tokenize the long string into words first.

EDIT: If I understood correctly and the vectors are normalised, it would match
'house' to 'hiusehiusehiuse'. Which may or may not be desirable.

~~~
dalke
The vectors are not normalized. That is:

    
    
        from __future__ import print_function
        from collections import Counter
    
        def cosine_similarity_unweighted(a, b):
            a = Counter(a)
            b = Counter(b)
            dotProduct = sum((c in a) * (c in b) for c in a)
            magnitudeA = sum(1 for c in a)
            magnitudeB = sum(1 for c in b)
            return dotProduct / (magnitudeA * magnitudeB)**0.5
    
        def cosine_similarity_weighted(a, b):
            a = Counter(a)
            b = Counter(b)
            dotProduct = sum(a.get(c, 0) * b.get(c, 0) for c in a)
            magnitudeA = sum(value**2 for value in a.values())
            magnitudeB = sum(value**2 for value in b.values())
            return dotProduct / (magnitudeA * magnitudeB)**0.5
    
        print("Scores for 'hello' and 'holl'")
        print("unweighted", cosine_similarity_unweighted("hello", "holl"))
        print("weighted", cosine_similarity_weighted("hello", "holl"))
    

gives the output:

    
    
        Scores for 'hello' and 'holl'
        unweighted 0.866025403784
        weighted 0.925820099773
    

The article gives:

    
    
        holl        index: 0.9258200997725514  
    

which means the article used letter weights.

That said, the scores for your example are the same, because each of the
letters in 'house' is used only once (the 3 in the numerator cancels out the 3
in the denominator):

    
    
        Scores for 'house' and 'hiusehiusehiuse'
        unweighted 0.8
        weighted 0.8
    

If the query had instead been 'houss', you would see a difference in the
similarities:

    
    
        Scores for 'houss' and 'hiusehiusehiuse'
        unweighted 0.67082039325
        weighted 0.676123403783

------
sabujp
much better than difflib sequencematcher

