Hacker News new | comments | ask | show | jobs | submit login
Fuzzy string matching using cosine similarity (nishtahir.com)
83 points by pdq on Sept 21, 2015 | hide | past | web | favorite | 18 comments

Many people comment that there are nicer string-distance measures.

However the real advantage of cosine distance is that you can perform dimensionality reduction (see e.g. page 38 of [1]). This allows you to work with very large documents efficiently and fuzzy. It also allows you to create efficient data structures for finding similar strings and much more.

[1]: http://web.stanford.edu/class/cs246/slides/04-lsh_theory.pdf

That is a simple but very effective method. I am using it for my projects with this tiny library I've written earlier this year (C++). It uses Cosine Similarity + Trigrams and is quite fast.


NLP neophyte here: where do trigrams fit in? Do you base your vectors on trigrams instead of characters?

This is an experiment that everyone who learned about cosine distance and vector space information retrieval probably tried, just to see what would happen.

"Fuzzy string matching" isn't really the right term. Of course string distance will give a better fuzzy string match. But this finds words with similar structure rather than words that are similar in the traditional sense. Perhaps there's no obvious application, but it gives results that are very interesting in and of themselves.

It might serve to partition search space for another more expensive algorithm.

> It might serve to partition search space for another more expensive algorithm

This is what we're doing. We're working on approaches for performing nucleotide sequence alignment using approximate time series representations of vectorised DNA sequences. There are some really elegant lower bounding similarity search methods for time series which generate no false negative alignments, allowing use of more expensive alignment algorithms later on in the search to prune out the false positives.

We've tested a few transformations including Haar wavelets, DFT and PAA and implemented an indexing structure in C++.

Preprint: https://www.academia.edu/12575290/Alignment_by_numbers_seque... (slightly outdated... Email me for more info)

I am not thinking this through but since cos is monotonous between 0 and pi/2, you can avoid the square root by squaring the numerator and still have a good similarity indicator. This is going to eke out some performance with a short tweak.

It might not be a good idea to just square the numerator. Doing so would give an advantage to words with a small norm, i.e. a small number of letters. Unless you actually want that to happen of course.

You could alternatively square the thing in it's entirety, this causes problems if a.b is negative but if you're only using positive coefficients (as in this case) then that can't happen, but even if it could then it's pretty easy to check if it is positive or not.

That is what I meant. I wanted to point out that sqrt is more expensive compared to squared and you could just square the whole thing without affecting the results too much.

cos(theta) and cos^2(theta) are pretty close and monotonous between 0 and pi/2:


Dimensionality reduciton is correct - the cosine similarity on a character/word level really just provides a metric for measuring the "anagramness" of two words. But "Mary" and "Army" would have a perfect similarity.

This is nice, but a better measure of string similarly is an edit count based measure. See Levenshtein distance for example: https://en.m.wikipedia.org/wiki/Levenshtein_distance

On the other hand, computing Levenshtein distance is O(n^2), while the cosine approach is O(n).

This very topic was discussed yesterday. For a given word, you can construct an automaton that matches that word within a certain edit distance in O(n) where n is the length of the word. Since the result is a DFSA, you can also see if a string matches in O(n). (See the work of Schulz & Mihov.)

Moreover, rather than going over each word in a dictionary to find out whether that word is within a certain distance, you can compute the intersection language of the Levenshtein automaton and the dictionary automaton (which is generally faster, intersection is O(mn) in the number of states of the two automata).

For an implementation of this and more, see:


Hmm true - very good point.

Worth noting that for short strings this is unlikely to matter.

There is [1], titled APPROXIMATING EDIT DISTANCE IN NEAR-LINEAR TIME. Wikipedia also has a reference[2] for better time complexity versions.

I haven't explored these though.

[1] http://www.mit.edu/~andoni/papers/compEdit.pdf

[2] https://en.wikipedia.org/wiki/Edit_distance#Improved_algorit...

Postgres users might find Teodor Sigaev's smlar [1] extension useful for this kind of thing. It implements cosine and other similarity metrics.

[1] http://sigaev.ru/git/gitweb.cgi?p=smlar.git;a=summary

Would this be able to find near matches within substrings, e.g. a search for "house" in "this is my hiuse typo" would bring back a near match result?

No. The character distribution of those two strings is massively different. You'd have to tokenize the long string into words first.

EDIT: If I understood correctly and the vectors are normalised, it would match 'house' to 'hiusehiusehiuse'. Which may or may not be desirable.

The vectors are not normalized. That is:

    from __future__ import print_function
    from collections import Counter

    def cosine_similarity_unweighted(a, b):
        a = Counter(a)
        b = Counter(b)
        dotProduct = sum((c in a) * (c in b) for c in a)
        magnitudeA = sum(1 for c in a)
        magnitudeB = sum(1 for c in b)
        return dotProduct / (magnitudeA * magnitudeB)**0.5

    def cosine_similarity_weighted(a, b):
        a = Counter(a)
        b = Counter(b)
        dotProduct = sum(a.get(c, 0) * b.get(c, 0) for c in a)
        magnitudeA = sum(value**2 for value in a.values())
        magnitudeB = sum(value**2 for value in b.values())
        return dotProduct / (magnitudeA * magnitudeB)**0.5

    print("Scores for 'hello' and 'holl'")
    print("unweighted", cosine_similarity_unweighted("hello", "holl"))
    print("weighted", cosine_similarity_weighted("hello", "holl"))
gives the output:

    Scores for 'hello' and 'holl'
    unweighted 0.866025403784
    weighted 0.925820099773
The article gives:

    holl        index: 0.9258200997725514  
which means the article used letter weights.

That said, the scores for your example are the same, because each of the letters in 'house' is used only once (the 3 in the numerator cancels out the 3 in the denominator):

    Scores for 'house' and 'hiusehiusehiuse'
    unweighted 0.8
    weighted 0.8
If the query had instead been 'houss', you would see a difference in the similarities:

    Scores for 'houss' and 'hiusehiusehiuse'
    unweighted 0.67082039325
    weighted 0.676123403783

much better than difflib sequencematcher

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact