
An Introduction to Approximate String Matching - semanti_ca
https://semanti.ca/blog/?an-introduction-into-approximate-string-matching
======
KenanSulayman
The Damerau-Levenshtein distance[1] also allows for the transposition of two
adjacent characters and is more precise in my experience. That said, those
string distance measures often have bad cases where seemingly similar string
can result in a high number of "edits" to be estimated by Levensthein'ian
algorithms.

In more advanced applications, there's also the Kendall tau distance [2],
Jensen-Shannon divergence [3], overlap coefficient [4], Jaccard index [5], any
many more, which may prove significantly more precise (depending on your use
case).

[1]
[https://en.wikipedia.org/wiki/Damerau–Levenshtein_distance](https://en.wikipedia.org/wiki/Damerau–Levenshtein_distance)
[2]
[https://en.wikipedia.org/wiki/Kendall_tau_distance](https://en.wikipedia.org/wiki/Kendall_tau_distance)
[3]
[https://en.wikipedia.org/wiki/Jensen–Shannon_divergence](https://en.wikipedia.org/wiki/Jensen–Shannon_divergence)
[4]
[https://en.wikipedia.org/wiki/Overlap_coefficient](https://en.wikipedia.org/wiki/Overlap_coefficient)
[5]
[https://en.wikipedia.org/wiki/Jaccard_index](https://en.wikipedia.org/wiki/Jaccard_index)

------
timClicks
If you're comparing names of things, use the Jaro–Winkler distance. It weights
the early letters in the string as more important.

Excellent implementation of these two and other edit distances are available
as part of the jellyfish package
[https://pypi.org/project/jellyfish/](https://pypi.org/project/jellyfish/)

------
sytelus
Very elementary article and author should change title to “Introduction to
Levantine’s distance”. Approximate string matching is very wide field and
there are many much more efficient ways than Levantine’s distance.

~~~
ofrzeta
It's "Levenshtein", named after the Soviet mathematician Vladimir Levenshtein.

------
Eridrus
There are all these edit distances that are useful, but when you want to do
fuzzy matching, you usually want to do it against a database, and while fuzzy
matching is a great place to start, I think the more general way to think
about this is document retrieval where you can just know that CA and
California are two names for the same entity.

~~~
jgalt212
Yes, I agree that a DB should be the first pass, but what do you do when you
encounter a never before seen token?

e.g. your September:SEP mapping will fail when encountering the token _Sept_ ,
but Levenshtein will help you get this one right.

~~~
Eridrus
This isn't really an either or situation, you can do both. Retrieve + Rank is
a very flexible paradigm.

~~~
jgalt212
I agree, but your original comment seemed to portray it as either/or.

------
rubatuga
I was hoping that the article would actually talk about string matching
algorithms...

