d(a, c) <= d(a, b) + d(b, c) (triangle inequality)
d(a, b) = d(b, a) (symmetry)
d(a, a) = 0 (reflexivity)
Unless there're clever compilation algorithms to calculate edit distance really fast, that fall apart without a metric...?
I thought that Levenstein was the "golden standard" for this, but just curious what makes algorithms different. Is it purely a question of performance in terms of how fast it executes or memory consumption? Or is there a performance difference in terms of accuracy?
If Sift4 produces different results, what does it count exactly?
A: Of course, they are different algorithms. This is a fuzzy distance calculator, it doesn't give you the exact value. There are still edge cases. But the idea of Sift is to be fast and relatively accurate, rather than very accurate. You need more accuracy, try to combine Sift with Levenshtein for example, computing Levenshtein only where Sift says the strings are above a certain similarity.
In other words, if there are 5000 words, and you need to pick one that's pretty close (or far away) from a target word, but don't care so much about being exact, but do care about performance, then you can use Sift. Or if you have a list of 1,000,000 words, then you can use Sift to pare that down to the top 100 closest, then use Levenshtein to get the exact one.
Exact Algorithm vs Heuristic
You cannot do that, because, other than performance, Sift gives no guarantees regarding accuracy. The exact one, or rather the closest match, need not be among the 100 closest sift candidates.
Levenshtein (Wiki): 33.11 ± 0.83%
Levenshtein (php.js): 28.01 ± 0.43%
sift4: 1620 ± 1.26%
Levenshtein (fast-levenshtein): 37.58 ± 3.61%