Hacker News new | past | comments | ask | show | jobs | submit login
Sift4: Fast and accurate string distance algorithm (siderite.blogspot.com)
53 points by sytelus on Oct 27, 2015 | hide | past | web | favorite | 15 comments



One great property of the Levenshtein distance is that it is a metric in the mathematical sense. This means that the following (in)equalities hold (with d being the Levenshtein distance):

    d(a, c) <= d(a, b) + d(b, c) (triangle inequality)
    d(a, b) = d(b, a)            (symmetry)
    d(a, a) = 0                  (reflexivity)
Many efficient algorithms operating on distances need the distance measure to be metric in this sense, otherwise they don't work (e.g., [1]). There is also a lot of really useful math that can’t be applied if a distance measure is not a metric. One important question therefore is whether Sift4 is a metric or not. I haven’t looked at the definition of the algorithm but the author acknowledges that the measure is not symmetric, so it can't be a metric. Despite its speed the Sift4 can therefore often not be used as a drop-in replacement for the Levenshtein distance. There is a reason for why we’re are still using the Levenshtein distance despite its somewhat poor scaling properties.

[1] https://en.wikipedia.org/wiki/Metric_tree


Some applications of Levenshtein, such as spell checking and google suggest, seem like game physics in that they are consumed by humans, and gross approximations are good enough - even with occasional glitches. Speed trumps accuracy.

Unless there're clever compilation algorithms to calculate edit distance really fast, that fall apart without a metric...?


Could you provide examples of math application - that you mention - in the case of Levenshtein distance? I'm not interested in general topology, functional analysis etc., but what finds its use in computer science, software engineering.


I wasn't thinking of particular applications but I suppose that the math underlying clustering algorithms assumes that the distance measure is a metric to name just one example.


Could someone explain the differences in simple terms?

I thought that Levenstein was the "golden standard" for this, but just curious what makes algorithms different. Is it purely a question of performance in terms of how fast it executes or memory consumption? Or is there a performance difference in terms of accuracy?


There is no proof that this approximates the Levenshtein distance. Might be interesting to show this is true. (or somehow characterize all the bad cases)


Levenshtein distance is a metric that counts the number of insertions, deletions, and substitutions between strings.

If Sift4 produces different results, what does it count exactly?


Q: I compared Sift4 with another algorithm that is much more exact and there are differences.

A: Of course, they are different algorithms. This is a fuzzy distance calculator, it doesn't give you the exact value. There are still edge cases. But the idea of Sift is to be fast and relatively accurate, rather than very accurate. You need more accuracy, try to combine Sift with Levenshtein for example, computing Levenshtein only where Sift says the strings are above a certain similarity.

In other words, if there are 5000 words, and you need to pick one that's pretty close (or far away) from a target word, but don't care so much about being exact, but do care about performance, then you can use Sift. Or if you have a list of 1,000,000 words, then you can use Sift to pare that down to the top 100 closest, then use Levenshtein to get the exact one.

tl;dr

Exact Algorithm vs Heuristic


> Or if you have a list of 1,000,000 words, then you can use Sift to pare that down to the top 100 closest, then use Levenshtein to get the exact one.

You cannot do that, because, other than performance, Sift gives no guarantees regarding accuracy. The exact one, or rather the closest match, need not be among the 100 closest sift candidates.


Only if you absolutely, positively, need to find the right answer 100% of the time. In a lot of cases, a heuristic that runs significantly faster, can be perfectly serviceable.


Here's a performance test of Levenshtein vs Sift4 using UTF-8 strings for anyone interested:

http://jsperf.com/levenshtein-perf


if only i could see the results.


You click the "run tests" button. Results are in "Ops/sec". I get

Levenshtein (Wiki): 33.11 ± 0.83%

Levenshtein (php.js): 28.01 ± 0.43%

sift4: 1620 ± 1.26%

Levenshtein (fast-levenshtein): 37.58 ± 3.61%


I like that Sift4 seems to give pretty consistent answers to Levenstein which is nice. Some of Sift3's answers are a little suspect (darn right wrong in some cases). I'd like to see more data on Sift4's performance before I consider using this over Levenstein, but overall looks pretty decent.


There was a benchmark elsewhere in the thread[1]. If the results are fair (and I've no idea if they are), Sift4 looks to be 1.5 orders of magnitude faster than Levenstein. So it's worth keeping in mind if you're finding Levenstein to be your bottleneck.

[1]: https://news.ycombinator.com/item?id=10461765




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: