
Sift4: Fast and accurate string distance algorithm - sytelus
http://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
======
tmalsburg2
One great property of the Levenshtein distance is that it is a metric in the
mathematical sense. This means that the following (in)equalities hold (with d
being the Levenshtein distance):

    
    
        d(a, c) <= d(a, b) + d(b, c) (triangle inequality)
        d(a, b) = d(b, a)            (symmetry)
        d(a, a) = 0                  (reflexivity)
    

Many efficient algorithms operating on distances need the distance measure to
be metric in this sense, otherwise they don't work (e.g., [1]). There is also
a lot of really useful math that can’t be applied if a distance measure is not
a metric. One important question therefore is whether Sift4 is a metric or
not. I haven’t looked at the definition of the algorithm but the author
acknowledges that the measure is not symmetric, so it can't be a metric.
Despite its speed the Sift4 can therefore often not be used as a drop-in
replacement for the Levenshtein distance. There is a reason for why we’re are
still using the Levenshtein distance despite its somewhat poor scaling
properties.

[1]
[https://en.wikipedia.org/wiki/Metric_tree](https://en.wikipedia.org/wiki/Metric_tree)

~~~
gregorias
Could you provide examples of math application - that you mention - in the
case of Levenshtein distance? I'm not interested in general topology,
functional analysis etc., but what finds its use in computer science, software
engineering.

~~~
tmalsburg2
I wasn't thinking of particular applications but I suppose that the math
underlying clustering algorithms assumes that the distance measure is a metric
to name just one example.

------
gingerlime
Could someone explain the differences in simple terms?

I thought that Levenstein was the "golden standard" for this, but just curious
what makes algorithms different. Is it purely a question of performance in
terms of how fast it executes or memory consumption? Or is there a performance
difference in terms of accuracy?

------
chaoxu
There is no proof that this approximates the Levenshtein distance. Might be
interesting to show this is true. (or somehow characterize all the bad cases)

------
jamesrom
Levenshtein distance is a metric that counts the number of insertions,
deletions, and substitutions between strings.

If Sift4 produces different results, what does it count exactly?

~~~
dsp1234
_Q: I compared Sift4 with another algorithm that is much more exact and there
are differences._

 _A: Of course, they are different algorithms. This is a fuzzy distance
calculator, it doesn 't give you the exact value. There are still edge cases.
But the idea of Sift is to be fast and relatively accurate, rather than very
accurate. You need more accuracy, try to combine Sift with Levenshtein for
example, computing Levenshtein only where Sift says the strings are above a
certain similarity._

In other words, if there are 5000 words, and you need to pick one that's
pretty close (or far away) from a target word, but don't care so much about
being exact, but do care about performance, then you can use Sift. Or if you
have a list of 1,000,000 words, then you can use Sift to pare that down to the
top 100 closest, then use Levenshtein to get the exact one.

tl;dr

Exact Algorithm vs Heuristic

~~~
patal
> Or if you have a list of 1,000,000 words, then you can use Sift to pare that
> down to the top 100 closest, then use Levenshtein to get the exact one.

You cannot do that, because, other than performance, Sift gives no guarantees
regarding accuracy. The exact one, or rather the closest match, need not be
among the 100 closest sift candidates.

~~~
lvh
Only if you absolutely, positively, need to find the right answer 100% of the
time. In a lot of cases, a heuristic that runs significantly faster, can be
perfectly serviceable.

------
nhebb
Here's a performance test of Levenshtein vs Sift4 using UTF-8 strings for
anyone interested:

[http://jsperf.com/levenshtein-perf](http://jsperf.com/levenshtein-perf)

~~~
ljw1001
if only i could see the results.

~~~
Veedrac
You click the "run tests" button. Results are in "Ops/sec". I get

Levenshtein (Wiki): 33.11 ± 0.83%

Levenshtein (php.js): 28.01 ± 0.43%

sift4: 1620 ± 1.26%

Levenshtein (fast-levenshtein): 37.58 ± 3.61%

------
Someone1234
I like that Sift4 seems to give pretty consistent answers to Levenstein which
is nice. Some of Sift3's answers are a little suspect (darn right wrong in
some cases). I'd like to see more data on Sift4's performance before I
consider using this over Levenstein, but overall looks pretty decent.

~~~
Veedrac
There was a benchmark elsewhere in the thread[1]. If the results are fair (and
I've no idea if they are), Sift4 looks to be 1.5 orders of magnitude faster
than Levenstein. So it's worth keeping in mind if you're finding Levenstein to
be your bottleneck.

[1]:
[https://news.ycombinator.com/item?id=10461765](https://news.ycombinator.com/item?id=10461765)

