
Levenshtein Distance - tosh
https://en.wikipedia.org/wiki/Levenshtein_distance
======
glangdale
We wrote a generalization of the Levenshtein distance algorithm for regular
expressions in Hyperscan
([https://github.com/intel/hyperscan](https://github.com/intel/hyperscan)),
which allows you to do Levenstein distance matching over regular expressions.
It is not necessarily a cheap thing to do, given that incautious use of edit
distances and permissive regular expressions can result in automata that match
almost anything.

I believe that the first widely available approximate regex matcher was Tre
([https://laurikari.net/tre/](https://laurikari.net/tre/)). However, while
everybody wants to talk about approximate matching like they have something to
say, most algorithmists act like they forgot about Tre.

~~~
eesmith
agrep is from 1988. I heard about it some 10 years later.

[https://en.wikipedia.org/wiki/Agrep](https://en.wikipedia.org/wiki/Agrep)
says "A more recent agrep is the command-line tool provided with the TRE
regular expression library. TRE agrep is more powerful than Wu-Manber agrep
since it allows weights and total costs to be assigned separately to
individual groups in the pattern."

~~~
glangdale
You're right. I can't believe I forgot agrep (so addled was I with my "forgot
about tre" pun). Our departmental Unixen machines had agrep in '91, I think.

------
jorangreef
Here's a helpful tip in case it comes in handy one day:

If you need to run Levenshtein Distance over tens of thousands of strings,
perhaps as part of a fuzzy search, and you only need to match on distances of
0 or 1, not any distance in general, then there's an optimization on
Levenshtein Distance where you hard code the function just for distances of 0
or 1. It's not difficult to do and the payoff might enable new applications of
what would otherwise be too costly.

~~~
apendleton
If you're fuzzily looking up members in a fixed corpus, you can also employ
the trick from
[https://github.com/wolfgarbe/SymSpell](https://github.com/wolfgarbe/SymSpell),
which is essentially just to, for each string in your corpus, enumerate all of
the variants of that string with one letter removed, and put both the original
string and each variant in a hashtable as a key, with the original string as
the value. To do a lookup, you do the same enumeration (original query plus
each variant with one letter dropped), and look all of them up in your
hashtable. The values you get out of that are a list of all of the strings in
your corpus at edit distances 0 or 1, and potentially some at edit distance 2;
you can do a subsequent Levenshtein calculation on each to weed out the
distance-2 strings, but you only have to do it on this massively reduced set
rather than on your whole corpus.

So like, to index a string "hello", you'd add all of
{"hello":"hello","ello":"hello","hllo":"hello","helo":"hello","hell":"hello"}
into your table, and for the query "gello", you'd look up all of ["gello",
"ello", "gllo", "gelo", "gell"], and get a match on "ello"->"hello", then do a
Levenshtein calculation dist("gello","hello") to confirm it's within ED=1 (it
is), and be done. (Bonus: the same method works with Damerau-Levenshtein
distance as well.)

~~~
developer2
In which case do you wind up with distance=2? I figured it'd jump out at me in
an obvious way, but it hasn't.

~~~
bryondowd
Can't think on a real word example, but abc would match bcd in the above algo,
as both permute to bc, but they are distance 2. So, any case where one letter
is deleted and another is added at a different position.

~~~
aripickar
Boat and Oats is a real world example with Levenshtein distance 2 that would
pass through the algo

~~~
apendleton
Yeah, that's the crux -- two different real words storing the same variant but
having omitted different letters.

------
fuzzyexplainer
Is there a comprehensive guide that explains the differences between the
different libraries for fuzzy/approximate string matching?

There is SymSpell, agrep, fuzzywuzzy, closestmatch, fuzzysearch, fuzzyset,
SimString, ukkonen and others in Github.

The problem is I don't have a clue which one to use for each case, which are
the cases and what are the differences between the libraries.

A writeup illustrating several cases with a recommendation of a library and a
practical example for each one would be very helpful.

Btw, if anyone has experience with this type of stuff I would be interested in
talking to you (mail is in profile).

~~~
dmlittle
There's also metaphone[1] for words that might sound similar but are spelled
differently for when levenshtein might be too sensitive.

For example, someone might have mistyped "berry" as "barry" but you don't want
to match it against "terry" (both have a levenshtein distance of 1)

[1]
[https://en.wikipedia.org/wiki/Metaphone](https://en.wikipedia.org/wiki/Metaphone)

~~~
mjevans
Thanks, I was searching the comments to see if anyone suggested a phonetic
variation. The default Linux spelling checker is usually good enough but
sometimes I only remember how a word sounds and the close letter variations
are clearly wildly incorrect, while asking a search engine or phone is able to
figure out exactly the word I mean nearly all of the time.

------
Timpy
At my first professional programming gig the company I worked for was selling
items on Amazon. The listings were big ticket items that had several sellers
competing on them. Any seller could change some of the details of the listing,
so suddenly all their competition would be shipping the item with wrong specs.
I wrote a program that periodically grabbed the titles of our live Amazon
listings and compared them to our database of what we actually wanted the
title to read. I used Levenshtein Distance as a sort of "severity of change"
metric, the program would sort the changes accordingly and send an email to a
person on the sales team. It was fun implementing Levenshtein Distance, but it
wasn't a perfect metric for this use case. Some of the most severe changes
would be a single digit changing; e.g. "2gb" vs "8gb"

~~~
LorenPechtel
You can modify the basic algorithm to weight changes differently. I had the
opposite case--differing numbers were more likely to be what the user was
looking for. I made digit changes cost 1 point, one string being longer cost 1
point per character (suffixes were another case where it was likely the
objective) and everything else cost 2.

~~~
nwallin
Someday I want to write a diff utility specifically for code. Whitespace
addition/removal cost like .1, addition/removal of a curly brace .5. There are
a handful of other things default diff does that make it really annoying for
code.

~~~
apjr
I did this for a school project, because students would "cheat" on group
projects by exclusively changing whitespace in diffs, or moving code around,
thereby making it look like they had large commits.

We tracked code "moves", whitespace changes, and "trivial" changes via
Levenshtein distance--which isn't great for capturing most simple renamings--
and then marked them up separately in an "augmented diff".

We briefly looked into doing AST edit-distance, but that turned out to be
infeasible for many reasons.

Over-all it helped the prof grade faster, at least in the degenerate cases.

~~~
bonoboTP
Measuring a student's impact by a project on the number of lines of code they
committed can be very misleading.

Sure, in the extremes it will give sensible results (someone wrote only 3
lines, or wrote 90% of the code base).

But minor, questionably motivated refactorings can create huge commits, while
someone may come up with some great design while walking around for hours and
sketching on paper, which all ultimately boils down to a short function.

Dumb metrics like this cannot replace actually talking with the students,
asking questions about their choices and alternatives, having them do
presentations or lab notebooks/logbooks etc.

~~~
Cogito
> Dumb metrics like this cannot replace actually talking with the students,
> asking questions about their choices and alternatives, having them do
> presentations or lab notebooks/logbooks etc.

As said in the parent:

>> Over-all it helped the prof grade faster, at least in the degenerate cases.

------
drusepth
I was once asked in an interview how I'd find a row in a relational database
where the value in one column was the shortest levenshtein distance away from
a search key, and I was absolutely stumped when I realized I didn't really
know a sort/search that optimized that way. Of course, by the time I realized
I could email the interviewer for the answer an awkward amount of time had
passed.

So now I ask HN: is there any "good" way to do this short of some kind of
answer like "find a database that natively offers levenshtein distance
search"? Are there actually sorting algorithms for sorting by levenshtein
distance (other than just using L as a sort function in a traditional sort
function)? Are there search algorithms that work well for searching by L
distance?

~~~
spdustin
Depending on the RDBMS and the target text (ideally, you’re looking to compare
an entire column to the search text, and in a known character set, and known
casing, etc.), I’d take one of two approaches:

1) write a UDF—in C, or for SQL Server, in pretty much any managed
language—that returned the Levenshtein distance between a passed param and a
column name, with a “max distance” param to aid in optimizing (e.g., maybe it
doesn’t bother calculating the distance if the length difference exceeds the
max Levenshtein distance)

Then you could write a query that looks something like:

    
    
        SELECT product, description WHERE levenshtein(`fobar`, product, 2) <= 2
    

-or-

2) if you only care about a distance of 1 or 2, create a set of secondary
tables where you’d pre-calculate the permutations of every word, with a FK
pointing back to the PK of the actual row.

Alternately, a combination of a Levenshtein distance function and an RDBMS
that already supports fuzzy text matching seems (to me) to be ideal.

~~~
sixdimensional
I think sometimes it is easy to forget that SQL is but an abstraction over
algorithmic processing and shaping of data. A UDF is only a way to extend the
built in algorithms, albeit, one which may not be fully integrated with a
given query engine architecture and therefore cannot be fully optimized
without having access to the engines internals.

I would argue your answer is the best as as far as doing an actual Levenshtein
distance calculation.

That being said, sort orders are implemented algorithmically by the query
engine. ORDER BY in SQL Server, as I recall implements up to 7 different
algorithms built in, depending on the case, as far as I recall. But those
algorithms are optimized for the engine and have access to the internals.

I can think of two interesting ways to look at this, one being how one could
use an open source database to extend the built in sort algorithms in the
query engine, perhaps controlled by collation settings, to get a native
levenshtein sort implemented that has access to internals. Collations defining
the rules for sort order, but this would probably be within a column, not
against a value (although if you’re extending the query engine, why not).
Example of collation settings [1] and how they control the rules of sort order
- perhaps a SET command to define levenshtein order and then a native
extension to the Postgres query engine, and now you can sort a column by
Levenshtein distances.

The other thing I was thinking was word vectors (but this is similar to
Levenshtein distance) and still in a relational database requires something
special. Maybe some of the full text search implementations in database could
do this out of the box.

[1] [https://docs.microsoft.com/en-us/sql/relational-
databases/co...](https://docs.microsoft.com/en-us/sql/relational-
databases/collations/collation-and-unicode-support)

------
inlined
My first job at Microsoft was to make a system that could catalog the
“contracts” or “extension points” installed in Windows. I’d define a category
of extensions, such as a COM object registration and my program used
normalized levenshtein difference to warn about potential typos. After
removing all matching fields names from the instance & class I compared the
edit distance in the remaining cross product. If any had a normalized distance
< N it produced a warning. It even helped me find a mistyped GUID in a
prerelease copy of Windows.

------
dgudkov
We use the Damerau-Levenshtein edit distance for fuzzy matching in our data
prep tool. Here is a post about how it works:
[http://blog.easymorph.com/2017/08/fuzzy-matching-and-
merging...](http://blog.easymorph.com/2017/08/fuzzy-matching-and-merging.html)

Damerau-Levenshtein [1] is more practical then just Levenshtein as it includes
transposition - a common cause of typos.

[1]
[https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_di...](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)

------
lpolovets
I love this algorithm -- it's so elegant and succinct.

Slightly off-topic, but one of the algorithms I'm most proud of coming up with
as an engineer was an offshoot of Levenshtein distance where I had to find all
pairs of strings in a set of ~1m strings which were fairly similar. 1m x 1m x
O(n^2) is very slow. There turned out be a good way to figure out a lower
bound on distances using some precomputation, and that eliminated almost all
of the pairwise O(n^2) Levenshtein computations.

I'm anyone's interested, here the description: [https://www.quora.com/What-is-
the-best-programming-algorithm...](https://www.quora.com/What-is-the-best-
programming-algorithm-that-you-have-ever-created/answer/Leo-Polovets)

~~~
tomglynch
Your sample use case in your 'At Google' response is eery.

> "At Google, say we had 1 million plaintext passwords..."

~~~
lpolovets
That wasn't the actual use case and we didn't store passwords in plaintext.
But I didn't want to give away the actual use case. In hindsight, passwords
are definitely a bad fake example.

------
packetpirate
Only application I've used this for so far is a technique for defining and
recognizing gestures. They're called "flash gestures" (or that's what the page
where I read about them called them) where you find the closest of 8
directions to the gesture direction and encode each gesture as a sequence of
digits 1-8 corresponding to these closest directions, then use the Levenshtein
distance algorithm to find the closest matching pattern.

EDIT: Found the reference I used to implement this.

[https://books.google.com/books?id=_1rSBQAAQBAJ&pg=PA85&lpg=P...](https://books.google.com/books?id=_1rSBQAAQBAJ&pg=PA85&lpg=PA85&dq=gesture+recognition+algorithm+%2B%22levenshtein%22&source=bl&ots=lW4f4MvuV0&sig=ACfU3U2psj9IXvQHlnhakiWYFIF-7ONTLw&hl=en&sa=X&ved=2ahUKEwiP5bDimpfkAhUESN8KHeWkDyo4ChDoATAAegQICBAB)

------
curtis
Here's my attempt to implement Levenshtein Distance in Go, with lots of
comments:

[https://github.com/cbartley/diffy/blob/master/src/diffy/diff...](https://github.com/cbartley/diffy/blob/master/src/diffy/diff/diff.go)

~~~
revertts
Very nice! If you're looking for one more variant to add: LD with a limit on
the acceptable distance can be done more efficiently, and is nearly always
what you want in practice.

I wrote a commented version years back that might be a handy reference; it's
since been deprecated because we moved it out of StringUtils, but the original
code is here [https://github.com/apache/commons-
lang/blob/master/src/main/...](https://github.com/apache/commons-
lang/blob/master/src/main/java/org/apache/commons/lang3/StringUtils.java#L2297)

~~~
curtis
> _This implementation only computes the distance if it 's less than or equal
> to the threshold value, returning -1 if it's greater. The advantage is
> performance: unbounded distance is O(nm), but a bound of k allows us to
> reduce it to O(km) time by only computing a diagonal stripe of width 2k + 1
> of the cost table. It is also possible to use this to compute the unbounded
> Levenshtein distance by starting the threshold at 1 and doubling each time
> until the distance is found; this is O(dm), where d is the distance._

That's pretty cool, especially the doubling scheme. I'm using a modified form
of Levenshtein Distance for comparing lines when diffing files, and that's
pretty expensive since code files that are thousands of lines long are not
uncommon. Since you are usually comparing one file to another version of
itself, the differences are often small though, so an incremental approach
would really pay off.

------
dcchambers
We used this when building an app to determine 'meaningful' discrepencies
between prelimary radiology reports (done by Residents) and final reports
(done by attending radiologists). It's a fun algorithm to implement that isn't
as simple as it seems. I think there's lots of uses for it.

~~~
dmtroyer
+1

------
mrcactu5
is Levenshtein distance a "metric" in the sense of "metric spaces"? For
example, the TaxiCab metric.

[https://en.wikipedia.org/wiki/Taxicab_geometry](https://en.wikipedia.org/wiki/Taxicab_geometry)

[https://en.wikipedia.org/wiki/Metric_(mathematics)](https://en.wikipedia.org/wiki/Metric_\(mathematics\))

Also, this looks like could be related to the Hamming Code ? Error correcting
codes.

[https://en.wikipedia.org/wiki/Hamming_distance](https://en.wikipedia.org/wiki/Hamming_distance)

~~~
jvanderbot
Yes, it is a metric. The following is BS "proof by restatement of
requirements".

1\. Always >= 0 (what is a negative character edit?)

2\. It satisfies identiy in that if no edits are required to make them equal,
metric distance is 0 and they are equal (applies to either ordering of strings
A->B and B->A)

3\. It is symmetric (run the edits backwards to get B->A instead of A->B)

4\. It satisfies triangle inequality. exercise left to reader, but intuitive
since the edit distance is always the "shortest path" through character
changes between two strings A and B, and deviating to visit an intermediate
string C would not decrease number of edits from A->C->B vs original path A->B

I'm sleep deprived due to "offspring induced insomnia" so take this with a
grain of salt.

------
lordleft
I became intimately acquainted with Levenshtein's distance for my summer
internship project, and have implemented it myself a few times for interviews.
It was central to an event de-duplication system I built. Amazing how such a
seemingly trivial idea can actually be a) non-trivial to implement and b) be
so powerful.

EDIT: By 'seemingly trivial', I mean that if you explain it to a lay person it
_seems_ easy at first blush - not that it is actually trivial.

~~~
kleiba
The problem with "trivial" is that it's in the eye of the beyolder. And
everything can seem "trivial" after you've studied it long enough ;-) For
instance, I'd say the matrix implementation of Levenshtein is quite trivial...

------
pitaj
I've used Levenshtein distance to sort fuzzysearch results by order of
relevance. Worked very well combined with a special-case to sort exact matches
first.

~~~
Filligree
Why would a special case be necessary? The edit distance for two identical
strings is zero.

~~~
pitaj
It's an autocompletion for emoji, so several letters will be missing
regardless.

I'm that case, a long word like "helicopter" will be placed at a greater
distance than "elipsis" despite being what I want "heli" to actually match, so
I just added a "startsWith" check to override such issues.

~~~
tomjakubowski
Have you experimented with n-grams for approximate matching? I've found it
works very well, especially if you add special characters to the alphabet to
mark "start of string" and "end of string": like "heli" would have trigrams
{"^he", "hel", "eli", "li$"}, which would certainly be a better trigram match
to "helicopter" than to "elipsis" (sic).

[https://en.wikipedia.org/wiki/N-gram#n-grams_for_approximate...](https://en.wikipedia.org/wiki/N-gram#n-grams_for_approximate_matching)

~~~
pitaj
I haven't, mostly because it's fast enough already and I don't really feel
like implementing this stuff myself.

~~~
simcop2387
Postgres has support for trigrams and indexes on them. I used them to help
match addresses at a previous company, worked fantastically for the part we
used them for.

~~~
pitaj
It's a client-side library.

------
olavgg
I use the levenshtein distance for NLP data cleaning/spell correction. I also
use a bloom filter that can validate if the word does not exists in my corpus,
so I don't have to run the algorithm for every word.

------
yeahboats
This link was gray for me because I recently used Levenshtein distance to sort
an arbitrary list of strings by finding a "comparison word" that gave the
ordering that I wanted. Definitely more of a cutesy (and difficult to explain)
way of doing things rather than practical but Case statements are so boring.
It was nice that Presto had it built in.

Did some more digging to see if there was a unique comparison word that would
enable me to sort any group of strings but found that it was impossible. Still
a fun afternoon of learning.

------
mrkeen
The names 'edit distance' and 'levenshtein distance' are a bit unfortunate.

When I was trying to match batches of OCR'd strings to known strings, I knew
some algorithm like this had to exist, but I could not find it. My searches
kept turning up Jaccard similarity and Hamming distance again and again.

(Once I found it and began working on an implementation, I started noticing my
compiler would find miss-typed variable names and suggest the correctly-
spelled variable for me.)

~~~
jointpdf
There are a lot of unhelpfully named metrics, algorithms, etc. but I actually
think “edit distance” is fairly illustrative as these things go.

Another good one is Manhattan distance (or better yet, city block distance) in
place of Hamming distance. It gives a nice mental picture of the number of
road segments you’d need to travel in a city to get from point A to point B.

~~~
vallismortis
One step further (multi-layer Manhattan distance) will also get you optimal
DNA, RNA and protein sequence alignment (Myers, Gotoh and others, 1977-1981).
The edit path (often discarded) can be repurposed for simulating RNA and
protein folding, as well as calculating DNA melting points (used for DNA-DNA
hybridization studies).

------
fuzzyexplainer
What type of algorithm does Algolia use for its search (HN search
specifically)?

Is it anything like this (typeahead example)?
[https://rawgit.com/jeancroy/FuzzySearch/master/demo/autocomp...](https://rawgit.com/jeancroy/FuzzySearch/master/demo/autocomplete.html)

What is the name for this kind of approximate string matching?

~~~
redox_
Algolia's engine relies on an optimised Damerau-Levenshtein + prefix matching
algorithm.

------
jhpriestley
For some reason the wikipedia article does not mention the more efficient
Myers diff approach. I have recently implemented it here:
[https://jasonhpriestley.com/array-distance-and-
recursion](https://jasonhpriestley.com/array-distance-and-recursion) (for edit
distance but Levenshtein should work similarly)

------
mromanuk
I'm using Levenshtein distance combined with Trie/DAWG in my iOS keyboard, to
detect typos and offer suggestions

------
lep
If you consider strings as list of characters you can abstract levenshtein to
work on any two lists. For example i use this with list of types to provide a
hoogle-like search for some programming languages stdlib. Ironically i don't
know what hoogle actually uses but i think they atleast considered such an
approach.

------
blt
I'm surprised this page and the page on "string metrics" don't say more about
the mathematical implications of defining a distance metric on strings.
Something interesting must pop up if you apply theorems about metric spaces
from analysis and topology to the set of strings with edit distance.

~~~
dls2016
This metric induces the discrete topology, which isn't very useful. Regarding
its metric structure, you could apply the contraction mapping theorem but I
can't think of any interesting contractions off the top of my head.

I'm just spit-balling here... but when a natural metric induces the discrete
topology then it's probably more useful to look at connectivity graphs induced
by the metric. For instance the graph on all strings with two strings
connected by an edge if their edit distance equals 1. Restrict to words in a
dictionary and you have a basic spelling suggestion algorithm.

Edit: the analyst in me really wants a translation-invariant abelian group
structure ... then we can do Fourier analysis...

~~~
olooney
> Restrict to words in a dictionary and you have a basic spelling suggestion
> algorithm.

Yep.

[https://norvig.com/spell-correct.html](https://norvig.com/spell-correct.html)

------
ThrottleHead
Had to use this to improve the speed of our admin user search after the table
got fairly large (>1mil). The biggest problem was that it's a general search
that searched display_name as well as email. We ended up cleverly omitting the
email search unless '@' was included in the search string.

------
bxio
This was the first algorithm I ever searched for on wikipedia, back when I
first learned to code. Good memories.

~~~
pvg
And then there's that strange yet oddly persistent certainty that back then it
was called the Levenshtain distance.

------
stanmancan
Ugh. At a past job I was tasked with matching records, based on street
addresses and business names, from various systems. Levenshtein Distance was
pretty instrumental in helping find potential matches once we had cleaned up
the address format and business name as much as possible.

~~~
tongo
I did the same thing. I was merging our billing db with that of a business we
had just bought. Due to the nature of our business, customers could exist in
both.

So I had a routine that normalized the address, as best I could anyway, didn't
help that it was all just one varchar field. Then implimented Levenshtein
Distance, messed with the weighting a bit to fit our particular data and away
it went. Saved a bunch of headaches. It wasn't perfect, but it was better than
hand matching a couple thousand accounts.

------
foobaw
Another cool metric that's used commonly when measuring quality difference
between two texts is a BLEU score
([https://en.wikipedia.org/wiki/BLEU](https://en.wikipedia.org/wiki/BLEU))

~~~
gattilorenz
Yup, it's used a lot in machine translation to compare different systems, and
to a lesser extent in natural language generation (where it's completely
inappropriate...)

------
MAXPOOL
For numerical sequences, see dynamic time warping (DTW) algorithm.

[https://en.wikipedia.org/wiki/Dynamic_time_warping](https://en.wikipedia.org/wiki/Dynamic_time_warping)

------
rident
What other languages have it built in?

[https://www.php.net/manual/en/function.levenshtein.php](https://www.php.net/manual/en/function.levenshtein.php)

~~~
humanrebar
D does
[https://dlang.org/library/std/algorithm/comparison/levenshte...](https://dlang.org/library/std/algorithm/comparison/levenshtein_distance.html)

------
xchaotic
We have implemented and using this in MarkLogic:

[https://docs.marklogic.com/spell.levenshteinDistance](https://docs.marklogic.com/spell.levenshteinDistance)

------
RosanaAnaDana
I like using levenshtein distance for building out bayseian networks. Its
works well in conjunction with other distance/ similarity metrics that are
better for continuous data.

------
wpskidd
I stumbled across the concept of Damerau-Levenshtein about two years ago.
Applying it to a a large database of very messy consumer data in Pandas
returned near miraculous results!

------
bitL
OT: Why are so many mathematical objects named after some dudes and not after
what they represent, i.e. in this case edit distance? Never understood this
obsession.

~~~
Cogito
Historically, at least to my eyes, this seems to derive from how mathematics
has been cited by mathematicians.

Before a thing is called 'Levenshtein Distance' it will be used by someone
saying 'calculating distance between these two strings the way Levenshtein did
in [0]'. Eventually, when enough people do this, or someone gets particularly
bold, the method will instead be given its name (occasionally, the original
author will be so bold, but that seems rare). If I were looking to
procrastinate more than I already am I would go find some examples of this,
examples that unfortunately don't spring immediately to mind though I'm sure I
remember their shadows.

As to why they are not named for what they represent - it's only after the
thing has been invented, tested, and named that we realise that it is the
correct or best way to do a thing. Often there are multiple ways of doing
things (this is Levenshteins edit distance) and so they are named to
distinguish them from each other. The survivor is not renamed just because the
others are forgotten.

[0] not a real reference

~~~
pandigital
I believe Turing Machines are an example. Turing originally called them
"automatic machines" or a-machines.

------
willis936
103 comments and not one mention of line coding!

This distance metric is exceptionally useful when picking the states to map to
in line codes with error correction/detection.

------
3xblah
[https://en.wikipedia.org/wiki/agrep](https://en.wikipedia.org/wiki/agrep)

------
iuguy
This is a major part of how [https://safepass.me/](https://safepass.me/)
works.

------
brightball
I used this to implement a spam detection system a few years back. It was
super helpful.

------
fuzzyexplainer
Anyone know of a library that is good for OCR post processing (correcting
errors)?

------
jorblumesea
What's the difference between Levenshtein distance and Hamming distance?

~~~
grzm
[https://en.wikipedia.org/wiki/Edit_distance#Formal_definitio...](https://en.wikipedia.org/wiki/Edit_distance#Formal_definition_and_properties)

> Given two strings a and b on an alphabet Σ (e.g. the set of ASCII
> characters, the set of bytes [0..255], etc.), the edit distance d(a, b) is
> the minimum-weight series of edit operations that transforms a into b. One
> of the simplest sets of edit operations is that defined by Levenshtein in
> 1966 Insertion of a single symbol. If a = uv, then inserting the symbol x
> produces uxv. This can also be denoted ε→x, using ε to denote the empty
> string. Deletion of a single symbol changes uxv to uv (x→ε). Substitution of
> a single symbol x for a symbol y ≠ x changes uxv to uyv (x→y).

...

> Other variants of edit distance are obtained by restricting the set of
> operations. Longest common subsequence (LCS) distance is edit distance with
> insertion and deletion as the only two edit operations, both at unit cost.
> Similarly, by only allowing substitutions (again at unit cost), Hamming
> distance is obtained; this must be restricted to equal-length strings.

------
TheGrassyKnoll
ripgrep [Discussion] Fuzzy Search

    
    
      https://github.com/BurntSushi/ripgrep/issues/1053

------
tzs
I've been playing around with the data made available [1] by the states that
are part of the Streamlined Sales Tax system [2].

The boundary file address records include, among other things:

    
    
      address range start
      address range end
      odd/even/both flag
      directional prefix
      street name
      street suffix
      directional postfix
      zip code
    

The address range specifies the range of numerical addresses that the record
applies to, and the odd/even/both flag indicates if it is for just the odd
address, just the even addresses, or both.

The direction prefix and suffix are supposed to be one of {blank, N, S, E, W,
NE, NW, SE, SW}, but some states botch that and I've seen EA, EB, NO, SO, and
WE end up in there.

The suffix field is for things like LANE, AVE, ST, and the like. It is a 4
character field, so a street that residents thing of as ELM AVENUE would be
name=ELM suffix=AVE in the boundary file. Or it would be name="ELM AVENUE"
with no suffix. (There are even some with AVENUE in the name and something
else in the suffix, such as ROOSEVELT AVENUE ALY in Albany GA. The ALY is the
suffix). (There are a 318 distinct suffixes across all the SST states, for the
curious [3]).

Anyway, my point is that the data is kind of loose.

I want to be able to take what a user puts on an address form:

    
    
      address
      city
      state
      zip code
    

and find a good match in the boundary files. Most people manage to put
something reasonable in the city, state, and zip code fields, and manage to
put their street address at the start of the address field, so I can pretty
reliably get the zip code and the street number. It's the rest of the address
line that is problematic.

At first I was trying to figure out what in there might be prefix, postfix,
and suffix to isolate just the street name. But because of both the
aforementioned looseness of the boundary data, and differences between how the
user might think of their address and how their state thinks of it, this
didn't work too well.

I was considering keeping an index that maps words that occur in street names
to those street names, and then looking for the longest word in the user
supplied address and looking up all streets in the user's zip code that
contain that word and that include the user's street number. If that returned
more than one match, the idea was then to look at the prefix, postfix, and
suffix and see how well those matched with things in the user address. Looking
at a few addresses that had given me trouble, though, it looked like this plan
to resolve them would still be tricky and unreliable.

So then I decided to try a Levenshtein distance approach. Given zip, street
number, street, where street is whatever the user included in the "address"
field after the street number, I pull up all records in my boundary DB that
match on zip code and street number. I don't even include street name in the
query.

Then for each result, I take the prefix, name, suffix, and postfix and join
them separated by space. I then simply take the one with the smallest
Levenshtein distance from what the user supplied. With this, I no longer care
if something is name=ELM, suffix=AVE or name="ELM AVE", suffix="".

This has worked remarkably well. Even if the user throws on stuff that doesn't
belong (some people manage to put the city, state, and/or zip in the address
box) it usually doesn't throw it off. It raises the Levenshtein distance of
the correct match by sometimes a lot...but it also raises the incorrect match
distances too, so the correct one still wins.

The boundary files also have secondary address fields, for things like
apartments, unit, suites, and the like. These are an abbreviation (APT, e.g.)
and low and high ends of a range, and an odd/even/both flag. Unfortunately the
ranges for these aren't always numeric. E.g., there is 695 E MORRILL AVE in
Columbus, OH, which has secondary fields of "AP", "A", "M", and the
odd/even/both flag set to both. I assume that means Apartments A-M.

So far, I've ignored the secondary address fields. Every time I've seen more
than one result for the lowest Levenshtein distance, the matches have been
close enough to each other to be in the same taxing districts, so would have
the same rate and the tax would be allocated to the same entities, so further
resolution is not necessary.

At some point I'll take my scripts that clean up the raw rate and boundary
files and transform them into a form more suitable for DB use, and that make
sqlite and mysql databases from those, and that do the address matching and
also sales tax lookups by address, and put 'em on Github, but that will
probably not be very soon.

[1] [https://www.streamlinedsalestax.org/Shared-Pages/rate-and-
bo...](https://www.streamlinedsalestax.org/Shared-Pages/rate-and-boundary-
files/rate-and-boundary-file-updates)

[2]
[https://www.streamlinedsalestax.org/](https://www.streamlinedsalestax.org/)

[3] [https://pastebin.com/tQLGTGS9](https://pastebin.com/tQLGTGS9)

