

Getting the closest string match - peter_l_downs
http://stackoverflow.com/questions/5859561/getting-the-closest-string-match#answer-5859823

======
dkersten
If you're using Python, difflib[1] is excellent tool for this:

    
    
        >>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
        ['apple', 'ape']
        >>> import keyword
        >>> get_close_matches('wheel', keyword.kwlist)
        ['while']
        >>> get_close_matches('apple', keyword.kwlist)
        []
        >>> get_close_matches('accept', keyword.kwlist)
        ['except']
    

Won't help you learn the algorithms, but if you need to get stuff done
quickly, its an easy to use solution. I use it on the 404 page of a website to
suggest what pages the user may have wanted (logging had shown that a lot of
404 and error pages were due to mistyped urls).

[1] <http://docs.python.org/library/difflib.html>

~~~
TillE
I tried it with the OP's sample input, and it favored Choice B - which seems
entirely reasonable, since the last five words of the string are an exact
match.

~~~
dkersten
I agree, I think choice B is in fact the closest since the only word from the
original string which is completely different is BROWN. FOX and COW have the O
in common and the rest of the words match completely, while Choice C has two
words which do not match at all.

At least as far as levenshtein distance goes, choice B is indeed the correct
one, even if its not whats closest logically.

------
MtotheThird
I've had to do this a few times when building simple input normalization
features. The trouble with scoring based on Levenshtein distance in that case
is that it improperly penalizes phrases that are significantly different in
length but similar in content.

For example, let's say I was searching a database of countries for "North
Korea". In my list I have:

South Africa (LD: 6) Congo (LD: 9) Republic of Korea (LD: 11) ... Democratic
People's Republic of Korea (LD: 28)

The actual answer (DPRK) will be pushed far to the bottom based on a naive
ranking that uses LD.

The hack that's worked best for me? Rank based on the number of common two-
character substrings between the source and target. It's simple, easy to build
an index for, and has surprisingly great results. Its ideal use case is if you
don't need to return a single absolute best result and can, say, present the
three best to a human being and let them pick the match.

For the above search I'd get the following results using the two-character
method:

Republic of Korea: 4 DPRK: 4 South Africa: 1 Congo: 0

~~~
goldmab
What kind of thought process and/or experimentation led you to that algorithm?

~~~
heyitsnick
The original code links to:

[http://stackoverflow.com/questions/653157/a-better-
similarit...](http://stackoverflow.com/questions/653157/a-better-similarity-
ranking-algorithm-for-variable-length-strings)

where there is more discussion, and links out to different string similarity
metrics

------
eblume
Fantastically thorough research, the kind of stuff I love to read. However I
can't help but feel the much less impressive other answer "Use Levenshtein
distance, here is the link." (paraphrased) is the better answer.

The long-winded answer goes in to a huge amount of details on really
interesting pattern matching problems... but not really the specific problem
mentioned by the question asker.

I'm glad for a world with Stack Overflow and other similar resources, because
I find the long answer fascinating but if I were asking the question I'd much
prefer also having access to the shorter, succinct, and also-correct answer
beneath it.

~~~
paulsutter
He did far more than use Levenshtein distance. He is matching phrases not
individual words.

He created different metrics based on Levenshtein distance, combined them
using a weighted formula, and used an optimization algorithm to choose
weights. And he provides great visualizations on how it worked.

I've tried several approaches to fuzzy string matching, and I'm impressed with
his results and approach. I've bookmarked it for future reference.

------
bitops
_"I spent a day researching"_ \- when software engineers are given reasonable
amounts of time, great results can occur.

~~~
Rudism
At my last job, I was always boggled by the architecture team, who would lock
themselves in a meeting for 8 hours and completely design some complex system
with just their notepads and a whiteboard. I used to think they must be über
geniuses who already know everything needed to come up with the optimal design
and algorithms without needing to do any research. In hindsight, with a few
more years under my belt, I now realize that what I mistook for knowledge and
skill was mostly just hubris on the part of most of the team members (which
also manifested itself in strong objections to any outside suggestion that
there may be a better way to do things).

Challenging yourself to come up with better ways to do things is a great way
to keep you learning, keep your work interesting, and improve the quality of
your projects.

~~~
ComputerGuru
At my old job, the other architects and I would spend days researching on our
own, then lock ourselves up for days in a glassed meeting room with nothing
but a notepad and a whiteboard and lots of coffee as we debated all the
various methods we'd discovered, pointed out the flaws in each others'
approaches, picking one person's research of choice and seeing how far we
could push it and when/where/how it would fail in an attempt to reach the best
solution.... then do more research, get some developers working on prototypes,
and repeat the whole thing all over again in two weeks' time.

All I'm saying is, it's not necessarily hubris. Too often you find half-baked
or half-researched ideas that work in theory but not in the real world or the
opposite (being things that work great but in practice could only ever be used
in that particular application). You need to look into what transpired before
and after those 8 hours meetings before coming to that conclusion (not saying
you haven't).

~~~
gruseom
Why wouldn't the architects work on prototypes?

In my experience, class distinction between "architects" and "developers" is a
red flag. Actually, the term "architect" is itself a red flag. Come to think
of it, even the term "developer" is kind of a red flag.

A lot of red flags :)

~~~
ComputerGuru
Oh they did. I was a developer/architect, did more developing than any of the
developers and more architecting than any of the architects.

However, not all the "architects" (mind you, none of us actually used that
term/title) were developers. We had the head of the QA in charge of testing
and real world deployments on the team, as well as the lead interface
designer. Their feedback helped to find a solution that would be both feasible
and user friendly.

That's why I said the developers would work on the prototypes.

~~~
gruseom
_mind you, none of us actually used that term/title_

What terms/titles did you actually use?

By the way, I didn't intend my "red flag" comment to imply that your
particular setup didn't work well. There are lots of good variations. The
sociology of software projects is fascinating.

~~~
ComputerGuru
Team Leaders. Even if a team was more smaller teams :)

C/C++ Team Lead, Web (C#/ASP) Team Lead, QA Team Lead, Design Team Lead, and
The Boss (startup CEO).

------
rikf
This article talks about using levenstein distance to compare the similarity
of two words/sentences another great algorithm for fuzzy string matching is
<http://en.wikipedia.org/wiki/Metaphone> it basically compacts a word into its
phonetic components and allows you to implement a google type did you mean.

~~~
spicyj
Your answer would be much more comprehensible if you used punctuation.

------
tnash
Levenshtein distance is actually built into PHP :
<http://php.net/manual/en/function.levenshtein.php>

------
sheraz
Postgresql has both Levenshtein and metaphone functions as part of the
fuzzymatch extension. It's nice to have this when trying to match addresses.

------
emmelaich
Someone should mention Norvig's spelling corrector.

<http://norvig.com/spell-correct.html>

Might as well be me! :-)

------
josephcooney
I hope whoever posted that answer doesn't get shitcanned for posting sensitive
information/their company's intellectual property/whatever to a public forum.

------
wqfeng
I think this algorithm was covered by the final exam of CS101 of Udacity. Any
udacitians come to confirm this?

------
warmwaffles
Answers like this are what make stackoverflow such an awesome resource

------
rburhum
the Levenhstein distance is at the heart of most opensource geocoders.

------
shimsham
Sounds like a task for Shingling man!

~~~
peter_l_downs
What does this mean? I googled but still have no idea what you're talking
about. Care to enlighten me?

~~~
shimsham
[http://www.google.co.uk/search?q=shingling&ie=UTF-8&...](http://www.google.co.uk/search?q=shingling&ie=UTF-8&oe=UTF-8&hl=en&client=safari)

includes useful/relevant results like:

[http://nlp.stanford.edu/IR-book/html/htmledition/near-
duplic...](http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-
and-shingling-1.html)

and from Wikipedia for a useful introduction:

<http://en.wikipedia.org/w/index.php?title=W-shingling>

