Hacker Newsnew | comments | show | ask | jobs | submit login
Getting the closest string match (stackoverflow.com)
171 points by peter_l_downs 1219 days ago | 45 comments



If you're using Python, difflib[1] is excellent tool for this:

    >>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
    ['apple', 'ape']
    >>> import keyword
    >>> get_close_matches('wheel', keyword.kwlist)
    ['while']
    >>> get_close_matches('apple', keyword.kwlist)
    []
    >>> get_close_matches('accept', keyword.kwlist)
    ['except']
Won't help you learn the algorithms, but if you need to get stuff done quickly, its an easy to use solution. I use it on the 404 page of a website to suggest what pages the user may have wanted (logging had shown that a lot of 404 and error pages were due to mistyped urls).

[1] http://docs.python.org/library/difflib.html

-----


Thanks -- I've used difflib several times but hadn't seen the difflib.get_close_matches() helper function.

-----


I tried it with the OP's sample input, and it favored Choice B - which seems entirely reasonable, since the last five words of the string are an exact match.

-----


I agree, I think choice B is in fact the closest since the only word from the original string which is completely different is BROWN. FOX and COW have the O in common and the rest of the words match completely, while Choice C has two words which do not match at all.

At least as far as levenshtein distance goes, choice B is indeed the correct one, even if its not whats closest logically.

-----


I've had to do this a few times when building simple input normalization features. The trouble with scoring based on Levenshtein distance in that case is that it improperly penalizes phrases that are significantly different in length but similar in content.

For example, let's say I was searching a database of countries for "North Korea". In my list I have:

South Africa (LD: 6) Congo (LD: 9) Republic of Korea (LD: 11) ... Democratic People's Republic of Korea (LD: 28)

The actual answer (DPRK) will be pushed far to the bottom based on a naive ranking that uses LD.

The hack that's worked best for me? Rank based on the number of common two-character substrings between the source and target. It's simple, easy to build an index for, and has surprisingly great results. Its ideal use case is if you don't need to return a single absolute best result and can, say, present the three best to a human being and let them pick the match.

For the above search I'd get the following results using the two-character method:

Republic of Korea: 4 DPRK: 4 South Africa: 1 Congo: 0

-----


The problem is that what you call the actual answer - DPRK - is subjective. By picking DPRK you need a more intelligent algorithm than fuzzy string matching because you are looking for the right answer in terms of semantic knowledge of countries, not blind character comparisons. Fundamentally, the fact that DPRK is the right answer is an accident of History and English, there really is no right answer and no algorithm can capture this insight without being told. By biasing in one direction you are sacrificing performance in some other unknown. The trick is in balancing the bias with general ability. As always.

Knowing which algorithm to use for a problem is most important; sometimes padded hamming will do, other times minhash or bitap or dice's coefficient etc are good. Each algorithm is a kind of balance between a metric and priorities - deciding what is subjectively most important in this problem. Your method for example is not as robust as Damerau Levenshtein when it comes to analyzing DNA sequences. As far as the best default string comparison, I have found dice coefficient to be ideal (your method appears to be based on similar ideas but my hunch is it's less robust).

For your example Dice's Coefficient does as well as can be expected without a concept of countries:

  ("dice", "South Africa", 0.125); 
  ("dice", "Congo", 0.0);
  ("dice", "Republic of Korea", 0.44);
  ("dice", "Democratic People's Republic of Korea", 0.24)]
and by blind luck ("ort Korea"), longest common sub-sequence does even better:

  [("lcs", "South Africa", 6.0); 
   ("lcs", "Congo", 2.0);
   ("lcs", "Republic of Korea", 7.0);
   ("lcs", "Democratic People's Republic of Korea", 9.0)] 
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_dis...

http://en.wikipedia.org/wiki/Dice_coefficient

-----


Oh, I'm under no illusions that it's a universally applicable string matcher. What it seems to be useful for is normalizing human input in situations where someone can pick the best match. It handles typical normalization transforms well (word fragments, transpositions) and is very easy to index.

In my example, DPRK is the objective best answer .. There's pretty clearly a correct match in the value domain, the challenge is to help the user find it as quickly as possible.

-----


Use the Jaccard over a sweeping window or something.

over 5 letter windows:

  TEST:   A 0.4
  TEST:   B 0.540540540541
  TEST:   C 0.692307692308
over words:

  TEST:   A 0.555555555556
  TEST:   B 0.714285714286
  TEST:   C 1.0
Here's the Korea one over a 5 letter window:

  north korea:congo 0.0
  north korea:democratic people's republic of korea 0.0526315789474
  north korea:republic of korea 0.111111111111
  north korea:south africa 0.0

-----


I had to do fuzzy string matching for car brand names, Two character pair matching tended to work best.

Here is a gist of the code in ruby if anybody wants to steal it. I stole it from somewhere else and thus you have the history of programming.

https://gist.github.com/1038659

-----


What kind of thought process and/or experimentation led you to that algorithm?

-----


The original code links to:

http://stackoverflow.com/questions/653157/a-better-similarit...

where there is more discussion, and links out to different string similarity metrics

-----


I struggled to make LD work well, then someone smarter than me suggested the two character method. :)

-----


That's a cool idea.

I wonder how the weights would turn out if your algo were added as a term in his formula and run through the optimization, or even applied to his sample data standalone.

-----


If you're dealing with people's names, there's also the soundex algorithm.

https://en.wikipedia.org/wiki/Soundex

-----


Very interesting. You should put this on the SO question.

-----


Fantastically thorough research, the kind of stuff I love to read. However I can't help but feel the much less impressive other answer "Use Levenshtein distance, here is the link." (paraphrased) is the better answer.

The long-winded answer goes in to a huge amount of details on really interesting pattern matching problems... but not really the specific problem mentioned by the question asker.

I'm glad for a world with Stack Overflow and other similar resources, because I find the long answer fascinating but if I were asking the question I'd much prefer also having access to the shorter, succinct, and also-correct answer beneath it.

-----


He did far more than use Levenshtein distance. He is matching phrases not individual words.

He created different metrics based on Levenshtein distance, combined them using a weighted formula, and used an optimization algorithm to choose weights. And he provides great visualizations on how it worked.

I've tried several approaches to fuzzy string matching, and I'm impressed with his results and approach. I've bookmarked it for future reference.

-----


The main problem is though that he doesn't go into much detail on how specific his set of parameters is to the phrases he's matching against. I suspect that his optimization won't help much when doing a free match against any possible string, in which case the added detail doesn't do much for the person asking (at least not from what we can gather from the question).

-----


This all presupposes that this is what is meant by "closest".

The deservedly highly ranked response at least addresses some interesting ways of defining "close" while skipping the most important, and difficult, case -- semantic content.

Frederick jumps cows

Frederick never jumps cows

Fred leaps bovines

-----


BTW, Levenshtein distance is implemented in Clojure under the Incanter project (incanter.stats).

-----


"I spent a day researching" - when software engineers are given reasonable amounts of time, great results can occur.

-----


At my last job, I was always boggled by the architecture team, who would lock themselves in a meeting for 8 hours and completely design some complex system with just their notepads and a whiteboard. I used to think they must be ├╝ber geniuses who already know everything needed to come up with the optimal design and algorithms without needing to do any research. In hindsight, with a few more years under my belt, I now realize that what I mistook for knowledge and skill was mostly just hubris on the part of most of the team members (which also manifested itself in strong objections to any outside suggestion that there may be a better way to do things).

Challenging yourself to come up with better ways to do things is a great way to keep you learning, keep your work interesting, and improve the quality of your projects.

-----


At my old job, the other architects and I would spend days researching on our own, then lock ourselves up for days in a glassed meeting room with nothing but a notepad and a whiteboard and lots of coffee as we debated all the various methods we'd discovered, pointed out the flaws in each others' approaches, picking one person's research of choice and seeing how far we could push it and when/where/how it would fail in an attempt to reach the best solution.... then do more research, get some developers working on prototypes, and repeat the whole thing all over again in two weeks' time.

All I'm saying is, it's not necessarily hubris. Too often you find half-baked or half-researched ideas that work in theory but not in the real world or the opposite (being things that work great but in practice could only ever be used in that particular application). You need to look into what transpired before and after those 8 hours meetings before coming to that conclusion (not saying you haven't).

-----


Why wouldn't the architects work on prototypes?

In my experience, class distinction between "architects" and "developers" is a red flag. Actually, the term "architect" is itself a red flag. Come to think of it, even the term "developer" is kind of a red flag.

A lot of red flags :)

-----


Oh they did. I was a developer/architect, did more developing than any of the developers and more architecting than any of the architects.

However, not all the "architects" (mind you, none of us actually used that term/title) were developers. We had the head of the QA in charge of testing and real world deployments on the team, as well as the lead interface designer. Their feedback helped to find a solution that would be both feasible and user friendly.

That's why I said the developers would work on the prototypes.

-----


mind you, none of us actually used that term/title

What terms/titles did you actually use?

By the way, I didn't intend my "red flag" comment to imply that your particular setup didn't work well. There are lots of good variations. The sociology of software projects is fascinating.

-----


Team Leaders. Even if a team was more smaller teams :)

C/C++ Team Lead, Web (C#/ASP) Team Lead, QA Team Lead, Design Team Lead, and The Boss (startup CEO).

-----


This article talks about using levenstein distance to compare the similarity of two words/sentences another great algorithm for fuzzy string matching is http://en.wikipedia.org/wiki/Metaphone it basically compacts a word into its phonetic components and allows you to implement a google type did you mean.

-----


Your answer would be much more comprehensible if you used punctuation.

-----


Levenshtein distance is actually built into PHP : http://php.net/manual/en/function.levenshtein.php

-----


Postgresql has both Levenshtein and metaphone functions as part of the fuzzymatch extension. It's nice to have this when trying to match addresses.

-----


Someone should mention Norvig's spelling corrector.

http://norvig.com/spell-correct.html

Might as well be me! :-)

-----


I hope whoever posted that answer doesn't get shitcanned for posting sensitive information/their company's intellectual property/whatever to a public forum.

-----


I think this algorithm was covered by the final exam of CS101 of Udacity. Any udacitians come to confirm this?

-----


Answers like this are what make stackoverflow such an awesome resource

-----


the Levenhstein distance is at the heart of most opensource geocoders.

-----


Sounds like a task for Shingling man!

-----


What does this mean? I googled but still have no idea what you're talking about. Care to enlighten me?

-----


http://www.google.co.uk/search?q=shingling&ie=UTF-8&...

includes useful/relevant results like:

http://nlp.stanford.edu/IR-book/html/htmledition/near-duplic...

and from Wikipedia for a useful introduction:

http://en.wikipedia.org/w/index.php?title=W-shingling

-----


[deleted]

The use of VBA arose purely out of necessity - I was asked to enhance a spreadsheet for underwriters of my reinsurance company. Those guys love their spreadsheets - I can't even describe.

-----


Those guys love their spreadsheets - I can't even describe.

No, describe :)

Seriously, this kind of folklore is valuable (I work on spreadsheet systems and feed on this stuff). What do they do with spreadsheets? How do you know that they love them? What sort of enhancement has to be done in VBA? Does each of them make their own independent spreadsheets? if not, what kind of sharing do they do? What are the most significant limits/problems you've observed in their use of spreadsheets?

-----


I can provide some folklore. I am a professional programmer with a Unix pedigree working as a quant in banking and even I succumbed to the charms of Excel (I still managed to evade VBA though).

I'll start with sharing. The idea of sharing is mostly .xls or .xslx file on mapped network drive and shouting over the floor "Jane, please close quarter results"

Sometimes they are shared on SharePoint which is only marginally better.

So, yeas, sharing is a problem.

But still most of the Excel use (at least with quants) is one off and the end result is a couple of charts that you put into presentation or some insights about numbers and data.

Excel is very good for interactive calculations and what if analysis when you have a complicated chain of formulas and you need results immediately updated when you change inputs

Excel is very good for exploratory data analysis: again charts are immediately updated when you change data and you do not have to remember a lot of API to make then in the first place unlike, say in Matlab or Matplotlib or ggplot.

Old Excel limitation on rows and columns was a problem but not anymore. Lack of column formula (when you would type say =B+C in D's header instead of =B1+D1 and then control-dragging for thousands of rows) (btw Guys from Resolver systems implemented that). Ugly charts.

Also, not a problem with Excel per se, but rather with people who try to use Excel for things it was not designed for: apply it to large datasets and build complex systems on top of it.

-----


I've spent the past week building a system that parses spreadsheets into xml and feeds them into a relational db. Having relational integrity gives insight into relationships in data in ways not possible with a vlookup. Databases also provide concurrency so that many analysts can work in the same data. But because of VBA, you can bind an Excel spreadsheet to a database and get the best of both worlds.

-----


It proves the language choice is irrelevant. It's the solution that counts. Even VBA can be written clearly and works well. It's fun to watch as hipsters flaming echo other for their adopted cool languages though.

-----


"Even VBA can be written clearly and works well."

And VBA has a surprisingly great development environment. I think the negativity towards it arises from it being accessible to non-programmers. The VBA in the SO post is beautiful work, however.

-----




Applications are open for YC Winter 2016

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: