

FuzzyWuzzy: Fuzzy String Matching in Python - chrisvoll
http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python

======
mopoke
Maybe I'm being a hypersensitive brit, but "Fuzzy Wuzzy" is as a pretty
offensive term in the UK.

See top entry: <http://www.urbandictionary.com/define.php?term=fuzzy+wuzzy>

Not to take anything away from the tech - that looks awesome and I can already
think of a few uses for it.

~~~
Terretta
I think the children's puzzler "Fuzzy Wuzzy was a bear, but Fuzzy Wuzzy had no
hair. So Fuzzy Wuzzy wasn't fuzzy was he?" is better known despite the Urban
Dctionary selection biased votes.

~~~
baha_man
Not in the UK. I don't think the tongue-twister is well known here, but the
phrase is used in the other sense in the TV show Dad's Army[1], which is still
shown on the BBC. However, it's only used by one character[2], who's supposed
to be 70 at the time World War II starts. So, I think it's a _very_ outdated
expression, and not likely to cause offence when used in the context of a
programming library. I've certainly never heard the expression used other than
in the TV show (where, as far as I know, it's still not censored - it's
understood that the character's views were outdated even at the time the show
was set).

The term is from a Kipling poem[3]:

"A derogatory term for a black person, especially one with fuzzy hair...
From... one of Rudyard Kipling's... poems, written in 1918. The poem is in the
voice of an unsophisticated British soldier and expresses admiration rather
than contempt, although expressed in terms that sound patronizing today."

[1] <http://en.wikipedia.org/wiki/Dads_Army>

[2] <http://en.wikipedia.org/wiki/Lance-Corporal_Jack_Jones>

[3] <http://www.phrases.org.uk/meanings/146100.html>

------
aonic
I did something similar for product matching across a Yahoo! store with
products in a Amazon merchant account.

I had a set of products from Yahoo! that needed their equivalent product in a
set of products from Amazon. I indexed all the Amazon products into Xapian and
let the search functionality do its magic by using the Yahoo product title as
the search keyword. It also had a scoring mechanism and worked flawlessly for
my needs.

------
plainOldText
While reading this article I started laughing of amazement.(if that is even
possible) It is delightful to discover something you knew you wanted which is
delivered to you free, courtesy of others.

~~~
acslater00
Well if you like, you can thank us by buying a very expensive ticket on
SeatGeek =P

------
ahi
I heartily recommend "Introduction to Information Retrieval":
[http://www.amazon.com/Introduction-Information-Retrieval-
Chr...](http://www.amazon.com/Introduction-Information-Retrieval-Christopher-
Manning/dp/0521865719/ref=ntt_at_ep_dpt_2)

Skim it once to collect vocabulary, then use it as a reference for IR
algorithms.

------
ianl
I can remember the pain of doing this as a first year intern at a sporting
odds aggregation site, the biggest challenge was dealing with the invalid xml
and non standardized naming scheme. Montreal Canadians, The Habs, etc.

Our eventual solution was to use a trained matcher, but obviously it was not
ideal since human intervention was required :(

~~~
acslater00
Yeah completely non-standard names (like nicknames, abbreviations, acronyms)
are a real pain to deal with, and string matching just completely fails on
them. We (seatgeek) handle it the low tech way -- a giant list of name aliases
that we run through during pre-processing. Not exactly worthy of a blog post,
but it does the job well enough.

------
alexitosrv
I did something similar, in Oracle and PostgreSQL, for a governmental entity.
Its main purpose was to perform data fusion, where a set of not so dissimilar
records represented the same person in several heterogeneous data sources. It
was fun, because the concepts involved, but not so much because the syntactic
sugar of the sql involved.

It's great to now have this in python.

~~~
Terretta
I found Google Refine saves some programming:

[http://google-opensource.blogspot.com/2010/11/announcing-
goo...](http://google-opensource.blogspot.com/2010/11/announcing-google-
refine-20-power-tool.html)

------
skawaii
This looks pretty awesome. I remember when I was thinking about making a quote
Website back in college. I had just learned about the Levenshtein distance
algorithm in a class and was exciting about finding a real-life (re: non-
contrived) scenario to apply it to.

Anyway, this looks like a really useful library. Glad it's freely available.

------
ecito
woah this looks really useful. Is there a gem for ruby that does this? I've
just been doing the first 'String Similarity' step using levenshtein distance

------
chime
> fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100

From what I can see, this will also give 100 for 'NEW', 'KEES', 'YANK' - all
of which could mean something completely different. How do they deal with
this?

~~~
josegonzalez
Context. We also know dates and times - more or less at least, there may be
some conversion to UTC if necessary - as well as other information about the
event - categories, locations etc.

On occasion there are false positives, in which case Our algorithm is the
Borg. They will be assimilated. Their grammatical and syntactical
distinctiveness will be added to our own. Resistance is futile.

------
nsomaru
As a Python programmer, would you guys recommend Google Refine vs FuzzyWuzzy
vs Febrl (<http://sourceforge.net/projects/febrl/>)

Purpose: Find duplicates in mess data sets with names and physical addresses

------
saygt
Awesome! I was just about to start searching for something like this. Thank
you HN

------
kragen
Looks pretty useful! I wonder if a simple application of TF/IDF could improve
the results by giving you better token weights. (Then you'd have to be
comparing token sets, of course, rather than strings.)

------
johnrob
Thanks, this will be useful for many screen scraping tasks!

------
john2x
Thanks for the explanations. Very helpful.

