Hacker News new | past | comments | ask | show | jobs | submit login
FuzzyWuzzy: Fuzzy String Matching in Python (seatgeek.com)
150 points by chrisvoll on July 8, 2011 | hide | past | web | favorite | 24 comments



Maybe I'm being a hypersensitive brit, but "Fuzzy Wuzzy" is as a pretty offensive term in the UK.

See top entry: http://www.urbandictionary.com/define.php?term=fuzzy+wuzzy

Not to take anything away from the tech - that looks awesome and I can already think of a few uses for it.


I think the children's puzzler "Fuzzy Wuzzy was a bear, but Fuzzy Wuzzy had no hair. So Fuzzy Wuzzy wasn't fuzzy was he?" is better known despite the Urban Dctionary selection biased votes.


Not in the UK. I don't think the tongue-twister is well known here, but the phrase is used in the other sense in the TV show Dad's Army[1], which is still shown on the BBC. However, it's only used by one character[2], who's supposed to be 70 at the time World War II starts. So, I think it's a very outdated expression, and not likely to cause offence when used in the context of a programming library. I've certainly never heard the expression used other than in the TV show (where, as far as I know, it's still not censored - it's understood that the character's views were outdated even at the time the show was set).

The term is from a Kipling poem[3]:

"A derogatory term for a black person, especially one with fuzzy hair... From... one of Rudyard Kipling's... poems, written in 1918. The poem is in the voice of an unsophisticated British soldier and expresses admiration rather than contempt, although expressed in terms that sound patronizing today."

[1] http://en.wikipedia.org/wiki/Dads_Army

[2] http://en.wikipedia.org/wiki/Lance-Corporal_Jack_Jones

[3] http://www.phrases.org.uk/meanings/146100.html


Guess there's not enough brits here for that to matter. It'd be a tad different if it started with an N.


As a scouser, I'm always amused by Mizage Software's window management program Divvy.

Seriously people, Google your package names or you might end up looking like a divvy.

http://www.urbandictionary.com/define.php?term=divvy


I did something similar for product matching across a Yahoo! store with products in a Amazon merchant account.

I had a set of products from Yahoo! that needed their equivalent product in a set of products from Amazon. I indexed all the Amazon products into Xapian and let the search functionality do its magic by using the Yahoo product title as the search keyword. It also had a scoring mechanism and worked flawlessly for my needs.


While reading this article I started laughing of amazement.(if that is even possible) It is delightful to discover something you knew you wanted which is delivered to you free, courtesy of others.


Well if you like, you can thank us by buying a very expensive ticket on SeatGeek =P


I heartily recommend "Introduction to Information Retrieval": http://www.amazon.com/Introduction-Information-Retrieval-Chr...

Skim it once to collect vocabulary, then use it as a reference for IR algorithms.


I can remember the pain of doing this as a first year intern at a sporting odds aggregation site, the biggest challenge was dealing with the invalid xml and non standardized naming scheme. Montreal Canadians, The Habs, etc.

Our eventual solution was to use a trained matcher, but obviously it was not ideal since human intervention was required :(


Yeah completely non-standard names (like nicknames, abbreviations, acronyms) are a real pain to deal with, and string matching just completely fails on them. We (seatgeek) handle it the low tech way -- a giant list of name aliases that we run through during pre-processing. Not exactly worthy of a blog post, but it does the job well enough.


I did something similar, in Oracle and PostgreSQL, for a governmental entity. Its main purpose was to perform data fusion, where a set of not so dissimilar records represented the same person in several heterogeneous data sources. It was fun, because the concepts involved, but not so much because the syntactic sugar of the sql involved.

It's great to now have this in python.


I found Google Refine saves some programming:

http://google-opensource.blogspot.com/2010/11/announcing-goo...


This looks pretty awesome. I remember when I was thinking about making a quote Website back in college. I had just learned about the Levenshtein distance algorithm in a class and was exciting about finding a real-life (re: non-contrived) scenario to apply it to.

Anyway, this looks like a really useful library. Glad it's freely available.


woah this looks really useful. Is there a gem for ruby that does this? I've just been doing the first 'String Similarity' step using levenshtein distance


> fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100

From what I can see, this will also give 100 for 'NEW', 'KEES', 'YANK' - all of which could mean something completely different. How do they deal with this?


Context. We also know dates and times - more or less at least, there may be some conversion to UTC if necessary - as well as other information about the event - categories, locations etc.

On occasion there are false positives, in which case Our algorithm is the Borg. They will be assimilated. Their grammatical and syntactical distinctiveness will be added to our own. Resistance is futile.


It's not often people sell tickets to a yanking of new knees? At least that would be my guess, that they also look for keywords.


As a Python programmer, would you guys recommend Google Refine vs FuzzyWuzzy vs Febrl (http://sourceforge.net/projects/febrl/)

Purpose: Find duplicates in mess data sets with names and physical addresses


Awesome! I was just about to start searching for something like this. Thank you HN


Looks pretty useful! I wonder if a simple application of TF/IDF could improve the results by giving you better token weights. (Then you'd have to be comparing token sets, of course, rather than strings.)


Thanks, this will be useful for many screen scraping tasks!


Thanks for the explanations. Very helpful.


[deleted]


Well, it expands difflib. It looks a bit like what google-refine does.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: