

Ask HN: List deduplicator w/fuzzy matching? - jellicle

Hi.  I have a spreadsheet with semi-duplicate entries that I want to merge (like merging Google Contacts, for example).<p>Input:<p>--  Robert Smith, 123 Main St., New York, NY, $foo<p>--  Bob Smith, 123 Main Street, NY, NY, $bar<p>Output:<p>-- Robert Smith, 123 Main Street, New York, NY, $foo, $bar<p>Googling, I see all sorts of various Windows-only list-management software that you can buy, or companies that will become my list cleanup provider for a couple thousand dollars, etc.  This is for a one-shot merging.  Is there any free/open-source software I can use?  Or a web service that I can pay $9.95 to and upload my list, download a cleaned/merged version, something like that?  I don't even mind making the final decisions about what is and isn't a duplicate entry - software doesn't have to be brilliant, just vaguely smart.
======
maxdemarzi
Did you look at Google Refine already?

<http://code.google.com/p/google-refine/>

Another option is to download a copy of sql server development edition and use
the fuzzy matching SSIS utilities. It is pretty easy to use.

~~~
jellicle
No, I hadn't heard of Google Refine. Looks like a great tool, might well solve
my problem. Thanks so much!

------
jrsmith1279
Do the lists contain the same fields (ie first name, last name ..etc)

If you have Excel you can use conditional formatting to highlight the
duplicates (Conditional Formatting -> New Rule -> Format only unique or
duplicate values -> Select a formatting style). If you format it as a table
and then sort by one of the columns then you should see all of the duplicates
listed together and you can remove the duplicate rows manually.

Edit: The Robert/Bob thing does cause an issue with this method, but I think
it's still a viable option.

~~~
jellicle
Well, I can make them have the same fields, and I'm bright enough to merge
exact matches myself, it's the inexact matches that are going to cause me some
difficulties.

------
aonic
FuzzyWuzzy: Fuzzy String Matching in Python

[http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-
matchin...](http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-
python)

<http://news.ycombinator.com/item?id=2744190>

------
bockris
I've looked at Febrl before <http://sourceforge.net/projects/febrl/> but in
the end found it easiest to write my own.

