

Dealing with Duplicate Person Data (Levenshtein Distance, Nicknames & Multi-core/CPU)  - draegtun
http://proudtouseperl.com/2009/04/dealing-with-duplicate-person-data.html

======
patio11
Please be careful to THOROUGHLY vet any logic like this prior to
internationalizing. (Though, truth be known, I'd be hesitant to employ it in
America these days for anything important.) You might think names are
reasonably unique identifiers. You'd be pretty catastrophically wrong in much
of Asia, for example.

Our Japanese university customers use the following to test identities for
equality:

1) Same first name

2) Same last name

3) Same high school

4) Same birthday

5) Same high school major

6) Same high school GPA

7) Human review

Human review exists because if after all those checks we still merge two
students and then someone fails the entrance examination as a result, someone
at our company and someone at the university will pretty much have to resign
in disgrace.

~~~
aristus
Amen. At Spock we had a cluster of women from Spain that got tagged as
Catholic priests because they had names like "Maria Pilar Pastor". Identity
resolution is one of the trickiest things they are solving.

------
nradov
This issue is usually known as record linkage. Academic research has produced
some more sophisticated techniques.
<http://ije.oxfordjournals.org/cgi/content/full/31/6/1246>
<http://www.fcsm.gov/working-papers/foreword.pdf>
<http://www.census.gov/srd/papers/pdf/rr99-04.pdf>

~~~
bsaunder
See also: <http://en.wikipedia.org/wiki/Record_linkage_problem> and
<http://www.fcsm.gov/working-papers/RLT_1997.html>

------
bsaunder
I've been tackling the same problem (in a similar architecture recently
(though more real-time oriented)). Some other suggestions:

    
    
      1.  Hash the names (with something like Text::DoubleMetaphone) and select from this list rather than trolling the whole database.
      2.  Check out POE (great for distributed parallel processing)
      3.  Consider how you will handle null fields (a lack of information is not the same as badly matching information).
      4.  Consider the uniqueness of the data values themselves (Ebenezer Rumpelstilzchen is more unique than John Smith), but you have to watch out for foreign names (which seem unique but aren't).
      5.  Matching on multiple fields seems multiplicative, not additive.

~~~
jrockway
Numbered lists are not code, so please don't format them that way. They break
the page.

(HN really needs some CSS that restricts the width of code blocks.)

~~~
Xichekolas
There is a GM script that does this (PRE-fix), but with javascript:
<http://userscripts.org/scripts/show/25039>

I tried with CSS alone, but as I recall it was inconsistent in multiple
browsers, so went with js.

~~~
jrockway
OK. The CSS I have on my blog seems to handle long code snippets pretty well,
but I can't explain how it works, since I just ported a Typo theme.

Regardless, it seems possible. Someone that actually has a clue about CSS
and/or Javascript should submit a patch to fix HN :) (Unfortunately, I do not
have said clue.)

------
alexitosrv
Recently, I had to solve that same problem in PostgreSQL. I use a multicolumn
index with the soundex of first name, soundex of last name, and birthdate.
This yields a set of preliminary candidates to match. On a second step, we
prune according to a hand-crafted distance measure, which relies on
levenshtein and a bunch of simple calculations on the whole record. Finally,
we order the candidates by distance and take as match that record with minimal
distance under a threshold.

------
edw519
Fuzzy logic that usually requires "tricks of the trade". Good suggestions
here.

Some other things we have used over the years to remove dups:

    
    
      SOUNDEX (last name, first name, street name)
      Zip Code (1st 2 digits, 1st 3 digits, all 5 digits)
      Street Name only (discard numerics)
      First letter of First name only (watch out, Joe and Jerry may be reduced to one record)
      For mailing lists, are 2 people with the same last name at the same address really 2 people? (Sometimes yes, sometimes no)

