Hacker News new | past | comments | ask | show | jobs | submit login

I have worked on this problem many times, at many companies. I am working on it again, actually. Usually some combination of scoring and persisting results in CSVs for human review.

(edit: I am at a desktop now and I can say a bit more)

Here is the process in a nutshell:

1. Create a fast hashing algorithm to find rows that might be dups. It needs to be fast because you have lots of rows. This is where SimHash, MinHash, etc. come into play. I've had good luck using simhash(name) and persisting it. Unfortunately you need to measure the hamming distance between simhashes to calculate a similarity score. This can be slow depending on your approach.

2. Create a slower scoring algorithm that measures the similarity between two rows. Think about a weighted average of diffs, where you pick the weights based on your intuition about the fields. In your case you have handy discrete fields, so this won't be too hard. The hardest field is name. Start with something simple and improve it over time. Blank fields can be scored as 0.5, meaning "unknown". Hashing photos can help here too.

3. Use (1) to find things that might be dups, then score them with (2). Dump your potential dups to a CSV for human review. As another poster indicated, I've found human review to be essential. It's easy for a human to see that "Super Mario 2" and "Super Mario 3" are very different.

4. Parse your CSV to resolve the dups as you see fit.

Have fun!




With regards to 1, I wonder: why would calculating the Hamming distance be slow? In python you can easily do it like this:

    hamming_dist = bin(a^b).count("1")
It relies on a string operations, but takes ~1 microsecond on an old i5 7200u to compare 32bit numbers. In python 3.10 we'll get int.bit_count() to get the same result without having to do these kind of things (and a ~6x speedup on the operation, but I suspect the XOR and integer handling of python might already be a large part of the running time for this calculation).

If you need to go faster, you can basically pull hamming distance with just two assembly instructions: XOR and POPCNT. I haven't gone so low level for a long time, but you should be able to get into the nanosecond speed range using those.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: