
Ask HN: Algorithms for the de-duplicate of content? - passioncurious
I&#x27;ve been on the hunt for algorithms that can &quot;dedupe&quot; content, but haven&#x27;t found much.<p>To give a little more context:<p>Given a piece of content and a second (altered) piece of content. Can I use an algorithm to determine the likelihood that that second piece of content is a variation of the first?<p>I essentially want to store one variation of the content in my database, with references to altered variations of the content, but I need to be able to detect the altered data.<p>Edit:<p>Imagine having a million resumes (some in doc, some in pdf, some in html) and you want to dedupe them.
======
yehosef
I think a pure diff is going to give you too much noise and will be misled by
simple mutations (I think..).

Here's an approach - after you normalize the docs to text you should then
extract terms/keywords (google "term extraction" \- there are libs that use
POS to help) and save those as tokens. One approach would be to use something
like elasticsearch to do a "more like this" filter. The idea is that you want
to find documents that are most similar to a certain document.

I haven't tried this approach but if it's effective, it should be able to
handle a million resumes easily.

------
pwg
Your question is asking for a "diff" algorithm. For Unix/Linux systems there
is the classic diff utility itself [1]. For arbitrary binary content you'd be
looking at something like bdiff [2] or xdelta [3].

[1]
[https://en.wikipedia.org/wiki/Diff_utility](https://en.wikipedia.org/wiki/Diff_utility)

[2] [http://bdiff.sourceforge.net/](http://bdiff.sourceforge.net/)

[3] [http://xdelta.org/](http://xdelta.org/)

~~~
passioncurious
As noted in another answer-- the data is very free form and a simple diff may
not help.

Imagine having a million resumes (some in doc, some in pdf, some in html) and
you want to dedupe them.

~~~
pwg
Which important detail also was not in the original question. What was in
there was:

I essentially want to store one variation of the content in my database, with
references to altered variations of the content, but I need to be able to
detect the altered data.

For which a "diff" style algorithm is a possibility, just storing the
"alterations".

Your update edit changed your question to be one for a different problem. For
this goal-post move, you'll want to look into plagiarism detection algorithms.

~~~
passioncurious
That's why I updated the question to prevent any more confusion.

Thanks-- plagiarism detection sounds like a good avenue to check out.

------
drdeca
Maybe locality sensitive hashing or locality preserving hashing ?

What sort of data are you storing?

~~~
passioncurious
The content is pretty freeform-- one could be in html, could be in pdf, etc.

To give a more concrete example, imagine having a million resumes (some in
doc, some in pdf, some in html) and you want to dedupe them.

~~~
yehosef
You'll have to normalize them to text to make it work - the format is not
important for the deduplication (meaning the same text in a word doc or pdf is
really the same thing to you, even though at the byte level the files may be
very different.)

