|
|Ask HN: Algorithms for the de-duplicate of content?
|
4 points by passioncurious 243 days ago | hide | past | web | favorite | 8 comments
|I've been on the hunt for algorithms that can "dedupe" content, but haven't found much.
To give a little more context:
Given a piece of content and a second (altered) piece of content. Can I use an algorithm to determine the likelihood that that second piece of content is a variation of the first?
I essentially want to store one variation of the content in my database, with references to altered variations of the content, but I need to be able to detect the altered data.
Edit:
Imagine having a million resumes (some in doc, some in pdf, some in html) and you want to dedupe them.
Guidelines
| FAQ
| Support
| API
| Security
| Lists
| Bookmarklet
| DMCA
| Apply to YC
| Contact
Here's an approach - after you normalize the docs to text you should then extract terms/keywords (google "term extraction" - there are libs that use POS to help) and save those as tokens. One approach would be to use something like elasticsearch to do a "more like this" filter. The idea is that you want to find documents that are most similar to a certain document.
I haven't tried this approach but if it's effective, it should be able to handle a million resumes easily.