

Clever method of near duplicate detection - prakash
http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html

======
wastedbrains
This is interesting. I used to to work with SVM document categorization, and
this was a huge problem.

This was years ago, so forgive me for not ever having an elegant solution, but
I never thought to use something similar to Markov's chains
(<http://en.wikipedia.org/wiki/Markov_chain>) to detect duplicates or avoid
layout data. I ended up doing custom layout detection and trying to ignore
headers, navigational menus, footers. It was always a constant battle and very
brittle and hackish. It worked OK for my purposes, but I still remember going
through and checking a whole bunch of new entries every time the crawler
pulled down new stories to see if I should add some obvious ignores to the
filters.

------
jaydub
One of my professors did some work in this area, as well
<http://www.cs.umd.edu/~pugh/google/>

I think its pretty interesting to look at those Google results from 2000.

------
cameldrv
Interesting but I'm not sure if it's any better than digram or trigram
frequency. The real difficulty of course is the k-nearest-neighbors problem
when you are trying to search for the duplicate.

~~~
gaika
It is more efficient, as the distance between the duplicates in their case is
a lot shorter (because effectively they only compare the important part of the
page). That lets you use simple algorithms that are a lot faster. This is
still not foolproof: some sites have an about header with lots of text or
blogs that have a blurb before and after a quote.

~~~
stcredzero
You can think of this as just changing the granularity of your tokens. The set
of possible tokens becomes huge and the number of them per page decreases at
the same time, so of course this is going to make looking for approximate
matches much easier.

