
Winnowing: Local Algorithms for Document Fingerprinting (2003) [pdf] - dang
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
======
dang
Does anyone have suggestions in this area? We're interested in building better
dupe and blogspam detection for HN, and want to get the lay of the land.

There are also interesting references at
[http://en.wikipedia.org/wiki/Plagiarism_detection#Fingerprin...](http://en.wikipedia.org/wiki/Plagiarism_detection#Fingerprinting).

Edit: Thank you all for the excellent comments.

~~~
ragle
I just finished my undergraduate senior project, which involved investigating
whether or not the normalized compression distance[1][2] could be used to
build a featureless clustering and classification system for Thai-language
documents.

An interesting use case for the NCD is authorship attribution and plagiarism
detection[3]. At one point during my project, while I was collecting documents
for the corpus we used - I noticed very low NCD scores between some of the
documents.

Sure enough, all of the documents with a compression distance between ~0.01
and 0.1 were either complete or heavy plagiarizations of each other! We lost
around 15 documents, but at least we knew the NCD functionality was working!
:)

if you think including NCD measurements in your system could be helpful, be
sure to check out the findings from [4] if you'll be working with any larger
documents.

[1] -
[http://en.wikipedia.org/wiki/Normalized_compression_distance](http://en.wikipedia.org/wiki/Normalized_compression_distance)

[2] -
[http://homepages.cwi.nl/~paulv/papers/cluster.pdf](http://homepages.cwi.nl/~paulv/papers/cluster.pdf)

[3] -
[http://www.inf.ufpr.br/lesoliveira/download/FSI2013.pdf](http://www.inf.ufpr.br/lesoliveira/download/FSI2013.pdf)

[4] -
[http://www.ims.cuhk.edu.hk/~cis/2005.4/01.pdf](http://www.ims.cuhk.edu.hk/~cis/2005.4/01.pdf)

~~~
2510c39011c5
cool result! Is this the basic assumption that close documents also have close
information entropy?

