

Finding Bieber: On removing duplicates from a set of documents - spiffytech
http://stevehanov.ca/blog/index.php?id=144

======
chaosfactor
Convert to tfidf space and search for the documents with the highest
correlations (using, i.e., a clustering algorithm).

