

Blog Comment Similarity Detector (Free Code For Disqus) - bravura
http://gilesbowkett.blogspot.com/2010/11/blog-comment-similarity-detector-free.html

======
aaronzinman
I used essentially this method as well for personas (personas.media.mit.edu)
as so many web results were the same... except I kept track of duplicates to
then employ further metrics to select 'the best one.'

Works well.

Although the tokenization is very much a poor mans. I have since open sourced
my tokenization framework (for python) which handles real world text:
<http://bitbucket.org/azinman/defuse/src/tip/smgtk/tokup/>

Aaron

