

The Science of Crawl, Part 1: Deduplication of Web Content - jisaacso
http://blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-web-content

======
stephen_mcd
Great article.

I went through a similar process about a year ago for
[https://kouio.com](https://kouio.com) (RSS reader). In its case I needed to
coalesce closely matching RSS feeds purely for storage and performance. After
trialling edit distance and various simhash implementations in Python, we
ended up needing to look no further than the standard library's
difflib.SequenceMatcher - I wish I documented my findings at the time, but I
recall it was the best in terms of speed and accuracy.

Also you might not want to rely on str.isalnum for stripping punctuation. I
made the same mistake here:
[https://twitter.com/stephen_mcd/status/506344236531212288](https://twitter.com/stephen_mcd/status/506344236531212288)

~~~
jisaacso
Thanks for the reference. It looks like SequenceMatch is "cubic time in the
worst case and quadratic time in the expected case". Did you notice any
performance issues as kouio scaled?

~~~
stephen_mcd
Perhaps it was more a case of accuracy for what we were looking at at the time
then :-)

It's something we run out of band on a subset of our data, so it's never been
performance critical.

------
thaumaturgy
There's also nilsimsa hashing (there's a Python implementation at
[http://code.google.com/p/py-nilsimsa/](http://code.google.com/p/py-
nilsimsa/)). Unfortunately, nilsimsa hashes can vary in their most significant
bits when used on similar inputs:

    
    
        773e2df0a02a319ec34a0b71d54029111da90838cbc20ecd3d2d4e18c25a3025
        47182cf0802a11dec24a3b75d5042d310ca90838c9d20ecc3d610e98560a3645
    

...so although nilsimsa is somewhat nice for calculating the difference of two
documents, it's a pain in the butt for finding similar documents in a
database.

The solution described in the writeup is neat, but I really wish there was a
LSH that generated hashes with a most-to-least significance in their bits.

Great writeup though!

------
boynamedsue
As an aside: util.clean_html() has been dropped from NLTK 3.0 which has
substantial API changes[1].

The recommendation is to now use BeautifulSoup or something similar.

[1] [https://github.com/nltk/nltk/wiki/Porting-your-code-to-
NLTK-...](https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0)

------
shabinesh
Good article. I had a challenge of deduplicating addresses-I had just used
cosine similarity , which just worked well for the purpose.

