Would be possible for you to post / send me the original data? I have been very interested in working on more longitudinal analysis using DiffBot data and this seems like a fun and interesting place to start. I'm happy to open-source / clearly attribute DiffBot's contribution to whatever I find / hack together, and would feel a lot more comfortable about integrating DiffBot into larger projects in the future.
Please email me (in my profile) if this is a possibility. Thanks!
All the pieces and services we used for this, including all the text extraction, topic detection, and crawling, are available to any user.
Have fun with it, and keep us informed about whatever cool stuff you build with it, and of course tell us about any features or capabilities you wish Diffbot can provide.
I should add that their API is fantastic, and far better than using BeautifulSoup/NLTK for extracting textual content from webpages.