

Show HN: Analysis of 2.5 Years of Frontpage Articles - miket
http://diffbot.com/robotlab/hackernews/

======
johncoogan
Huge fan of DiffBot and awesome projects like this. Really cool analysis,
thanks for posting.

Would be possible for you to post / send me the original data? I have been
very interested in working on more longitudinal analysis using DiffBot data
and this seems like a fun and interesting place to start. I'm happy to open-
source / clearly attribute DiffBot's contribution to whatever I find / hack
together, and would feel a lot more comfortable about integrating DiffBot into
larger projects in the future.

Please email me (in my profile) if this is a possibility. Thanks!

~~~
tswaterman
Great idea! We'd be happy to share/help. If more people are interested, we'll
figure out a good way to distribute the dataset generally. But in fact, you
can extract the same data set, and add whatever other smart things you want
along with it, using the Diffbot APIs. Everything we did to get this
information is explained on our blog at

    
    
        http://blog.diffbot.com/diffbots-hackernews-trend-analyzer/
    

Sounds like you're already using the Diffbot service, but for anyone who's
not, they can sign up for a free access token on our 'pricing' page at
<http://www.diffbot.com/> It's a few hundred thousand pages you'd need to
analyze to get this, which doesn't quite fit under the free plan. You might
not want to analyze as many years worth of stuff as we did for this demo,
though.

All the pieces and services we used for this, including all the text
extraction, topic detection, and crawling, are available to any user.

Have fun with it, and keep us informed about whatever cool stuff you build
with it, and of course tell us about any features or capabilities you wish
Diffbot can provide.

------
mayank
Funny, I just built a HN article catcher that uses Diffbot to collect and
classify submissions from the /new page [1]. I've been a Diffbot fan for quite
a while now (although their entity recognition/tag classifier needs a bit of
work as you can see from the categorization on my catcher page below).

[1] <http://lahiri.me/more>

I should add that their API is fantastic, and far better than using
BeautifulSoup/NLTK for extracting textual content from webpages.

~~~
tswaterman
Cool! How many articles, or what time period, did you use for this? It looks
like you're using only a subset of the topic tags -- did you make a list of
'interesting stuff' to filter against?

~~~
mayank
It's been running for about a week I think, and I'm just taking the top 80 or
so tags by article count. Glad you like it :)

------
tliou
Had to figure out how to use it ... but interesting once you do! Android vs
IPhone on Hackernews frontpage shows spike in iphone on launch dates, but
mediocre to no activity for android. is it because android is less interesting
and not as innovative? or not as fun to talk/read about?

[http://diffbot.com/robotlab/hackernews/#type=tags&item=I...](http://diffbot.com/robotlab/hackernews/#type=tags&item=IPhone&item=Android%20\(operating%20system\)&item=)

------
minimax
Neat! Wish I could select by just domain name (i.e. just nytimes.com rather
than dozen or so whatever.nytimes.com subdomains).

