Daily SnapShots Of Hacker News

johnrob · on Sept 29, 2007

Awesome. I had been thinking about building something like this for a while. Here is a tip (I am handing over the rest of my idea to you now): Index the content of the actual articles (follow the links). This would offer a perfect way to search for stuff you once read on ycnews. I usually remember what I read, not what the ycnews title was. It could also be easily extended to index reddit, digg, or any other link driven site.

kirubakaran · on Sept 29, 2007

Thanks! I'll build that.

bootload · on Sept 29, 2007

"... I wrote a simple python program to make daily snap-shots of the front page stories and comments and post ..."

I'm doing 15 minute snapshots in xml format from the news feed /news (meaning score threshold is >=2). You can find the feed here ~ http://goonmail.customer.netspace.net.au/hackerid/xml/hacker...

I capture posts from /news including ...

  <user>iamyoohoo</user>
  <inception>118 days ago</inception>
  <karma>60</karma>
  <item>60759</item>
  <title>
  What nine of the world's largest websites are running on
  </title>
  <url>http://royal.pingdom.com/?p=173</url>
  <points>23</points>
  <comments>9</comments>
  <posted>11 hours</posted>

Q's

- Are you going to sort them by anything date, score, user or just date order?

- Supply a data feed? - identify user

- are you archiving them?

Like yourself I wanted to grab stuff so I wouldn't miss it. Another was to capture metadata and add some useful extra stuff like

- identify users by various metrics

- graphing of data

- capture relationships b/w users, data

You can only do this with raw data. I'm thinking about archiving the data at least at some time.

Good stuff btw. I like this because the more resources we have like this the more fun you can have and more value you can add. If I have one request that would be to add a data feed that anyone can pull and use.

You can read how here ~ http://goonmail.customer.netspace.net.au/hackerid/colophon/

kirubakaran · on Sept 29, 2007

Wow! Your HackerID tool is impressive. (I was wondering how I missed that and realized that I was vacationing in LA when you posted that... the very reason I wrote 'phr0zen news')

At the moment, I am just grabbing the front page RSS and posting it like a blog. I would like to do the following too:

1. Deep Search - Index target webpage, comments, weighted by points gained by post/comment itself and submitter/commenter-karma. For example, I was to be able to search and read everything ever said here about EC2. This will be very useful for me personally, so I have good selfish reasons to do this.

2. Relationship between users (as you mentioned) - a graphical representation... what we uncover might be interesting. ( Users that pg replied to will have a 'knighted' symbol next to them ;-) )

I will supply a feed as soon as I figure that out. Is there a way to grab ALL the data from HackerID? BTW, when I read 'name the best hacker news contributers', I thought it is a contest :-)

bootload · on Sept 30, 2007

"... Is there a way to grab ALL the data from HackerID? ..."

Yes.

I've been saving the individual files since I've run the system as ISO8602 filename ".xml" since it started. I'm going to now write some code to store it in a db beats the 4Mb of text files I have. Also I want to write a quick script to add the archived files to the db.

"... Users that pg replied to will have a 'knighted' symbol next to them ;-) ..."

Nice touch. What I'm doing is checking if users on the leader board top 30 or top10 or number 1 and adding a star with 10, 30 or 1. Im also working on filters for 'pg' meaning he's off the board.

I tell you what would be nice is to synchronise the snapshot captures (by ISO-8602 time stamps set to Zulu time ie: 2007-09-30T00:16:42Z or now). This means you can at some later time import any data into a db and do a query by time.

"... when I read 'name the best hacker news contributers', I thought it is a contest :-) ..."

Yeah good point. Might have to change/tweak this it a bit. Maybe add a question mark.

My objectives at the moment are to

- do deep search on hacker by article in "new/" and extract WHO is making comments and build some sort of heirarchy of user interaction with link to comments

- do some stats on the hacker data

- " a graphical representation." I'll be using grahpviz ~ http://www.graphviz.org/Gallery.php You can see here some initial testing I've done (with static data) ~ http://flickr.com/photos/bootload/1291043939/

On the last one I'm finding a lot of variation b/w karma, comments. points and time. Interesting to see. The real objective is to release this as data rather than just pages so users can use the data (not sure about RT) to look at things on a particular day.

My objectives are different from yours keep up the good work and plz try to make the data free for use by others to use.

kirubakaran · on Sept 29, 2007

I wrote a simple python program to make daily snap-shots of the front page stories and comments and post it in reverse chronological order. I wanted this tool as I didn't want to miss out on anything on the days that I couldn't spend much time here. Regular RSS reader didn't fit. Any suggestions welcome.

tuukkah · on Sept 29, 2007

Nice! Of course, now you have a feed that would be useful for news aggregators if you provided RSS. Here's some quick suggestions:

Front page isn't the ultimate snapshot. If you wanted that, wouldn't you rather take the 20 submissions from the last 24 hours that got the most up-votes? If I understand correctly what you're doing, you get some 1 and 2-point submissions that just happen to be on the front page at the time.

Returning more to your use scenario, an alternative that gave you more control on how much to read would be to take the time you last read HN and use that to filter the list at http://news.ycombinator.com/best

kirubakaran · on Sept 29, 2007

Thanks :-)

I wrote it so that it saves front page because that is how I personally read HN. On the days that I am not able to check HN (like it happened to me last weekend), I don't know what made it to the front page. You are correct that this is certainly not the best way to do it. As bootload mentioned, the 'best' page is limited to 180 links and if the points gained a new post is less than that of the 180th post in 'best', then it won't make it to that list. May be we should collect links from front and front+1 page that are more than a day old and sort them in descending order of points? Your thoughts welcome.

tuukkah · on Sept 30, 2007

Indeed, I wasn't considering such implementation issues as the link limit of 180. Collecting links from some of the /news pages should be a good approximation. For longer times away, the collected links could be stored periodically and upon return sorted by points, descending.

bootload · on Sept 29, 2007

"... take the time you last read HN and use that to filter the list at http://news.ycombinator.com/best ..."

I'm note sure /best would be that useful. For all intensive purposes it's static and changes only when a post beats the points on the list. Do you mean compare against http://news.ycombinator.com/news which has the current highest posts?

"... Front page isn't the ultimate snapshot. If you wanted that, wouldn't you rather take the 20 submissions from the last 24 hours that got the most up-votes? If I understand correctly what you're doing, you get some 1 and 2-point submissions that just happen to be on the front page at the time. ..."

What you describe here is pretty much what /news and the RSS file http://news.ycombinator.com/rss does anyway. So you could just parse the http://news.ycombinator.com/news to do that or the RSS file. But the RSS file has some restrictions.

One further point. An boundary case you might run up against is user 'deleted', which occurs when editors delete a submission. It only occurs only occasionally and you do not see it on HN pages.

tuukkah · on Sept 30, 2007

Yeah, I only meant /best in the abstract, not considering the implementation limit of 180 links. In practice you could collect this information over time yourself.

The difference to what /news does would be that submissions with little points would not show in the results even briefly, and submissions with a lot of points wouldn't fall off even if you're away for a long time.