

Daily SnapShots Of Hacker News - kirubakaran
http://www.kirubakaran.com/phr0zen/
I wrote a simple python program to make daily snap-shots of the front page stories and comments and post it in reverse chronological order. I wanted this tool as I didn't want to miss out on anything on the days that I couldn't spend much time here. Regular RSS reader didn't fit. Any suggestions welcome.
======
johnrob
Awesome. I had been thinking about building something like this for a while.
Here is a tip (I am handing over the rest of my idea to you now): Index the
content of the actual articles (follow the links). This would offer a perfect
way to search for stuff you once read on ycnews. I usually remember what I
read, not what the ycnews title was. It could also be easily extended to index
reddit, digg, or any other link driven site.

~~~
kirubakaran
Thanks! I'll build that.

------
kirubakaran
I wrote a simple python program to make daily snap-shots of the front page
stories and comments and post it in reverse chronological order. I wanted this
tool as I didn't want to miss out on anything on the days that I couldn't
spend much time here. Regular RSS reader didn't fit. Any suggestions welcome.

~~~
tuukkah
Nice! Of course, now you have a feed that would be useful for news aggregators
if you provided RSS. Here's some quick suggestions:

Front page isn't the ultimate snapshot. If you wanted that, wouldn't you
rather take the 20 submissions from the last 24 hours that got the most up-
votes? If I understand correctly what you're doing, you get some 1 and 2-point
submissions that just happen to be on the front page at the time.

Returning more to your use scenario, an alternative that gave you more control
on how much to read would be to take the time you last read HN and use that to
filter the list at <http://news.ycombinator.com/best>

~~~
kirubakaran
Thanks :-)

I wrote it so that it saves front page because that is how I personally read
HN. On the days that I am not able to check HN (like it happened to me last
weekend), I don't know what made it to the front page. You are correct that
this is certainly not the best way to do it. As bootload mentioned, the 'best'
page is limited to 180 links and if the points gained a new post is less than
that of the 180th post in 'best', then it won't make it to that list. May be
we should collect links from front and front+1 page that are more than a day
old and sort them in descending order of points? Your thoughts welcome.

~~~
tuukkah
Indeed, I wasn't considering such implementation issues as the link limit of
180. Collecting links from some of the /news pages should be a good
approximation. For longer times away, the collected links could be stored
periodically and upon return sorted by points, descending.

------
bootload
_"... I wrote a simple python program to make daily snap-shots of the front
page stories and comments and post ..."_

I'm doing 15 minute snapshots in xml format from the news feed /news (meaning
score threshold is >=2). You can find the feed here ~
[http://goonmail.customer.netspace.net.au/hackerid/xml/hacker...](http://goonmail.customer.netspace.net.au/hackerid/xml/hackerid.xml)

I capture posts from /news including ...

    
    
      <user>iamyoohoo</user>
      <inception>118 days ago</inception>
      <karma>60</karma>
      <item>60759</item>
      <title>
      What nine of the world's largest websites are running on
      </title>
      <url>http://royal.pingdom.com/?p=173</url>
      <points>23</points>
      <comments>9</comments>
      <posted>11 hours</posted>
    

Q's

\- Are you going to sort them by anything date, score, user or just date
order?

\- Supply a data feed? \- identify user

\- are you archiving them?

Like yourself I wanted to grab stuff so I wouldn't miss it. Another was to
capture metadata and add some useful extra stuff like

\- identify users by various metrics

\- graphing of data

\- capture relationships b/w users, data

You can only do this with raw data. I'm thinking about archiving the data at
least at some time.

Good stuff btw. I like this because the more resources we have like this the
more fun you can have and more value you can add. If I have one request that
would be to add a data feed that anyone can pull and use.

You can read how here ~
<http://goonmail.customer.netspace.net.au/hackerid/colophon/>

~~~
kirubakaran
Wow! Your HackerID tool is impressive. (I was wondering how I missed that and
realized that I was vacationing in LA when you posted that... the very reason
I wrote 'phr0zen news')

At the moment, I am just grabbing the front page RSS and posting it like a
blog. I would like to do the following too:

1\. Deep Search - Index target webpage, comments, weighted by points gained by
post/comment itself and submitter/commenter-karma. For example, I was to be
able to search and read everything ever said here about EC2. This will be very
useful for me personally, so I have good selfish reasons to do this.

2\. Relationship between users (as you mentioned) - a graphical
representation... what we uncover might be interesting. ( Users that pg
replied to will have a 'knighted' symbol next to them ;-) )

I will supply a feed as soon as I figure that out. Is there a way to grab ALL
the data from HackerID? BTW, when I read 'name the best hacker news
contributers', I thought it is a contest :-)

~~~
bootload
_"... Is there a way to grab ALL the data from HackerID? ..."_

Yes.

I've been saving the individual files since I've run the system as ISO8602
filename ".xml" since it started. I'm going to now write some code to store it
in a db beats the 4Mb of text files I have. Also I want to write a quick
script to add the archived files to the db.

_"... Users that pg replied to will have a 'knighted' symbol next to them ;-)
..."_

Nice touch. What I'm doing is checking if users on the leader board top 30 or
top10 or number 1 and adding a star with 10, 30 or 1. Im also working on
filters for 'pg' meaning he's off the board.

I tell you what would be nice is to synchronise the snapshot captures (by
ISO-8602 time stamps set to Zulu time ie: 2007-09-30T00:16:42Z or now). This
means you can at some later time import any data into a db and do a query by
time.

 _"... when I read 'name the best hacker news contributers', I thought it is a
contest :-) ..."_

Yeah good point. Might have to change/tweak this it a bit. Maybe add a
question mark.

My objectives at the moment are to

\- do deep search on hacker by article in _"new/"_ and extract WHO is making
comments and build some sort of heirarchy of user interaction with link to
comments

\- do some stats on the hacker data

\- " a graphical representation." I'll be using grahpviz ~
<http://www.graphviz.org/Gallery.php> You can see here some initial testing
I've done (with static data) ~ <http://flickr.com/photos/bootload/1291043939/>

On the last one I'm finding a lot of variation b/w karma, comments. points and
time. Interesting to see. The real objective is to release this as data rather
than just pages so users can use the data (not sure about RT) to look at
things on a particular day.

My objectives are different from yours keep up the good work and plz try to
make the data free for use by others to use.

