
Full Hacker News database for download (posts, comments, points, date, username) - ronnier
http://api.ihackernews.com/?db
======
Silhouette
Am I really the only person who has dubious feelings about this? I contribute
my words to HN, where they can be seen in context and where they are viewed by
the same community that I am interacting with. I don't contribute them for
arbitrary other uses off the site.

Unless I have missed something, posters who submit their comments here do not
automatically release them into the public domain. In fact, I have seen no
legal statement anywhere about transfer of copyright as a condition of
posting, so it's not clear that posters give anyone any rights at all, other
than probably the operators of HN an implicit licence to publish them on the
site and visitors to HN an implicit right to read them while browsing the site
in the normal way. That would make downloading and sharing the entire HN
database in this way an obvious infringement of the copyright of every poster
here.

Sorry if this seems a bit OTT, but some of us watched many comments we
contributed to the community in the Usenet days being appropriated by long-
term Usenet archives that then republished them out of context, covered in
advertising, with comments/ratings attached to them that aren't open to the
rest of the Usenet community, etc. That is basically profit-making on the back
of others' work without their knowledge or consent, and potentially at the
expense of the community the poster originally wished to support, and I have a
problem with that.

~~~
moe
There's not much point lamenting about how your words could be appropriated in
ways that you don't like.

If you are worried about that then your only option is to refrain from
participating in public forums.

Otherwise the golden rule of the internet applies: When it's out on the
internet then it's out on the internet - and there's no taking back.

~~~
follower
> Otherwise the golden rule of the internet applies: When it's out on the
> internet then it's out on the internet - and there's no taking back.

That sounds suspiciously like the "The web is considered 'public domain'"
argument:

<http://news.ycombinator.com/item?id=1868736>

~~~
ramidarigaz
The "The web is considered 'public domain'" argument seems to be said by
people who have personal interest in that statement being true.

The "When it's out on the internet..." is more a statement of inevitability.
The person making the statement may not participate in taking the content,
they are just pointing out that it's inevitable that other people will.

------
dejb
Cool. Now in XX years time, after all my expressed opinions are proven to be
correct, I can fire up an intelligent program to try to track down everyone
who's ever dis-agreed with me and say 'I told you so'.

------
Smerity
Thanks! This opens up a number of really interesting possibilities when mixed
with people's expertise in search and machine learning / natural language
processing.

A long time ago I got hold of a large chunk of Slashdot's stories and
comments. The text and karma ratings for each post lead me to try some fun
experiments automatically extracting the community's sentiment towards certain
topics or trying to mine Slashdot memes.

I've wanted to play around with the comments of Hacker News for some time due
to the wealth of knowledge most comments hold but felt that crawling would be
a bad idea as I certainly didn't want to cause PG's bandwidth cost/server load
to increase.

Think about it - HN's a community full of people like me and if we all crawled
HN to get that data it would be somewhat ugly, so thanks for sharing your data
;)

------
il
I can't wait to see what everyone does with this data, there could be lots of
interesting insights gleaned from this.

For example, a more comprehensive list of top domains, domains with most
upvotes, domains with most unique people submitting, etc.

~~~
Zev
SearchYC has some of these already; <http://top.searchyc.com/>

------
robryan
One of the problems with hacker news is that while there is great discussion
whether it is one a short lived story or evergreen advice it pretty much fades
into obscurity a couple of days after it is posted.

There have been curation efforts in the past and collections like this make it
even more accessible and feasible the someone will apply some good NLP to
organise the data in such a way to provide the benefit of the older content
which is still relevant.

------
chasingsparks
Thanks!

I have actually been crawling HN restricted by PG's "a couple of pages per
minute" limitation for the past three weeks. This is far more convenient.

~~~
DanielRibeiro
For this simple query, these links usually suffice:

<http://news.ycombinator.com/submitted?id=pg>

<http://news.ycombinator.com/threads?id=pg>

~~~
chasingsparks
I meant, PG's restriction of spidering HN at the rate of a few pages per
minute, not posts limited to PG as the user.

------
pak
Now this is neat. I spent a bit of today reverse engineering the Better HN
chrome extension to see how it looked up url's on HN (<http://json-
automatic.searchyc.com/domains/find?url=%s>, for anybody who wondered, I can't
find this API documented anywhere). Now, if this API holds up, it seems like
there might be a more sustainable way of doing what I was planning.

------
codefisher
It is a pity that the API does not have a way of getting what appears on the
front page. I was planing to soon create a script that scanned the front page,
and then applied pg's algorithm from "a plan for spam" to remove links that I
think are off topic. Actually turning that into a desktop client would be
rather cool, or maybe a Firefox extension as that is more my kind of thing.

~~~
ronnier
It does: <http://api.ihackernews.com/page>

~~~
codefisher
Reminder to self, don't multi-task and half read stuff before posting
comments.

------
bambax
It looks great but it appears to be down? Every request of the form

<http://api.ihackernews.com/profile/pg>

answers

{"username":null,"createdAgo":null,"karma":0,"about":null,"version":"1.0","cachedOnUTC":"\/Date(-62135575200000)\/"}

and this page

<http://api.ihackernews.com/page>

says this

Server Error in '/' Application.

~~~
ronnier
My IP addresses have been blocked.

~~~
bambax
Oh, ok, thanks for the info. But did it send a request to HN with every
request it received? It could have answered with the data it had...?

Now everything is a 404, did you take the service down completely?

~~~
ronnier
Yes, I took it all down. I need to rethink this. Clearly some folks aren't
happy with it at all. Not that intentions really matter, but my intentions
were only to help all of the people who are making HN apps. I figured we could
reduce the need to scrape HN if we just distributed the data.

~~~
DanielRibeiro
As an API, it was rather unknown. Guess what really ticked people off was the
redistribution of the complete data. There are several custom private APIs
that scrape HN already (see backtype for instance
<http://www.backtype.com/page/paulgraham.com%2Ffounders.html> and
[http://www.backtype.com/url/news.ycombinator.com%252fuser%3F...](http://www.backtype.com/url/news.ycombinator.com%252fuser%3Fid%3Dpg)
and <http://scraperwiki.com/scrapers/y-combinator-scraper/>), and people don't
seem to care the slightest.

Not to mention there are open source tools to scrape HN by yourself:
<https://github.com/JackDanger/hacker_news>

------
aditya
Direct torrent link:
[http://api.ihackernews.com/torrents/hn_full_11-07-2010.zip.t...](http://api.ihackernews.com/torrents/hn_full_11-07-2010.zip.torrent)

------
mattyb
Thank you.

------
nl
Based on the front page example, it looks like the database doesn't contain
the comment structure (ie, what comments go with what article, and which
comments parent others)

Am I missing something?

~~~
ronnier
Sorry, top level posts don't have a parent. When that's null, it's excluded
from the XML. Comments all have a ParentID. I just added ParentID to the
example.

------
jacquesm
Cue dozens of made-for-adsense sites based on this in 3,2,1...

------
Anon84
I was hoping for a dump of the _actual_ database...

~~~
Anon84
I meant the actual YC/PG Hacker News database, not one reverse engineered from
the HTML output.

~~~
aditya
What's the difference? Other than parsing issues?

~~~
Anon84
Take a look at the example in the page:

    
    
        <HackerNews>
        <row>
            <ID>1</ID> 
            <Url>http://ycombinator.com</Url> 
            <Title>Y Combinator</Title> 
            <Text /> 
            <Username>pg</Username> 
            <Points>39</Points> 
            <Type>1</Type> 
            <Timestamp>2006-10-09T20:33:12.700</Timestamp> 
            <CommentCount>15</CommentCount> 
        </row>
        </HackerNews>
    

If you really want to understand the HN community (or online communities in
general) this data doesn't tell you much. There are many interesting questions
this data _cannot_ answer.

When was each of those 39 points gained?

Who upvoted?

How many points did the submission have when each comment was posted?

How does the number of points affect the number of comments?

Are there cascades of votes (several users voting in quick succession)?

For each of the comments, who voted up and who voted down?

~~~
mahmud
I don't think the actual HN "database" has that much info. You would need to
reconstruct this from the complete server logs, along with application logs
for PG's direct updates (assuming he doesn't use the web interface)

~~~
michaelhart
I imagine it does... It has to at least know who upvoted, to prevent revotes.
The timestamp is frequently stored as well.

~~~
catch23
Maybe they use a bloom filter to store who voted per comment, that would be an
efficient way to prevent upvotes without needing to store much data.

~~~
nkurz
I don't think this would work well. The occasional false-positives would look
like annoying glitches, every now and then making it look like your account
had been hijacked by hiding the buttons as if you had already 'voted'.
Although maybe I'm not getting the math right?

Also, the existence of the 'saved stories' link from the user page would seem
to indicate that for stories at least a full list is being stored.

------
Sevki
its taken down

