
Downloading All of Hacker News Posts and Comments - sytelus
http://shitalshah.com/?p=1424
======
bertrandom
I uploaded it to Internet Archive:
[https://archive.org/details/HackerNewsStoriesAndCommentsDump](https://archive.org/details/HackerNewsStoriesAndCommentsDump)

Not sure two giant JSON files is the best format for this, but I used jp:
[http://www.paulhammond.org/jp/](http://www.paulhammond.org/jp/) to browse
through it.

~~~
sytelus
Thank you! I was trying to do that but couldn't find information on their
policy about accepting and hosting such data. I've added links to this as well
as Torrent links other folks had sent.

------
sytelus
Just wanted to start a thread on some project ideas for this data:

* Discover geek friendly WordPress themes and plugins by analyzing CSS in stories posted on HN.

* My pet EVIL project: Extract self identifying statements from comments and create profile for HN users :).

* Find out abandonment rate of veteran users.

* Find out undiscovered great stories that didn't got in to frontpage because algorithm deficiency in HN (for example, get links posted by people with 10K+ karma but without upvotes.

~~~
guernica
Construct phrases like Yoda, shall I. Unmask this Identity, never you shall.

~~~
jacquesm
Oh come on George, stop it.

------
ivan_ah
Awesome, this will be super useful for running ML experiments on the HN
stories.

Previously, Max Woolf worked on this [http://minimaxir.com/2014/02/hacking-
hacker-news/](http://minimaxir.com/2014/02/hacking-hacker-news/) via
[https://news.ycombinator.com/item?id=7291531](https://news.ycombinator.com/item?id=7291531)

~~~
minimaxir
I did, and with that, I had also included that Python code in GitHub, which
uses the same implementation as the OP: [https://github.com/minimaxir/hacker-
news-download-all-storie...](https://github.com/minimaxir/hacker-news-
download-all-stories)

Fun fact: At 10,000 calls/hour and 1,000 objects per request, you can download
all stories AND comments in less than an hour.

As an aside: I tried to use ML on Hacker News stories and have had exactly
zero success. (i.e. the predictive models are not statistically significantly
better than the NIR)

~~~
jaredsohn
If you prefer the Python approach, I have forked minimaxir's code to work for
comments:

[https://github.com/jaredsohn/hacker-news-download-all-
commen...](https://github.com/jaredsohn/hacker-news-download-all-comments)

I have a few other things in the works related to the hnsearch API (had them
almost ready for release a few months ago but then I got distracted); this
post is persuading me to finish them up soon. :)

------
SergeyHack
Have you considered sharing the data as a torrent? The current FileDropper
speed is quite low (~200KB/s)

P.S. Thanks for the data, you saved a good amount of our time.

~~~
pmalynin
magnet:? xt=urn:btih:A3E2200A9A99906476E2E88CA002477219A3C2C3&dn=HN&tr=
udp%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=udp%
3a%2f%2ftracker.publicbt.com%3a80%2fannounce&tr=udp%3a%2f%2ftr
acker.ccc.de%3a80%2fannounce

~~~
DanBC
PLEASE add four spaces before very long unbroken strings.

------
danso
Danke...on my 1,000th day (which was quite a few days ago), I wanted to do an
analysis of how my upvoting/submission habits changed...that is, over the 3+
years I've been on HN, did my interests in hacking and languages diversify?
But in terms of what I upvoted, it's hard to tell without the entire post set
whether _my_ interests changed, or the composition of HN's submissions.

Also, it's great to be able to filter through and find all of the highly-
upvoted stories that I've missed out, and programmatically push them to
Pinboard. Thanks for this.

------
howlett
Your "Stories Download URL" is the same as the "Comments Download URL".

~~~
nekopa
Easy fix, just change comments in the url to stories.

~~~
jaredsohn
I imagine the point of that comment was to suggest that the OP fixes it to
make it easier for others.

------
voltagex_
> Content Blocked (content_filter_denied)

> Content Category: "Suspicious"

Which is an odd catch-all category. It may be a keyword match from the domain.
That sucks. Anyone got the github link?

~~~
SergeyHack
[https://github.com/sytelus/HackerNewsData](https://github.com/sytelus/HackerNewsData)

------
nekopa
It's strange to see 1.3m stories and only 5.8m comments. I look forward to
examining the data to see how many stories have 0 comments.

~~~
meritt
Lots and lots of stories never make it anywhere near the front page. Mostly
spam.

~~~
nekopa
But I think the spam is caught and doesn't register through the api...

~~~
minimaxir
Stories which are dead will not show in the API, that's correct. But the
proportion of _submissions_ that make it to the front page / get any comments
is very low. (<10%)

------
agibsonccc
Thanks for the dataset! I've been wanting this corpus for a while. I have a
few ideas I want to run with this.

------
hopeless
What licence is the HN content made available under? I'm trying to find some T
& C's that I might have agreed to but I haven't them yet. I don't think you
can assume that comments made here are free to copy, reuse, manipulate and
republish elsewhere

------
simonwalton
Great work! Does anyone know if there is anything similar for Slashdot?

~~~
frik
The /. comments with its tags "Interesting", "Funny", "Insightful",
"Informative", "Flamebait" can be very helpful for machine learning purposes.

------
mimighost
Very helpful. I was looking for dataset like this recently. Now there it is.
Great work.

------
LukeB_UK
Your "Fork me on Github" banner covers your hamburger menu icon.

------
naveen99
would be nice if there was something similar for reddit.

