
New HN data dump available with over 14.5m entries - cdman
https://archive.org/details/14566367HackerNewsCommentsAndStoriesArchivedByGreyPanthersHacker
======
minimaxir
If you're interested in playing with Hacker News data and don't want to
download the entire dataset (or don't have the CPU/memory to perform large
JOINs on stories/comments), you can use the Google BigQuery HN dataset, which
is now up-to-date: [https://cloud.google.com/bigquery/public-data/hacker-
news](https://cloud.google.com/bigquery/public-data/hacker-news)
(specifically, the .full table, which combines both stories and comments; the
dedicated tables are not up-to-date)

~~~
venning
I see this link mentioned all the time, but every time I try it I can't get it
to work.

Specifically, the "GO TO THE HACKER NEWS DATASET" big blue button on that
page. It kicks me over to a Google Cloud console link, which spins for a few
seconds, and then brings up a "Welcome to BigQuery!" modal. The only thing I
can do then is click "Create a Project", which then kicks me over to the
generic console with a listing of all APIs.

Am I missing something?

~~~
minimaxir
You'll need to create a GCE project before you can use BigQuery (you don't
need to provide a credit card if you remain in the free tier)

------
ers35
See also: A dump of the stories, comments, and users from the Firebase API as
a SQLite database with a full text search index:
[https://archive.org/details/hackernews-2017-05-18.db](https://archive.org/details/hackernews-2017-05-18.db)

~~~
B1FF_PSUVM
Can you tell about the time period and (estimate of) % of comments covered,
for your DB and the dump posted?

Thanks.

~~~
ers35
My DB covers from [https://hacker-
news.firebaseio.com/v0/item/1.json?print=pret...](https://hacker-
news.firebaseio.com/v0/item/1.json?print=pretty) (10/09/2006 at 6:21pm UTC) to
[https://hacker-
news.firebaseio.com/v0/item/14372035.json?pri...](https://hacker-
news.firebaseio.com/v0/item/14372035.json?print=pretty) (05/18/2017 at 11:58pm
UTC)

The dump posted claims to cover from 1 to 14566367 (06/16/2017 at 3:03am UTC)

~~~
B1FF_PSUVM
Ah, OK, forgot there's the item id for comments, ensuring 100% comment
coverage.

(I read "the story vote count is inaccurate for certain stories because it is
only scraped once and not updated" and thought some comments might be left out
too.)

So, 145M at min. 10 sec. per comment, that's at least 40k hours worth,
probably one order of magnitude more. Just writing time, reading maybe 3
orders of magnitude more.

Modern pyramids, they're impalpable ...

------
natch
Kudos to archive.org for hosting torrents. It would be helpful to know the
size of the download up front. Nice clean web page design; would love to see
that one bit of information added.

~~~
ers35
You can see the size of the download by clicking show all:
[https://archive.org/download/14566367HackerNewsCommentsAndSt...](https://archive.org/download/14566367HackerNewsCommentsAndStoriesArchivedByGreyPanthersHacker)

~~~
fit2rule
The answer: 1.6 gigs.

~~~
sillysaurus3
1.6gigs is no doubt the compressed answer. Anyone know the uncompressed size?

~~~
ers35
7.27 GB.

