Hacker News new | comments | show | ask | jobs | submit login
New HN data dump available with over 14.5m entries (archive.org)
79 points by cdman 120 days ago | hide | past | web | 13 comments | favorite

If you're interested in playing with Hacker News data and don't want to download the entire dataset (or don't have the CPU/memory to perform large JOINs on stories/comments), you can use the Google BigQuery HN dataset, which is now up-to-date: https://cloud.google.com/bigquery/public-data/hacker-news (specifically, the .full table, which combines both stories and comments; the dedicated tables are not up-to-date)

I see this link mentioned all the time, but every time I try it I can't get it to work.

Specifically, the "GO TO THE HACKER NEWS DATASET" big blue button on that page. It kicks me over to a Google Cloud console link, which spins for a few seconds, and then brings up a "Welcome to BigQuery!" modal. The only thing I can do then is click "Create a Project", which then kicks me over to the generic console with a listing of all APIs.

Am I missing something?

You'll need to create a GCE project before you can use BigQuery (you don't need to provide a credit card if you remain in the free tier)

See also: A dump of the stories, comments, and users from the Firebase API as a SQLite database with a full text search index: https://archive.org/details/hackernews-2017-05-18.db

Can you tell about the time period and (estimate of) % of comments covered, for your DB and the dump posted?


My DB covers from https://hacker-news.firebaseio.com/v0/item/1.json?print=pret... (10/09/2006 at 6:21pm UTC) to https://hacker-news.firebaseio.com/v0/item/14372035.json?pri... (05/18/2017 at 11:58pm UTC)

The dump posted claims to cover from 1 to 14566367 (06/16/2017 at 3:03am UTC)

Ah, OK, forgot there's the item id for comments, ensuring 100% comment coverage.

(I read "the story vote count is inaccurate for certain stories because it is only scraped once and not updated" and thought some comments might be left out too.)

So, 145M at min. 10 sec. per comment, that's at least 40k hours worth, probably one order of magnitude more. Just writing time, reading maybe 3 orders of magnitude more.

Modern pyramids, they're impalpable ...

Kudos to archive.org for hosting torrents. It would be helpful to know the size of the download up front. Nice clean web page design; would love to see that one bit of information added.

You can see the size of the download by clicking show all: https://archive.org/download/14566367HackerNewsCommentsAndSt...

Thanks. My point to the web designer stands though, the information should be on the first page. Before seeing your message, I looked and didn't find it; had to determine it by loading up the torrent in a client. Also as sillysaurus3 pointed out the expanded size is useful too.

The answer: 1.6 gigs.

1.6gigs is no doubt the compressed answer. Anyone know the uncompressed size?

7.27 GB.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact