Hacker News Dataset Update October 2016

yeldarb · on Oct 31, 2016

> Caution: That command took just over 30 hours to complete on my macbook. (it also killed Finder a couple times and I had to disable spotlight on the folder I was saving all the .json files to)

I had a similar job I needed to do a few months ago and used AWS lambda to massively parallelize the work.

I was able to bring down what I estimated would take my laptop 30 days down to about an hour by sharding to a ton of small instances.

Might be worth a look if you plan on updating this with any regularity.

aaronhoffman · on Oct 31, 2016

That's a great idea, and I'll definitely look into that for next time. When I started that command, I had no idea how long it would take :-) lucky for me, I could continue to work on a different machine.

curiousgal · on Oct 31, 2016

How much did it end up costing you?

yeldarb · on Oct 31, 2016

It looks like I barely went over the free tier. My lambda bill in September (when I did all this processing) was 4 cents.

aaronhoffman · on Oct 31, 2016

I noticed the Hacker News dataset that was published to big query was now a year out of date.

I have created an updated copy and made it available for download.

(This is the last 10MM entries, I can add the rest if people are interested.)

toomuchtodo · on Oct 31, 2016

Thanks for doing this Aaron! Something to consider for the future: if you use grab-site (https://github.com/ludios/grab-site), which is a close cousin of ArchiveTeam's ArchiveBot, it'll retrieve all of the content from HackerNews (making the requisite http request and receiving the response for each URL, all of which is stored in WARC/CDX files), and store it in WARC/CDX files which can be uploaded to the Internet Archive. Those can then be integrated into the Wayback Machine by an IA admin at a later date. Something to keep in mind!

aaronhoffman · on Oct 31, 2016

Thanks - was not aware of grab-site, I'll check it out!

minimaxir · on Oct 31, 2016

Since the Hacker News API (https://github.com/HackerNews/API) used in this scraping is being brought up again, I'll ask a burning question: is development of the API dead?

From the commit notes in that repo, the only changes from the initial release in 2014 are "minor README updates."

garysieling · on Oct 31, 2016

Is that different from the Algolia API? I would assume there is no reason to suppport two APIs.

https://hn.algolia.com/api

I've been using it for a project (collecting video lectures for https://www.findlectures.com) and it seems to work pretty well and seems to keep up to date.

minimaxir · on Oct 31, 2016

It was intended to replace the Algolia API (mentioned around discussion on the original announcement thread: https://news.ycombinator.com/item?id=8422599)

See my comment in another thread on why this did not work.

garysieling · on Oct 31, 2016

Awesome, thanks!

adamnemecek · on Oct 31, 2016

It's also possible that they aren't fixing what's not broken.

minimaxir · on Oct 31, 2016

At minimum, there is no authentication endpoint for HN users, which is the primary reason you haven't seen many HN apps take off in the past 2 years.

A more damning reason is that the official HN API in its current state is worse than the API it replaced! The Algolia API (https://hn.algolia.com/api) is still active, and can retrieve data with 1000 entries per page (vs. 1 at a time for the official API), and can also retrieve the comments plus text of a submission thread in a single HTTP request (the official API requires the user to perform a HTTP request to retrieve the text for each comment in a thread)

aaronhoffman · on Oct 31, 2016

This is true. Without OAuth, I was not able to connect to individual user accounts. I wanted to allow users to display their own upvote/post history (see here: https://www.sizzleanalytics.com/reddit/)

I was unaware of the algolia api, that will help for future tasks I'm sure. Thanks!