Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Hacker News Dataset Update October 2016 (aaron-hoffman.blogspot.com)
98 points by aaronhoffman on Oct 31, 2016 | hide | past | favorite | 14 comments


> Caution: That command took just over 30 hours to complete on my macbook. (it also killed Finder a couple times and I had to disable spotlight on the folder I was saving all the .json files to)

I had a similar job I needed to do a few months ago and used AWS lambda to massively parallelize the work.

I was able to bring down what I estimated would take my laptop 30 days down to about an hour by sharding to a ton of small instances.

Might be worth a look if you plan on updating this with any regularity.


That's a great idea, and I'll definitely look into that for next time. When I started that command, I had no idea how long it would take :-) lucky for me, I could continue to work on a different machine.


How much did it end up costing you?


It looks like I barely went over the free tier. My lambda bill in September (when I did all this processing) was 4 cents.


I noticed the Hacker News dataset that was published to big query was now a year out of date.

I have created an updated copy and made it available for download.

(This is the last 10MM entries, I can add the rest if people are interested.)


Thanks for doing this Aaron! Something to consider for the future: if you use grab-site (https://github.com/ludios/grab-site), which is a close cousin of ArchiveTeam's ArchiveBot, it'll retrieve all of the content from HackerNews (making the requisite http request and receiving the response for each URL, all of which is stored in WARC/CDX files), and store it in WARC/CDX files which can be uploaded to the Internet Archive. Those can then be integrated into the Wayback Machine by an IA admin at a later date. Something to keep in mind!


Thanks - was not aware of grab-site, I'll check it out!


Since the Hacker News API (https://github.com/HackerNews/API) used in this scraping is being brought up again, I'll ask a burning question: is development of the API dead?

From the commit notes in that repo, the only changes from the initial release in 2014 are "minor README updates."


Is that different from the Algolia API? I would assume there is no reason to suppport two APIs.

https://hn.algolia.com/api

I've been using it for a project (collecting video lectures for https://www.findlectures.com) and it seems to work pretty well and seems to keep up to date.


It was intended to replace the Algolia API (mentioned around discussion on the original announcement thread: https://news.ycombinator.com/item?id=8422599)

See my comment in another thread on why this did not work.


Awesome, thanks!


It's also possible that they aren't fixing what's not broken.


At minimum, there is no authentication endpoint for HN users, which is the primary reason you haven't seen many HN apps take off in the past 2 years.

A more damning reason is that the official HN API in its current state is worse than the API it replaced! The Algolia API (https://hn.algolia.com/api) is still active, and can retrieve data with 1000 entries per page (vs. 1 at a time for the official API), and can also retrieve the comments plus text of a submission thread in a single HTTP request (the official API requires the user to perform a HTTP request to retrieve the text for each comment in a thread)


This is true. Without OAuth, I was not able to connect to individual user accounts. I wanted to allow users to display their own upvote/post history (see here: https://www.sizzleanalytics.com/reddit/)

I was unaware of the algolia api, that will help for future tasks I'm sure. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: