> Caution: That command took just over 30 hours to complete on my macbook. (it also killed Finder a couple times and I had to disable spotlight on the folder I was saving all the .json files to)
I had a similar job I needed to do a few months ago and used AWS lambda to massively parallelize the work.
I was able to bring down what I estimated would take my laptop 30 days down to about an hour by sharding to a ton of small instances.
Might be worth a look if you plan on updating this with any regularity.
That's a great idea, and I'll definitely look into that for next time. When I started that command, I had no idea how long it would take :-) lucky for me, I could continue to work on a different machine.
Thanks for doing this Aaron! Something to consider for the future: if you use grab-site (https://github.com/ludios/grab-site), which is a close cousin of ArchiveTeam's ArchiveBot, it'll retrieve all of the content from HackerNews (making the requisite http request and receiving the response for each URL, all of which is stored in WARC/CDX files), and store it in WARC/CDX files which can be uploaded to the Internet Archive. Those can then be integrated into the Wayback Machine by an IA admin at a later date. Something to keep in mind!
Since the Hacker News API (https://github.com/HackerNews/API) used in this scraping is being brought up again, I'll ask a burning question: is development of the API dead?
From the commit notes in that repo, the only changes from the initial release in 2014 are "minor README updates."
I've been using it for a project (collecting video lectures for https://www.findlectures.com) and it seems to work pretty well and seems to keep up to date.
At minimum, there is no authentication endpoint for HN users, which is the primary reason you haven't seen many HN apps take off in the past 2 years.
A more damning reason is that the official HN API in its current state is worse than the API it replaced! The Algolia API (https://hn.algolia.com/api) is still active, and can retrieve data with 1000 entries per page (vs. 1 at a time for the official API), and can also retrieve the comments plus text of a submission thread in a single HTTP request (the official API requires the user to perform a HTTP request to retrieve the text for each comment in a thread)
This is true. Without OAuth, I was not able to connect to individual user accounts. I wanted to allow users to display their own upvote/post history (see here: https://www.sizzleanalytics.com/reddit/)
I was unaware of the algolia api, that will help for future tasks I'm sure. Thanks!
I had a similar job I needed to do a few months ago and used AWS lambda to massively parallelize the work.
I was able to bring down what I estimated would take my laptop 30 days down to about an hour by sharding to a ton of small instances.
Might be worth a look if you plan on updating this with any regularity.