Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Reddit Archiving Tool
16 points by cookiengineer on June 10, 2023 | hide | past | favorite | 1 comment
Inspired by the ongoing call-to-action by the Internet Archive team over at /r/DataHoarder [1], I've decided I want to try to preserve all cybersecurity related subreddits. [2]

For people that don't know what's going on: There's a likelihood that the try to monetize the Reddit API will lead to a lot of moderators quitting the platform, and it could be that a lot of subreddits are going to be set on private and/or their threads are going to be deleted. At least that's kind of the fear from the ongoing moderator strike.

In my case I learned a LOT from reddits' discussions about malware, exploits and how they work, and without those I certainly wouldn't be where I am today ... so I'm trying to preserve them.

As the Archive Warrior only scrapes the HTML directly to the Web Archive, I'm trying to preserve the data itself directly as JSON files; with intent to store it later on IPFS (having been inspired a couple days ago by the-eye-team's effort to archive RARBG on IPFS).

I just wanted to let people know here about the tool, and in case you want to archive your favorite subreddits, feel free to modify it.

There are some limitations though, because listings (new/hot/top/search) are all limited to 1000 entries, which means that the discovery of old threads is quite limited.

Keyword search increases the discovery of old threads. In my case I'm searching for a lot of keywords (like CVE, RCE, vulnerability etc) in order to discover more threads.

Would love to hear feedback, currently it's just a prototypical quick n' dirty tool because the threat of my favorite subreddits going dark is quite immediate. I tried to reduce as much noise from the schema as possible, and the tool is only archiving the subreddit threads and comments, with the idea to be able to scrape the websites/blog articles at a later point in time.

[1] https://old.reddit.com/r/DataHoarder/comments/142l1i0/archiveteam_has_saved_over_108_billion_reddit/

[2] https://github.com/cookiengineer/reddit-archivar




Thanks for sharing this, I think I'll find this really useful!

Have you gotten any utility from the data hosted on the-eye.eu? https://the-eye.eu/redarcs/




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: