Hacker News new | past | comments | ask | show | jobs | submit login
ArchiveTeam has saved over 11.2B Reddit links (reddit.com)
549 points by susanthenerd 11 months ago | hide | past | favorite | 159 comments



From discussions it seems that the problem is that many subreddits will be going private to protest the recent Reddit API costs changes. Some will not come back unless the change is reverted. If the change is never reverted, they will be gone forever and this project is trying to save old posts so they can still be seen even though the subreddits are private. Not sure how usefull it will be... but somehow interesting as an example of reaction of policies changes of centralised social networks


Reddit were threatening to de-private the subs and replace the mods. When they can do, but will likely kill the community anyway for small communities. Perhaps big ones are anonymous enough for that to work.


There are subreddits where mods meet in person a few times per year. Replace that commitment.


>Replace that commitment.

They don't need to. They do need to keep the stock price up just long enough to cash out and leave a sucker holding the bag.


I think they missed that window.


Easy if you pay the new ones really.


Do you think it's possible to pay a new group of people and have them care as much as the people who met voluntarily out of sheer interest and dedication to their community? I doubt it.

Edit: not that I'm against paying mods, I support that. But replacing enthusiast mods with paid mods, I doubt that'll be adequate.


The obvious example here is /r/AskHistorians. Notoriously among the strictest subreddits, and an absolute treasure to read as a result. There is zero chance it could be maintained by paid staff without a serious search for qualified people.

There's a bit in Predictably Irrational by Daniel Ariely where he talks about social compensation vs. market compensation (I could be remembering the terms wrong). He asks how poorly it would go if you went to Thanksgiving dinner with your grandmother, and after eating the amazing spread you handed her a $20 and said, "That's for the great food, gran, it was amazing!"


Their whole business model is that you have free labor to moderate the subs. They can't afford to pay for all subs and will become centralized in a few subs only.


As the front page of the Internet, I would hope someone there has enough imagination to find a way to pay the people that do the work. If not, do we really have an Internet?


Front page of the internet? Please. It's not that important.


They were referencing reddit's slogan, although I don't know if they still use it anywhere.


It's in some of the older CSS still, last I checked.


It's the <title> on old.reddit.com still.


Yeah, that's right up there with "Free, and always will be", except there's different costs not covered by that free.


Is replacing free labour with paid labour really the best direction for the IPO balance sheet?


If I consider it from the perspective of reddit inc, probably. Get some cheap labor where I can pay as little per hour as possible, and get rid of people who can strike without losing anything. The people I'll hire for cheap won't strike because I do unlikable changes like pushing out 3rd party apps or try to get rid of NSFW content. And if they do, I'll just fire them and replace with someone else.

Gets rid of a nuisance relatively easy, but degrades the quality. But reddit doesn't seem to care much about quality for the last years so.


Maybe Reddit has decided the free labor costs too much, even if the price is cheap.


Any examples that come to mind? That sounds like a positive signal.


They can do that but that will accelerate the titanic effect of those subs, moderation is already the weakest point of Reddit.


Maybe it could have a good effect ironically, the problem with Reddit's moderation is the actual moderators. They mostly seem to be self righteous narcissists and have been the cancer killing reddit for the past decade. Purging them all and starting over with a clean slate could bring new life and freshness to the septic tank of a community it has become.

I don't care about the reddit API but I am enjoying watching the this dumpster fire burn. Popcorn sure does taste good indeed.


Most "bad" moderators I've encountered could easily be explained by potentially being a stressed, overworked volunteer who takes a lot of shit from assholes all the time.

I think my explanation is at least as plausible as yours with the added benefit of not denigrating an entire group.


It’s a generalization, but it holds true on Internet communities in general that the most dedicated people who want and obtain mod power start out beforehand with less than ideal mental health, which is often a contributing factor as to why they have the enormous amounts of free time to moderate communities for free in the first place.

Obviously there’s exceptions, but it is a really common phenomenon. They found communities or get modded simply because they spend the most time and hang out in the right chats, not because they’re objectively great neutral moderators. They are very frequently people with very strong personalities, and are often in the long-term have a more destructive influence on communities than any short-lived comment troll, especially wrt to the effect on newcomers as they tend to make communities increasingly insular.


So, it's your contention that the sort of people who gravitate towards positions of power/authority without any effective oversight and who have the technical means to essentially erase evidence...

These same people, are just overstressed and take too much shit, and this explains the vast majority of all alleged abuses?

I've heard this before. I hear it every time someone is shot to death while crawling towards their attackers while being screamed at to not move and come closer simultaneously. Every time that someone dies with a knee on their neck for 10 minutes. Every time a grenade is thrown into a baby's crib.

That's about the most generous I can be. Reddit moderators have not actually thrown grenades into baby cribs. They're not quite as bad as those that do throw grenades into babies' cribs.


I've seen a lot of both: there are some power-tripping selective-enforcement bully types in those volunteer moderator roles, on Reddit or Facebook or Discord or wherever, and then there are the moderators that try to act with integrity and tend to get burnt out by insufferable agitators and the general lack of respect from the crowd they moderate. I also don't get the sense they're wholly separate groups.


I could probably pick just about any large volunteer group, online or otherwise, and I'd agree.

>tend to get burnt out

People can also just overcommit to volunteer activities in general for a period. If people can just dial down their involvement, that can be OK. But if it's an activity that's all-in or nothing, it probably won't last for more than a while as people burn out or their priorities just change.


Good.

I’ve never been an active Redditor, but there are people discussing my open source software on Reddit, so I’ve answered questions there on a good number of occasions.

That my informative and completely rule-abiding posts can be made unavailable at the whim of some community mods while the site is still alive feels like a betrayal. These mods don’t own my posts. No complaint if Reddit goes the way of Digg and takes everything with it.


That sounds naive. Your contributions were only needed and possible because of the volunteers that built the community you participated in. These mode don't own your posts, you can publish them anywhere you like after the community is gone. If you rely on an unpaid commercial entity to preserve your content forever I have bad news for you.


My contributions were possible because I and a bunch of other non-mods contributed. As I said, I expect the commercial entity to preserve my content until the entity decides not to, not some random third party.

My prior experience moderating other forums tells me the contribution of these community moderators are often way exaggerated and usually easily replaceable. Some form personality cults and periodic rotation would actually be a good thing.


That entity delegated responsibility before you ever posted. That you didn't care back then doesn't matter: it's always been this way.


Pretty sure taking the subreddit private as a form of protest is an unintended abuse of "delegated responsibility", and they can take that back at any time. Which is what they are considering anyway if gp is to be believed.


Given that subreddit s have been doing it for years when it served reddit's interest, I disagree.


You think that people Reddit would have to appoint on such short notice with no experience of the prior community will be _less_ likely to go on power trips?


“The commons are working fine for ME, how dare you protest?”


Yeah... if unpaid mods get screwed over and their reasonable tools destroyed, if greedy IPOism rules and starts price gouging third parties on short notice, and if there is a crack-CEO who even goes so far as manipulating other's posts in the backend to make users look bad - all fine. They all knew they were operating on a platform where they should have expected this. But while I'm in no way an active user, but my some comments I made become unavailabe, gosh I will be angry!!! Lol (:


> Reddit were threatening to de-private the subs and replace the mods.

Is there a link to this? That's crazy!


They've done it before, shouldn't be a surprise at this point.


Why exactly would it be crazy? A few mods taking a community of half a billion hostages is not crazy?

If I were spez I would simply disable the ability to make new subreddits private.

This protest is not for the greater good, it's harming half a billion users for the benefit (actually, for no benefit since nothing will come out of it) of the 3% that want to not see ads and use the 3rd party apps.


Running a subreddit is an alternative to running your own forum. An alternative thats much easier to get up and running, so it's a very popular one.

If Reddit does a mass replacement of mods, the illusion is broken. You're not running your own forum, you're doing free work for some website. So if you want to create a place to discuss X, then you dont think to make a subreddit for X, you go with something you actually control or far more likely just somewhere the illusion hasn't been broken; like create a discord server.

That illusion is what has made Reddit basically the forum. It's the whole value of the site. Destroying the thing that makes your site valuable is crazy.


How many times do you get a Discord server in your Google results when searching for something?

Reddit's value proposition is not the ego stroking of the 0.0000001% that are moderators, it's the discoverability and interoperability between unrelated niches.

If little dictators don't get their kick from rulling lawlessly on a community anymore, I say good ridance.


Ego stroking of some mods is far from what is happening.. please get the full picture. Those unpaid mods that did work for Reddit for free get their tools taken away they need to do this unpaid work reasonably, while at the same time Reddit starts price gouging 3rd party apps to extract more value for their IPO - Reddit wouldn't be there where it is today if it wouldn't have all the free content of the users and free work of the mods. Kind of ridiculous, but I mean how Reddit is acting, they can just remove those unpaid moderators, replace them with paid ones and restore everything back to normal: If that is your's and also Reddit's view, where is the problem then?

Sad.


Only 3% of moderation actions come from from third-party apps [1]. What was that again about taking away the tools the moderators are using?

This whole thing about third-party apps has been ridiculously mismanaged by the communities.

The only 2 reasons people want to keep third-party apps are 1) they prefer them to the official one, 2) they don't want ads. Both of those reasons are valid, but neither are even remotely close to justifying the actions that those Reddit nerds are taking.

[1] https://www.reddit.com/r/reddit/comments/145bram/addressing_...


The communities are what made the Reddit results so relevant.

I get not being sympathetic with petty kings of message boards, but let’s be real, Reddit is an awful company whose incompetence is legendary. They’ve failed to monetize, failed to maintain the user experience and now are failing to keep a vital aspect of their business going.

Ultimately their failure is a good thing. Reddit broke the internet by ending the phpbb era. Time for the next thing.


>How many times do you get a Discord server in your Google results when searching for something?

What does this have to do with what I wrote?

>the discoverability and interoperability between unrelated niches.

Why are those niches on Reddit if Reddit isn't giving away faux forums?

>If little dictators don't get their kick from rulling lawlessly on a community anymore, I say good ridance.

Odd because I get the impression you prefer the communities those little dictators create to be on a google crawlable site over discord.


> >How many times do you get a Discord server in your Google results when searching for something?

> What does this have to do with what I wrote?

You are suggesting that those users will go away from Reddit to form separate dedicated communities, and I am saying that they will try but fail to attract people.

Reddit allows the vast majority of the subs that participate in the blackout to survive simply because they are a part of Reddit and benefit from its infrastructure and features.

You really think a website dedicated to cute animals will attract 34M subscribers like /r/aww? And another one with the exact same theme will attract 4M subscribers (with a lot of overlap) like /r/Eyebleach?

Those communities exist and strive because the barrier to entry is literally non-existent. It takes one input to create them and one click to join them.

> Odd because I get the impression you prefer the communities those little dictators create to be on a google crawlable site over discord.

I would prefer if there were no little dictators, with elections every 3 months, showing detailed stats of the moderators actions, and most of the moderation to be in the style of StackOverflow, so community driven.


The other benefit of running a subreddit is that it comes with Reddit's logged-in audience, sort of.


The problem is that people are (a) using the 3rd party apps for entirely genuine usability reasons beyond ads and (b) while the API users may only be a small percentage, they're the ones holding the site together. Few moderators use only Reddit official tooling. Some have built quite sophisticated tools to automate their work. /r/music mention of having their own server for some purpose: https://www.reddit.com/r/Music/comments/141tzgd/comment/jn2l...

The notorious Digg collapse was in part because of their fight with a big poster who was dominating the rankings and in the process supplying a lot of the content. They won against him, and the rest is history. Similar with Vine.


You are extremely naive if you think this stops at 3rd party apps.


The community (not all, I know) is supporting this.


The community is a massive majority of lurkers that don't comment, don't upvote and may not even have an account.

It's not because a few terminally online Reddit addicts are vocally posing as the resistance that the majority of the community supports it.


I think this might be an extreme case of misunderstanding how internet communities work.

Without comments, HN would be just a boring link aggregator and we'd get very little information if the article was BS or not. But because we have comments we get gems at times where 'the creator of X' discusses the merits of the article. That can be nearly priceless. Things like this draw people that don't upvote and don't comment, but they still get immense value from it.

Posts are what makes Reddit, so much so that Reddit created hundreds of fraudulent profiles in their early days to fake popularity.

https://arstechnica.com/information-technology/2012/06/reddi...

----

Of course this interests me what the future looks like for social media. At one time in the past you needed users to generate and post content. Could we end up with social media sites with 'good enough' bots faking humans that draw in the masses, but few biological commenters and posters would exist?


Lurkers who never post are not participating in the community


.. but they do generate ad impressions. Everyone else is "the product", I guess.


You're talking about the 3%(?) that actually make the community, discussion, and value (the reason people show up in the first place, randomly or via seach).


I think most subs that are a few hundred thousand won't really notice the difference.

The big subs that they would 100% do it on that are a few million plus, you would never really notice. /r/pics, /r/gaming, etc who even pays attention to who the mods are. The mods aren't the community.

If Reddit replaces volunteer mods with paid mods we would get a more consistent moderation and almost certainly a more professional experience. People getting banned for disagreeing with a mod would stop for example. You wouldn't have to guess what the mods mean by their rules. For example, on /r/startups replying to people giving them an answer and saying if you have any more questions I'm free is called an unauthorised ask me anything. Which is crazy. There are many subs where it's anyone's guess what the rules are. And it can literally depend on the mood of the moderator. There are some subreddits that automatically ban people who have posted in certain subreddits.

Moderation is an important job. It's needed. But I can't think of any other social site that has such a bad rep for moderation.


> If Reddit replaces volunteer mods with paid mods we would get a more > consistent moderation and almost certainly a more professional experience.

I can guarantee you the opposite. You'll have a site-wide abusable Scunthorpe-incompatible report-based automated moderation with no proper appeal mechanism in place because half a dozen underpaid interns/offshored employees will be responsible for taking over the work of hundreds of moderators.

It's already happening in some cases. There's ways to make reports go directly to the so-called Anti-Evil Operations team who will irrevocably override any moderation decision and enforce abusive reports.

It's easy to get people banned, post some hateful content, wait for reports, and then report the reports for report abuse.

> But I can't think of any other social site that has such a bad rep for moderation.

There are some legitimate cases of poorly moderated subreddits and mod abuse (and the whole powermod issue), but beware, most of the time people complaining about power-tripping mods and "not being able to say anything anymore" have been banned for very good reasons (those reasons being straight-up hate speech most of the time).


Yes, but we even give murderers due process. You don't have any civil liberties with reddit, but if the punishment doesn't fit the crime or you know they just used the rules as a pretext to squash dissent, it's going to leave a bad taste.

They can do whatever they want with their platform, but I can also not like mods who make decisions I don't agree with.

I also don't think mods should be allowed to ban people just for being subscribed to other subs or having posted there. The whole idea that I co-sign everything the sub stands for just because I read it tells you all you need to know: they expect and often demand that you do in fact co-sign everything the sub stands for.

And that's the fundamental problem with reddit, really. It's not just the keyword squatting mods: it's that it's a giant social experiment that distills the worst of mob behavior and anonymity.

Fixing reddit requires some kind of check on the mods and the elimination or complete overhaul of the karma system.


I'm not so sure, yes the mods are pretty anonymous on the bigger subs, but they do a tonne of work and have a tonne of experience on how to actually moderate the subs. There's a whole host of rules that have grown up around each of these communities that have been learned through bitter experience. So sure, you might not notice tomorrow if /r/pics moderators all got replaced, but I guarantee you over the next 6 months the sub-reddit would change in character significantly.

You can employ paid moderators, I don't think it's a terrible idea from a quality of user experience perspective. It's an awful idea from a "We're desperately trying to get this fucking company to IPO".


> Moderation is an important job. It's needed. But I can't think of any other social site that has such a bad rep for moderation.

Pretty much every other social site of note doesn't have a rep for moderation, on account of not having moderation at all. A solid 20% or higher of the YouTube comments I see are straight-up phishing scams, Facebook and Twitter are complete cesspits where only content that's literally illegal to post ever gets removed, and the less said about imageboards the better.

Wikipedia is the only even remotely comparably large site I can think of that actually has anything resembling moderation, and you'll find the exact same crowd criticising them for enforcing their rules as well.


>on account of not having moderation at all

I think you may be making a mistake....

Years ago I had to manage SMTP servers. Of course a huge part of that is dealing with spam. Users were mad about how much spam they got and asked if "I was even doing my job". In one particular users case, they did get a lot of spam, and it was a pain in the ass to deal with and they always had lots of complaints. So I showed them for ever 3 messages they received I blocked somewhere near 1000 messages to their address.

If you've never been on that side of the system you don't know what 'not having moderation at all' looks like, but I can tell you it looks far, far, far worse than YT psts.


What you're describing is administration, not moderation. Very similar concepts, but there's an absolutely gigantic difference in the feel of a community with active moderation versus a site like YouTube where the overwhelming majority of all user-submitted content is never looked at by a single human with the ability to remove it. Often by design - an SMTP server does not have the same use cases as a forum board, and doesn't need the same kind of hands-on attention that a social service requires to be enjoyable.

In theory channel owners have the ability to handle that for YouTube specifically, but in practice the tools and incentives aren't there to make it actually happen.


Facebook groups have them as well.


> If Reddit replaces volunteer mods with paid mods we would get a more consistent moderation and almost certainly a more professional experience.

Did you mean "more advertisement friendly"?

Like no shitting on a big mobile publishers, no criticism of Blizzard, or EA, or other big corp.


No. I gave clear examples of what I meant.


If they pick moderators, those moderators are no longer volunteers, but employees & agents of Reddit, Inc. — and because of legal precedents with how moderators were engaged with AOL & LiveJournal, anything those technically-employee-agents of Reddit, Inc do wrong with respect to criminal activity and torts, Reddit Inc is on the hook for.

That’s why they use the volunteer mod model, and why they keep us at arm’s-length, and mandate that we cannot receive any compensation of any kind from anyone for moderating.

That said —

The mod code of conduct gives them avenues for removing mods that violate it; there’s also neutral admin-developed tools that identify users who are already active in helping the community out as potential moderator recruits.

So mods that close subreddits maliciously — with an intent to damage Reddit or to demand that they disburse money to a third party — could be removed from mod privileges, and replacements found.


I'm pretty sure that's not what Section 230 says - in fact, the opposite, which is that reddit cannot be held responsible for the content on the site, even if reddit employees are moderating content. Sounds to me more like reddit trying to spin a false narrative (if that in fact is what they've said) to take advantage of free labor.


i mean also given that reddit has actually picked moderators on subs that need it in the past (like /r/redditrequest exists), i'm pretty sure this is just made up


that bike chain logic reminded me of "freeman on the land" lawyering.


No, no, no, no, no. This is one of those Section 230 myths that somehow keeps circulating. It does not matter if moderators are paid or volunteer. Section 230 says nothing about this. AOL won all those lawsuits -- Zeran v AOL (1997), Blumenthal v. Drudge (1998), Doe v AOL (2001), Green V. AOL (2003).

Now, reddit may SAY that this is the case as an excuse, but it isn't the truth.


This is absolutely not the case, if you've spent much time in the Reddit ecosystem you would know that frequently mods go rogue and Reddit staff have to step in to appoint new mods and some times whole new mod teams.


I moderate tens of subreddits, some of the most prominent and significant in the Reddit ecosystem. So yes indeed, I would say I have "spent time"


Since you're definitely the same Steve as the one on Reddit, how do you feel posting here without the ability to ban people for asking if you've started paying your child support?


And yet r/btc are or were all or almostly entirely paid employees of Roger Ver and Reddit didn't seem to care at all.


People seem to forget that the subreddits they create or moderate don't actually belong to them, they belong to Reddit Inc.


No. That’s the issue. Without moderators and posters, there’s no subreddits. Without Reddit, Inc., there’s no subreddits. In my experience, when trapped in a causality dilemma, it’s preferable to be thankful for both the chicken and the egg and not ask too many questions.


And without active users they are worthless.


https://www.reddit.com/r/ltsc/ Did that quite awhile ago, I forget in response to what. I lost a lot of really great info on how to use Windows 10 LTSC / Server / Enterprise / IoT when r/LTSC went permanently private.


Couldn't you (request to) join if that's interesting/valuable to you? Or it went private as a sort of pseudo-deletion, and doesn't really exist any more?


I’ve tried that with multiple alts and haven’t been approved. Again, I forget the story. It was in protest to some Reddit policy.


It's also not uncommon for users to nuke their accounts when they close them, deleting all posts and comments.


Live leaderboard of archived links https://tracker.archiveteam.org/reddit/

I've been contributing to this project for ~2 years now and I've never seen it running so fast


That's because it's the ArchiveTeam selected project now. Everybody that has their warrior set to auto will now work on this project.


How can I help?


Run this software in a VM:

https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

It appears to make you into a sort of node that backups up a share of reddit and sends a copy to a group that's archiving all of reddit.


Too bad they don’t support running on ARM.

> Can I run the Warrior on ARM or some other unusual architecture?

> No, currently we do not allow ARM (used on Raspberry Pi and M1 Macs) or other non-x86 architectures. This is because we have previously discovered questionable practices in the Wget archive-creating components and are not confident it runs under different endiannesses etc. If you still want to run it apparently Docker can emulate x86_64.


Oh, I was downloading the container on my pi right now. Thanks for the hint.


There's no ARM Docker build, unfortunately.


I'm sure most arm64 is difficult to get up and running, but on my M2 I used colima to do the x86_64 emulation, then just ran the docker container.

https://dustinrue.com/2022/10/colima-cpu-architecture-emulat...

colima start --profile amd64 -a x86_64 -c 4 -m 6


I was hoping to run it on one of my Raspberry Pi. I doubt that x86_64 emulation for the container on RPi would be worth it, if it is even possible to do.


Hmmm... endianness? Would've been nice for them to provide an example that actually applies to ARM instead of leaving that under "etc."


Here is more context: https://github.com/ArchiveTeam/warrior-dockerfile/issues/56 ( ARM support #56 - Apr 16, 2021 - 27 comments )


I'm using Qubes OS. I wonder why I can't simply run some application and instead have to install and manage Docker for that. I care about my security myself.



For those who favour Docker - you can do this:

docker run --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --restart=unless-stopped atdr.meo.ws/archiveteam/reddit-grab --concurrent 1 [username]


Works as is on podman as well. Add -d to run it in background.


Just reading the reddit post I linked gives enough information on how to do it.


Funnily, my main beef with reddit has always been that it is "yet another data silo" and now that they seem hell-bent on proving my point, this project might actually change that to an extent. Of course, the move will still kill the platform or at least gut it of its actual value (the many niche communities built on it) but at least the data will be free. :)


how does compliance with RGPD will be hold by Archive Teams ? do they remove all personal information while scraping ?


I have read the linked post and they seem to be only saving links, possibly titles, but not comments or text posts?

Reddit's value is in the discussions (the links are usually shared on all social platforms, so that's the only differentiator).

So what value is there in a collection of links going back years (meaning most of them are likely already broken)?


The OP said this:

> By Reddit links I mean posts/comments/images, I should’ve been a bit clearer.


The archive includes all information in the webpage including text posts, comments, images and is updated to archive.org.


The link here is everything in that link including post, comments, likes ...


You are misinterpreting across multiple categories


I plan to stop using Reddit for social media. But I also add “reddit” to a lot of my searches, using Reddit posts for “Buy it for Life” products and obscure knowledge among other things, and I don’t want that to do away. So I’m glad ArchiveTeam managed to archive almost all of them.

I also use Reddit for tech updates and discussions from subreddits like r/rust and r/ProgrammingLanguages, but most of these already have alternative sites (e.g. discourse, Discord, SE), and I’m more hopeful most of the new posts migrate to one of these than other subreddits or Reddit migrating in general.


> but most of these already have alternative sites (e.g. discourse, Discord, SE)

I've seen Discord used as a platform for project-based discussion, and I just can't get into it. It just seems so fundamentally not built for long-term threads and the aggregation of information (like a wiki, for instance), and it kind of frustrates me to see some projects that are hell-bent on keeping all related discussion on Discord.

Granted, Reddit is not the greatest place for what I'm talking about either, since it also prioritizes newer self-posts/links, but I feel the clunkyness is even worse in Discord.


Same, I really don't like Discord. It probably doesn't help that I only use it when I have to but it makes me feel tired trying to find specific information even in smaller Discord groups, and even when I know for sure it's there.


Google used to keep a mirror of all reddit posts and comments as a demo for their cloud bigquery product:

    SELECT * FROM fh-bigquery.reddit.subreddits
Unfortunately they stopped updating it in 2016.


Comments data is available until the end of 2019 in fh-bigquery.reddit_comments

There are torrents available it seems for comments that cover later times than that.


There's archived data until March, https://archive.org/details/pushshift-reddit-2023-03

The rest of the data can be found on AcademicTorrents.


I really hope they're defaulting to Old Reddit, because I seem to recall Archive.org choking on the redesign and not actually showing anything readable for archived pages.

(Also, is this including Reddit-hosted images/video?)


Here is an example of analyzing 20+ billion Reddit comments in ClickHouse: https://clickhouse.com/docs/en/getting-started/example-datas...


We need someone like SciHub to say the ownership of those Reddit posts does not belong to Reddit, and simply fork the entire thing.


The reddit user agreement used to say something like "You retain the rights to your copyrighted content or information that you submit to reddit ('user content') except as described below." so reddit inc never really owned those comments in the first place, whatever "ownership of comments on reddit" really means.


First some facts and then some news.

The first rule of social networks is you can't touch them or they will die.

Proof: Aol MySpace Twitter Facebook (Zuckerberg won't touch it now) Reddit

This is obvious except to billionaires.

Now for the news: the only content you see on social networks is designed to reinforce your supposedly persistent self and sell ads.

There is no other purpose for social media.

All social media content on the internet since the internet began will be deleted, lost and forgotten as there will be no profit motive to do otherwise.


Facebook and Instagram have historically shut down API with 0 days heads up. Breaking apps that depended on them. This is pretty normal for a social network trying to commercialize


Thank you for taking the time to reinforce your supposedly persistent self by posting this reply on a social media site.


What about the fact that OpenAI used Reddit as a dataset for training GPT? Seems like the data was valuable.


This is a problem people have with macro, large scale data, the problem you are exhibiting in regards to "value".

Ill use a simple example to illustrate.

If you (yourself) were to write a song that sounded just like a Beatles song and was in effect a lift or "copy" of it and it went to #1 on the charts. You would expect a Beatles lawyer to contact you pretty quickly.

Thats called "one to one" copyright infringement. Lawyers and court systems are setup to process these claims and do so from time to time.

However, if you were to copy and make available with no royalty ALL THE SONG EVER WRITTEN AND RECORDED, the legal system, and lawyers have almost no way to get their head around it and the judicial system can barely assign a plaintiff to the defendant.

This is more analagous to the situation Google created when it scanned "every book written in the english language" and made it available via google search. At the time the courts and the law had no concept of such a thing and couldnt process the mass copyright violation.

Google didnt scan all the book s because they had "value". Google scanned all the books to "devalue" them to zero, to serve its purposes. Google as far as I know never paid to this day a cent.

So ChatGPT didnt scan the entire internet (reddit was a rounding error), because it HAD value. It scanned it to form english grammer out of tokens and to get away with mass theft. Just like google.

As an AI assistant, I hope I answered your question to your satisfaction.


What about the fact that reddit closing API access does in no way stop others from doing the same? There are multiple archives of reddit all over the place, and it hardly won't stop just because they make the API paid. Instead people will scrape the HTML and extract the data there, which will surely be more costly for reddit than the simple JSON API they have (for now).


What about it? You're arguing a point I haven't made.

I was merely saying the data set is valuable or they wouldn't have trained on it. GP was saying the data was worthless.


I have somewhere some polaroids from my childhood that have more value than the entire corpus of reddit. But they are fading.

If people care so much take a screenshot and put it in a photo album. Otherwise dont expect other peoples servers which cost money to run to hold your valuables.

Grow up.


> the only content you see on social networks is designed to reinforce your supposedly persistent self and sell ads.

There were TONS of comments on HN about how reddit is still safe place for product reviews (or just opinions).

Plus they are kind of forums where niche knowledge can be found.


Which is fine. These people are digital hoarders, desperate to save threads of puns on pictures of cats. None of it matters. You're not going to go back and look at all the pictures and threads you've posted. Life goes on.


I disagree, there's tons of valuable content on Reddit that people regularly reference years later. Tech support, product reviews, and similar. You'll often see people on HN recommend searching with `site:reddit.com` based on the valuable information it holds and the degradation of search result quality.

Saving content from forums is the same idea. Reddit is just a collection of forums.


Holy shit is this a wrong take on this.

Do you think that, at no point in the future, it would be valuable to know how regular people thought and what they talked about today?


There's more than enough evidence of that in a billion different places. And we're not exactly suffering from not having those records of the past, either. If keeping records long term were actually important we'd have done it well before the sites end up deleting all their content. This is just a fear of the unknown with no concrete examples of cases to justify it.


The US congress would never bother regulating

"icanhazcheezburger.com"

These things and by things I mean the dumpsterfication of social media content or internet content because it has no value.

Those same things don't "influence elections" and traumatized children and supposedly destroy society.

Social media is a fancy dumpster fire and that's all it will ever be. It burns out and starts anew from some other source.


I agree with the hoarding bit. I'm not an fan of that either. But I'm much less of a fan of the self importance of tech pretending this "data" has any more use than temporary ad surveillance until they delete it or resell it.


Could one import all of this into, say, a Lemmy instance to kickstart a reddit alternative?


Haven't tried it, but this comment on /r/DataHoarder mentioned these two repos:

https://github.com/rileynull/RedditLemmyImporter

https://github.com/LemmyNet/lemmy


How can I download the data they archived? Asking for a friendly AGI :)



It seems that the ArchiveTeam servers are overloaded. I constantly get errors like these:

    @ERROR: max connections (-1) reached -- try again later
    rsync error: error starting client-server protocol (code 5) at main.c(1817) [sender=3.2.3]


This is common and just means that the rsync server your warrior tried to upload to was too busy. It'll retry and try another upload host if you leave it to do its thing.


Got my threads stuck on that as well. Increased concurrency now to the maximum of 6 so that those 60 seconds delays don't choke it completely, or at least less. Working reasonably so far, 5 threads waiting for the upload to complete, 1 still going. (Can't imagine that 1 thread continuously working will remain unbanned 24/7 anyway.)


It also happens every time it retries, so it spends most of the time doing nothing.

    Retrying after 60 seconds...
Is that really normal?

The downloads from reddit work fine, but if the upload doesn't work then I don't see the point of running this.


Yes, it is really normal when lots of people try to upload at the same time. Bandwidth is limited, so when lots of people start to run the warrior, the servers need some space to do their thing. Also, IA has limited bandwidth, so sometimes that's the bottleneck too.

If you give it time, it'll work eventually. Up the concurrency to max, so you can have more items in the upload state, as long as you don't start hitting rate limits from reddit, it'll be fine.


The point of having many people run it is to maximise the number of different IP addresses scraping the data.

Even if you are only using a small percentage of your available bandwidth you are still helping out by running this.

If they attempted to max out the download bandwidth of all clients they’d only end up getting everyone IP banned by Reddit, and then the scraping would not be successful.

So even if most time is waiting to upload the scraped data, it’s still good.

Slow and steady.


Please allow it time to catch up. It always does.


I setup the agent a few days ago just for fun, however it seems to have stalled/not getting new jobs.


Are you running the watchtower (for automatic updates) as well? Otherwise, a restart should update it.


Looks like someone configured their server connection limit incorrectly ;)

@ERROR: max connections (-1) reached -- try again later rsync error: error starting client-server protocol (code 5) at main.c(1817) [sender=3.2.3]


Seemingly, Internet Archive is overloaded with upload requests from Archive Team. That error is hinting that the upload slots are all currently used.


Internet Archive Wayback outbound archiving slots are distinct from ArchiveTeam. There are per host concurrent limits at Wayback to be polite.


It's me or the data they get is not in a searchable format nor indexed ?


Usually it goes something like this:

- Grab the data in a raw format

- Upload to Internet Archive

- Figure out how to extract structured data from raw dump

- Upload structured data to IA


[flagged]


There are plenty of communities on Reddit that provide truly useful information, tons of niches. We should be archiving this supposed "rubbish."


[flagged]


Terms of Service applies to users of the service. You can run this without being a user of the service. And worse case scenario, they can ban you for breaking the Terms of Service, it's not law, but dictates how you can use the service.

If you're not actually breaking any laws, you don't have to worry about lawsuits.



are TOS terms law though?


Although I'm not exactly opposed to the archiving, I personally think this is a waste of bytes.

Reddit is not like a regular web forum. Posts/threads are designed to be short-lived and quickly forgotten. Because threads can't be bumped, many communities are a dumping ground of rehashed ideas, memes, sanctimony, and "im new here how do i get started". Many people like(d) Reddit because one can easily sign up, post some stuff anonymously, and not have it effectively hang around in the high consciousness of the internet.

Although there are niche subreddits with some good information in them, this doesn't mean that the vast-vast majority of Reddit posts reflect that quality. I can't say there's anything I've ever posted on Reddit that I've wanted to continue existing in perpetuity, and there's nothing I've ever read on there that I've wanted or expected to last forever.

IMO, the archive team should do what it's doing but keep the archive and stay out of the issue to give negative sentiment towards Reddit enough of a chance to create some actual change. I don't have much confidence said change will take place, but I don't think what the archive team has done here will do anything but subvert the energy to fight back against Reddit. Only when Reddit has been defeated should they expose their archive of Reddit posts.

Then again, the money needed to store Reddit links would be better spent on the archive team's legal budget right now.


> Although there are niche subreddits with some good information in them, this doesn't mean that the vast-vast majority of Reddit posts reflect that quality.

Even if that's true, you're left with the problem of figuring out which is which. That's going to be hell of a lot harder and time-consuming that throwing a bunch of volunteer bandwidth at the problem.

It's not like the ArchiveTeam has an office staffed with a thousand people working 9 to 5 to make case by case archival decisions.


That seems less like an argument for this being a job of Archive Team and more like one for individual Reddit communities and individuals to handle. If the content really matters to people, and the tools to archive that content are available, then it's those entities that are in a better position to archive. Yet, if very few communities manage to take on this task, then that speaks to the lack of actual value that most Reddit content represents.


> That seems less like an argument for this being a job of Archive Team and more like one for individual Reddit communities and individuals to handle. If the content really matters to people, and the tools to archive that content are available, then it's those entities that are in a better position to archive.

Honestly, you're just letting the perfect be the enemy of the good so the job just won't get done.

Everyone can't be competent at and focused on everything, but that seems to be what you're asking.

> Yet, if very few communities manage to take on this task, then that speaks to the lack of actual value that most Reddit content represents.

No, it does not. That's trivially demonstrated by an example where someone had a problem that they got solved in some thread, and then everyone moved on because their problem was solved. Later, maybe years later, someone else has the same problem. By your thinking, because the content ceased to be valuable to someone arbitrarily close to its generation, it doesn't have any "actual value," which is false. Frequently, the "actual value" is found later, by other people.


It has nothing to do with being "perfect". Archiving billions of Reddit posts isn't free or necessarily of enough value to be worthwhile in my opinion.

> Everyone can't be competent at and focused on everything, but that seems to be what you're asking.

No, they're not that incompetent. People are as generally competent as what is demanded of them by their motivation. If the Archive Team, instead of blanketly archiving stuff off Reddit, devoted that effort to releasing archiving tools that make it simple for a 90+ IQ person to back up a Reddit community, that would be more worthwhile and far less wasteful. It would even potentially make communities less dependent on Reddit no matter the outcome of the recent controversy.

> No, it does not.

How so? Scraping old.reddit.com is not hard. Even snapshotting whole pages of the new Reddit can be done if that becomes the only alternative. Given enough motivation to archive valuable information, even a junior developer could do it. If a community is valuable enough, someone will archive it. People don't devote effort to things they don't find valuable.


> It has nothing to do with being "perfect". Archiving billions of Reddit posts isn't free or necessarily of enough value to be worthwhile in my opinion.

No one's making you pay for it.

And my understanding is those billions of blanket-archived posts have been instrumental in training the current crop of "AI" language models.

> If the Archive Team, instead of blanketly archiving stuff off Reddit, devoted that effort to releasing archiving tools that make it simple for a 90+ IQ person to back up a Reddit community, that would be more worthwhile and far less wasteful. It would even potentially make communities less dependent on Reddit no matter the outcome of the recent controversy.

I suppose you can go onto their IRC and let them know your thoughts about how they're doing it all wrong: https://webirc.hackint.org/#irc://irc.hackint.org/#archivete...

Honestly, you seem to just be really confused about this whole archiving thing. Like you have it backwards about what's cheap (bandwidth and storage) and what's not (human effort), so you're advocating spending expensive resources to conserve cheap resources. Furthermore, you have some strange preference for isolated individual efforts, which are totally unsuited for actual preservation, which really requires long-lived institutions to be successful.

> How so? Scraping old.reddit.com is not hard. Even snapshotting whole pages of the new Reddit can be done if that becomes the only alternative. Given enough motivation to archive valuable information, even a junior developer could do it. If a community is valuable enough, someone will archive it. People don't devote effort to things they don't find valuable.

That's twisted logic. People do find these communities valuable and they do devote effort to archiving it, but that's not actually good enough for you. You deem that only certain people should archive, and if they don't do it (maybe they want to, but are busy with other things) it shouldn't be done. Either you don't realize the flaw there, or it's just a way of indirectly saying "I don't like it, so don't do it, period."


> Posts/threads are designed to be short-lived and quickly forgotten

No matter what they are designed for, the fact is that links around the net links to them, some of them 10 days old, others 10 years old. All those links might stop working, which sucks, no matter what the content itself is.


That's fair, but I don't think it sucks, personally. Just because it exists doesn't mean it necessarily has the universal value to be archived. Others will disagree, and perhaps most will, but the way I see it is that there's better things to spend time and effort on than backing up Reddit. As an alternative, I think it would be better for there to be good tools for individual Reddit communities to back up their own content, which would also allow them the control to archive what's valuable and leave behind the flotsam. Perhaps such a tool would even support migrating to other forum systems. If no one would be willing to use such a tool, then that would essentially prove my point that most Reddit content isn't highly worthy of being archived.

Like I tried to suggest in my first sentence in my original comment, I don't necessarily think there's no argument for or no value at all in archiving Reddit. It just doesn't seem like the best choice, especially for an organization like the Archive Team who could be spending that time and effort to fight for their own existence, even if it's only costing them $20 in total. The effort would be better left to communities and individuals to decide what should be archived and how to host it.


> Just because it exists doesn't mean it necessarily has the universal value to be archived. Others will disagree, and perhaps most will

Most specifically, Archive Team disagrees with this, as they'll aggressively will go ahead and archive the internet, even if the hosters of said content don't want the content to be archived.

Figuring out what's worth keeping around is a lot easier to do after you've archived it, rather than before.

> especially for an organization like the Archive Team who could be spending that time and effort to fight for their own existence

The Archive Team's goal is to archive content of the internet that others won't archive. It's the explicit and only goal of the organization.

That's basically fight for their own existence, without the goal of archiving, Archive Team wouldn't exists.

And there is nothing that is working against their existence, are you maybe confusing them for Internet Archive which is under a bit of legal fire right now? Because they're separate organizations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: