Hacker News new | past | comments | ask | show | jobs | submit login
ArchiveTeam Warrior (archiveteam.org)
233 points by xnx 10 months ago | hide | past | favorite | 67 comments



These projects are very much needed, but I also believe that we need to go beyond simple "archiving" and have a way to make this data actually available to the public.

For example: I don't think that Reddit ever felt threatened by the fact that a bunch of people are pulling their data and creating a backup that can not be easily accessible. But I think that Reddit would be very much afraid of a distributed network of nodes running partial copies of Reddit and making it available for local-first clients and/or Lemmy mirrors.


Most projects output WARC files which are batched, uploaded to archive.org, and go into the Wayback Machine: https://web.archive.org/

(Note that Archive Team is separate from Internet Archive)


Right, but AFAIK there is no way to query a WARC file, is there?

Let me explain where I am coming from. I'm working on an (open source, self-hosted) service to help people migrate from Reddit to Lemmy, called Fediverser [0]. It offers the following:

- A crowdsourced map of "reddit-to-lemmy" alternatives. - Use the list of subreddits and some preferences to find a Lemmy instance that is suitable for you.

- Lets people sign up to a "fediversed" Lemmy instance directly via Reddit OAuth. Simplifies the registration process, can let an admin skip the verification process (e.g, reject redditors whose account are less than 3 years old)

- Using the crowdsourced data, automatically subscribe the user to the communities that correspond to their favorite subreddits.

- If the admin of the Lemmy instance so chooses, it can also set up mirror bots, which will create "shadow accounts" for each reddit author. This shadow account can then be "taken over" by the real redditor if/when they sign up to the instance.

I believe that these features together would lead to a credible threat to Reddit's dominance. My remaining "problem" to solve is, simply put, that I need more people running this, because it's just too much data for a single node. I set up an instance that was mirroring ~100 reddits (posts and comments). In three months, my database was already recording ~3 million "shadow" users and ~10 million posts + submissions.

For this to work, I either need to have more instance admins willing to run the Fediverser software, or I need to move the "shadow" users and the mirrored content straight to the client and only bring to the Lemmy server the content from users who actually migrated.

[0]: https://fediverser.io


It’s an iso standard. There’s warc libraries around. Ex: https://github.com/webrecorder/warcio


That still doesn't solve the discoverability problem. The Wayback Machine is only useful if you happen to know the URL where a piece of content used to live.


That, and the fact that the data is "frozen". There is no way to add a comment to a Reddit thread that has been archived without going to Reddit first. I'd like to have a two-way mirror.


Isn't read-only one of the main points of being "archived"?


Yeah, but I am saying that I don't want it merely "archived". I want it "copied and available outside of Reddit's control".


That's lemmy.ml . See this comment: https://news.ycombinator.com/item?id=41143632

edit: they already did.


Did you link to my own comment? :)


Indeed, what's really needed is a search engine that indexes all the content on the Wayback Machine and makes it accessible in a similar fashion as e.g. Google.


The Wayback Machine does that, to some extent. It tends to prefer to return results where the query is in the URL, but sometimes it picks up on the word being in the page. I don't know how much indexing it does, but it does some.


>That still doesn't solve the discoverability problem

I have a somewhat of a conspiracy theory that deep down they don't implement a search feature on all their content on purpose. Essentially if WB made easy to discover stuff, you end up having to deal more and more with all those shenanigans of people requesting information to be removed. By making the information there, but somewhat unfindable or at least very hard to find, they essentially preserve the information, without having to deal with such problem (I know this happens even nowadays, but if it was easier to find information, it would happen even more).


Raw genius, the site even looks "old" ...

On related note, Internet Archive backup would probably cost between 20M and 60M USD. Many EU countries would have incentive to do this as public culture preservation projects.

The archive is roughly 70 PB. Decentralized storage projects have achieved 7PB already. Thus attempting a decentralized backup would ALSO work.


Many EU countries would have incentive to do this as public culture preservation projects

There are multiple archive projects around the world, which often isn't understood when the Internet Archive (the biggest and original) is discussed.

Various countries including some EU already have those as "public culture preservation projects", targeting their own nation's web presence.

In that context, and with scarce funding already, there is not really an incentive to back up a load of irrelevant (in the sense it's not their country's) archive material.


Targeting a nation's own web presence cannot capture as complete picture as IA. For example, discussions about minor languages that are vanishing are often English. In addition, IA does not only do archiving of the inernet; it's rather an archive ON the internet that scans, digitalizes & organizes tons of material.

Now, notice that the budget of many of these projects are X billions:

https://www.ne-mo.org/cooperation-funding/funding-opportunit...

Putting a 100 million EUR into an European IA backup would be more cost effective than any of these projects.

Alternatively or additionally:

https://en.m.wikipedia.org/wiki/Wikipedia:Fundraising_statis...

Wikipedia could actually also probably – being dependent on IA – invest 50M into the project. In fact, this would probably do more what the donations were meant to do than anything else they could do with the ("excess") funds.

Truth to be spoken, NSA probably has an IA backup. But it still sort of drives me insane to know that political change or natural catastrophes could lead to loss of public access to the IA. No one seems to care about IA enough except IA itself.


I would agree that additional funds to initiatives like the IA and local equivalents targeted at resilience would be welcome, but:

Current digital preservation projects are likely a tiny fraction of that 100 million on national levels and will include additional activities like those you ascribe to the IA, but carefully attuned to each nation's priorities while collaborating internationally with each other including organisations in the USA like the IA.

Importantly, they will also be operating strictly according to national and intra-national legislation (which IA has gotten into severe trouble within the USA).

In the context of a complicated international environment with many different local political, cultural, commercial and other factors, it's difficult to see how your proposal to replace local projects with an IA backup would be either more cost effective or legally practical.

The Internet Archive is inspirational and does a terrific job, but canning the many disparate entities that do national equivalents on much more limited budgets in favour of moving those funds to the IA (or any other global corporation or organisation) would risk invoking the classic problems of centralisation with associated detrimental effects on local requirements.


I agree that politics complicate things, though at some point it becomes a question of courage and not legislative-political succintiveness.

I am not sure if you opened the link I gave. It seems the budget in EU is particularly high.

I don't believe the optimal solution would be to move anything to IA: in fact, a separate legal entity would be much better option due to decreased legal risks. This only needs to be updated perhaps once or twice a decade, or even less.


Some Governments (US...) require that government sites be archived. There is a service for that.

https://www.archive-it.org/


Or the Archive's intent is long term preservation. I don't think the intention is to be a free source for access to copyrighted material. There is no perfect solution I think. Far more digital content disappears each day than most people realize. [1]

[1] https://www.pewresearch.org/data-labs/2024/05/17/when-online...


Reddit was quite unhappy with services that mirrored their data and made it publicly available. So much so they updated TOS for the API, and forced services to delete data and shutter.


So much so, they've now blocked all search engines except for Google. (Google paid a lot of money for this)

Reddit's become low quality now anyway. All the good old high quality information that people search Reddit for is in the dumps - they're not generating much more of it.


Yet, it doesn't take a motivated actor to do it.

They certainly don't want to have other Big Tech companies exploring "their" data, but I'd argue they would only after someone who had a clear commercial interest. They are not going to chase a few dozen people on /r/DataHoarder running an Archive Warrior.


I'm not sure if threatening a corporate entity who can change their API at any time is a feature here. See: PushShift, Nitter, Youtube Vanced


There are levels of survival we are prepared to accept...

1) A lot of Reddit's usefulness comes from the bots. If they shut down the API entirely, they would lose a lot of value which would accelerate their demise.

2) A more cynical person would say that without bots, Reddit would lose 30-40% of its "traffic", and that they can not afford to do that. This is why the current API is still quite generous. It's enough for most bots, but just too expensive for third-party clients.

3) Even if Reddit shut down its API tomorrow, the majority of interesting content has already been copied/archived.

4) Scraping old.reddit is quite easy, and getting rid of it altogether is not something that they are willing to do.

All in all, I'd say that if we ever get to a point where Reddit is cutting down the API, it will be a time to celebrate.


> Scraping old.reddit is quite easy, and getting rid of it altogether is not something that they are willing to do.

Really? Unless there's an ironclad public statement to the contrary, the vibe I get from old.reddit is that it won't be long until the axe falls, especially since they are fine with "new" reddit features that malfunction in old.reddit. And I'd imagine their attitude towards unhappy users once they kill old.reddit will be the same as when they killed the API (which was "go fuck yourselves", to put it plainly).


Maybe it was just my corner of Reddit, but last year when I was trying to convince people to move away. There were a good number of people who were just saying "I don't care about the API pricing changes because I just use old.reddit. If they do get rid of it, then I'll leave it right away."

My feeling is that Reddit will continue to push and nudge people to new UI, but will not fully retire old. I think that if they ever do it it will be the final straw and mass migration will be inevitable.


New users always end up using the new version, and if they're new to the Internet they don't realize what they're missing. Old users contribute the most value, but it's not captured in shareholder- or management-visible statistics so they don't care and will drive them off to improve the statistics they do have (like ad impressions per page load). Thus old.reddit will die.


P2P ad hoc offline first internet that deprecates bgp and uses a globally memoized shared content/compute addressable memory via DHT/virtual machine. Identity for any thing is its composition.

I can dream.


Related:

ArchiveTeam Warrior: archiving as much of imgur as possible - https://news.ycombinator.com/item?id=35983510 - May 2023 (2 comments)

Help preserve the internet with Archiveteam's warrior - https://news.ycombinator.com/item?id=30524842 - March 2022 (51 comments)

ArchiveTeam Warrior backing up Reddit - https://news.ycombinator.com/item?id=29584622 - Dec 2021 (71 comments)


I can run docker for this. So looked up the repo and sweet surprise - it's all there.

https://github.com/ArchiveTeam/warrior-dockerfile/tree/maste...

https://github.com/ArchiveTeam/warrior-dockerfile/blob/maste...


Very easy to set up. Glad to contribute back in some small way now that I my Internet plan is "unlimited". Was trying to figure out if I could run ArchiveTeam Warrior scripts on an unused Android phone, but that's not directly supported and above my skillset.


Agree, I found it trivial to run in a docker-composed container on an x86 box.

I wish I had spare time to try to figure out how to get it set up inside QEMU on a Raspberry Pi... Seems like publishing such a compose file would unlock it for even more people.


No need to mess with QEMU. You may need to build the container for ARM, though: https://github.com/ArchiveTeam/warrior-dockerfile/tree/maste...

There's a specific "wget-lua.raspberry" file in the repo, so Raspberry Pi seems to be supported almost natively.


Unfortunately the Raspberry Pi version isn't supported anymore because their version of Wget isn't well-tested on ARM. I'm surprised it's still there in the repo.

You can get Docker to emulate x86, though.


I believe that file isn't actually supported because of potential data consistency issues in how wget runs on ARM devices.


A few months ago I designed a system that will allow people to contribute their disk space towards a service like archive.org. Basically, you'd say "I want to host 100 GB of archive.org content" and my system would talk to archive.org, figure out which content is currently the most rare, and push it to you.

It also comes with a retrieval function, where you can say "I want to get X content from this network" and it will find it for you.

It's a fairly thin wrapper over torrents, but it something that doesn't currently exist. Unfortunately, it wasn't met with much interest when I contacted some archivists.


Wasn't IPFS supposed to fill that role?


No, IPFS only lets you retrieve and store a file you want, not a bunch of files someone upstream wants you to, and not with a specific space target.


Running an IPFS Cluster follower node gets part of the way there. I'm intending to do this with the data from Anna's Archive. See some existing projects: https://collab.ipfscluster.io/


That sounds like a BOINC-like file system.


Archive Team had a defunct project called INTERNETARCHIVE.BAK (they use slightly whimsical project names) which is basically this idea. Note that by using this tool, you're breaking the law (copyright infringement).


libgen has a tracker like this for books https://phillm.net/libgen-seeds-needed.php

looks like it's currently broken :(


I wish one of the file coins could have solved this. Pay a fee to host blocks of data for a fixed duration and a fixed number of downloads.


That is a fantastic idea, I wonder why they didn’t take on it, did they say why?


Unfortunately not really, I just didn't get a response, or I got something like "nobody is interested in this".


Who did you contact?


A few people at the IA, and I also asked in a community of archivist called "the eye".


Please consider running a warrior appliance or docker container to contribute to digital preservation efforts.


I don’t understand the advantage of people downloading sites and then uploading them to the archive setup by archive team. What is the advantage over archive team directly downloading sites into their archive?


A) It requires fewer resources from the archive team. B) Probably more importantly, it distributes where the requests are coming from so they're less likely to be throttled or blocked.


Archive Team isn’t Internet Archive.

They need somewhere to download things before pushing them to archive.org


How does this project prevent that I as an attacker insert random garbage or rewrite the downloaded pages to be wrong in specific ways?

For HTTP this seems impossible. For HTTPS, who does the TLS termination? It could be safe if the warrior was just at TCP proxy and Archive was the TLS client, but https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior doesn't seem to explain that clearly.


> How does this project prevent that I as an attacker insert random garbage or rewrite the downloaded pages to be wrong in specific ways?

It doesn't, but in practice, it's not a problem. The goal is to with minimum turnaround rescue as much data as possible before a site goes down. Getting 1% garbage is more preferable than losing 30% to inefficiency in archiving because you're paranoid.

> For HTTPS, who does the TLS termination?

The Warrior literally just runs wget and saves to warc files that are uploaded to a tracker server.


Would love to contribute using my Raspberry Pi, but it not supporting ARM is a bit of a pain (and no easy way around it; I want to download the container, run it and done): https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior#C...?


There are some consistency issues when wget-at is compiled for ARM. You can run in under QEMU user emulation however, and I do that on a couple hosts


Do you have a write up or link to a page that describes how to run it under QEMU emulation? I've been wanting to try it / publish a docker-compose file that does it in order to make it easy for Raspberry Pi users to spin up a Warrior...


I just used https://github.com/dbhi/qus to enable it for all docker containers on the system (alternatively you can do the binfmt config manually, eg https://wiki.debian.org/QemuUserEmulation)

However to have it more tightly integrated, you'd want an arm image at /, mount the original warrior image at /warrior, then you could do something like `qemu-user-amd64 /warrior/bin/chroot /warrior/entrypoint` - Updating all paths as relevant


I decided to read the README and saw the following notice: Version 4 is now available at https://warriorhq.archiveteam.org/downloads/warrior4/

Which then contains a README with a link to the source for the new version: https://github.com/ArchiveTeam/warrior4-vm

This new version is also mentioned on the homepage of the Wiki: An updated Warrior virtual appliance (v3.2, v4.0) is now available with better support for newer projects that utilize wget-at.

ArchiveTeam should update the appliance download URL.


See also the page on the Archive Team wiki, with more detailed instructions: https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior


This is a much better link and really should be what the headline points to.


How will running Warrior in the background affect the pc? Will it degrade the disks overtime? I remember running the Siacoin app degraded my disks a bit. Or was it IPFS?


Is it possible to work on multiple projects at the same time ?


You can run multiple warriors, select different projects for each one you run (or run both with "auto"). I think it's advisable to only run one warrior per IP though, if you're doing many warriors for same project, as otherwise it's a lot easier to get rate-limited by whatever website/service you're helping to archive.

https://github.com/ArchiveTeam/warrior-dockerfile makes it pretty easy to setup.


Not in a single warrior, but you can run multiple warriors. I do this; I have one running Telegrab and one running URLTeam.

Once in a while, a site will tolerate a large number of connections, and since the Warrior VM only supports a concurrency of 6, it can make sense to run multiple warriors on the same project. But this is almost always a bad idea, and many sites will 429 you with a concurrency of anything more than 2 or 3, always check the project-specific IRC channel for concurrency recommendations.


I created a VM and then just duped the VM. Definitely overkill for this purpose, but nicely segregates them out.

Right now I run three instances. Pretty low resource utilization and they're totally segregated into their own instances so just boot them and they run, shut them down and they stop.

Given they're running arbitrary external commands I wanted them kept on their own machines as much as possible.


Not with the Warrior, as far as I'm aware, but you can do it with docker. It's usually the repos with -grab at the end:

https://github.com/orgs/ArchiveTeam/repositories




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: