I don't know how anyone manages to use archivebox. I've tried it twice in the last 3 years and its site compatibility is bad, it quietly leaks everything you archive to archive.org by default, and whenever it fails on a download it stops archiving anything even after deleting and resubmitting all the jobs.
These are legitimate gripes that have plagued specific past releases, I hear your frustration. Please keep in mind this was a solo effort of a single developer, only worked on in my spare time over the last 7 years (up until very recently).
The new v0.8 adds a BG queue specifically to deal with the issue of stalling when some sites fail. There was a system to do this in the past, but it was imperfect and mostly optimized for the docker setup where a scheduler is running `archivebox update` every few hours to retry failed URLs.
Site compability is much improved with the new BETA, but it's a perpetual cat and mouse game to fix specific sites, which is why we think the new plugin system is the way forward. It's just not sustainable for a single company (really just me right now) to maintain hundreds of workarounds for each individual site. I'm also discussing with the Webrecorder and Archive.org teams how we can to share these site-specific workarounds as cross-compatible plugins (aka "behaviors") between our various software.
> it quietly leaks everything you archive to archive.org by default
It's prominently mentioned many times (at least 4) on our homepage that this is the default, and archiving public-only sites (which are already fair game for Archive.org) is a default for good reason. Archiving private content requires several important changes and security considerations. More context: https://news.ycombinator.com/item?id=26866689
I can accept the other issues, but archivebox needs be private and secure by default.
Sending everything to archive.org is bad default value and it erodes a certain level of trust in the project. Requiring "several important changes and security considerations" just makes a non-starter. The default settings should be "safe" for the default user, because as you mentioned in that post, 90% of users are never going to change them. Users should be able to run it locally and archive data without worrying about security issues, unless you only want experts to be able to use your software.
Also a contradiction between your statement and your blogpost, someone saving their photos isn't going to be want to worry about whether they configured your tool correctly or leaking all the group logs or grandma's photos.
>It's prominently mentioned many times (at least 4) on our homepage that this is the default, and archiving public-only sites (which are already fair game for Archive.org) is a default for good reason. Archiving private content requires several important changes and security considerations. More context
> Who cares about saving stuff?
> All of us have content that we care about, that we want to see preserved, but privately:
> families might want to preserve their photo albums off Facebook, Flickr, Instagram
> individuals might want to save their bookmarks, social feeds, or chats from Signal/Discord
> companies might want to save their internal documents, old sites, competitor analyses, etc.
I want the project to do well but it really needs to be secure by default.
> The default settings should be "safe" for the default user,
I 100% agree, but because private archiving is doable but NOT 100% safe yet I cant make that mode the default. The difficult reality currently is that archiving anything non-public is not simple to make safe.
Every capture will contain reflected session cookies, usernames, and PII, and other sensitive content. People don't understand that this means if they share a snapshot of one page they're potentially leaking their login credentials for an entire site.
It is possible to do safely, and we provide ways to achieve that that I'm constantly working on improving, but until it's easy and straightforward and doesn't require any user education on security implications, I cant make it the default.
The goal is to get it to the point where it CAN be the default, but I'm still at least 6mo away from that point. Check out the archivebox/sessions dir in the source code for a look at the development happening here.
Until then, it requires some user education and setting up a dedicated chrome profile + cookies + tweaking config to do. (as an intentional barrier to entry for private archiving)
I don't think it's possible to remove information about yourself from a webpage before you share it. It's always possible to have crafted a website that sneaks reflected session information or the instance of archivebox's IP address into the main content. This can be a real response:
> And that was this week's newsletter! Congratulation for reading to the bottom, dear 198.51.100.1.
Even if the archivebox instance noted its own IP to do a search-and-replace like s|198\.51\.100\.1|XXX.XXX.XXX.XXX| on the snapshot it is about to create, it's possible to craft a response that obscures the presence of the information, such as by encoding the IP like this: MTk4LjUxLjEwMC4xCg==. I.e. steganography (https://en.wikipedia.org/wiki/Steganography).
Being able to anonymize archives before sharing them is something I would find interesting, but I don't think you can beat steganography, so I'm wondering what exactly you mean you plan to do.
I've been very impressed by all of your responses in here, but that one in particular shows empathy, compassion, and a deep deep subject matter expertise.
As a custom tool built to archive stuff for archive.org, why would you expect that it can also do a completely opposite task, saving information privately?
I can see why you would want such a tool, but it seems like a direct divergence from the core goal of the existing codebase.
Gyms are notorious shysters who made it difficult to cancel your membership, even when you have the right. Don't blame the consumers for this bullshit. Do as many chargebacks as you can.
Don't sign an agreement to do something you don't want to do. It's as simple as that.
It's not "blaming the consumers" for expecting people to follow the terms of contracts they sign. I never had a Gold's Gym membership for exactly this reason - their cancellation terms were onerous, I wasn't interested in complying, so I never signed and never gave them any money.
If you say "well, I don't want to do that, but I'm just going to sign this anyway then do a chargeback because that's easier" them yes, you deserve to be blamed, you deserve to be shamed, and you should have to pay the cancellation fees, early termination fees, whatever.
Bullshit rules are bullshit rules, the fact that something is technically legal doesn't make it morally justifiable. The default assumption of any consumer in a high trust society is that they are going to receive a fair service for the price they pay.
This was written in a very confident way, but I can say with at least as much confidence that my house was mass produced in a factory and assembled locally in the middle of nowhere without any regard for local architecture.
My house has strategic overhangs (and trees with summer foliage to the south) leading to drastically different winter/summer insolation. (in addition, the dark stonework on the ground floor functions to passively clear light snow in spring and early winter)
It was built in the XX, but according to local vernacular, which likely (we have a few examples surviving from the XIII) predates both the modern profession of "architect" and metal-framed awnings.
(my friend the architect has plenty of local work, but maybe that's because we live in different countries?)
It's always funny seeing people pretend they live in a world where small issues like land or food don't affect them, even though they only have that privilege because a larger military says they should.
I'm sure it works for some people, but not me.