
Archiving URLs (2019) - tosh
https://www.gwern.net/Archiving-URLs
======
gwern
My current experiment in fighting linkrot is pre-emptive local archiving,
hosting static snapshots of links, serialized using SingleFile:
[https://www.reddit.com/r/gwern/comments/f5y8xl/new_gwernnet_...](https://www.reddit.com/r/gwern/comments/f5y8xl/new_gwernnet_feature_local_link_archivesmirrors/)

I am still finetuning it: there are a lot of domains which should be
whitelisted because archiving doesn't work or isn't useful, and sometimes
pages break when SingleFile'd (the content is there, but the CSS or JS is
broken, and I muck around manually trying to fix them). The resource
requirements are not that bad so far (21GB) but are probably a deal-breaker
for anyone who wants the usual personal-static-site budget strategy of hosting
on Amazon S3 for <$5/monthly, as even assuming way fewer links than gwern.net,
cloud bandwidth is so expensive they'll quickly blow their budget. It also
makes server logs messy as various bots try to fetch broken relative links,
which were of resources which didn't get inlined by SingleFile (not sure what
exactly triggers those).

However, so far it seems like a viable strategy.

~~~
gildas
Author of SingleFile here, Thanks for your feedback! Feel free to contact me
or create posts on Github about these bugs, I will do my best to fix them.

~~~
gwern
I haven't mentioned it because logs are full of garbage from crawlers even
without any bugs (there's a shocking amount of malformed, malicious, or just
crazy bugs - I think there must be crawlers which use n-grams or something to
predict possible URLs to speculatively fetch), and I haven't tracked down
whether it's anything to do with SingleFile or just the crazy old Internet
being crazy. At some point I'll have to dig into it more.

It's _probably_ something simple like "lots of websites encode addresses with
absolute paths, so bots follow those; SingleFile ought to rewrite absolute
paths to point to the original domain."

~~~
gildas
Thanks again for your comments. I also own a SaaS related to SEO and I can
confirm what you see. Some bots request URLs that don't exist by trying to
guess them. One the other hand, I also often see errors in links where the
HTML contains for example <a href="www.example.com/another-page"> and is
resolved as "[https://www.example.com/www.example.com/another-
page"](https://www.example.com/www.example.com/another-page"). FYI, note that
SingleFile is supposed to resolve all the links and that this should be
reliable since it delegates this resolution to the browser.

~~~
gildas
I'm also realizing your issue could be related to some <noscript> tags which
are not processed by SingleFile and could contain relative URLs. I should
maybe add an option to remove them.

------
ezequiel-garzon
Does anybody know the rationale behind the policy the Internet Archive has
about removing past content at the request of the domain owner (or leaser, to
be precise)? I'm sure they have their reasons, but they are not very
intuitive.

To me it's as if I lived on 15 Main St., and for that reason alone I had the
right to block the publication of photo albums of past years years of 15 Main
St., even if the dates are clearly indicated.

~~~
pbhjpbhj
It's a simple way to remove content so it, I assume, acts as a way for IA to
claim carrier protections under USA's DMCA, and to limit damages in other
jurisdictions.

What they do is certainly tortuous infringement in UK - and probably most
jurisdictions AFAICT (based on Berne Treaty, say), but it's so easy to remove
your domains from IA that courts will be loathe to award much in damages.

AIUI they block access but don't delete the info; Fair Use/archival laws in
USA probably make that lawful.

------
NewEntryHN
First law of the Internet: you cannot delete something from the Internet.

Second law of the Internet: you cannot keep track of where something is on the
Internet.

------
apacheCamel
With link rot and all the lost pages of the day, I really do wonder how much
information and effort is just lost to the sands of time over the years. This
truly makes the Internet Archive such an important tool for everything we do
on the internet. I know that is a commonly held belief here, but I really
think many regular users need to come in contact with it and see the great
benefit it has for us all.

------
infogulch
[https://archivebox.io/](https://archivebox.io/) is another interesting web
archive tool.

~~~
nikisweeting
Many others listed in our Wiki too:

[https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-
Comm...](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-
Community#other-archivebox-alternatives)

------
dreamcompiler
This problem would go away if everybody used URNs instead of URLs and we had
multiple public URN resolvers. URN resolution should be as ubiquitous as DNS
resolution.

But alas, here we are.

------
abhayhegde
Fixed it for you:
[https://web.archive.org/web/20200908171030/https://www.gwern...](https://web.archive.org/web/20200908171030/https://www.gwern.net/Archiving-
URLs)

~~~
Smaug123
In fairness, if anyone's content is going to stick around, it's probably
Gwern's.

