
Archiving URLs - gwern
http://www.gwern.net/Archiving%20URLs
======
alextingle
Warning: His mention of the "freedup" program piqued my interest, so I went to
have a look at it. Overall, the description makes it sound like a solid tool,
and the documentation seems relatively complete.

However, when I built it, I noticed that the Makefile attempts to write new
lines into /etc/services, and yes, the program does contains socket/server
code - which is apparently triggered by undocumented options.

Personally, I'm a bit leery of file-system level tools that contain
undocumented server code, so I'll not be using it. (Although I might try to
audit it, or maybe just try one of the other similar tools listed here:
[http://en.wikipedia.org/wiki/Freedup](http://en.wikipedia.org/wiki/Freedup) )

~~~
zokier
While I'm not ready to make any judgements of nefariousness, it is worrisome
from general security point of view that if I read the makefile correctly, it
is configuring its network service to run as root.

edit: grepped changelog and todo files, found these:

    
    
        TODO:  1  - graphical web based user interface (full version with 2.0)
        TODO:v1.7 - first working web interface (non-stabil enhancement)
        TODO:done	+ single unmodified web template on GET request
        TODO:	+ non-interactive execution with display to web interface
        TODO:done	+ webpage streaming
        TODO:	+ providing a web interface for the interactive mode
        ChangeLog:  + gui defaults to off, activate and deactivate using "make webon/weboff/state"
        ChangeLog:  + basic web interface offered (reply not accepted yet)
        ChangeLog:  + first helper routines for web-based GUI
    

The networking code does not seem to be activated without the -W flag, so I
wouldn't be too worried about it. Just remove the entries it tries to create
under /etc and you should be fine.

~~~
alextingle
Yeah, I'm sure it's not nefarious. It's just very poor judgement to leave that
code in the build without documenting it. So poor, that I don't feel I'm able
to trust the rest of the code without auditing it.

------
getdavidhiggins
From: [https://thoughtstreams.io/higgins/permalinking-vs-
transience...](https://thoughtstreams.io/higgins/permalinking-vs-transience/)

For me, permalinking is a much deeper thing than a mere pointer to a resource,
because the web is inherently transient, and forever shifting. Link rot is
what defines the web. Very few people grok the concept of "evergreen domains",
and "citable resources".

A domain must be kept renewed for, at least, ten years for it to be evergreen,
and content must be citable if you want to go down in history for your
efforts. How we go about implementing citable resources is tricky.

Unless you are a Zen Warrior and quite enjoy Sand Mandalas, then you shouldn't
be reading this. This article is for those that want to be found on the web
for their work in the foreseeable future, and for those who don't want their
work attributed to other people. You want authorship of your resource to be
correct.

Nobody ever really owns a domain. Much like land, it trades hands with
numerous parties before it belongs to any one person; in which case that
person can then sell it off and buy new land. Everything is borrowed. You own
nothing, really. Imagine these scenarios:

\- Domain expires for financial reasons; content can not be found.

\- Webmaster dies; so he/she can't renew it indefinitely.

\- Your content is simply sold because you got greedy, and you don't have
authorship anymore.

\- Domain stays up for 20 years, but the TCP/IP stack is rewritten, and
Meshnets now dominate. You fade into darkness. (You are no longer
discoverable).

\- URLs point to the correct resource, but require proprietary software to
view them. Only the intellectual elite can view the content.

\- Cached copies exist, but only at the whim of Archivists who slurp the web
using Historious / Pinboard / Archive.org / Google Cache / Reverse Proxies /
Many Others... You can't rely on these people / companies for keeping your
content permanent.

And so I conclude that permalinking is a much deeper concept than a mere URL
that points to a resource. It entails a slew of other topics that all center
around the age old philosophy of permanence versus transience. It's not just
something bloggers use in Wordpress!

------
vitovito
Local archives are great, but they help only you.

Please consider having your archiving scripts/services store their content in
WARC format so you can submit bundles to the Internet Archive for integration
into the Wayback Machine.

That's how Archive Team's downloads are able to be integrated. The latest
versions of wget support storing their downloads in this fashion.

You could even regularly spider your own site and package it up into a WARC
for submission.

~~~
gwern
I don't know how to usefully create WARCs. Wget has a WARC option, but when I
tried it out, it created weirdly named files that littered the www tree and
looked like the names would collide and files be overwritten. Plus the IA live
request should be handling getting webpages into IA.

~~~
Asparagirl
Here, have a Gist for creating proper WARC's, ready for the Internet Archive:

[https://gist.github.com/Asparagirl/6202872](https://gist.github.com/Asparagirl/6202872)

And a Gist for uploading the completed WARC's to the IA using their s3-like
service:

[https://gist.github.com/Asparagirl/6206247](https://gist.github.com/Asparagirl/6206247)

Final step is to e-mail someone at Archive Team with admin rights to move your
IA upload into the proper "Archive Team" bucket, instead of "Community Texts".
The awesome Mr. Jason Scott should be able to help you with that.

------
MWil
This has serious implications for our entire legal system.
[http://www.pogo.org/blog/2013/09/the-supreme-court-has-a-
ser...](http://www.pogo.org/blog/2013/09/the-supreme-court-has-a-serious-case-
of-link-rot.html)

~~~
gwern
Great link. I'll add that, although I wonder why the people concerned about it
are setting up their own web archiving systems rather than just asking the
Internet Archive for an Archive-It account or something.

------
p4bl0
See also the work of urlteam [1,2].

[1] [http://urlte.am/](http://urlte.am/)

[2]
[http://archiveteam.org/?title=URLTeam](http://archiveteam.org/?title=URLTeam)

~~~
ddorian43
my vm to help the archive team is not working anymore. do you have a fresh
link/vm to download?

~~~
Mithrandir
Do you mean the ArchiveTeam Warrior?
[http://www.archiveteam.org/index.php?title=Warrior](http://www.archiveteam.org/index.php?title=Warrior)

~~~
ddorian43
yep, it's not working on 2 different computers that i've tried (no item
received after 30 seconds)

------
r721
A couple of "snapshot" services which weren't mentioned:

[http://www.peeep.us/](http://www.peeep.us/)

[http://archive.is/](http://archive.is/)

~~~
gwern
Those look interesting and I'll mention them, but I'm not sure whether I can
support them in my archive-bot code. Neither site seems to offer an easily
scripted way to archive URLs; for example, the archive.is blog seems to reject
adding such functionality: [http://blog.archive.is/post/60948358744/given-
that-youre-not...](http://blog.archive.is/post/60948358744/given-that-youre-
not-happy-with-the-source-as-it)

~~~
r721
Yes, it seems batch downloading won't be implemented, but batch archiving is
possible:

[http://blog.archive.is/post/45031162768/can-you-recommend-
th...](http://blog.archive.is/post/45031162768/can-you-recommend-the-best-
method-script-so-i-may-batch)

~~~
gwern
Sweet. I've added archive.is to my archive bot.

------
patrickmclaren
Whilst the link to linkchecker @ sourceforge still works, they have moved the
project to GitHub.

Perhaps checking for a recent last-modified date together with a large
relative change in size would suffice for this situation.

~~~
gwern
I'll update that link. Checking for size changes would be a useful heuristic,
but you'd need to load the equivalent URL off the disk or something like that,
and my experience has been that links tend to go completely dead rather than
half-dead, so I don't sweat it much.

------
ris
I'm quite a heavy user of WebCite and I think here would be a good place to
remind people that they are in desperate need of funding to keep their service
accepting submissions.
[https://fundrazr.com/campaigns/aQMp7](https://fundrazr.com/campaigns/aQMp7)

------
ihenriksen
I actually lauched a WebCite kind of service back called Svonk back in 2009,
unfortunately it got VERY little usage - like just a couple of hundred hits a
month, and it got no external promotion. Here is a tutorial video
[http://www.youtube.com/watch?v=V9b2Xgi-
xLM](http://www.youtube.com/watch?v=V9b2Xgi-xLM) and the old press release if
anyone is interested
[http://www.prweb.com/releases/meronymy/svonk/prweb2900644.ht...](http://www.prweb.com/releases/meronymy/svonk/prweb2900644.htm)
It had a RESTful interface and everything. I guess could re-launch it as I
still have the source code, if there's an interest for it that is.

~~~
gwern
Replied in OP.

------
nbody
Tip #1: Don't add spaces in your URLs

~~~
jms18
What is wrong with spaces in URLs?

~~~
eCa
You need an (implicit or explicit) url encoding for it to work. The actual
link to the linked article is[1] which is quite ugly, at least imho.

Much easier for everyone involved to just stick a dash in there.

[1]
[http://www.gwern.net/Archiving%20URLs](http://www.gwern.net/Archiving%20URLs)

~~~
thorum
I think somewhere along the way my brain started processing %20 as a space,
because I don't even notice anymore.

~~~
sedev
It might also be that browsers have started _displaying_ it as a space - my
Firefox does, for example (until you select it - copy-pasting it preserves the
%20).

------
RexRollman
I remember reading that Richard Stallman uses a program that grabs webpages
via a remote command, which are then emailed to him. A side effect of this, I
suppose, would be that he could have an archive of everything he read and
wanted to keep.

~~~
pain
I would love it, to remember too.

If I had a Save button, my Start button would have a balance.

I could live without Save as.. & upload-into-proprietary-backup if my
operating system could just remember better.

We need to remember more. If we need to save forever press Archive / Email /
Save.

I need that command.

------
thecabinet
I don't do much writing for the web, but I do a lot of reading. For awhile I
was using Evernote to save a snapshot of every (worthwhile) page I read, but
once I had about 10,000 notes in my Evernote account it started to impair my
ability to use it for things other than digital hoarding. Want to look up the
baked tilapia recipe on my phone? Hold on 10 minutes while the headers for new
notes are downloaded...

I like the idea of his archiving system, but is there a cross platform/device
way to do it? The easiest option would seem to be proxying all your web
traffic through a single Squid proxy that was configured to archive rather
than just cache?

~~~
vidarh
I've considered the proxy approach. Consider that proxies do not see the
content of SSL connections though. Which may be just what you want, or
absolutely not what you want, depending.

The archiving he outlines is _reasonably_ cross platform - most platforms have
ports of the tools in question, and you "just" need a way to get the browsing
history.

The trickier part is good/proper indexing if you want to be able to do more
than look it up by url.

------
1angryhacker
The real danger here is that citing webpages creates significant risk that the
content will be altered in future.

For example, to re-write a portion of history:

1\. Scan wikipedia for any broken links

2\. Purchase domains

3\. Reinstate links with different content

4\. Update wikipedia article to represent changed content

We need a method of publicly verifying the contents of a link at the time it
was cited.

Could we do something similar to bitcoin? Where we have a publicly available
hash of each site when cited?

------
fcc3
I used to use
[Furl]([http://en.wikipedia.org/wiki/Furl](http://en.wikipedia.org/wiki/Furl)).
I would periodically download their zip archive of my urls, until it became
2GB. I now use [Zotero]([http://www.zotero.org/](http://www.zotero.org/))
which takes a snapshot of the url I am on and puts all files into a folder. I
realize that this is proactive work: taking snapshot when I find an
interesting page, rather than later.

------
zokier
One alternative approach would be to convert/"print" selected web documents as
PDFs for archival. Especially with modern ajax(and other fancy tech)-heavy
sites I don't know how reliable/easy mirroring is.

~~~
gwern
As I understand it, you're actually much better off not 'printing' to PDF, but
instead, saving to a format like MHT or MAFF, which preserve the DOM more
accurately than a screenshot or PDF.

(I've been using MHT & MAFF a lot over the past few days to archive stuff
relating to Silk Road - 183 files so far! - so it's on my mind.)

~~~
zokier
Yes, there are tradeoffs. But for archival PDF has one major advantage: it is
actual, widespread standard widely considered being suitable for archival.
Fast-forward 50 years, which one do you think is more likely: finding a PDF
reader or a MAFF reader?

How do MHT creation tools handle dynamic pages? Another major win for PDFs are
that you are essentially requesting a static version of the content if you use
the "print to PDF" method, so it works nicely for archiving even if some
fidelity is lost.

~~~
gwern
> it is actual, widespread standard widely considered being suitable for
> archival.

For archival of webpages? Who recommends that?

> Fast-forward 50 years, which one do you think is more likely: finding a PDF
> reader or a MAFF reader?

If I can't find a MAFF reader, I can unzip the MAFF and deal with the files
directly, as I have already done in automating some editing some of the SR
MAFFs to remove my username. It was much easier than the last time I wanted to
edit a PDF, where it took me several hours to figure out how to do just one
edit by hand.

> How do MHT creation tools handle dynamic pages?

They don't, but I haven't seen any 'print to PDF' mechanism which handled
dynamic pages either. So I'm not sure how this is a 'major win' for PDFs
compared to MHT/MAFF.

