Hacker News new | comments | ask | show | jobs | submit login
Archiving URLs (gwern.net)
191 points by gwern on Oct 6, 2013 | hide | past | web | favorite | 46 comments

Warning: His mention of the "freedup" program piqued my interest, so I went to have a look at it. Overall, the description makes it sound like a solid tool, and the documentation seems relatively complete.

However, when I built it, I noticed that the Makefile attempts to write new lines into /etc/services, and yes, the program does contains socket/server code - which is apparently triggered by undocumented options.

Personally, I'm a bit leery of file-system level tools that contain undocumented server code, so I'll not be using it. (Although I might try to audit it, or maybe just try one of the other similar tools listed here: http://en.wikipedia.org/wiki/Freedup )

While I'm not ready to make any judgements of nefariousness, it is worrisome from general security point of view that if I read the makefile correctly, it is configuring its network service to run as root.

edit: grepped changelog and todo files, found these:

    TODO:  1  - graphical web based user interface (full version with 2.0)
    TODO:v1.7 - first working web interface (non-stabil enhancement)
    TODO:done	+ single unmodified web template on GET request
    TODO:	+ non-interactive execution with display to web interface
    TODO:done	+ webpage streaming
    TODO:	+ providing a web interface for the interactive mode
    ChangeLog:  + gui defaults to off, activate and deactivate using "make webon/weboff/state"
    ChangeLog:  + basic web interface offered (reply not accepted yet)
    ChangeLog:  + first helper routines for web-based GUI
The networking code does not seem to be activated without the -W flag, so I wouldn't be too worried about it. Just remove the entries it tries to create under /etc and you should be fine.

Yeah, I'm sure it's not nefarious. It's just very poor judgement to leave that code in the build without documenting it. So poor, that I don't feel I'm able to trust the rest of the code without auditing it.

I've emailed the listed contact email for freedup with your comment. Perhaps it was some prototype code which never got finished but the author didn't realize how it looked like a potential vulnerability.

From: https://thoughtstreams.io/higgins/permalinking-vs-transience...

For me, permalinking is a much deeper thing than a mere pointer to a resource, because the web is inherently transient, and forever shifting. Link rot is what defines the web. Very few people grok the concept of "evergreen domains", and "citable resources".

A domain must be kept renewed for, at least, ten years for it to be evergreen, and content must be citable if you want to go down in history for your efforts. How we go about implementing citable resources is tricky.

Unless you are a Zen Warrior and quite enjoy Sand Mandalas, then you shouldn't be reading this. This article is for those that want to be found on the web for their work in the foreseeable future, and for those who don't want their work attributed to other people. You want authorship of your resource to be correct.

Nobody ever really owns a domain. Much like land, it trades hands with numerous parties before it belongs to any one person; in which case that person can then sell it off and buy new land. Everything is borrowed. You own nothing, really. Imagine these scenarios:

- Domain expires for financial reasons; content can not be found.

- Webmaster dies; so he/she can't renew it indefinitely.

- Your content is simply sold because you got greedy, and you don't have authorship anymore.

- Domain stays up for 20 years, but the TCP/IP stack is rewritten, and Meshnets now dominate. You fade into darkness. (You are no longer discoverable).

- URLs point to the correct resource, but require proprietary software to view them. Only the intellectual elite can view the content.

- Cached copies exist, but only at the whim of Archivists who slurp the web using Historious / Pinboard / Archive.org / Google Cache / Reverse Proxies / Many Others... You can't rely on these people / companies for keeping your content permanent.

And so I conclude that permalinking is a much deeper concept than a mere URL that points to a resource. It entails a slew of other topics that all center around the age old philosophy of permanence versus transience. It's not just something bloggers use in Wordpress!

Local archives are great, but they help only you.

Please consider having your archiving scripts/services store their content in WARC format so you can submit bundles to the Internet Archive for integration into the Wayback Machine.

That's how Archive Team's downloads are able to be integrated. The latest versions of wget support storing their downloads in this fashion.

You could even regularly spider your own site and package it up into a WARC for submission.

I don't know how to usefully create WARCs. Wget has a WARC option, but when I tried it out, it created weirdly named files that littered the www tree and looked like the names would collide and files be overwritten. Plus the IA live request should be handling getting webpages into IA.

Here, have a Gist for creating proper WARC's, ready for the Internet Archive:


And a Gist for uploading the completed WARC's to the IA using their s3-like service:


Final step is to e-mail someone at Archive Team with admin rights to move your IA upload into the proper "Archive Team" bucket, instead of "Community Texts". The awesome Mr. Jason Scott should be able to help you with that.

This has serious implications for our entire legal system. http://www.pogo.org/blog/2013/09/the-supreme-court-has-a-ser...

Great link. I'll add that, although I wonder why the people concerned about it are setting up their own web archiving systems rather than just asking the Internet Archive for an Archive-It account or something.

See also the work of urlteam [1,2].

[1] http://urlte.am/

[2] http://archiveteam.org/?title=URLTeam

my vm to help the archive team is not working anymore. do you have a fresh link/vm to download?

Do you mean the ArchiveTeam Warrior? http://www.archiveteam.org/index.php?title=Warrior

yep, it's not working on 2 different computers that i've tried (no item received after 30 seconds)

I'm not part of the team, so I don't know. I don't know if some of them hang out on HN either, so you probably should contact them directly.

A couple of "snapshot" services which weren't mentioned:



Those look interesting and I'll mention them, but I'm not sure whether I can support them in my archive-bot code. Neither site seems to offer an easily scripted way to archive URLs; for example, the archive.is blog seems to reject adding such functionality: http://blog.archive.is/post/60948358744/given-that-youre-not...

Yes, it seems batch downloading won't be implemented, but batch archiving is possible:


Sweet. I've added archive.is to my archive bot.

I've been pretty happy with Pinboard's archiving option, $25/year. It does a better job of saving pages so they render correctly later on than other services I've tried, no complaints yet :) Plus it does in-page searching, which is incomparably superior to just archiving.

I don't think peeep is a viable option

How long will Peeep keep my data? Virtually forever. Nevertheless, we retain a right to remove content which has not been accessed for a month.

Yes, I think archive.is is a better one of the two. From the developer's blog:

"- How long my snapshot will be stored? I have in mind the case when it has not been accessed for a long time.

- Forever.

Although the snapshot may be deleted if it violates the rules of the hosting provider (for example, if the page contains pornography or used as the landing page for spam campaigns)."

" - For how long will this website and the archives be available, how many people maintain this project?Thank you.

- Forever. Actually, I think, in 3-5-10 years all the content of the archive (it is only ~20Tb) could fit in a mobile phone memory, so anyone will be able to have a synchronized copy of the full archive. So my “forever” is not a joke.

Two persons, currently."


However, peeep.us exists since 2009, and a couple of my snapshots from 2010 are still alive (I definitely didn't access them once a month, maybe once a year at most).

Whilst the link to linkchecker @ sourceforge still works, they have moved the project to GitHub.

Perhaps checking for a recent last-modified date together with a large relative change in size would suffice for this situation.

I'll update that link. Checking for size changes would be a useful heuristic, but you'd need to load the equivalent URL off the disk or something like that, and my experience has been that links tend to go completely dead rather than half-dead, so I don't sweat it much.

I'm quite a heavy user of WebCite and I think here would be a good place to remind people that they are in desperate need of funding to keep their service accepting submissions. https://fundrazr.com/campaigns/aQMp7

I actually lauched a WebCite kind of service back called Svonk back in 2009, unfortunately it got VERY little usage - like just a couple of hundred hits a month, and it got no external promotion. Here is a tutorial video http://www.youtube.com/watch?v=V9b2Xgi-xLM and the old press release if anyone is interested http://www.prweb.com/releases/meronymy/svonk/prweb2900644.ht... It had a RESTful interface and everything. I guess could re-launch it as I still have the source code, if there's an interest for it that is.

Replied in OP.

Tip #1: Don't add spaces in your URLs

What is wrong with spaces in URLs?

You need an (implicit or explicit) url encoding for it to work. The actual link to the linked article is[1] which is quite ugly, at least imho.

Much easier for everyone involved to just stick a dash in there.

[1] http://www.gwern.net/Archiving%20URLs

I think somewhere along the way my brain started processing %20 as a space, because I don't even notice anymore.

It might also be that browsers have started displaying it as a space - my Firefox does, for example (until you select it - copy-pasting it preserves the %20).

I remember reading that Richard Stallman uses a program that grabs webpages via a remote command, which are then emailed to him. A side effect of this, I suppose, would be that he could have an archive of everything he read and wanted to keep.

He usually mentions that this is for personal reasons that only apply to himself. No tinfoil hat here (at least not openly). http://en.wikiquote.org/wiki/Richard_Stallman#On_web_browsin...

I would love it, to remember too.

If I had a Save button, my Start button would have a balance.

I could live without Save as.. & upload-into-proprietary-backup if my operating system could just remember better.

We need to remember more. If we need to save forever press Archive / Email / Save.

I need that command.

He could, but I've never seen him mention preserving the files (as opposed to downloading to /tmp), and given how much he travels and how he sometimes loses/has his laptop stolen, it may not do him much good long-term anyway.

I don't do much writing for the web, but I do a lot of reading. For awhile I was using Evernote to save a snapshot of every (worthwhile) page I read, but once I had about 10,000 notes in my Evernote account it started to impair my ability to use it for things other than digital hoarding. Want to look up the baked tilapia recipe on my phone? Hold on 10 minutes while the headers for new notes are downloaded...

I like the idea of his archiving system, but is there a cross platform/device way to do it? The easiest option would seem to be proxying all your web traffic through a single Squid proxy that was configured to archive rather than just cache?

I've considered the proxy approach. Consider that proxies do not see the content of SSL connections though. Which may be just what you want, or absolutely not what you want, depending.

The archiving he outlines is reasonably cross platform - most platforms have ports of the tools in question, and you "just" need a way to get the browsing history.

The trickier part is good/proper indexing if you want to be able to do more than look it up by url.

The real danger here is that citing webpages creates significant risk that the content will be altered in future.

For example, to re-write a portion of history:

1. Scan wikipedia for any broken links

2. Purchase domains

3. Reinstate links with different content

4. Update wikipedia article to represent changed content

We need a method of publicly verifying the contents of a link at the time it was cited.

Could we do something similar to bitcoin? Where we have a publicly available hash of each site when cited?

I used to use [Furl](http://en.wikipedia.org/wiki/Furl). I would periodically download their zip archive of my urls, until it became 2GB. I now use [Zotero](http://www.zotero.org/) which takes a snapshot of the url I am on and puts all files into a folder. I realize that this is proactive work: taking snapshot when I find an interesting page, rather than later.

One alternative approach would be to convert/"print" selected web documents as PDFs for archival. Especially with modern ajax(and other fancy tech)-heavy sites I don't know how reliable/easy mirroring is.

As I understand it, you're actually much better off not 'printing' to PDF, but instead, saving to a format like MHT or MAFF, which preserve the DOM more accurately than a screenshot or PDF.

(I've been using MHT & MAFF a lot over the past few days to archive stuff relating to Silk Road - 183 files so far! - so it's on my mind.)

Yes, there are tradeoffs. But for archival PDF has one major advantage: it is actual, widespread standard widely considered being suitable for archival. Fast-forward 50 years, which one do you think is more likely: finding a PDF reader or a MAFF reader?

How do MHT creation tools handle dynamic pages? Another major win for PDFs are that you are essentially requesting a static version of the content if you use the "print to PDF" method, so it works nicely for archiving even if some fidelity is lost.

> it is actual, widespread standard widely considered being suitable for archival.

For archival of webpages? Who recommends that?

> Fast-forward 50 years, which one do you think is more likely: finding a PDF reader or a MAFF reader?

If I can't find a MAFF reader, I can unzip the MAFF and deal with the files directly, as I have already done in automating some editing some of the SR MAFFs to remove my username. It was much easier than the last time I wanted to edit a PDF, where it took me several hours to figure out how to do just one edit by hand.

> How do MHT creation tools handle dynamic pages?

They don't, but I haven't seen any 'print to PDF' mechanism which handled dynamic pages either. So I'm not sure how this is a 'major win' for PDFs compared to MHT/MAFF.

Doing this can be handy with Creative Commons licensed items, so that you have proof of it. I started doing this when I noticed that some people will change the license to something that is non-CC.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact