Warning: His mention of the "freedup" program piqued my interest, so I went to have a look at it. Overall, the description makes it sound like a solid tool, and the documentation seems relatively complete.
However, when I built it, I noticed that the Makefile attempts to write new lines into /etc/services, and yes, the program does contains socket/server code - which is apparently triggered by undocumented options.
Personally, I'm a bit leery of file-system level tools that contain undocumented server code, so I'll not be using it. (Although I might try to audit it, or maybe just try one of the other similar tools listed here: http://en.wikipedia.org/wiki/Freedup )
While I'm not ready to make any judgements of nefariousness, it is worrisome from general security point of view that if I read the makefile correctly, it is configuring its network service to run as root.
edit: grepped changelog and todo files, found these:
TODO: 1 - graphical web based user interface (full version with 2.0)
TODO:v1.7 - first working web interface (non-stabil enhancement)
TODO:done + single unmodified web template on GET request
TODO: + non-interactive execution with display to web interface
TODO:done + webpage streaming
TODO: + providing a web interface for the interactive mode
ChangeLog: + gui defaults to off, activate and deactivate using "make webon/weboff/state"
ChangeLog: + basic web interface offered (reply not accepted yet)
ChangeLog: + first helper routines for web-based GUI
The networking code does not seem to be activated without the -W flag, so I wouldn't be too worried about it. Just remove the entries it tries to create under /etc and you should be fine.
Yeah, I'm sure it's not nefarious. It's just very poor judgement to leave that code in the build without documenting it. So poor, that I don't feel I'm able to trust the rest of the code without auditing it.
I've emailed the listed contact email for freedup with your comment. Perhaps it was some prototype code which never got finished but the author didn't realize how it looked like a potential vulnerability.
For me, permalinking is a much deeper thing than a mere pointer to a resource, because the web is inherently transient, and forever shifting. Link rot is what defines the web. Very few people grok the concept of "evergreen domains", and "citable resources".
A domain must be kept renewed for, at least, ten years for it to be evergreen, and content must be citable if you want to go down in history for your efforts. How we go about implementing citable resources is tricky.
Unless you are a Zen Warrior and quite enjoy Sand Mandalas, then you shouldn't be reading this. This article is for those that want to be found on the web for their work in the foreseeable future, and for those who don't want their work attributed to other people. You want authorship of your resource to be correct.
Nobody ever really owns a domain. Much like land, it trades hands with numerous parties before it belongs to any one person; in which case that person can then sell it off and buy new land. Everything is borrowed. You own nothing, really. Imagine these scenarios:
- Domain expires for financial reasons; content can not be found.
- Webmaster dies; so he/she can't renew it indefinitely.
- Your content is simply sold because you got greedy, and you don't have authorship anymore.
- Domain stays up for 20 years, but the TCP/IP stack is rewritten, and Meshnets now dominate. You fade into darkness. (You are no longer discoverable).
- URLs point to the correct resource, but require proprietary software to view them. Only the intellectual elite can view the content.
- Cached copies exist, but only at the whim of Archivists who slurp the web using Historious / Pinboard / Archive.org / Google Cache / Reverse Proxies / Many Others... You can't rely on these people / companies for keeping your content permanent.
And so I conclude that permalinking is a much deeper concept than a mere URL that points to a resource. It entails a slew of other topics that all center around the age old philosophy of permanence versus transience. It's not just something bloggers use in Wordpress!
Please consider having your archiving scripts/services store their content in WARC format so you can submit bundles to the Internet Archive for integration into the Wayback Machine.
That's how Archive Team's downloads are able to be integrated. The latest versions of wget support storing their downloads in this fashion.
You could even regularly spider your own site and package it up into a WARC for submission.
I don't know how to usefully create WARCs. Wget has a WARC option, but when I tried it out, it created weirdly named files that littered the www tree and looked like the names would collide and files be overwritten. Plus the IA live request should be handling getting webpages into IA.
Final step is to e-mail someone at Archive Team with admin rights to move your IA upload into the proper "Archive Team" bucket, instead of "Community Texts". The awesome Mr. Jason Scott should be able to help you with that.
Great link. I'll add that, although I wonder why the people concerned about it are setting up their own web archiving systems rather than just asking the Internet Archive for an Archive-It account or something.
Those look interesting and I'll mention them, but I'm not sure whether I can support them in my archive-bot code. Neither site seems to offer an easily scripted way to archive URLs; for example, the archive.is blog seems to reject adding such functionality: http://blog.archive.is/post/60948358744/given-that-youre-not...
I've been pretty happy with Pinboard's archiving option, $25/year. It does a better job of saving pages so they render correctly later on than other services I've tried, no complaints yet :) Plus it does in-page searching, which is incomparably superior to just archiving.
Yes, I think archive.is is a better one of the two. From the developer's blog:
"- How long my snapshot will be stored? I have in mind the case when it has not been accessed for a long time.
- Forever.
Although the snapshot may be deleted if it violates the rules of the hosting provider (for example, if the page contains pornography or used as the landing page for spam campaigns)."
" - For how long will this website and the archives be available, how many people maintain this project?Thank you.
- Forever. Actually, I think, in 3-5-10 years all the content of the archive (it is only ~20Tb) could fit in a mobile phone memory, so anyone will be able to have a synchronized copy of the full archive. So my “forever” is not a joke.
However, peeep.us exists since 2009, and a couple of my snapshots from 2010 are still alive (I definitely didn't access them once a month, maybe once a year at most).
I'll update that link. Checking for size changes would be a useful heuristic, but you'd need to load the equivalent URL off the disk or something like that, and my experience has been that links tend to go completely dead rather than half-dead, so I don't sweat it much.
I'm quite a heavy user of WebCite and I think here would be a good place to remind people that they are in desperate need of funding to keep their service accepting submissions. https://fundrazr.com/campaigns/aQMp7
I actually lauched a WebCite kind of service back called Svonk back in 2009, unfortunately it got VERY little usage - like just a couple of hundred hits a month, and it got no external promotion. Here is a tutorial video http://www.youtube.com/watch?v=V9b2Xgi-xLM and the old press release if anyone is interested http://www.prweb.com/releases/meronymy/svonk/prweb2900644.ht... It had a RESTful interface and everything. I guess could re-launch it as I still have the source code, if there's an interest for it that is.
It might also be that browsers have started displaying it as a space - my Firefox does, for example (until you select it - copy-pasting it preserves the %20).
I remember reading that Richard Stallman uses a program that grabs webpages via a remote command, which are then emailed to him. A side effect of this, I suppose, would be that he could have an archive of everything he read and wanted to keep.
He could, but I've never seen him mention preserving the files (as opposed to downloading to /tmp), and given how much he travels and how he sometimes loses/has his laptop stolen, it may not do him much good long-term anyway.
I don't do much writing for the web, but I do a lot of reading. For awhile I was using Evernote to save a snapshot of every (worthwhile) page I read, but once I had about 10,000 notes in my Evernote account it started to impair my ability to use it for things other than digital hoarding. Want to look up the baked tilapia recipe on my phone? Hold on 10 minutes while the headers for new notes are downloaded...
I like the idea of his archiving system, but is there a cross platform/device way to do it? The easiest option would seem to be proxying all your web traffic through a single Squid proxy that was configured to archive rather than just cache?
I've considered the proxy approach. Consider that proxies do not see the content of SSL connections though. Which may be just what you want, or absolutely not what you want, depending.
The archiving he outlines is reasonably cross platform - most platforms have ports of the tools in question, and you "just" need a way to get the browsing history.
The trickier part is good/proper indexing if you want to be able to do more than look it up by url.
I used to use [Furl](http://en.wikipedia.org/wiki/Furl). I would periodically download their zip archive of my urls, until it became 2GB.
I now use [Zotero](http://www.zotero.org/) which takes a snapshot of the url I am on and puts all files into a folder.
I realize that this is proactive work: taking snapshot when I find an interesting page, rather than later.
One alternative approach would be to convert/"print" selected web documents as PDFs for archival. Especially with modern ajax(and other fancy tech)-heavy sites I don't know how reliable/easy mirroring is.
As I understand it, you're actually much better off not 'printing' to PDF, but instead, saving to a format like MHT or MAFF, which preserve the DOM more accurately than a screenshot or PDF.
(I've been using MHT & MAFF a lot over the past few days to archive stuff relating to Silk Road - 183 files so far! - so it's on my mind.)
Yes, there are tradeoffs. But for archival PDF has one major advantage: it is actual, widespread standard widely considered being suitable for archival. Fast-forward 50 years, which one do you think is more likely: finding a PDF reader or a MAFF reader?
How do MHT creation tools handle dynamic pages? Another major win for PDFs are that you are essentially requesting a static version of the content if you use the "print to PDF" method, so it works nicely for archiving even if some fidelity is lost.
> it is actual, widespread standard widely considered being suitable for archival.
For archival of webpages? Who recommends that?
> Fast-forward 50 years, which one do you think is more likely: finding a PDF reader or a MAFF reader?
If I can't find a MAFF reader, I can unzip the MAFF and deal with the files directly, as I have already done in automating some editing some of the SR MAFFs to remove my username. It was much easier than the last time I wanted to edit a PDF, where it took me several hours to figure out how to do just one edit by hand.
> How do MHT creation tools handle dynamic pages?
They don't, but I haven't seen any 'print to PDF' mechanism which handled dynamic pages either. So I'm not sure how this is a 'major win' for PDFs compared to MHT/MAFF.
Doing this can be handy with Creative Commons licensed items, so that you have proof of it. I started doing this when I noticed that some people will change the license to something that is non-CC.
However, when I built it, I noticed that the Makefile attempts to write new lines into /etc/services, and yes, the program does contains socket/server code - which is apparently triggered by undocumented options.
Personally, I'm a bit leery of file-system level tools that contain undocumented server code, so I'll not be using it. (Although I might try to audit it, or maybe just try one of the other similar tools listed here: http://en.wikipedia.org/wiki/Freedup )