However, when I built it, I noticed that the Makefile attempts to write new lines into /etc/services, and yes, the program does contains socket/server code - which is apparently triggered by undocumented options.
Personally, I'm a bit leery of file-system level tools that contain undocumented server code, so I'll not be using it. (Although I might try to audit it, or maybe just try one of the other similar tools listed here: http://en.wikipedia.org/wiki/Freedup )
edit: grepped changelog and todo files, found these:
TODO: 1 - graphical web based user interface (full version with 2.0)
TODO:v1.7 - first working web interface (non-stabil enhancement)
TODO:done + single unmodified web template on GET request
TODO: + non-interactive execution with display to web interface
TODO:done + webpage streaming
TODO: + providing a web interface for the interactive mode
ChangeLog: + gui defaults to off, activate and deactivate using "make webon/weboff/state"
ChangeLog: + basic web interface offered (reply not accepted yet)
ChangeLog: + first helper routines for web-based GUI
For me, permalinking is a much deeper thing than a mere pointer to a resource, because the web is inherently transient, and forever shifting. Link rot is what defines the web. Very few people grok the concept of "evergreen domains", and "citable resources".
A domain must be kept renewed for, at least, ten years for it to be evergreen, and content must be citable if you want to go down in history for your efforts. How we go about implementing citable resources is tricky.
Unless you are a Zen Warrior and quite enjoy Sand Mandalas, then you shouldn't be reading this. This article is for those that want to be found on the web for their work in the foreseeable future, and for those who don't want their work attributed to other people. You want authorship of your resource to be correct.
Nobody ever really owns a domain. Much like land, it trades hands with numerous parties before it belongs to any one person; in which case that person can then sell it off and buy new land. Everything is borrowed. You own nothing, really. Imagine these scenarios:
- Domain expires for financial reasons; content can not be found.
- Webmaster dies; so he/she can't renew it indefinitely.
- Your content is simply sold because you got greedy, and you don't have authorship anymore.
- Domain stays up for 20 years, but the TCP/IP stack is rewritten, and Meshnets now dominate. You fade into darkness. (You are no longer discoverable).
- URLs point to the correct resource, but require proprietary software to view them. Only the intellectual elite can view the content.
- Cached copies exist, but only at the whim of Archivists who slurp the web using Historious / Pinboard / Archive.org / Google Cache / Reverse Proxies / Many Others... You can't rely on these people / companies for keeping your content permanent.
And so I conclude that permalinking is a much deeper concept than a mere URL that points to a resource. It entails a slew of other topics that all center around the age old philosophy of permanence versus transience. It's not just something bloggers use in Wordpress!
Please consider having your archiving scripts/services store their content in WARC format so you can submit bundles to the Internet Archive for integration into the Wayback Machine.
That's how Archive Team's downloads are able to be integrated. The latest versions of wget support storing their downloads in this fashion.
You could even regularly spider your own site and package it up into a WARC for submission.
And a Gist for uploading the completed WARC's to the IA using their s3-like service:
Final step is to e-mail someone at Archive Team with admin rights to move your IA upload into the proper "Archive Team" bucket, instead of "Community Texts". The awesome Mr. Jason Scott should be able to help you with that.
How long will Peeep keep my data?
Virtually forever. Nevertheless, we retain a right to remove content which has not been accessed for a month.
"- How long my snapshot will be stored? I have in mind the case when it has not been accessed for a long time.
Although the snapshot may be deleted if it violates the rules of the hosting provider (for example, if the page contains pornography or used as the landing page for spam campaigns)."
" - For how long will this website and the archives be available, how many people maintain this project?Thank you.
- Forever. Actually, I think, in 3-5-10 years all the content of the archive (it is only ~20Tb) could fit in a mobile phone memory, so anyone will be able to have a synchronized copy of the full archive. So my “forever” is not a joke.
Two persons, currently."
However, peeep.us exists since 2009, and a couple of my snapshots from 2010 are still alive (I definitely didn't access them once a month, maybe once a year at most).
Perhaps checking for a recent last-modified date together with a large relative change in size would suffice for this situation.
Much easier for everyone involved to just stick a dash in there.
If I had a Save button, my Start button would have a balance.
I could live without Save as.. & upload-into-proprietary-backup if my operating system could just remember better.
We need to remember more. If we need to save forever press Archive / Email / Save.
I need that command.
I like the idea of his archiving system, but is there a cross platform/device way to do it? The easiest option would seem to be proxying all your web traffic through a single Squid proxy that was configured to archive rather than just cache?
The archiving he outlines is reasonably cross platform - most platforms have ports of the tools in question, and you "just" need a way to get the browsing history.
The trickier part is good/proper indexing if you want to be able to do more than look it up by url.
For example, to re-write a portion of history:
1. Scan wikipedia for any broken links
2. Purchase domains
3. Reinstate links with different content
4. Update wikipedia article to represent changed content
We need a method of publicly verifying the contents of a link at the time it was cited.
Could we do something similar to bitcoin? Where we have a publicly available hash of each site when cited?
(I've been using MHT & MAFF a lot over the past few days to archive stuff relating to Silk Road - 183 files so far! - so it's on my mind.)
How do MHT creation tools handle dynamic pages? Another major win for PDFs are that you are essentially requesting a static version of the content if you use the "print to PDF" method, so it works nicely for archiving even if some fidelity is lost.
For archival of webpages? Who recommends that?
> Fast-forward 50 years, which one do you think is more likely: finding a PDF reader or a MAFF reader?
If I can't find a MAFF reader, I can unzip the MAFF and deal with the files directly, as I have already done in automating some editing some of the SR MAFFs to remove my username. It was much easier than the last time I wanted to edit a PDF, where it took me several hours to figure out how to do just one edit by hand.
> How do MHT creation tools handle dynamic pages?
They don't, but I haven't seen any 'print to PDF' mechanism which handled dynamic pages either. So I'm not sure how this is a 'major win' for PDFs compared to MHT/MAFF.