

How to save the web - aristus
http://carlos.bueno.org/2008/08/save-web.html

======
MicahWedemeyer
Why focus solely on the text Web? The article mentions archaeological value of
the data, and how trash dumps are more important than great books. On the
Internet, I think that translates to archiving things like World of Warcraft
and Punch the Monkey. It's not highbrow, but it gives a good idea of how we're
spending our time.

~~~
aristus
That's a very good point, but you have to start somewhere. Text is the easiest
thing to preserve and packs a lot of information into a small amount of bytes.

------
sh1mmer
I think this is part of the job of Archive.org

I would also say that particularly on blogs it shouldn't be too hard to
capture revisions of major blogs. Most blogs 'ping' out Technorati, etc in
order to get updated into the search index. It would be easy to capture a
revision with each ping.

~~~
aristus
It is archive.org's mission, but archiving is a case where you'd want to have
more than one... right? :)

~~~
sh1mmer
I get that but I'm not sure I see the author's point though. Replicating
Archive.org isn't hard, it's pretty standard web spider stuff, it's just
massively resource intensive.

You can try and distribute that but I suspect it would just end up like most
Bit Torrent trackers where the head of the long tail has lots of support and
the tail has little to no support.

------
gojomo
FYI, I work on web archiving at archive.org.

Our tools for crawling and creating your own Wayback machine, including search
at smaller scales, are open source. See the projects 'Heritrix', 'Wayback',
and 'NutchWAX'. (Though, the bulk of our public archive still comes via
Alexa's closed-source crawling.)

There's a company called Iterasi now offering personal archiving. A company
called HanzoWeb at one point offered a del.icio.us-like bookmarking-plus-
archiving service, and that might return.

CS professor Frank McCown, while a PhD student at Old Dominion, built a tool
called 'Warrick' for reconstructing recently-disappeared websites from a
combination of public sources.

So many options for collection and access are out there -- it's a matter of
organization: building redundant stores and collecting the right materials at
the right time.

~~~
aristus
This is good to hear! The more the merrier, and thank you for your work on
archive.org. I agree that the organization and redundant stores need work, but
the collection and access need help, too.

If something is not in Google Cache or the Wayback, it's effectively gone.
There may be a copy in a LOCKSS server or in a nearby browser cache, but if I
can't know that or am not allowed to read it, it's useless.

