The team at the Internet Archive, responsible for Wayback Machine, ArchiveTeam, the TV News Archive, and many other projects, are true gems. I had the chance to meet Brewster and many others whilst last in San Francisco and their passion is infectious.
If anyone is interested, they have an open lunch on Fridays[1] where you get to see the church, the tech, and meet the team. Each team member and guest gives a few sentences about who they are and what they do, which really gives you a feel to how much work is going on under the covers.
I feel humanity will look back on this period -- lost diskettes, CDs, DVDs, game consoles, Betamax, VHS -- and lament the comparative black hole of information. Myspace history deleted with no warning, Justin.tv deleting history with only 8 days warning[2], ... and a million more examples. The Internet Archive are fighting that one bit at a time.
A lot of people here are confusing the Archive Team with the Internet Archive. They are not the same. IA is more polite, they always respect robots.txt and they will sometimes remove data if you ask politely. AT are self-described "rogue archivists" and their motto is "we are going to rescue your shit".
I love the Wayback Machine, I wish they'd archive all pages though even those that don't wish to be archived.... keeping them away from public view until copyrights expire someday.
I don't know about this, to me archiving everything seems like a gross inefficiency. Most of the internet is spam and advertising, and of the rest, less than 5% is actually useful information or knowledge.
Archiving books, scientific journals and the likes would seem much more useful, but obviously you'd run into copyright issues.
Agree that highest priority should go to the "serious" stuff. However the most interesting part of a really old magazine or newspaper, for me, is the advertising. For example an early 80s computer ad, or a 50s railroad or airline ad. I find that stuff really fascinating, and it gives more of the flavor of the era. It might have a surprising amount of value to a historian or anthropologist.
The trick is, of course, that it's nearly impossible to predict what will be useful to someone ahead of time. While you can probably sort out some of the spam, a comprehensive archiving project should probably avoid false positives when throwing things away.
Seems like a hard problem to solve. The low-hanging fruit would probably be detecting duplicates and combining them, which loses redundancy but handles all of those identical landing pages.
Quite coincidentally, I was just now reading an interview with Brewster Kahle, from NewScientist (23 November 2002) - back when the Wayback Machine had only 100 terabytes archived.
He said: "I guarantee that in the future researchers will curse us for having missed something absolutely critical. But only people using the archive can tell us about mistakes in what we collect. There is a cheaper alternative concept, called 'dark archiving', which means that we should not give people access to them. But preservation without access is dangerous - there's no way of reviewing what's in there."
But later on, he mentioned that: "AltaVista was the first Internet search engine that tried to be a complete index of all the pages. But what really got me was that they threw away the original pages. That grated, no end."
Aside: Kahle was one of the founders, with Danny Hillis, of Thinking Machines - the company that created the fabulous 'Connection Machine'.
The Wayback Machine is so essential to the nature of the internet that I wish it could be made into some sort of automatic, decentralized service that's part of the web itself. Imagine how much linked knowledge would permanently, irreparably disappear if this one company went out of business!
I love what the archive team does. I used their VM when the posterous backup effort was happening last year, and today I sent a link to my friend's now defunct posterous blog.
But I know that the owner of days posterous page had no intent on keeping the page a going concern, and was happy to see it disposed of. In light of the recent Google "right to be forgotten" ruling, will there come a day when the right to be forgotten will extend to archive.org?
>will there come a day when the right to be forgotten will extend to archive.org?
Sites can at any time opt out of being archived via a robots.txt exclusion (IA still keep their previous archives privately). However for public blogging sites operated by a third-party that's another matter.
The "right to be forgotten" ruling was new because it talked about _linking to_ stuff. But Archive Team and the Internet Archive both make actual copies, so they are covered by copyright. If your friend want his Posterous posts gone he can serve an DMCA notice on them.
All this great content and their website is designed in a way that discourages people to peruse it. They really need a re-design and "relaunch" of their brand to flaunt the great things that they're doing.
Unlike most of the web 2.0 world, there is substantially more value in their content than in their design wizardry. It's a team with limited resources whose can barely keep up with the information they archive and is doing an impressive work, not at all devalued by the absence of some precious yetanothercrap.js.
It's true their site could be more navigable but they're doing the lord's work, so I let them off the hook.
Contacted the Archive Team about a few sites that would be otherwise lost and they archived them, and will eventually be progressively be uploaded to IA. Great folk there as well.
For instance, this broken piece of shit was my attempt to do something about the LMA interface -- http://www.archive-ui.org/#/. (select a show, reload the page, then it'll work. got busy and lost interest).
The interface is as bad as you say, but at least they give us the ability to do something about it ourselves.
Among many many other differences, assuming your angle isn't to denigrate the Archive team, one of the important differences is that TPB makes no garantees about content availability. Information that you can find through TPB (they host magnet links, not content. magnet links can be used to find other who host content) is only available as long as those who are interested in the content are interested in hosting it. Conversely, the Archive people seek to ensure that content remains available even after everyone else seemingly loses interest in it.
It's something like the difference between your local used book store, and the Library of Congress. Or maybe the difference between the display cases of your local natural history museum, and the basement of the Smithsonian.
That and they primarily archive public domain material and abandonware (apart from their web archiving project). They really couldn't be more different.
I think simply disregarding the web archiving is a bit of a cop out. It's interesting though that for the most part, nobody minds them redistributing loads of copyrighted material. Here's some reasons that come to mind:
They web material was distributed for free in the first place. They're redistributing ad-ware, not stuff behind a paywall. (The same can be said of some TV shows and indeed I think TV show piracy if often met with a comparatively cavalier attitude.)
It's used as a measure of last resort. If I want to read an article from Wired, I'm going to try to find it on Wired -- or more likely, I'm going to Google it and get a link to Wired, and not the archive. It's only when it's unavailable from the original publisher or when I have specific historic interest that I end up using the web archive. The result is that publishers aren't denied their ad revenues as long as they host their material. Your abandonware argument translates neatly to the web archiving efforts.
They're archiving. This gives them a touch of academia and altruism that's casts them in a totally different light.
None of these things in isolation necessarily makes what the IA does entirely legit under current copyright law; they effectively operate in something of a legal grey area. But add it all together and not many people are going to get upset--especially given that they'll remove material if asked to do so.
I feel humanity will look back on this period -- lost diskettes, CDs, DVDs, game consoles, Betamax, VHS -- and lament the comparative black hole of information. Myspace history deleted with no warning, Justin.tv deleting history with only 8 days warning[2], ... and a million more examples. The Internet Archive are fighting that one bit at a time.
[1]: https://news.ycombinator.com/item?id=7826313
[2]: https://twitter.com/internetarchive/status/31565557291982848...