Hacker News new | comments | ask | show | jobs | submit login
The Internet Archive Telethon (archive.org)
177 points by empressplay on Dec 20, 2015 | hide | past | web | favorite | 35 comments



Please consider donating. IMHO, the Internet Archive is one of the crown jewels of the Internet. It's absolutely incredible what they've achieved, how much information and knowledge they've made accessible and how important the task of what they're doing is...and it's all free for anybody to enjoy, millions upon millions of hours of content from books to magazines to movies to music, classic radio, video games, and more than I could even guess at. They're just asking for a little help and give so much back.


I agree with you but I wonder why I never get archive.org results when I search Google...


1) Because they block it in the robots.txt file : https://web.archive.org/robots.txt

2) They are creating a search engine for this very purpose.


I do get those for some of their stuff like books and some movies. If you mean results from the Wayback Machine, those likely aren't indexed due to copyright issues


When they stop memory holeing content due to robots.txt, maybe. That's all I use them for, old and/or inaccessible webpages.


It should be socially acceptable for Internet Archive to ignore robots.txt.

They have to respect it because we, collectively, say so. Obeying robots.txt is the minimum acceptable behavior for any robot, short of the Asmiov laws.

But archiving is different. I've been running into "Site was not archived due to robots.txt" more and more frequently. Often these are articles from ~2011 and earlier which the author no doubt would have wanted to be archived.

Trouble is, robots.txt is also the only thing that people really bother to set up. Maybe there's a way right now to indicate "Sure, archive my site please, and ignore my robots.txt." But if there is, it's not really common knowledge, and it's kind of unreasonable to expect every single website on the internet to opt-in to that.

On the flipside, it seems entirely reasonable that if someone really wants to opt out of archiving, that they explicitly go and tell Internet Archive. Circa 2016, Internet Archive is the only archive site that seems likely to persist to 2116. It's a shared time capsule, a ship that we all get a free ticket to board. If someone wants off, they can say so.

But right now, large swaths of the internet simply aren't being archived due to rules that don't entirely seem to make sense. There are excellent reasons for robot.txt, but opting out of "Make this content available to my children's children's children's children" seems perhaps beyond the scope of the original spec.

Would you feel ok with the Archive ignoring your robots.txt, or would you feel annoyed? If annoyed, then this is a bad idea and should be rejected.

But if nobody really cares, then here's a proposal: Internet Archive stops checking /robots.txt, and checks for /archive.txt instead. If archive.txt exists, then it's parsed and obeyed as if it were a robots.txt file.

That way, every site can easily opt-out. But everyone opts-in by default. Sites can also exercise control over which portions they want archived, and how often.


If example.com allowed indexing in 1999, a new owner of example.com can hide/delete the 1999-2015 content by changing the robots.txt in 2015.

It would be better if archive.org would adhere to the robots.txt of the requested date/year (show content of example.com from 1999-2014).


The fact that all popular URLs which fall out of registration are now picked up by squatter-spambots is also troubling. An Archive.org entry should not cease to exist when the registration lapses if the squatter-spambots decide to robots.txt everything. That would defeat its purpose completely.


I think the archive.org crawler should respect robots.txt as it looked at the time of the crawl. As a well-behaved robot, archive.org's crawler should fetch and respect robots.txt each time it crawls. However, archive.org should not retroactively delete old content when the current site puts up a robots.txt.

(To answer your other question, the robots.txt standard already allows giving different instructions to different crawlers.)


The situation is a bit more nuanced then that. I had a website on shared hosting, and it was being indexed by archive.org. But years ago (maybe a decade?), their robot was doing something crazy that was overwhelming sites, and the server admin blocked the Internet Archive robots. Even worse, archive.org interpreted the block retroactively and deleted all the archives.

I would have loved for my site to be archived, but I also need my site to perform well. I'm savvy enough to use robots.txt but not to monitor my site's CPU - and I imagine a lot of people with Wordpress or Squarespace sites don't even know about robots.txt. We need to find easy ways for people to control how their sites are archived. (And I don't know how any of this would fit with EU laws like the Right To Be Forgotten.)


The Archive doesn't delete anything; depending on the current robots.txt, they may not show pages from past crawls.

Update the robots.txt and you should be good to go.


Very well said and I strongly agree. What's the worse is that highly legitimate sites that existed for years get domain parked after shutting down and become suddenly inaccessible. Maybe for sites like that they can make the archives before the switchover available but it would probably be too costly staff-wise to look at each case-by-case.


robots.txt already lets you specify per-robot behaviour. You can trivially opt-out of crawling, but opt-in to archiving by explicitly allowing archive.org's bot and disallowing all other user agents.


Sorry a little too drunk to scan your post, but have considered this before.

I think Archive dot org as they said on Science Friday podcast are not legal archive or otherwise final word, just trying to help out with archiving humanity. If I want to delete some old posts for whatever unsupported reason (or if takeover of domain new robots.txt) then that's how it should go.

IMO.


Read your post tomorrow. I guarantee you will laugh. I've been there.


I'm honored to work at the Internet Archive and hope you will consider chipping in to help us continue to champion the import of, and make progress towards, our mission: universal access to all knowledge. If you can't part with a little $$ now, how about uploading some digital media you think are important? The Archive will endeavor to make them available to everyone, forever, for free.


Thank you for your precious work! I will donate to support the Internet Archive, but when will it have a proper Web Archive search engine? It will be a huge thing!


They accept bitcoin, without any hassles or having to provide any information about yourself. Just a simple QR code for the bitcoin address:

http://archive.org/donate/bitcoin.php

If you already have a wallet, it may be the easiest way to donate.


My firefox bookmark manager contains about 1300 links. Since quite a while I was really unhappy with this solution. Link rot is really bad and it's sad when even internet archive does not contain the lost site.

The only solution to this problem I found is storing the links locally. I'm now in the process of importing everything into OneNote (onenote clipper is a huge help). A big plus is, that the content is indexed and fully searchable.

I probably would not do this if internet archive was more reliable. I'm ok with this solution, but it's a bit strange that firefox/chrome/IE haven't made this process of storing sites locally easier.


I was toying with the idea of a browser archiving mode for Firefox that would allow something like or "save local copies of everything in my history" or "just save local copies of all my bookmarks". I sketched out a few use cases. < https://wiki.mozilla.org/Permafrost >

Social and technological hurdles for development were big enough that I never pursued it past that. One issue is that the Firefox product inside Mozilla was starting on a trajectory towards becoming increasingly inward-focused, disfavoring ideas that didn't come from within, and following a more corporatized product development approach. So what one would need to do (and ideally everyone working on new, discrete features should be doing, anyway) is to start out by building it as an extension to work as a proving ground.

The problem there is, in order for that to work in any sort of reasonable way, changes to Firefox itself would need to be made (APIs reworked) so that the surface exposed to you there is not one that's running orthogonal to what you need. So in order to develop the thing, you'd need to get some cooperation to get these non-trivial core changes upstream. GOTO 10

AFAIK (I used to be on top of a huge part of what came out of the Mozilla firehose, but I don't do any of that anymore), the whole situation is no better today (and really, from what I understand, much worse). Chromium might be better there, but I really have no idea.


Another option is to use the Wayback Machine's "save page now" functionality.

And for local storage, check out https://webrecorder.io/ for an example implementation.


In Firefox I recommend the use of this bookmarklet (save current page into : Archive.is, Wayback Machine/Internet Archive, WebCite) http://that1archive.neocities.org/tools/bookmarklets.html


Google don't want you to use bookmarks, which is why Google Chrome bookmarking sucks.

There's definitely a niche in the market for a bookmarking site that does some form of bulk importing of your bookmarks; has really easy organisation (drag and drop, and an API for power users); has some kind of thumb-nail; has some kind of link to Internet Archive (to read old sites); and maybe a link to IA to store sites. The internet archive stuff could be a paid option with some money going to IA to help fund them.


Do you mean something like Pocket (https://getpocket.com)?


> Google don't want you to use bookmarks, which is why Google Chrome bookmarking sucks.

Why?


Google wants you to search, so they can show you ads in the search results.


Same thing for the history. Chrome's history forces you to view it page by page just to find something from a while back while Firefox will gladly let you view the history per month. It also makes removing items a huge pain because you have to individually check each item. (I'm aware of the "clear recent history" feature, but not what I'm looking for)


Have you tried Pinboard? Their archiving option might suit your needs.


I donated due to the number of times I've seen the wayback machine links come up in discussions on HN.

A great resource that we can't afford to be without.


I am interested in learning more about their system architecture. Anyone know if any such writeup is available? How do they scale and what amount of data are they dealing with daily? Disaster recovery, considering they archive history?


Supported. They are an invaluable resource.


It has been awesome, thank you so much internet archive for your hard work and dedication to openness and knowledge.

This is a must watch, incredibly entertaining =)


best telethon ever


IPFS is an attempt to fix link rot by addressing content by its contents, not its location. Very cool project.


Any other HNs here?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: