2) They are creating a search engine for this very purpose.
They have to respect it because we, collectively, say so. Obeying robots.txt is the minimum acceptable behavior for any robot, short of the Asmiov laws.
But archiving is different. I've been running into "Site was not archived due to robots.txt" more and more frequently. Often these are articles from ~2011 and earlier which the author no doubt would have wanted to be archived.
Trouble is, robots.txt is also the only thing that people really bother to set up. Maybe there's a way right now to indicate "Sure, archive my site please, and ignore my robots.txt." But if there is, it's not really common knowledge, and it's kind of unreasonable to expect every single website on the internet to opt-in to that.
On the flipside, it seems entirely reasonable that if someone really wants to opt out of archiving, that they explicitly go and tell Internet Archive. Circa 2016, Internet Archive is the only archive site that seems likely to persist to 2116. It's a shared time capsule, a ship that we all get a free ticket to board. If someone wants off, they can say so.
But right now, large swaths of the internet simply aren't being archived due to rules that don't entirely seem to make sense. There are excellent reasons for robot.txt, but opting out of "Make this content available to my children's children's children's children" seems perhaps beyond the scope of the original spec.
Would you feel ok with the Archive ignoring your robots.txt, or would you feel annoyed? If annoyed, then this is a bad idea and should be rejected.
But if nobody really cares, then here's a proposal: Internet Archive stops checking /robots.txt, and checks for /archive.txt instead. If archive.txt exists, then it's parsed and obeyed as if it were a robots.txt file.
That way, every site can easily opt-out. But everyone opts-in by default. Sites can also exercise control over which portions they want archived, and how often.
It would be better if archive.org would adhere to the robots.txt of the requested date/year (show content of example.com from 1999-2014).
(To answer your other question, the robots.txt standard already allows giving different instructions to different crawlers.)
I would have loved for my site to be archived, but I also need my site to perform well. I'm savvy enough to use robots.txt but not to monitor my site's CPU - and I imagine a lot of people with Wordpress or Squarespace sites don't even know about robots.txt. We need to find easy ways for people to control how their sites are archived. (And I don't know how any of this would fit with EU laws like the Right To Be Forgotten.)
Update the robots.txt and you should be good to go.
I think Archive dot org as they said on Science Friday podcast are not legal archive or otherwise final word, just trying to help out with archiving humanity. If I want to delete some old posts for whatever unsupported reason (or if takeover of domain new robots.txt) then that's how it should go.
If you already have a wallet, it may be the easiest way to donate.
The only solution to this problem I found is storing the links locally. I'm now in the process of importing everything into OneNote (onenote clipper is a huge help). A big plus is, that the content is indexed and fully searchable.
I probably would not do this if internet archive was more reliable. I'm ok with this solution, but it's a bit strange that firefox/chrome/IE haven't made this process of storing sites locally easier.
Social and technological hurdles for development were big enough that I never pursued it past that. One issue is that the Firefox product inside Mozilla was starting on a trajectory towards becoming increasingly inward-focused, disfavoring ideas that didn't come from within, and following a more corporatized product development approach. So what one would need to do (and ideally everyone working on new, discrete features should be doing, anyway) is to start out by building it as an extension to work as a proving ground.
The problem there is, in order for that to work in any sort of reasonable way, changes to Firefox itself would need to be made (APIs reworked) so that the surface exposed to you there is not one that's running orthogonal to what you need. So in order to develop the thing, you'd need to get some cooperation to get these non-trivial core changes upstream. GOTO 10
AFAIK (I used to be on top of a huge part of what came out of the Mozilla firehose, but I don't do any of that anymore), the whole situation is no better today (and really, from what I understand, much worse). Chromium might be better there, but I really have no idea.
And for local storage, check out https://webrecorder.io/ for an example implementation.
There's definitely a niche in the market for a bookmarking site that does some form of bulk importing of your bookmarks; has really easy organisation (drag and drop, and an API for power users); has some kind of thumb-nail; has some kind of link to Internet Archive (to read old sites); and maybe a link to IA to store sites. The internet archive stuff could be a paid option with some money going to IA to help fund them.
A great resource that we can't afford to be without.
This is a must watch, incredibly entertaining =)