That just seems wrong.
Fair use is not the only exception to copyright. US copyright law has a separate section on exceptions for libraries and archives.
I believe Us Code 108 is relevant here. It starts:
it is not an infringement of copyright for a library
or archives, or any of its employees acting within the
scope of their employment, to reproduce no more than
one copy or phonorecord of a work[...]
Nice point about the lack of implied permission in copyright. It makes me think robots.txt probably doesn't have any meaning one way or the other legally, but is just a community thing.
It's more than a theoretical point - that each "serving" of a file is a copy is well established legally. In fact, even loading a program to RAM was considered a copy, per MAI Systems Corp. v. Peak Computer, until Congress made an explicit exception.
It absolutely could be/would be argued. Otherwise an arbitrary library or archive--oh, lets give it a name like Google Books--would have the right to make digital copies of physical books available to the public. Obviously Google tried to do this and (although the case was/is complicated) they weren't allowed to do this unconditionally.
ADDED: Or, heck, any site could declare themselves an archive and offer up ripped CDs to the public.
The ACLU and IA are two different entities, donating to the one does nothing to help the other.
> I believe Us Code 108 is relevant here.
Yes, it is.
> There's obviously more to it that I haven't done research on
Glad we got that out of the way.
> but that's a pretty good start and I wouldn't worry too much about lawsuits.
Well, since you're not operating the archive it isn't you that should be worried. And given that 'there is more to it that you haven't done research on' it is probably fair to say that lack of such worries thereof is a bit premature.
> In fact, if they were at risk of lawsuits, I don't see why respecting robots.txt would stop them–there's no "but you didn't tell me not to" excuse in copyright.
Because it shows effort on their side to not collect when copyright holders make a minimum effort to warn outside parties not to collect their data.
In the eyes of a judge - or a half decent lawyer - that will go a long way towards establishing that the archive made an effort to stay on the bright side of the line.
Law is interpreted, the fact that there is no such provision in copyright law doesn't mean that a judge isn't able to look past the letter and establish intent. If you are clearly in violation and refuse to do even the minimum in order to avoid such violations then judges tend to be pretty strict, in other words, they'll throw the book at you. But if you can demonstrate that you did what you could and that the plaintiff did not make even a minimum effort to warn others that archival storage or crawling is not desired then their case suddenly is a lot weaker.
See also: DMCA and various lawsuits in lots of different locations, the internet is far larger than just the USA and there are a number of interesting cases around this subject in other countries, some of those had outcomes that were quite surprising (at least, to non-lawyers).
I've copied Geocities.com when it went down and have had quite a bit of discussion with IP lawyers on the subject. So far I've been able to avoid being sued by responding timely to requests by rights holders. But that doesn't mean they would not have standing to sue me and if they do I might even lose.
This is not at all a settled area of the law and if you feel that the Internet Archive is in the right here no matter what then you could of course offer to indemnify them from any damage claims.
Couldn't they move operation to a non sketchy one, IIRC they anticipated the need for such a move due to trump and now have a backup ready in a different country.
Example: two months before the movie "The Social Network" got released to theaters in 2010 Facebook decided to add a robots.txt to Facebook.com. Immediately Archive.org deleted/disabled access to the archive how Facebook startpage looked in 2004-2010.
BTW. the correct way would be to activate archive access to Facebook.com for the 2004-2010 time-frame again. The "The Accidental Billionaires: The Founding of Facebook" book and the "The Social Network" film based on that book used of course partly Archive.org and various other research methods to get the facts.
For future domain-owners this is likely far too much control, but maybe that could be mitigated if IA tracks DNS/whois/registration info too
too bad they already lost loads of internet content that way
They won't be furious when they're dead.
I think the main value of the Internet Archive is not so much in the near term, but in the long term. I hope in the future they enact some policy that ignores any robots.txt for scrapes older than, say, 50 years.
I think I would be tempted to say that the data can't be removed to avoid abuse from future domain owners (or current ones) but I'm not sure if there would be any legal consequences of this attitude.
The issue of curators' views biasing the contents of collections seems to be underappreciated in general in the digital age, for some reason.
Archive Team (not a part of Internet Archive) actually archives piles and piles of web-based material, sometimes in response to current events, sometimes because of known shutting down of services, and sometimes because of speculative worry about longevity. (For an example of the last one, we've been archiving all current FTP sites left.)
Meanwhile, Internet Archive's crawlers are bringing in millions (really millions) of URLs every day, just constantly grabbing websites, files, video, you name it.
There's certainly a "bias" to the current administration in terms of 1. They're in power 2. They keep removing things new and old. But think of it as us having a few lights shined in specific directions while thousands of other floodlights go literally everywhere.
In general, the archive spiders the web and ingests information so that there is a certain mean frequency of visits and a certain likelihood of any particular revision of a web page being captured.
There would be instances in which data was entered into the archive more certainly and more frequently, on the basis of the nature of that data, than otherwise would have occurred.
What one means by bias when one says that this biases the contents of the collection needs to be understood with some care. It would be interesting to hear some historians discuss the matter. I do not think that it is a type of bias that is likely to lead them very far astray.
If it mollifies your concerns any, the last time I checked, anyone could manually archive any web page they liked. However, I would recommend writing to The Archive to express your concern.
I have an entirely partisan appreciation of the ability of The Archive to prevent redactions from the historical record of material that might later be disavowed. However, I share your more general view that there is no reason that the online history of any single major U.S. political party should be documented any less carefully than any other other.
All I can say is f*ck that. It's a free and open internet. If you put content up on a public site, anyone has the right to go and look at it. Stop complaining when someone saves it.
And sure some people complain that scrapers slow down their site and that's why they use robots.txt, but really? Really? It's 2017 and your site is affected by that. I think you have bigger things to worry about.
That someone wants to use a robot to completely scrape an entire dynamic website is their goal. A site is not responsible to make that possible. One bot causes _way_ more traffic and CPU usage than just a normal visitor or 1000s of visitors.
Saying '2017' or anything else: meh.
Various network operators are pretty helpful. Sending abuse complaints regarding misbehaving bots has resulted in actions before. I've seen action being taken from universities, ISPs, etc. Though normally the bots are auto-blocked (on IP address or ranges; quite easy to script).
robots.txt is an established / de facto standard. Ignore it, be prepared to explain why. IMO pretty much any country have computer hacking laws which are vague enough that to consciously ignore such a standard can be seen as "invading".
A "not my problem" approach: I think you should really think a little bit more.
Also thanks for spreading bad information.
You're not wrong about robots.txt, you're wrong in a much more broad way. There is in fact an extremely dangerous law that could easily ensnare what you're talking about:
I don't thing that browsing a web page and saving it's content it's the same than scamming people by doing fake online site. This is growing in our country and the local police don't have any rights.
If it's a global problem we need to have global rules, we can't have Chinese not respecting Authors' rights and in the other hand only blame local people it's stupid.
Specially when it's non-tech people that do the rules, they don't know tech therefore should not say anything about it.
EDIT: You can be mad at me and down vote, but what I say is true and relevant. There's not only US in the world, specially when there's other way than protecting your site behind a robots.txt
Obviously some countries have a more lax enforcement than others, but don't be surprised if the US starts squeezing and one day you suddenly get a knock on the door.
Using simple conditional tests in haproxy, I stop most of the bots from crawling anything more than my root page, robots.txt and humans.txt. Anything else gets silently dropped and the bots will retry for a while then go away. I don't see anything in the logs beyond the root page and robots/humans.txt any more.
archive.org fucked up by making robots retroactive, if they used the archived robots.txt as a filter for a site at the relevant date, they'd have had the best of both worlds - respecting how sites work without losing how sites appeared at a date.
What the archive can do after that point is a different issue, but they clearly can keep a copy. Further, someone else is using the domain they don't nessisarily have anything to do with the archived data.
Google and others have enhanced robots.txt to enable permission for crawling (allow, sitemap), meta tags can deny archiving and various means allow permission to be explicitly denied for caching.
To use your analogy of raising a sign: if you don't put up a 'no trespassing' sign then it doesn't make trespassing legal.
FWIW I disprove of this state of affairs and consider copyright to be hugely defective in these respects.
>but they clearly can keep a copy //
It's nuanced but permission to access a page =/= permission to keep a copy. Just as you have explicit permission to access a video on YouTube but in most jurisdictions will not have permission to download it for later (commercial) use.
Right. And it's actually not a bad analogy as analogies go. Not having a sign doesn't make trespassing legal but if someone sometimes walks over a corner of your property and you go to the police to try to get him arrested, the first thing they'll probably ask you is if you have your property posted and/or if you bothered to ask him to stop. If the answer is no, they'll probably tell you to go away and do so and only come back if he ignores the sign.
The requirement to post a sign to make trespassing an actionable offence is a USA thing AIUI, it's not a UK thing at least, but copyright is almost universal and doesn't require even adding a (c) mark, it's automatic at the point of creation under the Berne Convention. Or in other words you've pushed your analogy too far and hit a marked legal difference between USA physical property law enforcement and international intellectual property law.
The more general principle is that if no harm is done and the individual/organization will just stop the action if you asked, the courts are often reluctant to get involved. There are exceptions of course, especially in the vein of making an example of someone to discourage others.
Also, it's meaningless for a bot to get permission without having permission to make a copy. There are arguments around the number of copies, but the clear implication is at lest all the routers can make a copy.
Not really. There are often default robots.txt files that the system just puts there in the course of building a default website.
The legally "right" way to do things is only archiving sites that give explicit permission to do so. But then, for all intents and purposes, you can't have a web archive. So we have the current ask forgiveness rather than permission system which works fine most of the time for organizations like the IA and AT. But it does mean that someone like the IA is inclined to err on the side of removing content if someone objects.
Further, setting up a physical device connected to a public IP is never default behavior so you are putting up the sign in either case. So, at best your argument is someone athorised to do something put up a sign by mistake saying something that was not intended, but your intent has little relevance at that point.
Worse, your argument is based on the assumption that nobody knew what was going on so even simple coursework mentioning robots.txt would demonstrate knowledge and thus intent through willful inaction.
There's a minor technical problem with that USC too, it seems. It allows archives to "reproduce no more than one copy of a work". But to compare a website you make a second [admittedly transient] copy to decide whether to re-archive. That's technically not within the scope of that 17USC108 accommodation AFAICT. This may have been solved in US law; I've a feeling there was a modification of EU law to allow transient/cache copies?
But they're not a target that you're going to collect big from, as a non-profit archive/library they're sympathetic whether or not that gives them any special legal standing, and they'll basically take down your content past and present if you ask them to.
So it will almost certainly cost you money to sue them, you won't collect much in the best case, and you can get your content taken down in about as much time as it would take you to pick a lawyer out of the phone book.
On Google? Really?
The author runs CelebrityNetWorth.com, which BusinessInsider cites, but the snippets cite BusinessInsider. So the user doesn't see the proper attribution.
a poorly written scraper may really slow down your site, especially if it wasn't intended to be scrapped repeatedly. There should be something to be said about frequency which scrapers should follow (specified by the website owner via a robots.txt like spec).
But website owners cannot demand unreasonable frequencies (such as once a year!), and what constitutes unreasonable is up for debate.
Nope, if a website wants such a restriction, it must enforce it. Robots.txt is a request. It's worthless.
> Stop complaining when someone saves it.
What you don't say is that it is fine to recreate and publish that content against the owner's wishes, especially when said content is copyrighted in one way or another. You're failing to see the whole picture from the content owner's point of view.
For instance, is it OK to crawl a blog with explicit copyright, save that data, then publish it elsewhere?
Thats inviting lawsuits they can't win and expecting people to pay the bandwidth for it too.
Jason Scott (an employee of the Internet Archive) mentioned that the Archive doesn't ever delete anything. He stated that items may be removed from public access because of changes to "robots.txt" but they're not actually deleted. (That's a little comforting, at least.)
Archive Team's take on this
The problem: when robots.txt for a website is found to have been made more restrictive, the IA retrospectively applies its new restrictions to already-archived pages and hides them from view. This can also cause entire domains to vanish into the deep-archive. No-one outside IA thinks this is sensible.
Their solution: ignore robots.txt altogether. What? That will just annoy many website operators.
My proposed solution: keep parsing robots.txt on each crawl and obey it progressively, without applying the changes to existing archived material. This is actually less work than what they currently do. If the new robots.txt says to ignore about_iphone.html you just do that and ignore it. Older versions aren't affected.
Basically they're switching from being excessively obedient to completely ignoring robots.txt in order to fix a self-made problem. I can only see that antagonising operators.
What needs to be fixed first is just the really common case mentioned in the blog post, where a domain changes ownership and a restrictive robots.txt is applied to the parking page.
- Respect robots.txt at the time you crawl it.
- If robots.txt appears later, stop archiving from that date forwards.
- Preserve access to old archived copies of the site by default.
- Offer a mechanism that allows a proven site owner to explicitly request retrospective access removal.
If archive.org have recorded the date that they first observed a robots.txt on the sites currently unavailable, they could even consider applying the above logic today retrospectively. Perhaps after a couple of warning emails to the current Administrative Contact for the domain.
It should be "a proven content owner", just buying a site shouldn't allow someone to remove it from archive.
The IP address changing is a pretty solid indicator that control of that content has moved to a new organisation. Note this does not always coincide with the domain name owner changing.
A scenario that I can imagine becoming litigious: company owns a domain for promoting some product and they use robots.txt to prevent copies. The product reaches end of life and domain is allowed to expire. Someone else buys the domain and starts hosting content with no robots restriction. Archive.org start to display pages from the old company. Company then sues archive.org for copyright violation.
It looks like Facebook banned ia_archiver (recently? I recall it worked a few weeks ago):
How about an IETF RFC to clarify?
Libraries operate under a lot of unwritten social conventions, perhaps even more than most other institutions. (robots.txt even if largely ignored is a popular convention) Aggressive or confrontational wording, regardless of whether they are "right" doesn't seem in libraries' interests.
After I graduated from college I lost access to my website which was hosted on the Computer Science department's web servers.
I wish I hadn't used that robots.txt file. I would love to find the pages I made that compared interfold vs. exterfold staple strength, or the site I made with a ranch theme with a cowboy that had humorous advice....I don't have any content in archive.org because it honored the robots.txt file.
...sigh...wish I had backed up my stuff.
For example I've got a link to do delegated login like /login-with/github. When people click it an oauth flow will start. But it is useless for robots to follow so I disallow it in robots.txt. If they still follow nothing breaks and it's not a security issue but if I can avoid starting unnecessary oauth logins it's an additional benefit.
However I'm not claiming security is the only reason people use (misuse?) robots.txt. For example in your case you could mitigate your need for a robots.txt with a nofollow attribute. Sure bad bots could still crawl your site and find the authentication URL without probing robots.txt so the security implications there is pretty much non-existent. But you've already got a thoughtful design (the other point I raised) that mitigates the need for robots.txt anyway so adding something like "nofollow" maybe enough to remove the robots.txt requirement altogether.
According to your logic, newspapers are a "failed experiment because they rely on trust rather than security or thoughtful design". I published an article with my treasure map and told people not to go there, but they stole it.
I said following proper security and design practices renders obsolete all the edge cases that people might use robots.txt. I'm saying if you design your site properly then you shouldn't really need a robots.txt. That applies for all examples that HN commentators have raised in terms of their robots.txt usage thus far.
I would rewrite my OP to make my point clearer but sadly I no longer have the option to edit it.
But how? For example, if you don't want a page to be indexed by Google, you add this information to robots.txt. Nofollow doesn't work for every case, because any external website can link to it, and Google will discover it.
<meta name="robots" content="noindex">
Interestingly in that article, there is the following disclaimer about not using robots.txt for your example:
"Important! For the noindex meta tag to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex tag, and the page can still appear in search results, for example if other pages link to it."
I must admit even I hadn't realised that could happen and I was critical of the use robots.txt to begin with.
Nofollow is a good suggestion of you control links to the resource, robots of you don't.
Using robots.txt to secure your server from bots is the equivalent of attempting to secure your house from robbery by planting a sign that says "please,don't rob my house". Surprisingly it may works from time to time, by if you're into attempting security by wishful thinking maybe don't be too surprised when it fails about as much as security by chance.
If you need to add security (logins) to protect content you don't need to protect you inconvenience users.
Your point about sitemaps helps illustrate that point of mine because having a decent sitemap mitigates the need for Allow lines in robots.txt. It's another feature of the web where robots.txt isn't well equipped to handle and thus there have been other, better, tools built to highlight pages of interest to search engines.
robots.txt was proposed after a badly behaved bot DoSed a web server 20+ years ago, those were different times.
With the robots.txt standed now those who wants to play nice can do so without asking anything, for the badly behaved ones it's still up to the admin to put forward the appropriate measures.
I do get what you're saying but if you have to implement "appropriate measures" anyway then the robots.txt file becomes completely redundant.
It should be non-negotiable if you don't want your personal contents indexed by scrapers and archivers, and it should be enforced by design. It's a broken system.
I think they should honor robots.txt, and the meta tag version on specific pages or links -- given the site publisher went out of their way to give instructions to crawlers it seems reasonable to honor those requests.
Here's my shameless plug: https://github.com/yeukhon/robots-txt-scanner
I still remember writing most of this on Caltrain one morning heading to SF visiting someone I dearly loved.....
I have a very big problem with them disregarding robots directives. Sure some crawlers ignore them: Hostile net actors up to no good. This decision means they are a hostile net actor. I'll have to take extreme measures such as determining all the ip address ranges they use and totally blocking access. This inconveniences me, which means they are now my enemy.
edit- For those interested: Deny from 22.214.171.124/22
Because believe me, we do...good luck banning every AWS and DO IP range.
Good luck playing whack-a-mole against the crawlers. I admit to being very curious what you're openly hosting online that you really don't want to get saved for posterity?
I have considered putting a single file that is only accessible via no-follow links and perma-ban any ip that access the file, as a way to punish bad robots.
Not so long ago changing your user-agent to one of the search engine bot as a simple workaround for some paywalls that appeared in search results was a thing.
It's also part of the techniques used to give extra privacy and messing with fingerprinting. For example random agent spoofer: https://github.com/dillbyrne/random-agent-spoofer
If the site don't want to be scanned they can adopt a lot of counter measure and robots.txt will not save it from abuse.
He remind me the old days when my website wasn't working from US because I just fake that the site was down because there's no reason that somebody goes to my site from US (I know it's kind stupid, but when all your content is in french and you are a kid... :) )
The specifics here matter a great deal, the versions so far are regularly abused by the wealthy and don't apply to any of the data warehouses that the powerful and well connected have access to.
Where did this "right" come from? What's the legal and ethical basis for it? It is analogous to censorship or book burning at the basic level, destroying information to hide it from the public. It requires a consistent and strong justification as well as justified limited scope because of that, and it better be obviously beneficial to society even accounting for the inevitable misuse by those in power.
it's german, but basically it says: "This is not a blank check".