Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, I know, I'm a member of Archive Team, and I use "wget -e robots=off --mirror …" quite a bit, and then I upload those WARC's to the IA. But major content providers like the Washington Post that explicitly choose to block their entire website and its history should be named and shamed.

Authors don't get the right to go around removing their novels from public libraries just because they would rather the books be available only for pay in bookstores.



It's not really shameworthy to want to regulate access to your own sites, and physical metaphors work about as well here as "you wouldn't steal a car" does for piracy.

The Internet Archive does wonderful work, but just because somebody doesn't want you folks crawling their content doesn't make them worthy of "naming and shaming"


Why can a physical library collect and display physical newspapers, but a digital library cannot collect and display digital newspapers?


As I pointed out elsewhere in the thread, making analogies to physical media is just as flawed as the "you wouldn't steal a car" anti-piracy campaigns.

A physical library is either getting their newspapers by asking/paying the newspaper company to deliver them, asking citizens to donate them, or collecting them from already delivered newspapers. If the IA was just piggybacking on user activity (by caching and storing things from a user's browser cache after they visit a page) then I'd have far less of a concern with them. If we're so attached to physical metaphors, this would be equivalent to the librarian running around outside the newspaper's printing room and snatching newspapers from the bundles as the company's employees loaded them onto trucks.


I was going to reply pointing out that whether or not to name and shame someone is a subjective decision which you and I do not see eye to eye on, and which generally requires quite a few people to agree with you before it becomes a problem for the shamee, but then I rembered the poor way that IA handles changes in ownership with respect to robots.

When IA stops wiping out historical content due to a change of domain ownership in the now then I will have more support (and USE) for them.


How is IA supposed to distinguish a new website from a sincere wish to delete old stuff? A change in domain registration data means nothing; I have a domain that I registered for an association in my name, and which I then sold to them (for a symbolic price), but it was only an administrative issue - the site was the same.

IA is on iffy territory w.r.t. copyright as it is; if they stop respecting robots.txt, they could get into a world of hurt.


Your last sentence is key. As I understand it, there's no real legal precedent for IA which basically copies everything out there on an opt-out basis. I personally am glad they do but one of the ways they get off with it is by treading as lightly as possible, including respecting robots.txt even retroactively.

They're also non-commercial, broad in scope, arguably serve a valuable scholarly function and have other characteristics that have kept them mostly out of legal hot water. But it's unclear to what degree they're legally different from a site that decided to create an archive of all comics, commercial and otherwise, and slap advertising up.


Internet Archive isn't wiping out historical content. It's just unavailable/hidden for the time being. (As long as there is a restrictive robots.txt available).


Unfair to name & shame a private entity that doesn't want it's content to be archived.


privately owned but publicly acting!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: