Hacker News new | past | comments | ask | show | jobs | submit login

Nice they're there. At the same time, it's amazingly easy for content to be removed from there - if someone objects or even if things are murky.

For example, all content from the old ezboard site was been removed based on the configuration of the current URL owners' robots.txt, and current URL owner is just a domain parker. Ezboard hosted a lot of content back in the day.

https://archive.org/post/560730/ezboard-is-there-any-hope




This is an old problem I could have sworn there were promises they were going to change their procedures.

The question I have is how fast is the content removed after the domain name registration changes, i.e., is there is a window of time between the appearance of a new robots.txt and the next scheduled crawl, and if so, is it be possible to "rescue" the content, as ArchiveTeam would do, during that window, before it disappears.

If this is possible, there could be a service for monitoring changes to domain name registrations for sites that have large amounts of historical content. I would happily volunteer to set up such a service.


Well, "complain on hn" has been a way to get stuff from Google. Maybe someone at archive will notice this thread.


I'm curious. What changes has Google made due to complains on HN?


Hopefully it is just hidden and not deleted. But this is the main reason why alternative archive sites exists which ignore the original posters requests. Frequently used to archive posts from public figures which are suspected to be attempted to be scrubbed later.


> Hopefully it is just hidden and not deleted.

Hidden. Even when you request for them to remove stuff.

Had domain, stuff got archived, asked for them to remove it, added robots.txt. Domain lapsed. Someone else picked it up. their robots.txt now permissive, old stuff that I requested for them to remove is now visible.


That's insidious. Does it even make sense to revive data that has been removed based on current configuration?

Even if the owner is the same, allowing the site to be archived going forward isn't the same thing as permitting it retroactively.


Wonder how you'd overcome that flaw, is there a history of domain name ownership?


Maybe we need a whois.archive.org.


Just delete the things when requested. No need to make it more complicated than that.


When requested by who? The current owner of the domain? Do they own what was on that domain 20 years ago?

What if you lost your domain, but owned it in the past? Can you delete stuff from that era?


I’ve requested old domains of mine to get removed from archive.org and at least as of a few years ago it was an all or nothing thing. There was no surgical way to say “only remove content from between 1995 and 2005 when I owned the domain”. Maybe things have changed since then, though.


As far as I was able to experience it's just hidden and not deleted.

I have to keep an old domain indefinitely to host a robots.txt just to keep sensitive personal data hidden that little me foolishly published on the open internet.

But I'm not complaining. The internet archive is a great gift. Using it with a bookmarklet really feels like a super power.


it sounds a lot like this would need some kind of delegation mechanism where you could point to a different URL in-time before abandoning the place. or maybe some kind of sealing using a cryptographic function that lets you proof your are the owner of the current/previous content and also would proof you are not the owner of the newer content while this proof could be used to release the ban if ever needed.


Got any examples of these alternative archives?


This is the main one https://archive.is/

From the FAQ, they do not respect robots.txt since they only archive on request by a user and they do not remove archives unless they contain illegal content.


HN commenters like to use archive.is but I always wonder if people are aware that archive.is is (a) blocked in some countries^1 and (b) may block access to itself in a country if it feels threatened.^2

There is also the issue of EDNS subnet.^3 archive.is tries to require it; it wants to know what location a request is coming from. In addition to EDNS, archive.is inserts the IP address and geolocation of the incoming request into the HTML of the returned page as a tracking pixel.^4

Thus archive.is does some things archive.org does not do besides just ignoring robots.txt

One of the things archive.org does that archive.is does not do is that archive.org inserts an HTTP response header intended to disable Chrome FLoC.^5 I add this header for all sites in a local proxy; however I do not see many sites adding it as a courtesy. Thanks archive.org for doing that.

1. https://en.wikipedia.org/wiki/Archive.is

2. https://www.reddit.com/r/KotakuInAction/comments/3e29vm/arch...

3. https://webapps.stackexchange.com/questions/135222/why-does-...

4. https://news.ycombinator.com/item?id=27498902

5. permissions-policy: interest-cohort=()



There’s archive.is but I get the sense that the major use case for that is getting around paywalls as opposed to permanently archiving a page - indeed, since they host content that the site owner probably doesn’t want them to, it would stand to reason the service would not be likely to stand the yet of time. But I could be wrong.


They actually posted about this in 2017: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea... . At the time it sounded like they might change their robots.txt policy. I guess they never followed up on it.

(I checked and ezboard is still excluded.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: