Nice they're there. At the same time, it's amazingly easy for content to be remo...

1vuio0pswjnm7 · on July 22, 2021

This is an old problem I could have sworn there were promises they were going to change their procedures.

The question I have is how fast is the content removed after the domain name registration changes, i.e., is there is a window of time between the appearance of a new robots.txt and the next scheduled crawl, and if so, is it be possible to "rescue" the content, as ArchiveTeam would do, during that window, before it disappears.

If this is possible, there could be a service for monitoring changes to domain name registrations for sites that have large amounts of historical content. I would happily volunteer to set up such a service.

joe_the_user · on July 22, 2021

Well, "complain on hn" has been a way to get stuff from Google. Maybe someone at archive will notice this thread.

hidden-spyder · on July 22, 2021

I'm curious. What changes has Google made due to complains on HN?

SilverRed · on July 22, 2021

Hopefully it is just hidden and not deleted. But this is the main reason why alternative archive sites exists which ignore the original posters requests. Frequently used to archive posts from public figures which are suspected to be attempted to be scrubbed later.

_xy8h · on July 22, 2021

> Hopefully it is just hidden and not deleted.

Hidden. Even when you request for them to remove stuff.

Had domain, stuff got archived, asked for them to remove it, added robots.txt. Domain lapsed. Someone else picked it up. their robots.txt now permissive, old stuff that I requested for them to remove is now visible.

throwslackforce · on July 22, 2021

That's insidious. Does it even make sense to revive data that has been removed based on current configuration?

Even if the owner is the same, allowing the site to be archived going forward isn't the same thing as permitting it retroactively.

mavhc · on July 22, 2021

Wonder how you'd overcome that flaw, is there a history of domain name ownership?

account42 · on July 22, 2021

Maybe we need a whois.archive.org.

newswasboring · on July 22, 2021

Just delete the things when requested. No need to make it more complicated than that.

mavhc · on July 22, 2021

When requested by who? The current owner of the domain? Do they own what was on that domain 20 years ago?

What if you lost your domain, but owned it in the past? Can you delete stuff from that era?

logbiscuitswave · on July 23, 2021

I’ve requested old domains of mine to get removed from archive.org and at least as of a few years ago it was an all or nothing thing. There was no surgical way to say “only remove content from between 1995 and 2005 when I owned the domain”. Maybe things have changed since then, though.

fwn · on July 22, 2021

As far as I was able to experience it's just hidden and not deleted.

I have to keep an old domain indefinitely to host a robots.txt just to keep sensitive personal data hidden that little me foolishly published on the open internet.

But I'm not complaining. The internet archive is a great gift. Using it with a bookmarklet really feels like a super power.

mercora · on July 22, 2021

it sounds a lot like this would need some kind of delegation mechanism where you could point to a different URL in-time before abandoning the place. or maybe some kind of sealing using a cryptographic function that lets you proof your are the owner of the current/previous content and also would proof you are not the owner of the newer content while this proof could be used to release the ban if ever needed.

joe_the_user · on July 22, 2021

Got any examples of these alternative archives?

SilverRed · on July 22, 2021

This is the main one https://archive.is/

From the FAQ, they do not respect robots.txt since they only archive on request by a user and they do not remove archives unless they contain illegal content.

1vuio0pswjnm7 · on July 22, 2021

HN commenters like to use archive.is but I always wonder if people are aware that archive.is is (a) blocked in some countries^1 and (b) may block access to itself in a country if it feels threatened.^2

There is also the issue of EDNS subnet.^3 archive.is tries to require it; it wants to know what location a request is coming from. In addition to EDNS, archive.is inserts the IP address and geolocation of the incoming request into the HTML of the returned page as a tracking pixel.^4

Thus archive.is does some things archive.org does not do besides just ignoring robots.txt

One of the things archive.org does that archive.is does not do is that archive.org inserts an HTTP response header intended to disable Chrome FLoC.^5 I add this header for all sites in a local proxy; however I do not see many sites adding it as a courtesy. Thanks archive.org for doing that.

1. https://en.wikipedia.org/wiki/Archive.is

2. https://www.reddit.com/r/KotakuInAction/comments/3e29vm/arch...

3. https://webapps.stackexchange.com/questions/135222/why-does-...

4. https://news.ycombinator.com/item?id=27498902

5. permissions-policy: interest-cohort=()

mellosouls · on July 22, 2021

A couple of pointers to the wider world of web archiving:

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

https://github.com/iipc/awesome-web-archiving

v0x · on July 22, 2021

There’s archive.is but I get the sense that the major use case for that is getting around paywalls as opposed to permanently archiving a page - indeed, since they host content that the site owner probably doesn’t want them to, it would stand to reason the service would not be likely to stand the yet of time. But I could be wrong.

Jiro · on July 22, 2021

They actually posted about this in 2017: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea... . At the time it sounded like they might change their robots.txt policy. I guess they never followed up on it.

(I checked and ezboard is still excluded.)