I noticed a month or two ago that certain sites seem to be excluded even from in...

lucasmullens · on Oct 23, 2019

I get "Sorry. This URL has been excluded from the Wayback Machine."

To my understanding websites can chose to not be included in the Wayback Machine.

chongli · on Oct 23, 2019

Not only that, if they are archived at one point and later decide to be excluded then all of the archives for that domain are deleted retroactively. This means that a valuable site can be archived for years but if the domain lapses and gets bought up by somebody who decides to exclude the Wayback Machine then all of that precious content is destroyed even though the destroyer never owned it!

I feel like we’re in the Internet equivalent of the early years of Hollywood. We’ve lost so many precious early movies because there were no backups of the film.

schoen · on Oct 23, 2019

I don't think your interpretation of the latter case is quite right. In my understanding, the Archive is willing to disable display of the site content retroactively due to a new robots.txt, but they won't delete that content. It will still be present in their files.

chongli · on Oct 23, 2019

What you’re describing may have been a recent change that I wasn’t aware of. If so, it’s at least some consolation. We could still do better.

gbear605 · on Oct 24, 2019

I believe that that has been the case for the entire existence of Wayback, not just recently.

TJSomething · on Oct 23, 2019

This policy was actually changed in 2017 to completely ignore robots.txt.

[0] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

mc32 · on Oct 23, 2019

Deleted or not made available?

ryanlol · on Oct 24, 2019

If the site goes down and the robots.txt becomes unavailable, the content will become available on internet archive.

kick · on Oct 23, 2019

This is how it works, yes.

----> Why isn't the site I'm looking for in the archive?

Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl. It's also possible that some sites were not archived because they were password protected, blocked by robots.txt, or otherwise inaccessible to our automated systems. Site owners might have also requested that their sites be excluded from the Wayback Machine.

----> How can I exclude or remove my site's pages from the Wayback Machine?

You can send an email request for us to review to info@archive.org with the URL (web address) in the text of your message.

https://help.archive.org/hc/en-us/articles/360004651732-Usin...

adventured · on Oct 23, 2019

They've recently made the process intentionally more annoying. You used to be able to exclude a site you control via a robots.txt file. That wouldn't actually delete archived content and would merely hide it from viewing. They no longer honor that approach. You now have to send them an email and specifically request content removal.

Aloha · on Oct 23, 2019

I consider this to be acceptable.

bigwavedave · on Oct 24, 2019

Most rational people would.

LocalH · on Oct 24, 2019

That approach allowed domain spammers to (unintentionally?) disable access to content that predates their ownership of the domain, so it was a garbage approach. In the current situation, a content owner can still achieve removal, but a subsequent owner of the domain cannot. That's how it should be. If I buy domain x, I should not be able to request the removal of content that was present on domain x before I owned it.

tropo · on Oct 24, 2019

Snopes kept quietly making politically-motivated changes, and they kept getting caught. Now they can get away with it.