Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I noticed a month or two ago that certain sites seem to be excluded even from internet archives like the Wayback Machine. The first example I stumbled on was Snopes. I only noticed because I saw a Snopes link on Reddit where the thumbnail was clearly different from the live version; the thumbnail appeared to show a human-like baby in the womb, whereas the live version had that edited out. Example: https://web.archive.org/web/*/https://www.snopes.com/fact-ch...


I get "Sorry. This URL has been excluded from the Wayback Machine."

To my understanding websites can chose to not be included in the Wayback Machine.


Not only that, if they are archived at one point and later decide to be excluded then all of the archives for that domain are deleted retroactively. This means that a valuable site can be archived for years but if the domain lapses and gets bought up by somebody who decides to exclude the Wayback Machine then all of that precious content is destroyed even though the destroyer never owned it!

I feel like we’re in the Internet equivalent of the early years of Hollywood. We’ve lost so many precious early movies because there were no backups of the film.


I don't think your interpretation of the latter case is quite right. In my understanding, the Archive is willing to disable display of the site content retroactively due to a new robots.txt, but they won't delete that content. It will still be present in their files.


What you’re describing may have been a recent change that I wasn’t aware of. If so, it’s at least some consolation. We could still do better.


I believe that that has been the case for the entire existence of Wayback, not just recently.


This policy was actually changed in 2017 to completely ignore robots.txt.

[0] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...


Deleted or not made available?


If the site goes down and the robots.txt becomes unavailable, the content will become available on internet archive.


This is how it works, yes.

----> Why isn't the site I'm looking for in the archive?

Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl. It's also possible that some sites were not archived because they were password protected, blocked by robots.txt, or otherwise inaccessible to our automated systems. Site owners might have also requested that their sites be excluded from the Wayback Machine.

----> How can I exclude or remove my site's pages from the Wayback Machine?

You can send an email request for us to review to info@archive.org with the URL (web address) in the text of your message.

https://help.archive.org/hc/en-us/articles/360004651732-Usin...


They've recently made the process intentionally more annoying. You used to be able to exclude a site you control via a robots.txt file. That wouldn't actually delete archived content and would merely hide it from viewing. They no longer honor that approach. You now have to send them an email and specifically request content removal.


I consider this to be acceptable.


Most rational people would.


That approach allowed domain spammers to (unintentionally?) disable access to content that predates their ownership of the domain, so it was a garbage approach. In the current situation, a content owner can still achieve removal, but a subsequent owner of the domain cannot. That's how it should be. If I buy domain x, I should not be able to request the removal of content that was present on domain x before I owned it.


Snopes kept quietly making politically-motivated changes, and they kept getting caught. Now they can get away with it.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: