I noticed a month or two ago that certain sites seem to be excluded even from internet archives like the Wayback Machine. The first example I stumbled on was Snopes. I only noticed because I saw a Snopes link on Reddit where the thumbnail was clearly different from the live version; the thumbnail appeared to show a human-like baby in the womb, whereas the live version had that edited out. Example: https://web.archive.org/web/*/https://www.snopes.com/fact-ch...
Not only that, if they are archived at one point and later decide to be excluded then all of the archives for that domain are deleted retroactively. This means that a valuable site can be archived for years but if the domain lapses and gets bought up by somebody who decides to exclude the Wayback Machine then all of that precious content is destroyed even though the destroyer never owned it!
I feel like we’re in the Internet equivalent of the early years of Hollywood. We’ve lost so many precious early movies because there were no backups of the film.
I don't think your interpretation of the latter case is quite right. In my understanding, the Archive is willing to disable display of the site content retroactively due to a new robots.txt, but they won't delete that content. It will still be present in their files.
----> Why isn't the site I'm looking for in the archive?
Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl. It's also possible that some sites were not archived because they were password protected, blocked by robots.txt, or otherwise inaccessible to our automated systems. Site owners might have also requested that their sites be excluded from the Wayback Machine.
----> How can I exclude or remove my site's pages from the Wayback Machine?
You can send an email request for us to review to info@archive.org with the URL (web address) in the text of your message.
They've recently made the process intentionally more annoying. You used to be able to exclude a site you control via a robots.txt file. That wouldn't actually delete archived content and would merely hide it from viewing. They no longer honor that approach. You now have to send them an email and specifically request content removal.
That approach allowed domain spammers to (unintentionally?) disable access to content that predates their ownership of the domain, so it was a garbage approach. In the current situation, a content owner can still achieve removal, but a subsequent owner of the domain cannot. That's how it should be. If I buy domain x, I should not be able to request the removal of content that was present on domain x before I owned it.