I think there's a few problems with that kind of setup. Quick thoughts:
- Content creators have less discretion on who to allow/block crawling for when there's a middleman index (probably doesn't matter so much now given the flagrant use of content for AI)
- Content recency. The data sizes can get quite huge, and certain pages require updates more often than others so who gets to decide (one user of the index may be interested in a different set of pages vs another)
- Centralised content on the likes of Reddit, who are already aggressively blocking most bots from crawling their content. You'd have to crawl many pages per day (and quite likely end up getting blocked) as generally only a handful of bots get favourable treatment to crawl sites more aggressively.
> - Content recency. The data sizes can get quite huge, and certain pages require updates more often than others
I have always imagined that having an open crawl corpus aligns closely with the goals of the Internet Archive, where one could already strictly speaking submit updates to with second-level precision based on their URL slugs. The bad news is that with any such common corpus it would actually worsen their bandwidth bill since I would highly suspect that a corpus would be read from much more than it would ingest (e.g. snap https://news.ycombinator.com/item?id=43318384 once but then every downstream corpus consumer would read from IA n times)
> so who gets to decide (one user of the index may be interested in a different set of pages vs another)
Surely that's a solved problem in that a common corpus would ingest updates to all frontier pages that it knows about, and exponentially back-off as it finds less and less updates. I don't think the CommonCrawl.org cited by the sibling comment is selective about updates, and unquestionably IA does not: they accept snapshot requests from anyone for what I presume is any URL
I believe it's not just an issue of detecting actual changes in content, but that there are pages that change very quickly and there can be many of them. e.g. social media posts/comments, reddit, news pages. 'Hit them all very often' would be an answer.
How much a document has changed I think is fairly well-solved and there's working solutions (but there's a semi-related issue regarding who the original author of multiple copies is)
A similar issue is near identical content and canonical URL issues e.g. who gets to decide whether a page gets indexed or not due to similarity with another document, what URLs are indexed and crawled etc. People may have different interpretations of this.
There's other issues for crawling e.g. Facebook and other major sites that have a whitelist approach, presuming any such crawler would respect robots.txt and use a readily identifiable user agent.
- Content creators have less discretion on who to allow/block crawling for when there's a middleman index (probably doesn't matter so much now given the flagrant use of content for AI)
- Content recency. The data sizes can get quite huge, and certain pages require updates more often than others so who gets to decide (one user of the index may be interested in a different set of pages vs another)
- Centralised content on the likes of Reddit, who are already aggressively blocking most bots from crawling their content. You'd have to crawl many pages per day (and quite likely end up getting blocked) as generally only a handful of bots get favourable treatment to crawl sites more aggressively.