I think there's a few problems with that kind of setup. Quick thoughts: - Conten...

mdaniel · 2025-03-10T15:17:31 1741619851

> - Content recency. The data sizes can get quite huge, and certain pages require updates more often than others

I have always imagined that having an open crawl corpus aligns closely with the goals of the Internet Archive, where one could already strictly speaking submit updates to with second-level precision based on their URL slugs. The bad news is that with any such common corpus it would actually worsen their bandwidth bill since I would highly suspect that a corpus would be read from much more than it would ingest (e.g. snap https://news.ycombinator.com/item?id=43318384 once but then every downstream corpus consumer would read from IA n times)

While typing this out, I actually wonder if the big search players don't maintain "page diffs" in their index such that loading https://news.ycombinator.com/item?id=43311573 an hour ago and then loading https://news.ycombinator.com/item?id=43311573 now stores only the new content

> so who gets to decide (one user of the index may be interested in a different set of pages vs another)

Surely that's a solved problem in that a common corpus would ingest updates to all frontier pages that it knows about, and exponentially back-off as it finds less and less updates. I don't think the CommonCrawl.org cited by the sibling comment is selective about updates, and unquestionably IA does not: they accept snapshot requests from anyone for what I presume is any URL

ricardo81 · 2025-03-10T22:29:20 1741645760

>a solved problem

I believe it's not just an issue of detecting actual changes in content, but that there are pages that change very quickly and there can be many of them. e.g. social media posts/comments, reddit, news pages. 'Hit them all very often' would be an answer.

How much a document has changed I think is fairly well-solved and there's working solutions (but there's a semi-related issue regarding who the original author of multiple copies is)

A similar issue is near identical content and canonical URL issues e.g. who gets to decide whether a page gets indexed or not due to similarity with another document, what URLs are indexed and crawled etc. People may have different interpretations of this.

There's other issues for crawling e.g. Facebook and other major sites that have a whitelist approach, presuming any such crawler would respect robots.txt and use a readily identifiable user agent.

immibis · 2025-03-10T14:07:07 1741615627

So you can't block search engines other than your favorite - that's a positive. If you want to block one you have to block them all.

Recency is not a problem any more than it's a problem for one individual search crawler