>a solved problem I believe it's not just an issue of detecting actual changes i...

>a solved problem

I believe it's not just an issue of detecting actual changes in content, but that there are pages that change very quickly and there can be many of them. e.g. social media posts/comments, reddit, news pages. 'Hit them all very often' would be an answer.

How much a document has changed I think is fairly well-solved and there's working solutions (but there's a semi-related issue regarding who the original author of multiple copies is)

A similar issue is near identical content and canonical URL issues e.g. who gets to decide whether a page gets indexed or not due to similarity with another document, what URLs are indexed and crawled etc. People may have different interpretations of this.

There's other issues for crawling e.g. Facebook and other major sites that have a whitelist approach, presuming any such crawler would respect robots.txt and use a readily identifiable user agent.