Is it possible companies will plant honeypots using non-exisiting unique words to detect if LLM crawlers are bypassing their do-not-crawl policies?
I think that depends on how advanced the software used to detect plagiarism has become and how unique your data is. Big data LLM's combine massive data sets so your data would have to be unique enough to remain untainted. Perhaps a more generalized practice would be to play the cat and mouse arms race of trying to evolve protection against bots. Most fail at this game. Even if the big players were stopped by bot protection and legal agreements, nothing would stop them from buying your data from the unscrupulous scrapers that claim they obtained it legitimately.
Think we put - UEGVHBEWCOUB, in my website and the website is about making pan cakes, now you expect when we ask for a recipe of pan cakes it will return this string? It won't because it scanned 100k other websites for that too.
I think that depends on how advanced the software used to detect plagiarism has become and how unique your data is. Big data LLM's combine massive data sets so your data would have to be unique enough to remain untainted. Perhaps a more generalized practice would be to play the cat and mouse arms race of trying to evolve protection against bots. Most fail at this game. Even if the big players were stopped by bot protection and legal agreements, nothing would stop them from buying your data from the unscrupulous scrapers that claim they obtained it legitimately.