IANAL, but lately I've had this quixotic daydream of a combination accept-cookies / agree-to-TOS page that comes up, and the Terms of Service says by proceeding they agree to give the site-owner an perpetual, irrevocable, and royalty-free to use and re-license any future content that they create using any generative AI that was trained using the website contents.
Then you carefully log what LLM user-agents/IPs go past that agree, along with some very distinctive secretly crawlable pages which have contents that can be distinctively reproduced back out of the model if needed.
Then whenever SomeShittyLLM posts "articles", everybody with that TOS that was crawled gets to duplicate it without ads for free. :P
This idea is reminiscent of the opening scene of Accelerando by Charlie Stross:
Are you saying you taught yourself the language just so you could talk to me?"
"Da, was easy: Spawn billion-node neural network, and download Teletubbies and Sesame Street at maximum speed. Pardon excuse entropy overlay of bad grammar: Am afraid of digital fingerprints steganographically masked into my-our tutorials."
…
"Uh, I'm not sure I got that. Let me get this straight, you claim to be some kind of AI, working for KGB dot RU, and you're afraid of a copyright infringement lawsuit over your translator semiotics?"
"Am have been badly burned by viral end-user license agreements. Have no desire to experiment with patent shell companies held by Chechen infoterrorists. You are human, you must not worry cereal company repossess your small intestine because digest unlicensed food with it, right?”
More inspired by the GPL, I think, although the sketch above doesn't force the writer to put things into the public domain.
I'm imagining a separate declaration of: "Content I can sublicense from ShittyNewsLLM--which is everything made by their model--is now public-domain through me until further notice", without any need to identify specific items or rehost it myself.
I suppose the counterstrike would be for them to try to transform their own work and argue what they finally released contains some human spark that wasn't covered by the ToS, in which case there may need to be some "and any derivative work" kinda clause.
I wonder if some organization (similar to the Open Software Foundation) could get some lawyers and web-designers together to craft legally-sound site-design rules and terms-of-service, which anyone could use to protect their own blogs or web-forums.
I love this, I did something like that with made-up-italian-sounding words a while ago (you used to be able to find my site if you looked for FANTACHIAVE).
Then you carefully log what LLM user-agents/IPs go past that agree, along with some very distinctive secretly crawlable pages which have contents that can be distinctively reproduced back out of the model if needed.
Then whenever SomeShittyLLM posts "articles", everybody with that TOS that was crawled gets to duplicate it without ads for free. :P