I genuinely don’t know the answer but I can see it being more complicated than “...

yifanl · 2024-05-06T13:37:18

discussed 20 years ago https://ansuz.sooke.bc.ca/entry/23

> I think Colour is what the designers of Monolith are trying to challenge, although I'm afraid I think their understanding of the issues is superficial on both the legal and computer-science sides. The idea of Monolith is that it will mathematically combine two files with the exclusive-or operation. You take a file to which someone claims copyright, mix it up with a public file, and then the result, which is mixed-up garbage supposedly containing no information, is supposedly free of copyright claims even though someone else can later undo the mixing operation and produce a copy of the copyright-encumbered file you started with. Oh, happy day! The lawyers will just have to all go away now, because we've demonstrated the absurdity of intellectual property!

> The fallacy of Monolith is that it's playing fast and loose with Colour, attempting to use legal rules one moment and math rules another moment as convenient. When you have a copyrighted file at the start, that file clearly has the "covered by copyright" Colour, and you're not cleared for it, Citizen. When it's scrambled by Monolith, the claim is that the resulting file has no Colour - how could it have the copyright Colour? It's just random bits! Then when it's descrambled, it still can't have the copyright Colour because it came from public inputs. The problem is that there are two conflicting sets of rules there. Under the lawyer's rules, Colour is not a mathematical function of the bits that you can determine by examining the bits. It matters where the bits came from.

evilduck · 2024-05-06T17:47:00

I don't think that's what I was driving at. Monolith users in this scenario would be knowingly using copyrighted content with the clear intent to "de-copyright" it for distribution purposes by mixing it up into a new output via a reversible process. Which seems like it probably violates copyright because the intent is to distribute a copyrighted work even if the process makes programmatic detection difficult during distribution. This may operate within the wording of the law but it clearly is being done in bad faith to the spirit of the law (and this seems like standard file encryption of a copyrighted work where you are also publicly distributing the decryption key... and transmitting a copyrighted work over TLS today doesn't absolve anyone of liability). You seem to be suggesting this is what OpenAI has done via the transformer model training process - and acting in bad faith. Which is certainly possible but won't be proven unless their court case reveals it. I'm asking about the opposite: what if they acted in good faith?

What I'm getting at is that it's plausible that a LLM is trained purely on things that were available and licensed as Creative Commons but that the data within contains copyrighted content because someone who contributed to it lied about their ownership rights to provide that content under a Creative Commons license, i.e. StackOverflow user UnicornWitness24 is the perpetrator of the copyright violation by copying a NYT article into a reply to bypass a paywall for other users and has now poisoned a dataset. And I'm asking: What is the civil liability for copyright violations if the defendant was the one who was actually defrauded or deceived and was acting in good faith and within the bounds of the law at the time?