> I can use smthg like GPT-4 to label data and then use that as a train set for my own LLM, right?
Yes, almost all improved LLama models are tuned exactly that way (trained on examples of questions and answers from say GPT 4). If OpenAI stole copyrighted works to train their models it is morally fair game to do the same to them regardless of their TOS. It's not like they can prove it anyway.
Plus there's the other point where they also say that everything generated by their models is public domain, so which one is it eh?
Use of copyrighted material in such a way that it’s aggregated into statistical properties is almost certainly fair use. Use of the model to produce reproductions of copyrighted material then consuming or distributing it is almost certainly violating the copyright. But it was the facsimile of the material that’s the violation, not the abstract use of it to generate an aggregate model.
You understand these things have a very very wide interpretation scope here that has yet to be tested in court. I wouldn’t make these statements so confidently as courts tend to reinterpret the law significantly for the balance of societal factors when serious technology changes occur.
This is true - afaik there’s been no specific rulings on whether training models on copyright material is a violation. But to my mind it harkens back to stuff like xerox and such where the tool itself isn’t the violating thing it’s the use of the tool. Likewise, derivative works are often largely reproductions with minor variations and are protected under fair use. A model that takes enormous amounts of data and distills it into a tiny vector representation way below the information theoretic levels for any meaningful fidelity and mixes and overlaps data in a way that the original data isn’t plausibly stored in the model… I’m definitely not going to wager my life that’s fair use, but I would wager my company on it.
In the history of media law I’ve seen judged lean into whatever interpretation balances the ecosystem more than what is “literally the law”. The law is meant to serve people not the other way around. I hope judges will understand the contribution and theft can’t just be “haha fuck humanity love, openAI”
I want to train my own LLM on public but copyrighted data. I think this is serving humanity (and fucking OpenAI). I also think it is ethical because there's a big difference between "learning from" and "copying".
Your proposed reading of the law means only big tech will be able to afford the license fees to train on large amounts of data.
How do YOU plan on compensating those whose labor helped you? I bet you don’t. Same thing you are just imagining being David rather than Goliath makes it ok for you.
It's not always necessary to compensate those whose labor helped you. I haven't compensated many of the open source projects I use, for example, even those who clearly want me to (with nagging pop-ups). If the use of copyrightable material to train a model is legal, and it does not legally require compensation, it might be difficult to argue that the use of such material should be compensated or else. It would depend IMO on whether there are norms in place for this kind of thing, and I don't necessarily see wide agreement.
Ok, what about the open source and research models? I wouldn’t wager much on openai keeping a lead indefinitely. Certainly not to establish case law on what’s a pretty new technology (at least in its current use)
Yes, laws are about politics and dispute resolution more than reasoning or correctness. Focusing on the pure logic is a trap for the computationally inclined.
Nonethless, I can observe and predict that non-consensual "open sourcing" of these models would likely end up probably the best and safest way to do all of this stuff.
We were complacent while it happened because OpenAI wasn't a business, it wasn't seen as unethical to use community work to contribute to community research. Now they're entrenched and pulled the rug out from the community, whilst also trying to shut the door on anyone else.
Just a really disappointing series of events, the money and profit were never the big issue.
Is it legal for one of their computer systems to access mine without my consent, even if publicly routable via the internet?
If I found an open port on a government computer it is still illegal for me to access that isn't it? Is the difference that this is port 80/443 and happens to serve HTTP requests something that has been described in law or court?
I have to log in to OpenAI to generate conversations but the conversations I can post on my own logged-out blog. It's the same thing OpenAI would probably say if they got sued because GPT spits copyrighted content it found on a logged-out webpage. They can't reasonably expect people to not use them for training.
show me the ToS where it says that, and I still won't care, because it would absolute be legal under the same principle openAI is using for the training data as a transformative work.
FYI: here are the relevant parts from the TOS:
(iii) use output from the Services to develop models that compete with OpenAI; (iv) except as permitted through the API
Because by training it they created something new.
I don't mind just making a point.
But I don't think they mind. I don't believe that this type of model training is able to be bleeding edge which should guarantee that openai has enough motivation to continue the development and having a healthy competition
Yes, almost all improved LLama models are tuned exactly that way (trained on examples of questions and answers from say GPT 4). If OpenAI stole copyrighted works to train their models it is morally fair game to do the same to them regardless of their TOS. It's not like they can prove it anyway.
Plus there's the other point where they also say that everything generated by their models is public domain, so which one is it eh?