Hacker News new | past | comments | ask | show | jobs | submit login

> I can use smthg like GPT-4 to label data and then use that as a train set for my own LLM, right?

Yes, almost all improved LLama models are tuned exactly that way (trained on examples of questions and answers from say GPT 4). If OpenAI stole copyrighted works to train their models it is morally fair game to do the same to them regardless of their TOS. It's not like they can prove it anyway.

Plus there's the other point where they also say that everything generated by their models is public domain, so which one is it eh?




Use of copyrighted material in such a way that it’s aggregated into statistical properties is almost certainly fair use. Use of the model to produce reproductions of copyrighted material then consuming or distributing it is almost certainly violating the copyright. But it was the facsimile of the material that’s the violation, not the abstract use of it to generate an aggregate model.


You understand these things have a very very wide interpretation scope here that has yet to be tested in court. I wouldn’t make these statements so confidently as courts tend to reinterpret the law significantly for the balance of societal factors when serious technology changes occur.


AI generated work is not copyright-able. I guess the courts later could disagree though.

https://www.copyright.gov/ai/



If the AI generates a new Eric Clapton album, with the same similar voice and guitar playing style?


your example doesn't have to be AI generated. Human cover-bands play Song X in the style of Y all the time.


This is true - afaik there’s been no specific rulings on whether training models on copyright material is a violation. But to my mind it harkens back to stuff like xerox and such where the tool itself isn’t the violating thing it’s the use of the tool. Likewise, derivative works are often largely reproductions with minor variations and are protected under fair use. A model that takes enormous amounts of data and distills it into a tiny vector representation way below the information theoretic levels for any meaningful fidelity and mixes and overlaps data in a way that the original data isn’t plausibly stored in the model… I’m definitely not going to wager my life that’s fair use, but I would wager my company on it.


In the history of media law I’ve seen judged lean into whatever interpretation balances the ecosystem more than what is “literally the law”. The law is meant to serve people not the other way around. I hope judges will understand the contribution and theft can’t just be “haha fuck humanity love, openAI”


I want to train my own LLM on public but copyrighted data. I think this is serving humanity (and fucking OpenAI). I also think it is ethical because there's a big difference between "learning from" and "copying".

Your proposed reading of the law means only big tech will be able to afford the license fees to train on large amounts of data.


How do YOU plan on compensating those whose labor helped you? I bet you don’t. Same thing you are just imagining being David rather than Goliath makes it ok for you.


It's not always necessary to compensate those whose labor helped you. I haven't compensated many of the open source projects I use, for example, even those who clearly want me to (with nagging pop-ups). If the use of copyrightable material to train a model is legal, and it does not legally require compensation, it might be difficult to argue that the use of such material should be compensated or else. It would depend IMO on whether there are norms in place for this kind of thing, and I don't necessarily see wide agreement.


Ok, what about the open source and research models? I wouldn’t wager much on openai keeping a lead indefinitely. Certainly not to establish case law on what’s a pretty new technology (at least in its current use)


Yes, laws are about politics and dispute resolution more than reasoning or correctness. Focusing on the pure logic is a trap for the computationally inclined.


I'm a lawyer so one should never break the law.

Nonethless, I can observe and predict that non-consensual "open sourcing" of these models would likely end up probably the best and safest way to do all of this stuff.


This ... but we all know business is corrupt.

The current attempts to spur on regulation by OpenAI is moat building


We were complacent while it happened because OpenAI wasn't a business, it wasn't seen as unethical to use community work to contribute to community research. Now they're entrenched and pulled the rug out from the community, whilst also trying to shut the door on anyone else.

Just a really disappointing series of events, the money and profit were never the big issue.


It's against the terms of service to do the generation, but the generated text is not copyrighted. Those are different things.


GPT-4 is trained on a large number of web pages, some of which will have had their own terms of service.


Not only web sites, full books from scribd and other sources.


Is it legal for one of their computer systems to access mine without my consent, even if publicly routable via the internet?

If I found an open port on a government computer it is still illegal for me to access that isn't it? Is the difference that this is port 80/443 and happens to serve HTTP requests something that has been described in law or court?


see LinkedIn vs HiQ (which HiQ won) covering fair use of logged-out web pages.


I have to log in to OpenAI to generate conversations but the conversations I can post on my own logged-out blog. It's the same thing OpenAI would probably say if they got sued because GPT spits copyrighted content it found on a logged-out webpage. They can't reasonably expect people to not use them for training.


show me the ToS where it says that, and I still won't care, because it would absolute be legal under the same principle openAI is using for the training data as a transformative work.

FYI: here are the relevant parts from the TOS:

(iii) use output from the Services to develop models that compete with OpenAI; (iv) except as permitted through the API

sounds like you are allowed to as long as it's from the api, as this "imaginary" restriction isn't in https://openai.com/policies/api-data-usage-policies, or https://openai.com/policies/usage-policies.


Because by training it they created something new.

I don't mind just making a point.

But I don't think they mind. I don't believe that this type of model training is able to be bleeding edge which should guarantee that openai has enough motivation to continue the development and having a healthy competition




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: