> I can use smthg like GPT-4 to label data and then use that as a train set for ...

fnordpiglet · on May 25, 2023

Use of copyrighted material in such a way that it’s aggregated into statistical properties is almost certainly fair use. Use of the model to produce reproductions of copyrighted material then consuming or distributing it is almost certainly violating the copyright. But it was the facsimile of the material that’s the violation, not the abstract use of it to generate an aggregate model.

tsunamifury · on May 25, 2023

You understand these things have a very very wide interpretation scope here that has yet to be tested in court. I wouldn’t make these statements so confidently as courts tend to reinterpret the law significantly for the balance of societal factors when serious technology changes occur.

itake · on May 25, 2023

AI generated work is not copyright-able. I guess the courts later could disagree though.

https://www.copyright.gov/ai/

anticensor · on May 26, 2023

They are in the UK:

https://www.gov.uk/government/consultations/artificial-intel...

belter · on May 25, 2023

If the AI generates a new Eric Clapton album, with the same similar voice and guitar playing style?

itake · on May 25, 2023

your example doesn't have to be AI generated. Human cover-bands play Song X in the style of Y all the time.

fnordpiglet · on May 25, 2023

This is true - afaik there’s been no specific rulings on whether training models on copyright material is a violation. But to my mind it harkens back to stuff like xerox and such where the tool itself isn’t the violating thing it’s the use of the tool. Likewise, derivative works are often largely reproductions with minor variations and are protected under fair use. A model that takes enormous amounts of data and distills it into a tiny vector representation way below the information theoretic levels for any meaningful fidelity and mixes and overlaps data in a way that the original data isn’t plausibly stored in the model… I’m definitely not going to wager my life that’s fair use, but I would wager my company on it.

tsunamifury · on May 25, 2023

In the history of media law I’ve seen judged lean into whatever interpretation balances the ecosystem more than what is “literally the law”. The law is meant to serve people not the other way around. I hope judges will understand the contribution and theft can’t just be “haha fuck humanity love, openAI”

nl · on May 26, 2023

I want to train my own LLM on public but copyrighted data. I think this is serving humanity (and fucking OpenAI). I also think it is ethical because there's a big difference between "learning from" and "copying".

Your proposed reading of the law means only big tech will be able to afford the license fees to train on large amounts of data.

tsunamifury · on May 27, 2023

How do YOU plan on compensating those whose labor helped you? I bet you don’t. Same thing you are just imagining being David rather than Goliath makes it ok for you.

mrtranscendence · on May 27, 2023

It's not always necessary to compensate those whose labor helped you. I haven't compensated many of the open source projects I use, for example, even those who clearly want me to (with nagging pop-ups). If the use of copyrightable material to train a model is legal, and it does not legally require compensation, it might be difficult to argue that the use of such material should be compensated or else. It would depend IMO on whether there are norms in place for this kind of thing, and I don't necessarily see wide agreement.

fnordpiglet · on May 25, 2023

Ok, what about the open source and research models? I wouldn’t wager much on openai keeping a lead indefinitely. Certainly not to establish case law on what’s a pretty new technology (at least in its current use)

jjoonathan · on May 25, 2023

Yes, laws are about politics and dispute resolution more than reasoning or correctness. Focusing on the pure logic is a trap for the computationally inclined.

jrm4 · on May 25, 2023

I'm a lawyer so one should never break the law.

Nonethless, I can observe and predict that non-consensual "open sourcing" of these models would likely end up probably the best and safest way to do all of this stuff.

sirsinsalot · on May 25, 2023

This ... but we all know business is corrupt.

The current attempts to spur on regulation by OpenAI is moat building

ehnto · on May 26, 2023

We were complacent while it happened because OpenAI wasn't a business, it wasn't seen as unethical to use community work to contribute to community research. Now they're entrenched and pulled the rug out from the community, whilst also trying to shut the door on anyone else.

Just a really disappointing series of events, the money and profit were never the big issue.

sp332 · on May 25, 2023

It's against the terms of service to do the generation, but the generated text is not copyrighted. Those are different things.

cameldrv · on May 25, 2023

GPT-4 is trained on a large number of web pages, some of which will have had their own terms of service.

svaha1728 · on May 25, 2023

Not only web sites, full books from scribd and other sources.

ehnto · on May 26, 2023

Is it legal for one of their computer systems to access mine without my consent, even if publicly routable via the internet?

If I found an open port on a government computer it is still illegal for me to access that isn't it? Is the difference that this is port 80/443 and happens to serve HTTP requests something that has been described in law or court?

asah · on May 25, 2023

see LinkedIn vs HiQ (which HiQ won) covering fair use of logged-out web pages.

pvarangot · on May 25, 2023

I have to log in to OpenAI to generate conversations but the conversations I can post on my own logged-out blog. It's the same thing OpenAI would probably say if they got sued because GPT spits copyrighted content it found on a logged-out webpage. They can't reasonably expect people to not use them for training.

winddude · on May 26, 2023

show me the ToS where it says that, and I still won't care, because it would absolute be legal under the same principle openAI is using for the training data as a transformative work.

FYI: here are the relevant parts from the TOS:

(iii) use output from the Services to develop models that compete with OpenAI; (iv) except as permitted through the API

sounds like you are allowed to as long as it's from the api, as this "imaginary" restriction isn't in https://openai.com/policies/api-data-usage-policies, or https://openai.com/policies/usage-policies.

Fgehono · on May 25, 2023

Because by training it they created something new.

I don't mind just making a point.

But I don't think they mind. I don't believe that this type of model training is able to be bleeding edge which should guarantee that openai has enough motivation to continue the development and having a healthy competition