> and how that relates to how people work or “learn” is besides the point
It is important (for the training and generation stages) to distinguish between whether the model copies the original works or merely infers information from them - as copyright does not protect against the latter.
> The first part, collation of source material for training is emphatically not unexplored legal or moral territory.
Similar to as in Authors Guild v. Google, Inc. where Google internally made entire copies of millions of in-copyright books:
> > While Google makes an unauthorized digital copy of the entire book, it does not reveal that digital copy to the public. The copy is made to enable the search functions to reveal limited, important information about the books. With respect to the search function, Google satisfies the third factor test
Or in the ongoing Thomson Reuters v. Ross Intelligence case where the latter used the former's legal headnotes for training a language model:
> > verbatim intermediate copying has consistently been upheld as fair use if the copy is "not reveal[ed] to the public."
That it's an internal transient copy is not inherently a free pass, but it is something the courts take into consideration, as mentioned more explicitly in Sega v. Accolade:
> > Accolade, a commercial competitor of Sega, engaged in wholesale copying of Sega's copyrighted code as a preliminary step in the development of a competing product [yet] where the ultimate (as opposed to direct) use is as limited as it was here, the factor is of very little weight
And, given training a machine learning model is a considerably different purpose than what the images were originally intended for, it's likely to be considered transformative; as in Campbell v. Acuff-Rose Music:
> > The more transformative the new work, the less will be the significance of other factors
Listen, most website and book-authors want to be indexed by google. It brings potential audience their way, so most don’t make use of their _right_ to be de-listed. For these models, there is no plausible benefit to the original creators, and so one has to argue they have _no_ such right to be “de-listed” in order to get any training data currently under copyright.
> It brings potential audience their way, so most don’t make use of their _right_ to be de-listed.
The Authors Guild lawsuit against Google Books ended in a 2015 ruling that Google Books is fair use and as such they don't have a right to be de-listed. It's not the case that they have a right to be de-listed but choose not to make use of it.
The same would apply if collation of data for machine learning datasets is found to be fair use.
> one has to argue they have _no_ such right to be “de-listed” in order to get any training data currently under copyright.
Datasets I'm aware of already have respected machine-readable opt-outs, so if that were to be legally enforced (as it is by the EU's DSM Directive for commercial data mining) I don't think it'd be the end of the world.
There's a lot of power in a default; the set of "everything minus opted-out content" will be significantly bigger than "nothing plus opted-in content" even with the same opinions.
With the caveat that I was exactly wrong about the books de-listing, I feel you are making my point for me and retreating to a more pragmatic position about defaults.
The (quite entertaining) saga of Nightshade tells a story about what is going to be content creators “default position” going forward and everyone else will follow. You would be a fool not to, the AI companies are trying to end run you, using your own content, and make a profit without compensating you and leave you with no recourse.
> I feel you are making my point for me and retreating to a more pragmatic position about defaults
I'm unclear on what stance I've supposedly retreated from. My position is that an opt-out is not necessary under current US law, but that it wouldn't be the worst-case outcome if new regulation were introduced to mandate it.
> The (quite entertaining) saga of Nightshade tells a story about what is going to be content creators “default position” going forward and everyone else will follow
By "default" I refer not to the most common choice, but to the outcome that results from inaction. There's a bias towards this default even if the majority of rightsholders do opt to use Nightshade (which I think is unlikely).
It is important (for the training and generation stages) to distinguish between whether the model copies the original works or merely infers information from them - as copyright does not protect against the latter.
> The first part, collation of source material for training is emphatically not unexplored legal or moral territory.
Similar to as in Authors Guild v. Google, Inc. where Google internally made entire copies of millions of in-copyright books:
> > While Google makes an unauthorized digital copy of the entire book, it does not reveal that digital copy to the public. The copy is made to enable the search functions to reveal limited, important information about the books. With respect to the search function, Google satisfies the third factor test
Or in the ongoing Thomson Reuters v. Ross Intelligence case where the latter used the former's legal headnotes for training a language model:
> > verbatim intermediate copying has consistently been upheld as fair use if the copy is "not reveal[ed] to the public."
That it's an internal transient copy is not inherently a free pass, but it is something the courts take into consideration, as mentioned more explicitly in Sega v. Accolade:
> > Accolade, a commercial competitor of Sega, engaged in wholesale copying of Sega's copyrighted code as a preliminary step in the development of a competing product [yet] where the ultimate (as opposed to direct) use is as limited as it was here, the factor is of very little weight
And, given training a machine learning model is a considerably different purpose than what the images were originally intended for, it's likely to be considered transformative; as in Campbell v. Acuff-Rose Music:
> > The more transformative the new work, the less will be the significance of other factors