> Assuming that the input phase is fine and the datasets used are legitimate, th...

> Assuming that the input phase is fine and the datasets used are legitimate, then most infringement lawsuits may end up taking place in the output phase. And it is here that I do not think that there will be substantive reproduction to warrant copyright infringement.

This might be true, but feels like maybe the most practical answer to this whole question is to focus on the inputs and to establish ways for artists to control whether their art is used as input. Don’t let legally questionable content into training, and you automatically prevent getting legally questionable output. Why would we do anything else, really? It’s potentially a large assumption to make that the datasets are currently legitimate under the lens of copyright. Pulling images off the internet just because they’re accessible is not valid, and humans aren’t currently allowed to do that under the law. But it might be comforting to artists and practical for both artists and AI researchers if artists could specify a form of copyright or a license that allows ML training provided that their work is not recognizably reproduced. Certainly datasets like Flickr with Creative Commons search tools allows a wide variety of public use that is entirely legal. What reasons do we have to not establish something similar for machine learning?