True, but make even one classification mistake (and people upload stuff they don't own with the wrong license all the time) and you have to retrain your whole system for each mistake you make, as people trickle in and want their (wrongly classified as CC or public domain) stuff removed from your dataset.
It would chill the whole ML space significantly for decades, IMO, as the only truly safe data would be synthetic or licensed. This can work for some applications (e.g. Microsoft used synthetic data for facial landmark recognition[1]), but it would kill DALL-E 2 et al.
If I use photoshop to recreate a copywritten work - they don’t have to redistribute photoshop or change it in any way. The originals are not being shipped in the models but the models are capable of recreating copywritten work. These are tools just like photoshop.
Neural networks can and do encode data from their training sets in the models itself. That's the reason you can make some models reproduce things like the Getty watermark in the images they produce.
Again not directly though and that is all that matters - I can reproduce the getty watermark in photoshop but that doesn’t make adobe liable. The fact that a tool is capable of copyright infringment does not shift the legal burden anywhere - it is totally beside the point. Technically photoshop’s ‘content aware fill’ could fill in missing regions with copywritten content purely by chance but the burden is still on me if I publish that content, not on adobe. Legally speaking these are tools just like any other algorithm or machine out there, their sophistication and particular method is not particularly relevant (again legally speaking).
It would chill the whole ML space significantly for decades, IMO, as the only truly safe data would be synthetic or licensed. This can work for some applications (e.g. Microsoft used synthetic data for facial landmark recognition[1]), but it would kill DALL-E 2 et al.
[1] https://microsoft.github.io/DenseLandmarks/