Wouldn't this kind of ruling effectively put a halt to ChatGPT and other AI's training on publicly accessible data? What's the difference between Copilot creating output based on code on Github, and ChatGPT giving answers based on a NYT article (without attribution)?
IMO it should be treated like a human. Your output is 99% similar to this <code/article>? Copyright infringement, you should have mixed your own thoughts and reasoning into your output. Humans can plagiarize just as easy as ChatGPT/CoPilot can generate verbatim text from its training set.
At what point can one say that the source is unique enough to qualify for protection? Otherwise, I can't use `print "Hello world"` because I didn't mix my own thoughts and reasoning into the output.
It's no different from how current copyright works for us humans. Something is only copyright protected if it's a "sufficiently original" work and "possesses at least a minimal degree of creativity" https://www.copyright.gov/comp3/chap300/ch300-copyrightable-...
So (just thinking out loud), if Copilot suggests something only seen in one codebase, the code owners have a decent copyright case. But if copilot suggests something that's frequent across multiple, there's really no case to be made.
That's at least one of the rules that GH is trying to enforce on CoPilot, but legally I imagine that even repeating code that appears multiple times on the internet could be considered copyrighted infringement (ie. if multiple people copied that code from one person).
The problem here ends up being that code, especially in popular languages, will always looks similar when you're doing something like finding the best implementation for an algorithm. So if you invoke CoPilot for a common problem, chances are it can pull the exact code it needs from is dataset, but it also could've generated that same code snipped had the solution not existed in its training dataset. And when you start out solving a problem then ask it to continue writing more code, it just assumes you're solving the exact same problem that the original source code was solving.
This could probably be remedied if CoPilot spit out a "this is % similar to <x> source code from the internet" so that you can know just how unique CoPilot is being. Legally, copyright is just a mess and was not ready for the scale of the internet nor the advancements in ML when there are machines that have a 50% chance of infringing on someone's copyright and 50% chance of creating something new.
Good question. I think that's exactly what we need to decide as a society: both whether these things are violating existing copyright laws, and whether the laws should be changed to specifically handle this new situation. I don't know where I land, myself. My gut instinct is on the side of the creators and that these AI tools are illegally infringing on the creators' work. But I think there's reasonable arguments on both sides.
There is a difference between products and research in this case. Frequently, research is allowed by law as an exception, where building a product requires some more extensive agreement.
Unfortunately this is frequently abused where researchers build a model under the exemptions, and then others use that model commercially, even if they wouldn’t be allowed to build that model directly themselves.
Anyways, the scientific progress would continue, but products would halt until product developers get some kind of agreements with content creators (eg maybe people start adopting a new kind of open-ish license).