So.. I can see that this ML model is generating some code exactly same as the original dataset, which definiately a problem. A defect model, sure.
Beside that, I cannot understand why the overall idea, using open-source project to train a ML model that generates code would ever be a problem. We human beings are learning as the model, we read others code, books, articles, design patterns... and it becomes part of us. Even the private code, I mean like you join a company, you read their codebase, methodology and it becomes something yours. Copyrights generally not allow you to "copy" the original, but you can still synthesize your own code -- cutting, combination, creating based on whatever you have learnt.
The method of how a ML model works is differ from human brain for sure, but I cannot see why this would be a problem, or why an organic would become something superior that what they do is a creation and a ML mode is scraping your code. What is the difference here????
And also recently we saw GPT that generates articles, waifulabs that generates ... waifus... to be honest I cannot perceive the difference since all of them are "learning" (in a mechanical way of human created knowledge.
The difference is that it's a judgement call when to include attribution, whom to attribute with how much, and overall whether something is too close to be counted as a copyright or other license violation or not. Intelligent humans sometimes, or even often times, have a hard time doing this judgement call. An artificial intelligence would too, and a somewhat simple ML model (no offense) certainly does.
I'm really waiting for this to blow up from the open source license angle. Freely combining code with different license is a hellish undertaking on its own. But already just re-using some, say, GPL code, even staying under the same license, but without proper attribution, is Forbidden with capital F.
Beside that, I cannot understand why the overall idea, using open-source project to train a ML model that generates code would ever be a problem. We human beings are learning as the model, we read others code, books, articles, design patterns... and it becomes part of us.
It's an interesting question.
1) When a human being reads code or a CS text book, we think of them extracting general principles from the code and so not having to repeat that particular code again. In contrast, what GPT-3 and Copilot seem to do is just extract sequences of little snippets, something that apparently requires them to regurgitate the text they've been trained on. That seem rather permanently dependent on the training corpus.
2) Human beings have a natural urge, a natural ethos, to help people learn. It's understandable. The thing is, when suddenly you're not talking people but machines, the reason for this urge easily vanish. Even if github was extracting knowledge from the code, I wouldn't have a reason to help them do so since that knowledge would be their entirely private property. They expect to charge people whatever they judge the going rate would be - why should anyone be helping them without similar compensation? That this is being done by "OpenAI", a company which went from open-nonprofit to closed-for-profit in a matter of few years, should accent this point. We're nowhere near a system that could digest all the knowledge of humankind. But if we got there, one might argue the result should belong to humankind rather than to one genius entrepreneur. And having the result belong one genius entrepreneur has some clear downsides.
> I cannot see why this would be a problem, or why an organic would become something superior that what they do is a creation and a ML mode is scraping your code. What is the difference here????
TL;DR: The AI doesn't know it can't just copy past (from perfect memory) and as such it learned to sometimes just copy past thinks.
The GPT model doesn't: "learn to understand the code and reproduce code based on that knowledge".
What it learns is a bit of understanding but more similar to recombining and tweaking verbatim text snipped it had seen before, without even understanding them or the concept of "not just copy/pasting code". (But while knowing which patterns "fit together").
This means that the model will "if it fits" potentially copy/past code "from memory" instead of writing new code which just happens to be the same/similar. It's like a person with perfect memory sometimes copy pasting code they had seen before pretending they wrote the code based on their "knowledge". Except worse, as it also will copy semantic irrelevant comments or sensitive information (if not pre filtered out before training).
I.e. there is a difference between "having a different kind of understanding" and "vastly missing understanding but compensating it by copying remembered code snippets from memory".
Theoretically it could be possible to create a GPT model which is forced to only understand programming (somewhat) but not memorize text snippets, but practically I think we are still far away from this, as it's really hard to say if a model did memorize copyright protected code.
And also recently we saw GPT that generates articles, waifulabs that generates ... waifus... to be honest I cannot perceive the difference since all of them are "learning" (in a mechanical way of human created knowledge.