Hacker News new | past | comments | ask | show | jobs | submit login

Is it safe to assume that GPT3 has been trained on GPLed code? And in that case, should the GPT3 source code be freely available?



This is flawed for a few reasons:

- Training isn’t the same thing as actually using the code

- The training process doesn’t change the code of GPT-3, it changes the parameters, which are input data to the code and not a part of it

- They aren’t distributing any binaries for GPT3, though for AGPL that wouldn’t matter


We'll find out when this issue eventually ends up in court. Copilot is the most notable example of this mess.

If a judge rules that AI will regurgitate existing code or that the network stores this code, these models will very quickly infringe the GPL and all matter of licenses at the same time.

If this is the case, you'll also need to figure out what GPL compliance looks like. The code training the network likely doesn't contain any code that must be shared, only the model being executed. Does that mean that providing the entire data set complies with GPL, leaving it up to the user to spend millions on training their own network? Or does the model file itself need to be distributed as-is?

The AI people argue that AI learns concepts from code and does not store or replicate the input code directly, though I very much doubt that given that AI will spit out code it has been trained on verbatim. If this is the case then the code may have been processed by the algorithm, but the license impact would be similar to processing it in another way: just because you compress up a bunch of code doesn't mean you need to open source your compression tool.

We need clear legislation for this because the court cases are going to be a clusterfuck.


I've also read GPL code and that doesn't make anything I write GPL. It matters if the code was substationally copied or not.

So I think I would apply the same rules to AI. In general not all code produced is infringing on the copyright of all authors of training data. However there have been some clear cases of copying (GPL license text and a matrix multiplication routine for example) that do appear as copyright violations.


Imagine a model. It reads source code, and learns by heard how to execute it. Now you use that model to execute GPLed code + some changes to the code you instruct your model to also take into account. You don't need to adhere to the GPL license because your machine is merely "learning" not executing – Tadaa!

Truth is that the insight that there is no difference on data and program code in particular stands here. If the model can act on the code. It can be said to execute it.


Could you not say the same about a human? Also what you described is probably fine under the GPL because the input is source code which the users could view and edit (and if it’s not running locally the GPL doesn’t apply, only AGPL)


The healthy approach would be that AI models can 'study' code as much as they want and aggregate what they learn into a model. The 'right to read' is a thing. But the end user of the model should take care that the output they get is not infringing on anyone's copyright.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: