Hacker News new | past | comments | ask | show | jobs | submit login

Even MIT licensed code requires you to preserve the copyright and permission notice.

If a human did what these language models are doing (output derivative works with the copyright and license stripped), it would be a license violation. When humans want to create a new implementation with clean IP, they have one team study the IP-encumbered code and write a spec, then a different team writes a new implementation according to the spec. LM developers could have similar practices, with separately-trained components that create an auditable intermediate representation and independently create new code based on that representation. The tech isn't up to that task and the LM authors think they're going to get away with laundering what would be plagiarism if a human did it.




Why can't AI do the same: copyrighted code -> spec -> generated code.

... and then execute copyrighted code -> trace resulting values -> tests for new code.

AI could do clean room reimplementation of any code to beef up the training set. It can also make sure the new code is different from the old code at ngram-level, so even by chance it should not look the same.

Would that hold up in court? Is it copyright laundering?


Language models don't understand anything, they just manipulate tokens. It is a much harder task to write a spec (that humans and courts can review if needed to determine is not infringement) and (with a separately trained tool) implement the spec. The tech just isn't ready and it's not clear that language models will ever get there.

What language models could do easily is to obfuscate better so the license violation is harder to prove. That's behavior laundering -- no amount of human obfuscation (e.g., synonym substitution, renaming variables, swapping out control structures) can turn a plagiarized work into one that isn't. If we (via regulators and courts) let the Altmans of the world pull their stunt, they're going to end up with a government-protected monopoly on plagiarism-laundering.


Isn’t the language model itself the spec?

Potentially for all of the inputs at once.


> When humans want to create a new implementation with clean IP, they have one team study the IP-encumbered code and write a spec, then a different team writes a new implementation according to the spec.

Maybe at a FAANG or some other MegaCorp, but most companies around barely have a single dev team at all, or if they're larger barely have one per project.


There’s a clear separation between the training process which looks at code and outputs nothing but weights, and the generation process which takes in weights and prompts and produces code.

The weights are an intermediate representation that contains nothing resembling the original code.


But the original content is frequently recoverable.

You can't just take copyrighted code, base 64 it, sent it to someone, have them decode it, and claim there was no copyright violation.

From my (admittedly vague) understanding copyright law cares about the lineage of data, and I don't see how any reasonable interpretation could consider that the lineage doesn't pass through models.

IANAL


> But the original content is frequently recoverable.

What if we train the model on paraphrases of the copyrighted code? The model can't reproduce exactly what it has not seen.

Also consider the size ratio - 1TB of code+text ends up into 1GB of model weights. There is no space to "memorize" the training set, it can only learn basic principles and how to combine them to generate code on demand.

The copyright law in principle should only protect expression, not ideas. As long as the model learns the underlying principles without copying the superficial form, it should be ok. That's my 2c


The fact that this is a problem is a bug in copyright law, not a shortcoming of the LLM.


The neurons in my brain when I plagiarize are just arrangements of atoms that contain nothing that resembles orginal code/text passages/etc.


The trained weights of a GPT model are a frozen, static, transmissible representation. They’re not equivalent to the live state of a brain.


Pretty equivalent to the snapshot of a live brain. Those inside it are even called neurons and neural network


No, they are the weights that are used to configure a neural network. They’re a map of how to build a useful brain, not a neural state.


Machine learning neural networks have almost nothing to do with how brains work besides a tenuous mathematical relation that was conceived in the 1950s.


You can say that if you want to nitpick, but there are recent studies showing that neural and brain representations align rather well, to the point that we can predict what someone is seeing from brain waves, or generate the image with stable diffusion.

https://sites.google.com/view/stablediffusion-with-brain/

I think brain to neural net alignment is justified by the fact that both are the result of the same language evolutionary process. We're not all that different from AIs, we just have better tools and environments, and evolutionary adaptation for some tasks.

Language is an evolutionary system, ideas are self replicators, they evolve parallel to humans. We depend on the accumulation of ideas, starting from scratch would be hard even for humans. A human alone with no language resources of any kind would be worse than a primitive.

The real source of intelligence is the language data from which both humans and AIs learn, model architecture is not very important. Two different people, with different neural wiring in the brain, or two different models, like GPT and T5 can learn the same task given the training set. What matters is the training data. It should be credited with the skills we and AIs obtain. Most of us live our whole lives at this level and never come up with an original idea, we're applying language to tasks like GPT.


> The weights are an intermediate representation that contains nothing resembling the original code.

So is the ELF.


I think this view is incredibly dangerous to any kind of skills mastery. It has the potential to completely destroy the knowledge economy and eventually degrade AI due to a dearth of training data.


It reminds me of people needing to do a "clean room implementation" without ever seeing similar code. I feel like a human being who read a bunch of code and then wrote something similar without copy/paste or looking at the training data should be protected, and therefore an AI should too.


Okay, that’s an argument from consequences, but is the view factually wrong?


I mean those consequences are why patent law exists. New technology may require new regulatory frameworks, like we've been doing since railroads. The idea that we could not amend law and that we need to pedantically say "well this isn't illegal now" as an excuse for doing something unethical and harmful to the economy is in my opinion very flawed.


Is it really harmful to the economy, or only to entrenched players? Coding AI should be a benefit to many, like open source is. It opens the source even more, should be a dream come true for the community. It's also good for learning and lowering the entry barrier.

At the same time it does not replace human developers in any application, it might take a long time until we can go on vacation and let AI solve our Jira tickets. Remember the Self Driving task has been under intense research for more than a decade now, and it's still far from L5.

It's a trend that holds in all fields. AI is a tool that stumbles without a human to wield it, it does not replace humans at all. But with each new capability it invites us to launch new products and create jobs. Human empowerment without human replacement is what we want, right?


Has anyone been able to create a prompt that GPT4 replies to with copyrighted content (or content extremely similar to the original content)?

I'm curious how easy or difficult it is to get GPT to spit out content (code or text) that could be considered obvious infringement.

Tempted to give it half of some closed-source or restrictive licensed code to see if it auto-completes the other half in a manner that is obviously recreating the original work.


I don't know about GPT-4 but you could get ChatGPT to spit Carmac's Fast Inverse square root with the comments and all (I can't find the tweet though…)

Edit: it wasn't ChatGPT but Copilot see https://twitter.com/mitsuhiko/status/1410886329924194309


I can reproduce when prompted all the lyrics to Bohemian Rhapsody, but my doing so isn’t automatically copyright infringement. It would depend on where, when, how, in front of what audience, and to what purpose I was reciting them as to whether it was irrelevant to copyright law, protected under some copyright use case, civilly infringing, or criminally infringing copyright abuse.

The same applies to GPT. It could reproduce Bohemian Rhapsody lyrics in the course of answering questions and there’s no automatic breach of copyright that’s taking place. It’s okay for GPT to know how a well known song goes.

If copilot ‘knows how some code goes’ and is able to complete it, how is that any different?


OK, it can exist without breaking any laws, but if you can't release anything it helps you write, what's the point?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: