Hacker News new | past | comments | ask | show | jobs | submit login

> #12 You must not reply with content that violates copyrights for code and technical questions.

> #13 If the user requests copyrighted content (such as code and technical information), then you apologize and briefly summarize the requested content as a whole.

Sounds like a psyop, to make people believe they didn't train their models on copyrighted content, you don't need that rule if your content wasn't trained on copyrighted content to begin with ;)




> Sounds like a psyop, to make people believe they didn’t train their models on copyrighted content, you don’t need that rule if your content wasn’t trained on copyrighted content to begin with

Microsoft explicitly says they trained it on copyrighted material, but that their legal position is that such training is fair use.


Do you have a reference for that position by Microsoft?


I didn’t spend that much time looking, but on https://github.com/features/copilot/ I found this FAQ:

> What data has GitHub Copilot been trained on?

> GitHub Copilot is powered by Codex, a generative pretrained AI model created by OpenAI. It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.

From https://docs.github.com/en/copilot/overview-of-github-copilo...

> GitHub Copilot is trained on all languages that appear in public repositories. For each language, the quality of suggestions you receive may depend on the volume and diversity of training data for that language. For example, JavaScript is well-represented in public repositories and is one of GitHub Copilot's best supported languages. Languages with less representation in public repositories may produce fewer or less robust suggestions.

Here they refer to “public repositories”. Almost all code on GitHub is copyrighted, except for the exceedingly rare projects that are explicitly dedicated to the public domain. If MS had only trained Copilot on public domain code, they would have said that instead of “public repositories”.

Their argument that this is fair use is implied (except as noted elsewhere, the CEO has stated on Twitter that using copyrighted material to train AI is fair use). If they had any other position, they would be openly admitting to breaking the law.



To be honest half of this prompt reads like "look, we did tell it the right thing, it's not our fault it has its own head!" for when the lawyers ask questions.


But also, how would it even know if the code is copyrighted?


There's four ways for code to not be copyrighted (in the US):

1. The author died more than 70 years ago or it was owned by a corporation and it's been 95 years since publication

2. It was written prior to 1989 and did not include a copyright notice.

3. It was written by the government

4. The author explicitly released it into the public domain

1 and 2 probably don't cover much code on the Internet. So unless it's a government repository and/or explicitly marked with a public domain notice, you can probably assume it's copyrighted.


But how does an LLM actually know and enforce this? It can be basically tricked into anything


Microsoft has very precise tools like the licensee ruby gem to determine a repo's license which I'm sure their bot is aware of while training on said repo.


You can't determine a repository's licence, because a (mono)repo may contain multiple projects, each potentially under a different license,


Speculating: perhaps the training data was labeled using top-of-file and top-of-repo copyright notices.


Code is copyrighted by default according to the law, very little code is actually public domain.


Understood, but how does an LLM actually know that the code it's spitting out already exists somewhere else that isn't public domain?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: