> #12 You must not reply with content that violates copyrights for code and tech...

dragonwriter · on May 13, 2023

> Sounds like a psyop, to make people believe they didn’t train their models on copyrighted content, you don’t need that rule if your content wasn’t trained on copyrighted content to begin with

Microsoft explicitly says they trained it on copyrighted material, but that their legal position is that such training is fair use.

MobileVet · on May 13, 2023

Do you have a reference for that position by Microsoft?

kweingar · on May 13, 2023

I didn’t spend that much time looking, but on https://github.com/features/copilot/ I found this FAQ:

> What data has GitHub Copilot been trained on?

> GitHub Copilot is powered by Codex, a generative pretrained AI model created by OpenAI. It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.

From https://docs.github.com/en/copilot/overview-of-github-copilo...

> GitHub Copilot is trained on all languages that appear in public repositories. For each language, the quality of suggestions you receive may depend on the volume and diversity of training data for that language. For example, JavaScript is well-represented in public repositories and is one of GitHub Copilot's best supported languages. Languages with less representation in public repositories may produce fewer or less robust suggestions.

Here they refer to “public repositories”. Almost all code on GitHub is copyrighted, except for the exceedingly rare projects that are explicitly dedicated to the public domain. If MS had only trained Copilot on public domain code, they would have said that instead of “public repositories”.

Their argument that this is fair use is implied (except as noted elsewhere, the CEO has stated on Twitter that using copyrighted material to train AI is fair use). If they had any other position, they would be openly admitting to breaking the law.

dragonwriter · on May 13, 2023

Here’s the Github CEO:

https://twitter.com/natfriedman/status/1409914420579344385

rf15 · on May 13, 2023

To be honest half of this prompt reads like "look, we did tell it the right thing, it's not our fault it has its own head!" for when the lawyers ask questions.

afro88 · on May 13, 2023

But also, how would it even know if the code is copyrighted?

bskap · on May 13, 2023

There's four ways for code to not be copyrighted (in the US):

1. The author died more than 70 years ago or it was owned by a corporation and it's been 95 years since publication

2. It was written prior to 1989 and did not include a copyright notice.

3. It was written by the government

4. The author explicitly released it into the public domain

1 and 2 probably don't cover much code on the Internet. So unless it's a government repository and/or explicitly marked with a public domain notice, you can probably assume it's copyrighted.

afro88 · on May 13, 2023

But how does an LLM actually know and enforce this? It can be basically tricked into anything

silverwind · on May 13, 2023

Microsoft has very precise tools like the licensee ruby gem to determine a repo's license which I'm sure their bot is aware of while training on said repo.

scoot · on May 13, 2023

You can't determine a repository's licence, because a (mono)repo may contain multiple projects, each potentially under a different license,

piannucci · on May 13, 2023

Speculating: perhaps the training data was labeled using top-of-file and top-of-repo copyright notices.

pabs3 · on May 13, 2023

Code is copyrighted by default according to the law, very little code is actually public domain.

afro88 · on May 13, 2023

Understood, but how does an LLM actually know that the code it's spitting out already exists somewhere else that isn't public domain?