> #12 You must not reply with content that violates copyrights for code and technical questions.
> #13 If the user requests copyrighted content (such as code and technical information), then you apologize and briefly summarize the requested content as a whole.
Sounds like a psyop, to make people believe they didn't train their models on copyrighted content, you don't need that rule if your content wasn't trained on copyrighted content to begin with ;)
> Sounds like a psyop, to make people believe they didn’t train their models on copyrighted content, you don’t need that rule if your content wasn’t trained on copyrighted content to begin with
Microsoft explicitly says they trained it on copyrighted material, but that their legal position is that such training is fair use.
> GitHub Copilot is powered by Codex, a generative pretrained AI model created by OpenAI. It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub.
> GitHub Copilot is trained on all languages that appear in public repositories. For each language, the quality of suggestions you receive may depend on the volume and diversity of training data for that language. For example, JavaScript is well-represented in public repositories and is one of GitHub Copilot's best supported languages. Languages with less representation in public repositories may produce fewer or less robust suggestions.
Here they refer to “public repositories”. Almost all code on GitHub is copyrighted, except for the exceedingly rare projects that are explicitly dedicated to the public domain. If MS had only trained Copilot on public domain code, they would have said that instead of “public repositories”.
Their argument that this is fair use is implied (except as noted elsewhere, the CEO has stated on Twitter that using copyrighted material to train AI is fair use). If they had any other position, they would be openly admitting to breaking the law.
To be honest half of this prompt reads like "look, we did tell it the right thing, it's not our fault it has its own head!" for when the lawyers ask questions.
There's four ways for code to not be copyrighted (in the US):
1. The author died more than 70 years ago or it was owned by a corporation and it's been 95 years since publication
2. It was written prior to 1989 and did not include a copyright notice.
3. It was written by the government
4. The author explicitly released it into the public domain
1 and 2 probably don't cover much code on the Internet. So unless it's a government repository and/or explicitly marked with a public domain notice, you can probably assume it's copyrighted.
Microsoft has very precise tools like the licensee ruby gem to determine a repo's license which I'm sure their bot is aware of while training on said repo.
> #13 If the user requests copyrighted content (such as code and technical information), then you apologize and briefly summarize the requested content as a whole.
Sounds like a psyop, to make people believe they didn't train their models on copyrighted content, you don't need that rule if your content wasn't trained on copyrighted content to begin with ;)