I think you're misunderstanding the point here. The point is that co-pilot is not "fair use". Co-pilot can not exist without ingesting entire codebases to encode their aggregate statistics which Microsoft then uses to recommend auto-completions. So what you're arguing is untenable, co-pilot does not just ingest the header files for some C library and then somehow magically provides the function implementations, it literally copies and statistically obfuscates the entire codebase (headers and all). This is obviously not "fair use".
The equivalent argument here is homomorphically encrypting an entire codebase and searching through it for matching snippets and not attributing credit to the original code/source. This is essentially what co-pilot is doing but using AI hype as a cover.
It's not obvious to a court, at all. Two (edit: three) significant arguments against it being obvious:
1. What do our human brains do? How is that different from how a human brain may function? Our brains take large amounts of information, from all sorts of places, and boil it down just as Copilot tries to do.
2. According to courts, website scraping is completely legal and "fair use." Even if you scrape 200,000 profiles. How is this not different than website scraping, but with code? I expect that argument to also come up. And if it isn't quite the same, consider the legal arguments that allowed website scraping to pass as "fair use" - they will likely apply.
Edit: 3. Currently, ingesting large amounts of material, for the purposes of creating an AI-generated result, is considered legal, even if there aren't many cases. This is also how Stability AI, DALL-E, and others exist. Courts always loathe to shut down fledging industries because it makes them look political. Odds of success overturning that principle outright? I'd give it less than 10%.
In a nutshell, legally, it is not obvious.
Edit @belorn (stupid posting too fast limit):
There is actually a difference. YouTube uses a "rolling key" scheme, which actually qualifies as a DMCA "technical protection measure" under some courts. In which case, pardon a narrow exception granted by the Library of Congress every 3 years (say, for news reporting), your dataset would be illegally obtained - for bypassing DRM. A really, really weak DRM, but courts don't care about strength when determining these things. Which is why it isn't OK. However, if YouTube didn't do that, you actually would, under current law, be OK with training a model against YouTube whether YouTube liked it or not. Code on GitHub does not have any TPMs on it, so it is legal to obtain, and thus legal for training.
I now see you're confused about how exactly co-pilot actually works. I recommend learning about statistical machine learning and not getting wrapped up in AI hype. Brains don't work like co-pilot and that argument won't work in court. Co-pilot is a statistical encryption scheme, nothing more. Carrying this argument to its natural conclusion means that homomorphic encryption makes all data fair use for profit with no need to respect the licensing terms of the original data set.
1. Human brains aren't above the law here either. When producing code that is substantially the same (especially if it's a character for character match of large swathes of an existing work) it's presumed that you copied it unless you can prove one of a few exceptions like that the code itself wasn't copyrightable (moot point in context), or that you likely hadn't ever seen the original (which copilot explicitly has).
2. Scraping by itself is probably mostly legal. What you do with the scraped information is still subject to copyright and other laws.
Or, that the work you have reproduced is sufficiently "transformative," which is an exception under "fair use" law. How different is the copy from the original? If I draw Elsa but with a green dress and a different hairstyle, I'm not under Disney's copyright even if my inspiration is clear.
Edit: Also, the degree to which it must be transformative is, I would argue, lesser for code. There's only so many ways to implement an algorithm - meaning that if it is slightly different, it may be transformative enough, whereas a more creative and less utility-based outlet like painting may require greater changes.
It depend on the task. If we are to edit a movie then we cut and paste different parts from large amount of raw material until something new is created. If we do this too work we don't own we call it a remix.
Law is always about the context. Web scraping is OK. Scraping youtube is not. There is no technical explanation to distinguish the two.
I would very much like to see an AI that uses youtube videos as the training set. It could co-sing with people, co-edit movies and music videos, and even generate video game assets from just scraping videos of people playing video games. All very unlikely to ever occur.
> 1. What do our human brains do? How is that different from how a human brain may function? Our brains take large amounts of information, from all sorts of places, and boil it down just as Copilot tries to do.
I've posted this elsewhere, but it doesn't matter what our brains do and how similar that is to how AI operates. Humans have rights that machines do not. For example I can watch a movie and not be sued for infringement because I made a copy of the movie in my head.
> I've posted this elsewhere, but it doesn't matter what our brains do and how similar that is to how AI operates. Humans have rights that machines do not. For example I can watch a movie and not be sued for infringement because I made a copy of the movie in my head.
No AI model works that way either though. Think Stability AI: It doesn't have a copy of every image that it was generated with, but has "distilled" the patterns out of images. It no longer has a copy of any specific image inside it, nor is there an ability to extract training data from it.
In which case, it does not have a copy of the movie in its head either - but it does, for example, recognize the Disney look.
Right now, GitHub Copilot's argument is that training AI models, on copyrighted material, is currently legal. This is also the position almost every AI startup also takes, and it is rooted in "fair use" taking the "transformative" qualities of a work into consideration. There is no doubt that AI-generated suggestions generally are "transformative," it is only about whether they are transformative enough.
> No AI model works that way either though. Think Stability AI: It doesn't have a copy of every image that it was generated with, but has "distilled" the patterns out of images. It no longer has a copy of any specific image inside it, nor is there an ability to extract training data from it.
My point was that it doesn't matter whether AI works exactly like a human brain, humans have additional rights that an AI does not. This comment made my point better than I could: https://news.ycombinator.com/item?id=33273621
> There is no doubt that AI-generated suggestions generally are "transformative," it is only about whether they are transformative enough.
I don't think even Microsoft/Github actually believes this to be the case because they choose to train on public Github repos and did not include their own proprietary codebase.
> I don't think even Microsoft/Github actually believes this to be the case because they choose to train on public Github repos and did not include their own proprietary codebase.
There are good reasons they would not include proprietary codebases aside from this, so I don't see this as a smoking gun. Large codebases often involve elements that don't make sense outside of their immediate environment. Projects the size of Linux have this same issue, but large open source projects tend to have significant cross-pollination with the broader community so it's less of an issue.
For example, one large corporate codebase I worked in had a library of ~20, strangely named, short utility functions that were called from many thousands of places. People in the broader community would not find such ubiquitous use of these functions idiomatic at all, but it formed a "dialect" within the company that was often useful given that everyone knew it. These functions caused a lot of signatures/code to be structured in ways that assumed their existence - they were painful to extract when we open sourced internal things. There also tends to be a lot of business logic in corporate code (e.g., `max_space = 100 if has_excel_license else 10`) that would make zero sense in other codebases.
A real example in this case... I'd bet that many millions of lines of Microsoft's internal code still use Hungarian notation. Few users would want Copilot generating such names. I could see interest in a version of Copilot that augments the standard Copilot model with your own codebase if you have millions of lines of code in your GitHub Enterprise account.
The equivalent argument here is homomorphically encrypting an entire codebase and searching through it for matching snippets and not attributing credit to the original code/source. This is essentially what co-pilot is doing but using AI hype as a cover.