>lines here that should not have been crossed without the authors explicit permission, regardless of the utility of the tool they built.
Fyi... Google Books (scanned and OCR'd books) eventually won against the authors filing lawsuits of copyright infringement. So there is some precedent that courts do look at the "utility" or "sufficiently transformative" aspect when weighing copyright infringement.
A number of points in Google's favor: they are not passing off Google books content as their own, they limit your access to a small fraction of the offering.
The thing that surprised me about that ruling is that it was deemed final without a chance of an appeal.
Google also used all of this to improve their OCR algorithms, almost certainly used in Google Cloud Vision[0], but I doubt this was a consideration when deciding if it was transformative/fair use.
> Yet they did not build and market a service to authors that would write novels for them based on their OCR-ed catalog.
I find this to be a very appropriate analogy. If Google had done such a thing, they would be facing the same kinds of lawsuits that Microsoft is facing now. And despite Microsoft's money, I don't see how they can wiggle their way out of this one. They basically ignored the license terms and attribution requirements of the authors. Something Microsoft would never stand for, if "the shoe was on the other foot".
Indeed; that would be an excellent topic for litigation, and they would fight it with every lawyer they have b/c it could invalidate their efforts to zero out human labor costs in all possible areas.
Not really. Google won because Google Books was not actually a new concept; someone else had already built a book search engine the same way Google did, also got sued by the Authors Guild, and also prevailed. The only thing different about Google Books was that it'd give you two pages worth of excerpt out of the book. So it was very easy for a court to extend the fair use logic that they had already weaved into the law.
I still think "training is fair use" still has a leg to stand on, though. But it doesn't save GitHub Copilot because they're not merely training a model; they're selling access to its outputs and telling people they have "full commercial rights" to its outputs (i.e. sublicensing). Fair use is not transitive; if I make 100 Google Books searches to get all the pages out of a book, I don't suddenly own the book. There is no "copyright laundry" here.
> I still think "training is fair use" still has a leg to stand on, though
If that's the case, we need to serious re-consider how we reward Open Source as a society (I think that would be fantastic anyway!) -- we have people producing knowledge and others profiting directly from this material, producing new content and new code that's incompatible with the original license.
You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?
I think meanwhile the most reasonable solution is that an AI should always produce content compatible with the training material licenses. So if you want to use GPL training sets, you can only use that to create GPL-compatible code. If you use public domain (or e.g. 0BSD?) training sets, you can produce any code I guess.
> You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?
If the output (not just the model) can be determined to be a derivative work of the input, or the model is overfit and regurgitating training set data, then yes. It should. And a court would make the same demands, because fair use is intransitive - you cannot reach through a fair use to make an unfair use. So each model invocation creates a new question of "did I just copy GPL code or not".
It would be an essential feature, imo, to have this 'near-verbatim check' for copyleft code.
Overall it feels like it's a bit too much of specialized learning on GPL/Copyleft code to be fair. It's not like a human that reads some source code and gets an idea how it works. It's really learning code from scratch on Copyleft code, without which it would likely perform much worse and not generate a number of examples. It's not just copy-paste, but it's closer on the spectrum to copy paste than just super-abstract inspiration to feel fair.
As others have said, I don't think it would be fine (specially from big companies pov.) to decompile proprietary code (or just grab publicly available but illegal to reproduce code) and have AIs learn from that in a way that seems different in scope and ability to human research and reverse engineering.
I think we need a good tradeoff that isn't ludditism (that would reject a benefit for us all, i.e. that is good for everyone), but that still promotes and maintains open source software. In this case it's really a public "good" that's being seized and commercialized, that doesn't seem quite right: make copilot public, or use only permitted code (or share your revenue with developers -- although that would seem more complicated and up to each copyright holder to re-license for this usage). I remember not long ago MS declaring Open Source was a kind of "Cancer", now they're relying on it to sell their programming AIs. I personally think Open Source is quite the opposite of cancer, it is usually an unmitigated social good.
Much of the same could be said for the case of artists an generative AI art.
And this isn't even starting on how we move forward as a society that has highly automated most jobs and needs to distribute the resources and wealth in a good way to enable greatest wellbeing for all beings.
Depends if you think the GPL means "copyright is great!" vs "let's use their biggest weapon against them..."
It's a surprisingly subtle distinction.
EDIT - if I squint hard enough in exactly the right way, there's a sense in which CoPilot etc aligns perfectly with the goals of the free software movement. A world in which you can use it as a code copyright laundry might be a world where code is actually free.
Is that any weirder than bizarre legal contortions such as the Google/Oracle "9 lines of code"? Or the whole dance around reverse engineering: "It's OK if you never saw the actual code but you're allowed to read comprehensive notes from someone who did"..?
There's a ton of examples like this. Tell me with a straight face that there's a clear moral line in either copyright or patent law as it relates to software.
IP is a mess and it's not clear who benefits. Is a world where code isn't subject to copyright so bad?
If Copilot was released as FOSS with trained model weights, I don't think the Free Software movement would have "shot first" in the resulting copyright fight.
It is specifically the idea of using copyright to eat itself that is harmed by AI training. In the world where we currently live in, only source code can be trained on. If I want to train an AI on, say, the NT kernel; I have to decompile it first, and even then it's not going to be good training data because there's no comments or variable names to guide the AI. The whole point of the GPL was to force other companies to not lock down programs and withhold source code, after all.
Keep in mind too that AI is basically proprietary software's final form. Not even the creator of an AI program has anything that resembles "source code"; and a good chunk of AI safety research boils down to "here's a program you can't comprehend except through gradient descent, how do we design it to have an incentive to not do bad things".
If you like copyright licensing and just view the GPL as an exception sales vehicle, then AI is less of a threat, because it's just another thing to sell licenses for.
> You write close sourced code, then a make an AI that learns from that code, shouldn't its output be licensed as well?
> Ask the above, and suddenly Microsoft will agree.
Does Microsoft actually agree? Many people have posted leaked/stolen Microsoft code (such as Windows, MS-DOS 6) to GitHub. Microsoft doesn't seem to make a very serious effort to stop it – sometimes they DMCA repos hosting it, but others have stayed up for ages. They could easily build some system to automatically detect and takedown leaks of their own code, but they haven't. Given this reality, if they trained GitHub Copilot on all public GitHub repos, it seems likely that its training included leaked Microsoft source code. If true, that means Microsoft doesn't actually have a problem with people using the outputs of an AI trained on their own closed source code.
> If that's the case, we need to serious re-consider how we reward Open Source as a society (I think that would be fantastic anyway!) -- we have people producing knowledge and others profiting directly from this material, producing new content and new code that's incompatible with the original license.
Is that new? If I include some excerpt from copyrighted material in my own work and it's deemed to be fair use, that doesn't limit my right to profit from the work, sell the copyright to someone else, and so on, does it?
If open source code authors (and other content creators) don't want their IP to be used in AI training data sets then they can simply change the license terms to prohibit that use. And if they really want to control how their IP is used then they shouldn't host it on GitHub in the first place. Of course Microsoft is going to look for ways to monetize that data.
> they can simply change the license terms to prohibit that use
GitHub's argument* is not that they're following the license but that the license does not apply to their use. So they would continue to ignore any provision that says they can't use the material for training.
Moving off GitHub is a better step at a practical level. But again they claim the license doesn't matter, so even if it's hosted publicly elsewhere they would (presumably) maintain that they can still scoop it up. It just becomes more work, for them, to do so.
*Which is completely wrong in my opinion, for the record
> But it doesn't save GitHub Copilot because they're not merely training a model; they're selling access to its outputs and telling people they have "full commercial rights" to its outputs (i.e. sublicensing).
But if you read the source code of 100 different projects to learn how they worked and then someone hired you to write a program that uses this knowledge, that should be legit. I'm not sure if the law currently makes a distinction between learning vs. remixing, and if Copilot would qualify as learning.
That kind of legal ass-covering is expedient when you are going to explicitly reproduce someone else’s source-available work. It’s cheaper in that case to go through the whole clean room hassle than to risk getting into an intractable argument in court about how your code that does exactly the same thing as someone else’s code came to resemble the other people’s code so much.
But, for the general case, the argument still stands. I have looked at GPL code before. I might have even learned something from it. Is my brain infected? Am I required by law to license everything I ever make as GPL for the remainder of my days?
Yes, it will sometimes depend on the unique qualities of the code. For instance, if you learned a new sorting algorithm from a C repo, and then wrote a comparable imperative solution in OCaml, that might be a derivative work. But if you wrote a purely functional equivalent of that algorithm, I don't think that could be considered a derivative work.
And Google kept the copyright notices and attributions; probably not super relevant, but it's a difference between the two cases.
I mean in essence Github is a library; they did have a license to a point to do with the code as they pleased, but they then started to create a derivative work in the form of an AI, without correctly crediting the source materials.
I mean I think they made a gamble on it; as far as I'm aware, AI training sets were yet unchallenged in a court of law, so legally not fully defined yet. These lawsuits - and the ones (if any) aimed at the image generators, using CC artwork from e.g. artstation - will lay the legal groundwork for future AI / ML development.
Libraries are really not very special. They mostly exist on the basis of first-sale doctrine and have to subscribe to electronic services like everyone else.
Entities like the Internet Archive skate by (at least before their book lending stunt during COVID) by being non-profit and bending over backwards to respect even retrospective robots.txt instructions, meaning that it's not really worth suing them given they'll mostly do what you ask anyway.
But I guarantee you that if I set up a best comic strips of all time library I'll probably be in court.
Copilot also isn't retaining the actual content of the source code repositories and then deriving works from that. If I wrote a giant table of token frequencies and associative keywords by analyzing a bunch of source, and sold that to people as a "github code analysis" book, I'm pretty sure that's perfectly fine because it's not a derivative work. I'm not sure that the fact that a program can then take that associative data and generate new code doesn't suddenly make it not ok.
> If I wrote a giant table of token frequencies and associative keywords by analyzing a bunch of source, and sold that to people as a "github code analysis" book, I'm pretty sure that's perfectly fine because it's not a derivative work.
That sounds to me somewhat close to "if I take an FFT of each of those copyrighted images, glue them together, and sell this as a picture, is that a derivative work?" - I'd say yes, or perhaps even a different encoding of the original work, since you can reverse the frequency domain representation and get the original spatial representation - the original images - back.
Sure, ROT13 encoding is a derivative work because the entire original work is still there, encoded. Ditto for FFT. Large language models are not that.
Sometimes parts of the original works are still encoded, which we've seen when some code is reproduced verbatim, and I'm sure that happens to people as well, ie. they see some algorithm and down the road have to write something similar and end up reproducing the exact same thing.
Once they iron out those wrinkles, it's not clear to me that a large language model is a directly reversible function of the original works. At least, not any more than a human learning from reading a bunch of code and then going on to have a career selling his skills at writing code.
Edit: by which I mean, LLMs are lossy encodings, not lossless encodings.
> Ditto for FFT. Large language models are not that.
They're not, but the "giant table of token frequencies and associative keywords" reminded me of doing FFT on images, and I wanted to communicate the idea that transformations like this can actually retain the original information, and reproduce it back through inverse transform.
> by which I mean, LLMs are lossy encodings, not lossless encodings
Exactly. And while I doubt most training data is recoverable, "lossy encoding" is still a spectrum. As you move away from lossless, it's not obvious when, or if at all, the result is clear from copyright of original inputs' author. Compare e.g. with JPEG, which employs a less sophisticated lossy encoding - no matter how hard you compress a source image, the result would still likely retain the copyright of the source image author, as provenance matters.
> And while I doubt most training data is recoverable, "lossy encoding" is still a spectrum. [...] Compare e.g. with JPEG
I'll just finally note that LLMs are not lossy encodings in the same sense as JPEG. LLMs are closer to human-like learning, where learning from data enables us to create entirely new expressions of the same concepts contained in that data, rather than acting as pure functions of the source data. That's why this will be interesting to see play out in the courts.
My belief is there is no fundamental difference here. That is, learning is a form of compression. Learning concepts is just a more complex form of achieving that much greater (if lossy) compression levels. If the courts will see it the same way too, things will get truly interesting.
Yes learning concepts is a form of compression, but I'm not sure that implies there's no "fundamental" difference. I see it as akin to a programming language having only first-order functions vs. having higher-order functions. Higher-order functions give you more expressive power but not any more computational power.
You could say a higher order program can "just" be transformed into a first-order program via defunctionalization, but I think the expressive difference is in and of itself meaningful. I hope the courts can tease that out in the end, and we'll see if LLMs cross that line, or if we need something even more general to qualify.
> I see it as akin to a programming language having only first-order functions vs. having higher-order functions.
Interesting analogy, and I think there are a couple different "levels" of looking at it. E.g. fundamentally, they're the same thing under Turing equivalence, and in practice one can be transformed into the other - but then, I agree there is a meaningful difference for humans having to read or think in those languages. Additionally, if those are typical programming languages, you can't really have the code in the "weaker" language self-upgrade to the point the upgraded language has the same expressive power as the "stronger" one. If the "weaker" one is Lisp though, you can lift it like this.
In this sense I see traditional compression algorithms - like the ones we use for archiving, images and sound - to be like those typical weaker languages. There's a fixed set of features they exploit in their compression. But human learning vs. neural network models (or sophisticated enough non-DNN ML) is to me like Lisp vs. that stronger programming language, or even Lisp vs. a better Lisp - both can arbitrarily raise their conceptual levels as needed. But it's still fundamentally compression / programming Turing machines.
> that happens to people as well, ie. they see some algorithm and down the road have to write something similar and end up reproducing the exact same thing.
And if such algorithm is copyrighted, that would be infringing! It doesn't matter if you copy on purpose or by chance.
If you overlap a hundred different FFTs, then the result is likely fine copyright-wise.
These networks are not [supposed to] contain much of the original data. Like the trivia point that Stable Diffusion has less than two bytes per source image, on average.
Stitch them side by side. Yes, this is not how those DNNs work, but the example was more about highlighting that "a giant table of token frequencies" by itself is probably reversible back to original data, or at least something resembling it.
> Stable Diffusion has less than two bytes per source image, on average.
I'm not convinced by this trivia point, though. Stable Diffusion is, effectively, a lossy compression of the training data. Nothing says lossy compression algorithms can't exploit some higher-level conceptual structures in the inputs[0], and applying lossy compression to some work doesn't automatically erase the copyrights of the original input's author.
--
[0] - SD isn't compressing arbitrary byte sequences, it's compressing images - which is a small subset of all possible byte sequences as large as the largest image used in training. "Less than two bytes per source image, on average" doesn't sound to me like something implausible for a lossy compressor that is focused on such small subset of possible inputs, and gets to exploit high-level patterns in such data.
> the example was more about highlighting that "a giant table of token frequencies" by itself is probably reversible back to original data, or at least something resembling it
That depends entirely on how many frequencies you're keeping.
> high-level patterns in such data
High level patterns across thousands of images are generally not copyrightable.
I might even describe the purpose of stable diffusion as extracting just the patterns and zero specifics.
>"Less than two bytes per source image, on average" doesn't sound to me like something implausible for a lossy compressor that is focused on such small subset of possible inputs, and gets to exploit high-level patterns in such data.
Two bytes would only let you uniquely identify ~65k images though, which to me doesn't sound plausible for a lossy compressor.
> So there is some precedent that courts do look at the "utility" or "sufficiently transformative" aspect when weighing copyright infringement.
Curiously, from the article, copyright infringement is not alleged:
> As a final note, the complaint alleges a violation under the Digital Millennium Copyright Act for removal of copyright notices, attribution, and license terms, but conspicuously does not allege copyright infringement.
Perhaps the plaintiffs are trying to avoid exactly this prior law?
Fyi... Google Books (scanned and OCR'd books) eventually won against the authors filing lawsuits of copyright infringement. So there is some precedent that courts do look at the "utility" or "sufficiently transformative" aspect when weighing copyright infringement.
https://www.google.com/search?q=google+books+%22is+transform...
But courts in Europe may judge things differently.