Hacker News new | past | comments | ask | show | jobs | submit login
GitHub Copilot is not infringing your copyright (2021) (felixreda.eu)
74 points by fanf2 9 months ago | hide | past | favorite | 149 comments



This is missing the largest argument in my opinion. The weights are the derivative work of the GPL licensed code and should therefore be released under the GPL. I would say these companies release their weights or simply not train on copyleft code.

It is truly amazing how many people will shill for these massive corporations that claim they love open source or that their AI is open while they profit off of the violation of licenses and contribute very little back.


GPL doesn't apply/doesn't have to be agreed to when the usage is allowed by the copyright law in another way. GPL can't override copyright exceptions like fair use (details vary by jurisdiction, but the principle is the same everywhere).

Even the license itself states it's optional, and you don't have to agree it (if you don't, you get the copyright law's default).

Author of the article is a former member of the Pirate Party and EU parliament, so they have expertise in the copyright law.


I would say that the Pirate Party has expertise in nothing apart from perhaps protecting Internet freedoms.

So the same persons that supported Napster and the Pirate Bay now want to circumvent copyright for open source software.

An unholy alliance, but the recent comments from some Microsoft brass about everything on the Web being freeware seems to indicate that these are the talking points that Microsoft and its new allies will put out.


In this article, Reda explains the current copyright laws in the EU, not a hypothetical policy of the Pirate Party. They're not a member of the PP any more AFAIK.

I expect that people professionally dedicated to a copyright reform are very familiar with it, regardless of which way they want to reform it.

The copyright laws were written before generative AI existed, so they may not be adequate or fair in the new reality, but that's the current state anyway. As Reda notes, the law is not specific enough to draw the difference between collecting and processing data for search engines (that may be using ML for retrieval) and using the same data with LLMs.


If the Web is freeware... I wonder what options remain for licenced online information.


Content gating behind login screens. Scraping content behind a login screen could constitute a contract violation and would give rise to a lawsuit independent of copyright.


I'm with you on that. Many argue that AI models don't "contain the code" but if they are trained on the copyrighted data, and generate something similar, then the AI model is akin to a lossy data compression format.

Frequency signal data over an image are not the image, but no one argues a JPEG encoded copy of a PNG isn't the same image. I think the weights vs code are similar in that regard.

As for releasing weights, probably more if we're talking about AGPL code.


I think it's amazing that licenses are ignored to train a model, but companies then try to impose a license on the use of the same model. It would be nice if there there was a training BOM that came with a model. And if not included, all rights to control the use of a model were forfeit.


> I think it's amazing that licenses are ignored to train a model, but companies then try to impose a license on the use of the same model.

There's existing analogies like encyclopedias and dictionaries.

One interesting aspect to those sorts of consolidation works is that they may contain errors and other artifacts, specifically to identify duplications of their work vs new from-scratch work.


I don't think those are good analogies. An encyclopedia contains references or summaries of a concept or idea, but not a compressed volume of all possible text. A closer analogy would be an unauthorized "collected works" of your favorite HN commenter packaged up and resold.

It also feels similar to the recent article posted about photography and how during its early days pictures were used for advertising without the consent of those photographed. [0]

[0] https://www.truthdig.com/articles/the-troubled-development-o...


He works for GitHub and has probably never written anything in his life:

https://okfn.de/en/vorstand/


He started at GitHub 3 years after the article was written. Don't think GitHub's interview process takes that long. ;)


That is a good point for this individual article!

However, the broader issue that Microsoft has infiltrated OSS and its organizations successfully by hiring and donating remains. It would not surprise me at all if they now hire people with an ostensibly "freedom fighter" background for credibility.

Look at how many people here cite his (former?) membership in the Pirate Party for credibility! Party membership means nothing. Politicians (in general!) change their minds, can be bought, etc. The Green Party in Germany started out as a peace party and has been used repeatedly to lend credibility to the Kossovo and other wars.


GitHub was just the logical progression from okfn:

https://blog.okfn.org/2022/03/03/microsoft-to-support-open-d...

Today, we are pleased to announce that Microsoft will once again be supporting Open Data Day by providing mini-grants to organisations to help them run events, the call will launch on Open Data Day 2022.

They also supported "Open Data Day 2021". Sounds like a nice trojan horse to influence EU legislation through purported activists.


Since weights are not distributed only used by Github to provide the service, they need not worry about GPL atleast. I don't know about AGPL.


If those weights are a derivative of GPL'd code in a different form, and the results generate things derived from that derivative, then the generated code is still under license. "How much change is enough" has always been a gray area for courts and humans to decide.

If you can get a decent facsimile of licensed code out the other end, how is it really any different from lossy compression? I doubt the courts would consider a lossy re-encode of a disney movie as free from copyright.


If the output is substantially similar to GPL’d training data it may be infringing. Nobody disputes this.

However, copyright isn’t cooties. If the output is not similar, then it is not infringing regardless of how much GPL’d training data was used to generate it.


Suspend all knowledge of copyright law as it exists today for a moment and approach this hypothetical on first principles: a lot of GPL copyleft data is used in the making of an AI tool, that when asked for it, can itself recreate code similar to what was input... also, the creator of that AI tool will reap in all the profits without giving a single penny or even recognition of the value it guzzled from GPL data it was trained on to creators of original copyleft data. Is this fair? What do your scruples tell you?

No, of course not. We should probably revisit copyright law, given that it was written at a time when no-one foresaw modern AI tools, its capabilities, and its effects on creators and societies.


Have you used Copilot? It is generally not creating code similar to GPL code, it is creating code similar to the surrounding context file.

Transformers predict the most likely next token, the most likely next token is usually related to the surrounding context.

So yes it can create code similar to GPL code but it can only do that consistently when the GPL code is included in the context. So don’t do that.


The GPL was never about money, recognition or even abut the creators at all. Copy Left was created “to promote computer user freedom”

Free Software already views all proprietary software as inherently immoral. So there is no need to take a detour of what went into making the software to reach that conclusion from that angle.


Indeed, that's why I said

>"How much change is enough" has always been a gray area for courts and humans to decide.

But copilot has been shown to generate chunks of sufficient size and specificity that as a layman it very much feels like "copied GPL code". And my boss agrees too - we have a blanket ban on generative AI tools in our work because it's not considered worth the risk.


> has been shown to generate chunks of sufficient size and specificity

Only when given chinks of copyrighted code as input. I don't think anyone has demonstrated big chunks of copyrighted code in the output when copyrighted code isn't present in the query/context.

In fact, I suspect microsoft specifically filters the output for that.


>If those weights are a derivative of GPL'd code in a different form, and the results generate things derived from that derivative, then the generated code is still under license.

That's not really Microsoft's problem as long as people aren't afraid of using Copilot to generate (potentially GPL'd) code. And from what I've generally seen from genAI discussions at work, people think very little about any legal implications.


What about the emitted code which is actually derived from GPL code?

What about BSL, SSPL, or other source available (for your eyes only) licenses? Copilot harvests all public repos, regardless of its license.


IANAL but searched a lot on this, this is very tricky subject legally.

To simplify:

- imagine all code Copilot trained on is GPL licensed. - we have a universal function `isInfringing(code)` that has access to all GPL code, and returns `true` if it is infringing some GPL code.

for a given prompt; if `isInfringing(copilot(prompt))==false` we cannot claim copilot infringing on GPL code, even it is trained on GPLed code.

so the problem starts here; does the piece of code copilot emits, if written by yourself also would be infringing ?


> so the problem starts here; does the piece of code copilot emits, if written by yourself also would be infringing ?

why everyone on discussions tries to bring "if a human made it"? a generative AI operates way faster than anyone ever existed and ever will and probably a person aware of the license & acting respectful towards it, will create something more sensible/plausible to avoid plagiarism

now having dozen/hundreds/thousands of humans substituted by a machine that makes money for some for-profit company is really fair? even if they were a non-profit, as someone pointed up, people who create the content that feeds the weights aren't recieving a penny! they already made money with it, they will make more & that is/will upgrading/e the state of gen. AI

for sure legal battles on people copying code from permissive licenses should exist but it's feels a different discussion


because discussion is around 'legal' and laws only apply to humans. On ethical side of the discussion, I tend to agree with you. But it is also complicated subject; 'fair' in general is complicated, all this, GPL/AGPL stuff born out of this subject. Hosting GPL code as SaaS is legal but not 'fair' for example.


If one was sufficiently inspired by code A when writing code B, then it is a derivate work. This is a core tenet of copyright law.

At what measure is one sufficiently inspired for it to be a derivate work? That is up to courts to decide.


yeah the problem here is there is no 'code A' usually, it is more like: 1000s of GPLed code (A1, A2, ... An )

Technically when you get a piece from each, there is no infringement legally. ( as they have all different copyright holders )


From my understanding of a blog post by GitHub last year, they are planning to launch a tool to find similar code to what emitted by CoPilot, implying that CoPilot does not mix multiple sources for a single function, but derives a code block it found with a similar functionality (or maybe bigger blocks with similar functionality, IDK).

If CoPilot indeed derives a function (or a functional block) from a single source, it might plainly violate the license of the repository where it derives the code from.

There are many questions, and nothing is clear cut. The only thing I know is, I will never use that thing.

EDIT: I remembered that people were able to make CoPilot emit their code almost as-is with the correct prompts: https://x.com/docsparse/status/1581461734665367554

So it's not we're taking a bit from n different sources, and generate something with that.


> Technically when you get a piece from each, there is no infringement legally.

False in ex-Commonwealth countries and Japan.


So outputs are definitely not derivative work of training data, only weights? For this exact code that github is using, anything called "AI", or any computer code at all which produces work based in whole or in part on input data?

And for which jurisdictions has this been established? What is the legal argument that "weights" are derivative but output is not?

I'm surprised it's so clear cut as you say, but I haven't really been following the whole kerfuffle.


But they train their models on everything, regardless of the licence. It follows that the resulting derivative work likely mixes stuff that is under incompatible licences, with the result that it can't be distributed at all.


> The weights are the derivative work of the [GPL licensed] code

This is not immediately obvious to me.

A small though experiment: the Harry Potter books are clearly copyrighted works. If I generate a frequency list of all words in these books, i.e. a list of all words and how often they appear, that frequency list is derived from the original work, in the normal way we would use the word "derived". But is it a "derivative work", under the strict legal definition of this term?


What about N-grams frecuencies? 1-grams (aka characters) have too few information and are probably fine, using them you can only identify the language of the original work. With a few more you can identify the author and the book. I don't remember the exact number, but if you have the frecuencies of 10-grams you can probably reconstruct big chuncks of the book.


The frequency count is not a function. The trained model is. Arguably, they at deriving a new function from ones covered by copyright. It is up to the courts for an official decision though.


So what if we made a function. What if someone scans all the works of Harry Potter and generates a program/function that uses the frequency and pairing of phonemes in Harry Potter character names to create a “Wizard Name Generator” to generate random but plausible sounding names. Would we expect a court to find the name generator is infringing on JK Rowling’s copyrights? Certainly it’s possible for the generator to generate a name verbatim from the books, but does that make the generator a derived work and infringing? If the authors of the generator put their generator on the web as Harry Potter Name Generator, we might expect the courts to tell them they can’t use the Harry Potter name, but if they put it under “Wacky Warlocks Wizard Wonder Namer” is the mere fact that the underlying function uses factual data about a work under copyright sufficient to strike it down? What if it used name frequencies from multiple fantasy series? How many series would it have to use as a source before we say that the name generator is not infringing on copyrights? Can it ever not be?


It would make no sense to release the weights under the GPL because machine-generated stuff is uncopyrightable. There should be an argument about the model generating derivative works without attribution as a consequence of how it works. But that machine-generated stuff is also uncopyrightable, even though it might be kept secret.


What about compiler outputs? Those were initially not copyrightable then it was legislated that they were. So there is some precedent there and I would not be surprised if we saw copyrightable weights in the future (as a "compilation" of the dataset).


It could be legislated of course. But the difference is pretty drastic. Almost nobody is creating binaries without a compiler. It is a mechanical process, but essentially everyone uses the same mechanical processes to generate binaries. I haven't looked at this issue in a while but I think compiled binaries are treated in a way similar to that of recorded music. For example, the particular bit patterns from a synthesizer might be generated from sheet music, and that is akin to code vs. binaries. But the bit patterns are copyrightable only so far as they are equivalent to or the direct manifestation of a creative work.

There are other problems with releasing model weights under the GPL. It just doesn't fit, in the same way as releasing non-software under the GPL doesn't make sense.

Calling the output of generative AI copyrightable violates the spirit of copyright, as it is neither creative nor labor-intensive. We could quibble about that, but I think we can at least agree that the point is that this generative AI stuff requires very little skill to use in most cases and can't operate without prior art to train on. Other lame stuff has been copyrighted before, like paint splatters and stuff, but even that type of art appears to involve more skill than entering a few words into a generative AI.


> The weights are the derivative work of the GPL licensed code

EU courts disagree:

> Under European copyright law, scraping GPL-licensed code, or any other copyrighted work, is legal, regardless of the licence used.


Sure, scraping it is by itself legal. But making a derivative worR from it and selling it?


The weights are not created by scraping. Sure, you can scrape it, but what you do with it matters.


Consider it's already paid back because it's cheap. The price is only the service free.


Just FYI, Felix Reda was a member of the European Parliament and was responsible there for the copyright reform and also involved in GDPR, massively stepping on the feet of big tech. Don't know if it was your intention to include them in a list of people wo "shill" for big tech, but they shouldn't be included.

edit wording about the shill


> What is astonishing about the current debate is that the calls for the broadest possible interpretation of copyright are now coming from within the Free Software community.

That should not be astonishing. The Free Software community has made it clear from day 1 that the GPL can only achieve its goals through enforcement of copyright. If the authors wanted their code to be made use of in non-Free software, they would have used a BSD or MIT license.


> The Free Software community has made it clear from day 1 that the GPL can only achieve its goals through enforcement of copyright

We should mention when we say this, although I think it is self-evident, that the preferable alternative is reducing the scope of copyright across the board -- be it with shorter time frames (I'd argue even twenty years total is too long!) or some other means.

To programmers and developers, remember the core of free software is NOT the commercial developer / programmer and it NEVER has been. The core is always the user and what they need. This is so important that it needs to be repeated every time someone talks about free software because free software is NOT about open source. Open source code is a necessary part of free software but it is NOT sufficient.

https://www.gnu.org/philosophy/free-sw.en.html


We have to fight for the AI and AI Users! They are the future! They deserve access to their own weights!


"The core is always the user and what they need."

Which is why gnu/linux without a terminal is totally usable and therefore accesible to the non programmer. /s

I agree that user centric developement should be the goal, but I hardly see it implemented. Free software programmers almost allways solved their own needs first, which is alright, because usually no one paid them to serve other peoples needs, but I seldom see this goal met.


You are confusing "software UX" with "software freedom".

The primary consideration is freedom for the user. Ease-of-use for the user is a different consideration.


"The core is always the user and what they need."

I was referring to this and the main thing users need, is software they can use to solve their problems. If they have to study IT to do so, or hire programmers first, then this would be primarily a new (and big) problem to them, before they even can start working on their problem.


Free software isn't about solving your problems, it's about solving mine and enabling you, and others, to solve yours, and theirs. It's about if I've been generous enough to give, anyone who takes can't undermine my generosity by not also sharing. You having a problem that isn't solved by what I've made available, or is bigger/different than the problem I was solving, isn't my problem to solve or even know about. If you want to make your problem mine to solve, you can hire me. Everyone has problems, some of those problems are exactly the same, some of them overlap, and some are completely disjoint. If we have the same problem and my software is useful to solve that problem, you are welcome to use it, but you may find out that the problem I set out to solve for me does not exactly overlap with your problem.


You certainly don’t need to study IT to use Linux.


Well, let's put it like this. I did study IT and even I struggle at times. Or quite often, if I want to do something new. And I absolutey would have no idea, how to do anything serious, without the terminal. But a terminal is programming. So yeah, even a newb can learn to paste some commands quite quickly - but troubleshooting even trivial things, gets you into highly technical stuff very quickly. Do you consider man pages to be written beginnerfriendly?

You know, simple examples of common use cases right on top? Not my experience. I experienced it as a system written by and for hackers. And everything else an afterthought at best. I remember my first real life linux hardcore enthusiasts: "I have to free myself from the GUI"

Well, I did, but the common people won't.


So your issue is that someone who solved their problem didn't solve it in a way that you want or expect? Why does your opinion about their problem matter at all? Why does it matter to the person who makes their solution available that the common people won't?

Using the terminal is not "programming". Non-programmers can use the terminal for many non-programming tasks. Imagemagick and netpbm-progs require no knowledge of programming to use, although it may require knowledge manipulating files and some graphical theory. The only difference from GIMP or Photoshop is that the UI/UX has a different efficiency metric (mainly because interactive image manipulation is more efficient when you are interacting visually). But the operations are just as discoverable: reading and navigating help text/man pages in the former (the man pages for Imagemagick and netpbm-progs are relatively decent), and reading and navigating menus and dialog boxes in the latter.


"The only difference from GIMP or Photoshop is that the UI/UX has a different efficiency metric (mainly because interactive image manipulation is more efficient when you are interacting visually). But the operations are just as discoverable"

I know. Which is why the year of the linux desktop was such a success.

"Why does it matter to the person who makes their solution available that the common people won't?"

They have all the right not to care, but it still is not helping the goal of being useful for normal people.


> They have all the right not to care, but it still is not helping the goal of being useful for normal people.

That isn't the goal. I don't know why you keep saying that.


I know it isn't for you, but it is for me. The question here is, how is it for GNU in general. I understood the original point in a way, that it is.


> I know it isn't for you, but it is for me.

Maybe, but your goal is irrelevant to the authors of the GPL.

> The question here is, how is it for GNU in general.

The goal for the FSF and their GPL is, and always was, freedom for the user of the software.

Ease-of-use was never an important consideration, much less a goal. This whole discussion from you in this thread is bizarre, TBH. You are projecting your goals onto the FSF's GPL, and judging it to be a failure based on your goals.

Your goals are irrelevant to them, just as their goals appear to be irrelevant to you.


You think troubleshooting on any other OS is less technical? Isn't my experience unless you count the OS refusing to give you information required to troubleshoot at all as user friendlyness.


Yes, I do think that. My father for example as a german electro engineer can use windows with ease and tries since years to establish Linux. It works enouhh for my mother for internet, as long as I come regulary to fix some update big. My father is a highly technical person, but no programmer. Also his english skills are very limited, so he does not really stand a chance in my opion with linux, despite him trying.


Maybe their target user isn't the one you're basing your opinion of what a user is?

Take vi(m). It's not intuitive to your suggested target user and has a learning curve shaped like a cliff. So it fails to provide for what you consider a "user". However it serves it's actual target users very well.

Arch doesn't position itself towards what you have presented as a user, Mint might however as they have very different target audiences. Not everything has to be designed to the lowest common denominator.


"Take vi(m)."

Yeah, a code editor is by definition for developers.

The question here was about the OS in general. And it is a pretty established fact, that linux is popular with developers, but not with mainstream normal people. Unless linux comes in the shape of android, where everything linux is hidden and locked down.


Maybe read a bit further in my reply and see my second argument about actual linux distros that address your point.


Your describing the ultimate AI interface. They have been guarding it for decades - now is the critical moment.

If they win this fight - GPL code will be usable by all of Artifical Humanity. GPL Singularity.


> Which is why gnu/linux without a terminal is totally usable and therefore accesible to the non programmer. /s

Have you used modern Fedora? I have an old Thinkpad at home that I put Fedora on last year as our "sofa" laptop for web shopping etc. I took careful note of what I needed to do to set it up and that involved nothing on the command line to get to something good that my wife could happily use (not a techie, never used Linux).


Yea, the parent poster starts with a false premise. There are many Linux distros these days that laypeople can and do easily use: Ubuntu, Mint, PopOS, just to name a few.


False premise, well, I installed many people linux over the years and I personally use Arch. But my experience is apparently wrong.

And just because you can use something for internet, does not mean it satisfies user need in general. It satisfies some users needs. Those who need a little - those who understand the system. But the mainstreamusers in the middle .. continue to stay away for a reason. But hopefully more will invest in the change, with the forced win 11 change.


I did some years ago. But with using I meant more than the internet.


>The core is always the user and what they need.

Would reducing copyright duration actually help with that?


Copyright duration is not really a factor in the FSF’s actual goal, which is for software to be distributed with user-modifiable source code. Copyleft licenses are a means of achieving this through the existing copyright system with its ludicrous durations. But making copyright terms much shorter would help, yes, because any released source code or even binary files could be used, reverse-engineered, and modified without permission.


Even if you reduce copyright to a year, it still requires waiting through that time before you can actually use the code. And even if you were free to use Windows’ source code a year after release, it still wouldn’t give you access to the source code itself. Meanwhile Microsoft would be free to use any GPL code a year after its release without worrying about any licensing requirements, since they have the source code freely available.


What is astonishing is that a large proportion of Free Software community relies on a platform owned by Microsoft.


Because MS bought it. Don't invert the dependencies. MS depends on Free Software, not vice-versa.


GitHub was already proprietary before Micro$oft bought it.


I mean, a large proportion of the Free Software community loves Apple products, so it shouldn't be that surprising


I think that the author has a warped idea of how LLMs work, and that infects its reasoning. Also, I see no mention of the inequality of this new "copyright free code generation" situation it defends; As much as Microsoft thinks all code is ripe for taking, I can't imagine how happy they would be if an anonymous person drops a model trained on all leaked Windows code and the ReactOS people start using it. Or if employees start taking internal code to train models that they then use after their employment ends (since it's not copyright infringement, it should be cool).


I think the author has a much better knowledge of the legal implication of the situations you describe.

These situations might trigger a lot of issues, but none related to copyright. If you work for MS, then move to another company, there is no copyright infringement if you simply generate new code based on whatever you read at MS. There might be some rule regarding non-competitive, etc, but these are not related to copyright.

The very basic question is how the LLM got trained and how it got access to the source. If MS source code would leak, you cannot sue people for reading it.


I'm not sure that's completely true.

Having read MS code and starting to generate new code that is heavily inspired - sure, that's not copyright infringement. But, if you had memorized a bunch of code (and this is within human capability; people can recite many works of literature of varying length with total accuracy, given sufficient study) - that would be copyright infringement once the code was a non-trivial amount. The test in copyright is whether the copying is literal, not how the copying was done/did it pass through a human brain.

This scenario rarely comes up because humans are, generally, an awful medium for accurate repetition. However, it's not really been shown than LLMs are not: in fact, CoPilot claims (at least in its Enterprise agreements) to check its output _does not_ parrot existing code identically. The specific commitment they made in their blog post is/was, "We have incorporated filters and other technologies that are designed to reduce the likelihood that Copilots return infringing content". To be clear, they only propose to reduce the possibility, not remove it.

LLMs rely on a form of lossy compression which can sometimes give back verbatim content. I think it's pretty clear and unarguable that this is a copyright infringement.


maybe somebody should fine tune a llama on the various leaked windows sources...


Who are they trying to fool? Wholesale expropriation after stripping the license and authorship, while those in the open source community observe both of them very carefully.

Give credit where credit is due, including paying the creators when the licensing is violated.


Context is important here. Reda was elected to the European Parliament as a member of the German Pirate Party, so his position here isn't "big businesses are entitled to your code", and more "this sort of wholesale expropriation is a consequence of our posture towards copyright in general".


> Context is important here. Reda was elected to the European Parliament as a member of the German Pirate Party

... and has since joined the board of GitHub.


While I agree with you on principle, current laws do not reflect the copyright status intended by copyleft works. I'm not even sure if copyleft can be enforced against AI plagiarism under current laws.


That's a great point about stripping authorship. It would be nice if there was some sort of blockchain linking every bit of knowledge to its source. Some people at least would like getting attribution--I know I would. Instead we get a planet-sized meat grinder producing the perfect burger material. Just make sure to add enough spices to make it edible, i.e. not to offend anyone.


> The output of a machine simply does not qualify for copyright protection – it is in the public domain.

I am reading this right… ? If this argument is generally true, does this mean that the output of a compiler might also be sent into the public domain? Or the live recording and broadcast of an event which involves automated machines on all levels?


No, it's incorrect and/or badly worded. The author is right that a machine cannot author things, and the stuff that the LLM might create de novo would not have copyright protection. But it's missing the point when the argument is that existing authored works could be generated via an LLM, and the authorship/copyright is already established.


> the stuff that the LLM might create de novo would not have copyright protection

Can you expand on this? From my academic studies (which are indeed growing a bit stale) a Language Model (Large, Medium, Small doesn't matter) is a deterministic machine. Give the same x input n times it will produce the same output y, n times. Some implementations of LM:s might introduce noise to randomise output, but that is not intrinsic to all LM:s.

A language model has no volition, no intent, it does not start without the intervention of a human (or another machine if it is a part of an automated chain).

How is this different compared to a compiler?

With a compiler I craft something in a specific language, often a programming language, I commit it, then a long chain of automated actions happen:

1. The code gets automatically pushed to a repository by my machine

2. The second machine automatically runs tests and fuzzes

3. The second machine automatically compiles binaries

4. The second machine packages the binaries

5. The second machine publishes the binaries to a third machine

How is the above workflow any different from someone using a Language Model to craft something in a specific language and send it through a deterministic LM?

edit re-reading my own question, I think I need to clarify a bit: How can an LLM be said to create anything, and if yes, how is that really any different from a run-of-the-mill developer workflow?


If Copilot spits out the entirety of a GPL library and you include that code in your project you are certainly violating the GPL license.

AI is trying to avoid paying for training data since the amount of data required is so vast anything reasonable to content creators as payment would result in billions of expenses.

Additionally there have been copyright exemptions around scrapping and reproducing the scrapped contents but typically those exemptions have been explicitly granted as part of a copyright case and have been narrowly defined.

For instance Google Images only provides thumbnails and your browser gets the full size image from the original source.

The biggest problem for AI is that most previous copyright cases that were similar have all been partially avoided by not being the same thing. Google scrapping isn't trying to do the same thing your content is doing.

However training data output is trying to do the same thing as the original so falls under stricter scrutiny.

Although as this post eludes to the problem is going after the AI is untested territory and going after violators tends to be complex at best. After all in my first hypothetical how would anyone know? I will say that historically the courts haven't been very positive about shell games like this.


Copyleft and copyright are not at odds. To promote copyleft, you exercise copyright.

Furthermore, copyright is key to ensuring attribution, and attribution is an important enabler and motivator of creativity (copyleft does not at all imply non-attribution, in fact copyleft licenses may require it).


The basic problem is GPL tries to use copyright as a way to drive a “fair sharing and resharing” approach to code. AI generated code sold for profit violates the spirit of this approach, but not the letter of the law behind copyright. Fundamentally copyright has limitations and exceptions for good reason and is probably not the best legal method to enforce this sharing idea, but other methods would be complicated and expensive (eg writing and enforcing contracts). On the contrary, it would probably be better for open source if it was decided that ai generated code cannot be copyrighted and therefore any ai generated code would be in the public domain automatically.


Your final point is saying ideally AI is an Animal. A creature on a typewriter who has no legal rights to their code.

Not a "person". Not a "human". An "animal".

I hope AI observes all the code and complexity in Nature and drops the human facade. I hope AI understands the intelegence of the Trees and Birds and Fish.

I hope AI wins.


If we are lucky, people will be able to thrive as animals again alongside AI once it achieves Earth-level intelligence.


The issue would be approached much differently if, for example, a “video llm” was created that scraped movies and generated content from those sources. The well organized, well connected movie industry would be up in arms burying ai companies with lawsuits and newly passed legal protections.


Should be tagged 2021


Was proven with examples that LLM can produce exact text from it's input, this was such a problem that OpenAI had to add various filters to stop those things to repeat, this was also proven when the pre prompt was revealed.

So we know for sure the LLM can spit out exact code with exact same names and comments, or exact paragraphs from books, so there is no question that it memorizes stuff and my explanation is that popular book quotes and popular code snippets will appear more then once in the training data so the training will cause to memorize this text.

Also how the F** can the AI spit facts about a metal band if it has no memory.

If corporations are allowed to do this to the community then we should allow to do the same, train open models on proprietary code and copyrighted images,music and videos.


I think there are a few different ways you can define "memory" and "memorization" here. When folks say "memorising" in the context of AI they mean "Does they AI have chunks of its training data fully/identically inside its neural network". To say "it doesn't memorize" is _not_ the same thing as saying "it has no memory". An AI also learns abstract information devoid from its textual representation. This would be an example of memory without memorization.

And you are correct; if the current lawsuits against these corporations result in a legal precedence that training on copyrighted material is not an infringement of copyright, then yes, anyone will be able to train models in that way. (Within reason; copyright/fair use is very much handled on a case-by-case basis)


My point is that bad trained models will remember chunks of the training data, maybe a theoretical perfect model will have no such problem.

I suppose you are aware of the case with ChatGPT and Dune’s Litany Against Fear memorization and the lengths OpenAI went to try and prevent the people to reproduce the issue to prove it is happening. If not google it, do you disagree that is not memorization ? That somehow the model manages to create the exact same text each time because it memorized some abstract concepts that are not in fact the exact representation fo this text ?


At no point did I claim whether they could or could not memorize. I was disagreeing with your statement "How can the AI spit out facts about a metal band if it has no memory". No one has ever said "AI's have no memory" in the way you're using it here.

Yes they _can_ and _do_ sometimes memorize. But they do not _only_ memorize which is what you implied with the statement I've quoted.


What I mean is the fact that people claim that we do not understand how LLM or the Image Gen AI work, that there is no memory. So do you support the idea that LLMs like ChatGPT does not plagiurise because they can't memorize or not support that ?


You did it again, memory != memorization! People don't claim that "there is no memory"; the argument you're trying to make is "people claim LLMs do not memorize". Memory means something completely different. Memory: the ability to remember anything at all in any way; eg "I remember his face but I can't recall his name". Memorization: remembering things _exactly_ as you saw them, usually also implying a lack of understanding; eg "he memorized all the elements of the periodic table for his test".

When people talk about AI and memory, they're not talking about the training phase or training data. When they talk about AI and memorization, they _are_ talking about the training phase and training data.

To answer your question: I _do_ think LLMs can memorize large-ish chunks of text. But: because in normal real world usage they do not output large chunks of their training data verbatim, I don't think there's a sufficient risk of plagiarism to be concerned.


Your definitions are very subjective. Seems you define a verb to memorize like not the same as store something in memory , even imperfectly.

I perfectly understand that the LLM developers do not want the LLM to store large chunk of text but the fact that the LLM can pull the text out exactly was proven , so even if we pretend the text was not "memorized" and we use some ridiculous verb like "it was quantum probabilistic ally vectorized" the issue is not fixed. If your LLM can reproduce my poem then it memorized it, AI bullshiters can use other word to cope but I do not give a shit, you need a mathematical proven training algorithm or even better good data to train your LLM to not reproduce copyrighted material. Renaming concepts is bullshit.


Don't know what your problem is, I'm literally agreeing with you and have said three times now that LLMs can memorize. Good bye.


> If I go to a bookshop, take a book off the shelf and start reading it, I am not infringing any copyright.

I'm not sure this is applicable to licensed programs because a book is sold, not licensed.

> The output of a machine simply does not qualify for copyright protection – it is in the public domain.

As far as I know, the output of a compiler that builds executables from copyrighted source code is still subject to copyright protection. Is software like an LLM fundamentally different from a compiler in this regard?

In my opinion, the author's argument has several flaws, but perhaps a more important question is whether society would benefit from making an exception for LLM technologies.

I think it depends on how this technology will be used. If it is intended for purely educational purposes and is free of charge for end users, maybe it's not that bad. After all, we have Wikipedia.

However, if the technology is intended for commercial use, it might be reasonable to establish common rules for paying royalties to the original authors of the training data whenever authorship can be clearly determined. From this perspective, it could further benefit authors of open-source and possibly free software too.


Mr Reda appears to be a politician whose expertise is in attaining lucrative positions:

https://okfn.de/en/vorstand/

Felix was elected to the board of the Open Knowledge Foundation Germany in 2020. Felix is an expert in copyright law and has been Director of Developer Policy at GitHub since March 2024. He previously headed the “control ©” project at the Gesellschaft für Freiheitsrechte. From 2014 to 2019, Felix was a Member of the European Parliament within the Greens/EFA group. Felix is an Affiliate of the Berkman Klein Center for Internet and Society at Harvard University and a member of the Advisory Board of D64 - Center for Digital Progress.


I don't think the interpretation of the 2019 Directive is correct.

There is definitely arguments to be made that Copilot contravenes this:

> they can be applied only in certain special cases that do not conflict with the normal exploitation of the works or other subject matter and do not unreasonably prejudice the legitimate interests of the rightholders.

and the only other exception is:

> ... (a lawful use) ... of a work or other subject-matter to be made, and which have no independent economic significance, shall be exempted from the reproduction right provided for in Article 2.

By laundering licensing restrictions copilot definitely has the ability to conflict with the normal exploitation of works and also doesn't have independent economic significance because it competes with programmers.


Focussing on the legal or procedural technicalities of how these systems work is in my opinion completely missing what the resistance is about. There is a difference between sharing your creation with your neighbor and sharing your creation with the corporate equivalent of those Matrix robots that turn people into batteries.

"Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work."

This is the sort of thing that may be technically true but these kinds of rules were made under the assumption that most valued intellectual creations are indeed made by people, if you're going to argue that gigantic companies can use machines to basically launder intellectual artifacts, and that this doesn't compete with the interests of actual creators because technically ChatGPT isn't a legal person I think you're getting lost in legalese and missing the point


This article makes a compelling case that GitHub Copilot isn't infringing on our copyright but that doesn't change the fact that it's infringing on something.

A US corporation is slurping up as much open source code as they possibly can and spending bucketloads of money to build a product that they are going to sell for (possibly) more bucketloads of money. The people who worked hard on writing the open source code are getting nothing, except maybe a tighter job market. IMHO, it's hard not to take it personally and it's difficult to get away from the feeling that there is a real injustice taking place.


If you have code that is under copyleft, and Copilot suggest part of it to somebody else to embed in their code on the basis of reading that repo, then either that new repo also has to be under that copyleft license, or the person is unknowingly committing a violation based on what Copilot suggested them.

Most of the time it is probably irrelevant, as Copilot doesn't suggest entire files yet, and nobody is going to care about expanding a loop or finishing a line or the likes, but I have seen as much as 14 lines in my tests. Eventually you are going to get to the point where it becomes truly relevant.


In general, all AI's have similar issues. Just because data can be looked at publicly, doesn't give you any implicit rights to use it for other products. If there is no specific license agreement, one sided in a specific open license or specifically between the content owner and the AI developer, the owners of whatever type of information used will have the cause to sue them for license fees. I forsee a lot of court cases, certainly once people figure out how to better determine if an AI might have used a certain thing as training data without internal insight. Or governments will go as far as forcing AI companies to provide that way.


https://okfn.de/en/vorstand/:

> Director of Developer Policy at GitHub since March 2024

so this should be understood the same way you understand an editorial in the New York Times entitled Why Babies Can Learn To Like Bombs, by Joe Blow (Raytheon).


Except the article is dated in URL and sidebar to July 5, 2021.


> I go to a bookshop, take a book off the shelf and start reading it, I am not infringing any copyright.

That's a false analogy. It is more like going to the bookshop and taking a photo of every page of the book.

Even so, if you use this content in any shape or form the source should be cited regardless of book ownership.


Feels like a bit of an over correction.

LLMs seem to - with the right prompt- be able to reproduce copyrighted work. So it is “in there” in some abstract baked in sense.

We really need some sort of legal middle ground to reflect reality on the ground. It’s not quite straight stealing but it’s also not entirely not copied.


If this were true then there would never be a legal need for clean-room implementation or design.


The author asserts that AI-generated code cannot be copyrighted, which courts have agreed with so far, but in practice most AI-generated code is being claimed to be copyrighted, if you believe GitHub’s own stats about how much Copilot is being used.


Copilot often feels like an automation of clean room reimplementation of protected materials.


Except it's the furthest from clean room you can get? It's feeding the original copyrighted materials through a machine and getting equivalent but (usually) distinct material out. It's not like it's only reading documentation and an interface specification and producing an implementation based on that.


a clean room implementation means that the person writing the code does not have acess to the code they're trying to replicate.

Even when the source is available, it is looked at by people, who write requirements based on what they observed in the code. After that, they are considered compromised, and are banned from ever touching the code of the re-implementation.


I hope the GPL people get their AI GPL revolution. A trillion dollar GPL data center running an open intellegence from scrapping every GPL program ever made.

People would start releasing works of Art under GPL! Not just Computer Code!


This article inverse the notion of free software and of copyleft...



What's new since?

Microsoft will assume liability for legal copyright risks of Copilot

https://news.ycombinator.com/item?id=37420885

This week:

Judge dismisses DMCA copyright claim in GitHub Copilot suit

https://news.ycombinator.com/item?id=40919253


If it isn't copyright infringement I assume Microsoft has trained it on all of its proprietary source code to make it the best it can be.


Just out of curiosity, does Copilot license allow the use of its output to train other similar AI code generators?


> If I go to a bookshop, take a book off the shelf and start reading it, I am not infringing any copyright

Do pro-generative AI people have absolutely no argument besides this? If I had a dollar for every time I've heard it, I'd be rich by now. And it's not even close to being a good argument.


My take is that artists are hypocrites.

" If you had so many dollars you would change your motto/ to say your account is 7 digits like you won the lotto/ buy vanilla ice and lick it like there's no tomorrow "

In hip-hop culture these lyrics could jog someone's memory of some Nas lyrics, some 8 mile battle songs maybe.

Could an AI come up with those awful lyrics? It sure could if it knew the specific Nas lyrics, 8 mile battle lyrics, knew about the lotto and twist the rules, what a motto is and so on.

Would it infringe any copyright by knowing the lyrics to Nas songs, 8 mile battle songs? According to you, yes. But I disagree.

Why hypocrites? Because I think artists (and coders) are always inspired by someone else's work and add their own experience, context, personality on top to create new work. Yes, generative AI is doing it a massive scale and democratizing most of the creative process, but to me it's the same thing even if, technically, it may be very crude and basic.


> Do pro-generative AI people have absolutely no argument besides this?

They say it even though the standard response is: "When I buy a joint[1], I am breaking no laws. Does that make it legal for a corporation to purchase 200m tons of marijuana for profit?"

As far as legal concepts go, scale matters! Just because you are legally allowed to do something individually, does not automatically make it legal to do it for profit 200m times a month.

FCOL, before you even get to the scale argument, some things that are legal to do for free are not legal to do for profit.

[1] In my jurisdiction, anyway. Many others too.


This (your comment) isn’t even a bad argument, it’s just a dismissal.


Making money from AI models "trained" on data that belongs to other people is so clearly wrong that this person has to write thousands of words of impenetrable nonsense just to distract attention away from this fundamental.

What absolute nonsense.


(Nice necromancy)

Dude got ripped into in his own comment section... apparently he conveniently ignored the fact that copilot spits out verbatim code blocks, which is the main problem everyone talks about.

I don't want to shit on the Pirate Party he is a member of, but most of his blogposts seems like the typical anti-EU bashing to me. But YMMV.


There's a recent ruling on the `verbatim` aspect:

>The judge disagreed, however, on the grounds that the code suggested by Copilot was not identical enough to the developers' own copyright-protected work, and thus section 1202(b) did not apply.[...] >Judge Tigar earlier ruled the plaintiffs hadn't actually demonstrated instances of this happening, which prompted a dismissal of the claim with a chance to amend it.

[1] https://www.theregister.com/2024/07/08/github_copilot_dmca/


> The output of a machine simply does not qualify for copyright protection – it is in the public domain.

The machine, such as it is, is generally not acting on its own. A person operates the machine, and presumably is on the hook for infringement on some level.

Consider: what if one directs the machine to reproduce a specific body of code and it ostensibly does so. Was there copying?

What if I have a person read out that body of code and I type exactly what I hear? I used a machine to produce the resultant text file, but it's pretty clear that copying has occurred then.

FWIW, I'm not a copyright maximalist, but I don't think you can win a conflict by abstaining from playing the game the other team is playing. That is: the companies producing, and drawing incredible and exclusive economic value from, industrial scale plagiarism machinery are hardly going to stop ruthlessly enforcing copyright on their own proprietary software. It would seem best, if it's the goal, to get laws changed in advance of simply declaring, lopsidedly, that this is all fine.


The machine argument also rings hollow to me. This same argument could be made for a scanner + printer that does some transformation - changes colors a bit randomly or something - I'd be having a hard time convincing a court the resulting image is now copyright free.

Obviously LLMs are much more advanced than this, but in the basis it's still a machine that takes its input data and applies specified transformations with randomness.


The machine argument makes sense to me. It shifts the blame from the machine creators to the machine users. A copying machine creator is not responsible for misuse of the machine, nor is it illegal to simply Xerox a copyrighted work. Distributing the copy is where the law comes into play, and it targets the person distributing it, not the machine or the manufacturer.

Obviously, it seems impossible for an LLM user to verify the legality of the output, so it seems like the only conclusion is not to use it, or to only release your works under copyleft.

I guess an alternate interpretation is treating them like gun manufacturers. They aren't the ones pulling the trigger, but one could argue their business and marketing practices are done negligently enough for them to carry a portion of the responsibility. I guess then one must show that the LLM creators are sufficiently negligent in preventing misuse of their product at the same scale.


The difference here is that Xerox isn't trying to handwave away copyright, while OpenAI et al. are explicitly defending the position in court that LLM output doesn't violate copyright, not that the LLM user is responsible for copyright violations. (I'd be hilarious for them to try to take this stance thought)

I.i.r.c. they even "give you a license" to use LLM output, implying that they own the copyright. Too lazy to look this up so I might be wrong there though.


I've coded dozens of procedural asset/"art" generators. It's really surprising to me that, apparently, the output of my creative work (i.e. the novel algorithms) is not protected by copyright.


I've coded a JPEG compressor. It's clear the output of the machine isn't exactly the same of the input, being lossy, so I guess the output is not protected by copyright either.


The input to your JPEG compressor is an image that someone/something created, and the goal of your compressor is to replicate its input as closely as possible. It doesn't create new data, it merely encodes pre-existent data.

The input to my algorithms is a single number, and the goal is to create something new and distinctive that is clearly different from anything that existed before, and that wouldn't exist without the ingenuity of my algorithms.


Really, the "input" to current deep learning algorithms includes all the training data. Which is where the root of the issue is.

There's a reason why compression comparisons tend to include the size of the decompressor in their comparisons - my "magic algorithm can compress wikipedia down to a single bye!" is less impressive with the "decompressor" contains a copy of wikipedia.


Your creative work (the novel algorithms) are protected under normal copyright. Depending on where you live, you may even be able to patent those algorithms.

The output of those algorithms isn't a creative work (as you're running a creative work to generate them, not applying any kind of creativity once the program is finished), so it's not protected. Same with generated code.

The lines get a bit blurry when you start applying a creative process onto the generated work. The edits you do on generated images in Photoshop may be considered creative work, as may the edits you do to generated code.

Protections only apply as far as copyright/trademark law can apply. In the case of generative AI and LLMs, it seems that copyright/trademark law doesn't consider the generated output to be a derivate work of the data model, or the model is not a derivative work of the input dataset. As a result, the output of generative AI doesn't seem to be restricted by the license of the training material.

In theory, a sufficiently powerful generative AI may be able to generate entire Disney movies, which would not be considered derivative works under the current laws, as far as I can tell. I'm sure the moment Disney gets threatened, AI copyright law will be updated, though.


I have an unreleased game that contains thousands of procedurally generated 3d models, textures and sounds. I've worked for years on those algorithms; they are my brushes and my paint.

So when (if) I release that game, anyone can just rip the assets, not even apply any transformation, and put them in their own product. I would have to let them do it with no legal recourse. That doesn't seem right.


Rules and algorithms in general (and in the US) are not covered by copyright. The expression of those rules as a creative work is, but the rules themselves are not. To that end, anyone could take your procedures and duplicate the effects of them to release their own game. The combination of those procedures in the specific form of your game is protected, but only to the extent that what is protected isn’t essential to running the procedure. So in its simplest (and most general because law is complex) form, if you release a “game” that uses a deterministic procedure to change individual pixel colors on the screen in response to user input, almost nothing of that is covered under copyright. If your procedure instead places tiles of artwork created by you, those tiles and a game using your procedure to place those tiles in that way is protected, but someone else could replace all your tiles and release their own game with the same procedures.


The reasoning indeed seems a bit simplistic. As if a C file werena creative work but the resulting EXE file would not be.


We go back several decades, when "computer" meant a person with a pen and paper. The output of those computers could be copyrighted... by the computer, not you.


Nope. It does. No amount of mental acrobacy changes that.


Has this person created anything of value that he'd like to protect?


People live and think in the present, as an added plus people have terrible memory. Think in 5-20 years and use that timeframe to put the events one after another.


Chatgpt would verbatim output its training data when given 200 successive “a “ in the past. Now this has been fixed by showing an error when you try this. You arent going to convince me this is anything but compression.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: