Hacker News new | past | comments | ask | show | jobs | submit login
Judge dismisses DMCA copyright claim in GitHub Copilot suit (theregister.com)
381 points by samspenc 15 days ago | hide | past | favorite | 445 comments

> Indeed, last year GitHub was said to have tuned its programming assistant to generate slight variations of ingested training code to prevent its output from being accused of being an exact copy of licensed software.

If I, a human, were to:

1. Carefully read and memorize some copyrighted code.

2. Produce new code that is textually identical to that. But in the process of typing it up, I randomly mechanically tweak a few identifiers or something to produce code that has the exact same semantics but isn't character-wise identical.

3. Claim that as new original code without the original copyright.

I assume that I would get my ass kicked legally speaking. That reads to me exactly like deliberate copyright infringement with willful obfuscation of my infringement.

How is it any different when a machine does the same thing?

You might not get your ass kicked. Copyright doesn't protect function, to the point where the court will assess the degree to which the style of the code can be separated from the function. In the even that they aren't separable, the code is not copyrightable.



Software like Blackduck or Scanoss is designed to identify exactly that type of behaviour. It is used very often to scan closed source software and to check whether it contains snippets that are copied from open source with incompatible licenses (e.g. GPL).

To be able to do so, these softwares build a syntax tree of what your code snippet is, and compare the tree structure with similar trees in open source software without being fooled by variable names. To speed up the search, they also compute a signature for these trees so that the signature can be more easily searched in their database of open source code.

And that's all well and good, but that code that asserts to be protected by GPL still has to stand the abstraction-filtration-comparison test.

The plain fact is that you can claim copyright on plenty of stuff that isn't copyrightable.

Consider AI model weights at all: they're the result of an automatic process and contain no human expression; almost by definition, model weights shouldn't be copyrightable, but people are still releasing "open source" models with supposed licenses.

But there has to be a threshold. If a GPL project contains a function which takes two variables and returns x+y, and I have functionally identical code in a project I made with an incompatible license, it is obviously absurd to sue me.

You are right but there is no legally defined threshold so it's subjective.

As a matter of fact, the Eclipse Foundation requires every contributor to declare that every piece of code is their own original creation and is not a copy/paste from other projects, with the exception possibly of other Eclipse Foundation or Apache Foundation projects because their respective licenses allow that. Even code snippets from StackOverflow are formally forbidden.

If I am not mistaken, in the Oracle-Google trial over Java on Android, at the end Google re-implementation of Java API on Android was considered fair-use, because Google kept the original "signatures" of the Java SDK API and rewrote most of the implementation with the exception of copying "0.4% of the total Java source code and was minimal" [1] However the trial came to this conclusion after several iterations in court.

[1] https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....

You're right, there is. The threshold is whatever a court decides is "substantial similarity" in that particular case. But there's no way to know that ahead of time as the interpretation/decision is subjective.

The simple version is that code is copyrightable as an expression. And the underlaying algorithm is patentable.

The legal term you're looking for here is the "Abstraction-Filtration-Comparison" test; What remains if you subtract all the non-copyrightable elements from a given piece of code.

Algorithms have become patentable only very recently in the history of patents, without a rationale being ever provided for this change, and in some countries they have never become patentable.

Even in the countries other than USA where algorithms have become patentable, that happened only due to USA blackmailing those countries into changing their laws "to protect (American) IP".

It is true however that there exist some quite old patents which in fact have patented algorithms, but those were disguised as patents for some machines executing those algorithms, in order to satisfy the existing laws.

Doesn't really matter, the point is that they're patentable. They clearly shouldn't be IMO, but they are.

US copyright does protect for "substantial similarity" [0]. And at the other end of the spectrum, this has been abused in absurd ways to argue that substantially different code has infringed.

In Zenimax vs Oculus they basically argued that a bunch of really abstract yet entirely generic parts of the code were shared, we are talking some nested for loops, certain combinations of if statements, and due to a lack of a qualitative understanding of code, syntax, common patterns, and what might actually qualify for substantively novel code in the courtroom, this was accepted as infringing. [1]

Point is, the legal system is highly selective when it comes to corporate interests.

[0] https://en.wikipedia.org/wiki/Substantial_similarity

[1] https://arstechnica.com/gaming/2017/02/doom-co-creator-defen...

> US copyright does protect for "substantial similarity"

Substantial similarity refers to three different legal analyses for comparing works. In each case what the analysis is attempting to achieve is different, but in no case does it operate to prohibit similarity, per se.

The Wikipedia page points out two meanings. The first is a rule for establishing provenance. Copyright protects originality, not novelty. The difference is that if two people coincidentally create identical works, one after another, the second-in-time creator has not violated any right of the first. (Contrast with patents, which do protect novelty.) In this context, substantial similarity is a way to help establish a rebuttable presumption that the latter work is not original, but inspired by the former; it's a form of circumstantial evidence. Normally a defendant wouldn't admit outright they were knowingly inspired by another work, though they might admit this if their defense focuses on the second meaning, below. The plaintiff would also need to provide evidence of access or exposure to the earlier work to establish provenance; similarity alone isn't sufficient.

The second meaning relates to the fact that a work is composed of multiple forms and layers of expression. Not all are copyrightable, and the aggregate of copyrightable elements needs to surpass a minimum threshold of content. Substantial similarity here means a plaintiff needs to establish that there are enough copyrightable elements in common. Two works might be near identical, but not be substantially similar if they look identical merely because they're primarily composed of the same non-copyrightable expressions, regardless of provenance.

There's a third meaning, IIRC, referring to a standard for showing similarity at the pleadings stage. This often involves a superficial analysis of apparent similarity between works, but it's just a procedural rule for shutting down spurious claims as quickly as possible.

> Point is, the legal system is highly selective when it comes to corporate interests.

I don't even think it's that. In recent cases like Oracle v. Google and Corellium v. Apple, Fair Use prevailed with all sorts of conflicting corporate interests at play. The Zenimax v. Oculus case very much revolved around NDAs that Carmack had signed and not the propagation of trade secrets. Where IP is strictly the only thing being concerned, the literal interpretation of Fair Use does still seem to exist.

Or for a more plain example, Authors Guild. v. Google where Google defended their indexing of thousands of copywritten books as Fair Use.

In fact, go to far as to argue your example of Authors Guild v. Google is a good indication that most cases will probably go an AI platform's way. It's a pretty parallel case to a number of the arguments. Indexing required ingesting whole works of copyright material verbatim. It utilized that ingested data to produce a new commercial work consisting of output derived from that data. If I remember the case correctly, google even displayed snippets when matching a search so the searcher could see the match in context, reproducing the works verbatim for those snippets and one could presume (though I don't recall if it was coded against), that with sufficiently clever search prompts, someone could get the index search to reproduce a substantial portion of a work.

Arguably, the AI platforms have an even stronger case as their nominal goal is not to have their systems reproduce any part of the works verbatim.

> In fact, go to far as to argue your example of Authors Guild v. Google is a good indication that most cases will probably go an AI platform's way.

The more recent Warhol decision argues quite strongly in the opposite direction. It fronts market impact as the central factor in fair use analysis, explicitly saying that whether or not a use is transformative is in decent part dependent on the degree to which it replaces the original. So if you're writing a generative AI tool that will generate stock photos that it generated by scraping stock photo databases... I mean, the fair use analysis need consist of nothing more than that sentence to conclude that the use is totally not fair; none of the factors weigh in favor it.

I think that decision is much narrower than "market impact". It's specifically about substitution, and to that end, I don't see a good argument that Co-Pilot substitutes for any of the works it was trained on. No one is buying a license to co-pilot to replace buying a license to Photoshop, or GIMP, or Linux, or Tux Racer. Nor is Github selling co-pilot for that use.

To the extent that a user of co-pilot could induce it to produce enough of a copyrighted work to both infringe on the content (remember that algorithms are not protected by copyright) and substitute for the original by licensing in lieu of, I would expect the courts to examine that in the ways it currently views a xerox machine being used to create copies of a book. While the machine might have enabled the infringement, it is the person using the machine to produce and then distribute copies that is doing the infringing not the xerox machine itself nor Xerox the company.

Specifically in the opinion the court says:

>If an original work and a secondary use share

>the same or highly similar purposes, and the secondary use

>is of a commercial nature, the first factor is likely to

>weigh against fair use, absent some other justification for


I find it difficult to come up with a good case that any given work used to train co-pilot and co-pilot itself share "the same or highly similar purposes". Even in the case of say someone having a code generator that was used in training of co-pilot, I think the courts would also be looking at the degree to which co-pilot is dependent on that program. I don't know off hand if there are any court cases challenging the use of copyright works in a large collage of work (like say a portrait of a person made from Time Magazine covers of portraits), but again my expectation here is that the court would find that while the entire work (that is the magazine cover) was used and reproduced, that reproduction is a tiny fraction of the secondary work and not substantial to its purpose.

Similarly we have this line:

>Whether the purpose and character of a use weighs in favor

>of fair use is, instead, an objective inquiry into what use

>was made, i.e., what the user does with the original work.

Which I think supports my comparison to the xerox machine. If the plaintiffs against Co-Pilot could have shown that a substantial majority of users and uses of Co-Pilot was producing infringing works or producing works that substitute for the training material, they might prevail in an argument that co-pilot is infringing regardless if the intent of github. But I suspect even that hurdle would be pretty hard to clear.

Of the various recent uses of generative AI, Copilot is probably the one most likely to be found fair use and image generation the least likely.

But in any case, Authors Guild is not the final word on the subject, and anyone trying to argue for (or against) fair use for generative AI who ignores Warhol is going to have a bad day in court. The way I see it, Authors Guild says that if you are thoughtful about how you design your product, and talk to your lawyers early and continuously about how to ensure your use is fair and will be seen as fair in the courts, you can indeed do a lot of copying and still be fair use.

I agree. Nothing is going to be the final word until more of these cases are heard. But I still don't think Warhol is as strong even against other uses of generative AI, and in fact I think in some ways argues in their favor. The court in Warhol specifically rejects the idea that the AWF usage is sufficiently transformed by the nature of the secondary work being recognizably a Warhol. I think that would work the other way around too, that a work being significantly in a given style is not sufficient for infringement. While certainly someone might buy a license to say, Stable Diffusion and attempt to generate a Warhol style image, someone might also buy some paints and a book of Warhol images to study and produce the same thing. Provided the produced images are not actually infringements or transformations of identifiably original Warhol works, even if they are in his style, I think there's a good argument to be made that the use and the tool are non-infringing.

Or put differently, if the Warhol image had used Goldsmith's image as a reference for a silk screen portrait of Steve Tyler, I'm not sure the case would have gone the same way. Warhol's image is obviously and directly derived from Goldsmith's image and found infringing when licensed to magazines, yet if Warhol had instead gone out and taken black and white portraits of prince, even in Goldsmith's style after having seen it, would it have been infringing? I think the closest case we have to that would have been the suit between Huey Lewis and Ray Parker Jr. over "I Want a New Drug"/"Ghostbusters" but that was settled without a judgement.

I do agree that Warhol is a stronger argument against artistic AI models, but it would very much have to depend on the specifics of the case. The AWF usage here was found to be infringing, with no judgement made of the creation and usage of the work in general, but specifically with regard to licensing the work to the magazine. They point out the opposite case that his Campbell paintings are well established as non-infringing in general, but that the use of them licensed as logos for soup makers might well be. So as is the issue with most lawsuits (and why I think AI models in general will win the day), the devil is in the details.

A key finding that the judge said in the Authors Guild v. Google case was that the authors benefited from the tool that google created. A search tool is not a replacement for a book, and are much more likely to generate awareness of the book which in turn should increase sales for the author.

AI platforms that replaces and directly compete with authors can not use the same argument. If anything, those suing AI platforms are more likely to bring up Authors Guild v. Google as a guiding case to determine when to apply fair use.

Copyright is abused often. Our modern version of copyright is BS and only benefits large corps who buy a lot of IP.

Yep. Now it is a legal cudgel wielded most effectively by corporate giants. It has mutated to become completely philosophically opposed to what it was expressly created to protect.

If I were to license a cover of a song for a music video, I'd have to license both the original song and the cover itself.

I'd say this is extremely relevant in this case.

if that is the case why do people ever license covers?

to clarify - I thought you just had to negotiate with the cover artist about rights and pay a nominal fee for usage of the song for cover purposes - that is to say you do not negotiate with the original artist, you negotiate with a cover artist and the whole process is cheaper?

You're maybe thinking about this in a way that's not helping you to understand the system and why it works the way it does. It's very clear when you think of a specific case.

Say you want to make a recording of "Valerie" by the Zutons. You need permission (a license) from the songwriters (the Zutons presumably) to do this. You usually get this permission by paying a fee. Having done that, you can do your recording. Whenever that recording is played (or used) you will get a performance royalty and they will get a songwriting royalty.

Say you want to use a cover of "Valerie" by the Zutons in your film or whatever. Say the Mark Ronson version featuring Amy Winehouse. You need permission (a license) from the person who produced that version (Mark Ronson or his company) and will need to pay them a fee, some of which goes to the songwriter as part of their deal with Mark Ronson which gave him the license to produce his cover in the first place.

The Zutons don't have the right to sell you a license to Mark Ronson's version so if that's the version you want you have to negotiate with him. Likewise he doesn't have the right to sell you a license like the license he has (ie a license to do a recording/performance) so if you want that you have to negotiate with them.

OK it seems exactly what I thought and described, and the opposite of what the parent poster described. The parent poster said that if you want to use the cover of the song you need to negotiate with both the people who did the cover and the original rights owner.

The closest I could get to a situation like that would be if I told Band B do a cover of Song A for my movie and I paid the licensing costs as part of my deal with Band B, but still not the same as the parent poster's description.

Cover songs have a special abd explicit law covering them. Not relevant.

While correct, the example given is that they COPY the code, then make adjustments to hide the fact. I suspect this is still a copyright violation. It’s interesting that a judge sees it differently when it’s just run through a programme. I’m not a legal expert so I’m guessing it’s a bit more complex than the headline?

Ok I read the article and it looks like the issue is the DMCA specifically, which require the code to be more identical than is presented. I’m guessing separate claims could still come from other copyright laws?

No copy-paste was explicitly used. They compressed it into a latent space and recreated from memory, perhaps with a dash of "creativity" for flavor. Hypothetically, of course.

The distinction is pedantic but important, IMHO. AI doesn't explicitly copy either.

But isn’t that the same as memorising it and rewriting the implementation from memory? I’m sure “it wasn’t an exact reproduction” is not much of a defence.

I sure think so. I also think that (to first order) this is exactly what modern AI products do. Is a lossy copy still a copy?

I would have thought so but I’m not a lawyer. The article suggests DMCA is intended for direct copies so that’s why it failed here. Maybe more general copyright laws would apply for lossy copies.

You have a much smaller lobbying budget than the AI industry, and you didn't flagrantly rush to copy billions of copyrighted works as quickly as possible and then push a narrative acting like that's the immutable status quo that must continue to be permitted lest the now-massive industry built atop copyright violation be destroyed.

Violate one or two copyrights, get sued or DMCAed out of existence. Violate billions, on the other hand, and you magically become immune to the rules everyone else has to follow.

> Violate one or two copyrights, get sued or DMCAed out of existence. Violate billions, on the other hand, and you magically become immune to the rules everyone else has to follow.

Sounds like the same concept as commonly said of "murderer vs conqueror".

Could probably be applied to many other fields for disruption too. Not the murderer bit (!), more the "break one or two laws -> scaled up massively to a potential new paradigm".

"If you owe the bank $100 that's your problem. If you owe the bank $100 million, that's the bank's problem."

Pretty sure there's a bunch of pre-existing laws around that though, so not really ripe for disrupting by scaling up the problem. ;)

There's a strong geopolitical angle as well. If you force American companies to license all training data for LLMs, that is such a gargantuan undertaking it would effectively set US companies back by years relative to Chinese competitors, who are under no such restrictions.

Bottom line, if you're doing something considered relevant to the national interest then that buys you a lot of leeway.

You will need to first demonstrate that actual copying took place. And that what copying that did take place was actually illegal or infringing.

As we're seeing in court, that's a very interesting question. It turns out that the answers are very counter-intuitive to many.

What about the copyrights purpose of furthering the arts and sciences?

You want to look at the Supreme Court case "Eldred v. Ashcroft." Eldred challenged Congress for retroactively extending existing copyrights, for extending the patent protections on existing inventions could not possibly further arts and sciences. They also argued that if Congress had the power to continually extend existing copyrights by N years every N years, the Constitutional power of "for a limited time" had no meaning.

The Supreme Court's decision was a bunch of bullshit around "well, y'know, people live longer these days, and some creators are still alive who expected these to last their whole lives, and golly, coincidentally this really helps giant corporations."

Copyright has utterly failed to serve that purpose for a long time, and has been actively counterproductive.

But if you want to argue that copyright is counterproductive, I completely agree. That's an argument for reducing or eliminating it across the board, fairly, for everyone; it's not an argument for giving a free pass to AI training while still enforcing it on everyone else.

Could these "free passes" for AI training serve as a legal wedge to increase the scope of fair use in other cases? Pro-business selective enforcement sucks, but so long as model weights are being released and the public is benefiting then stubbornly insisting that overzealous copyright laws be enforced seems self-defeating.

Without copyright, entire industries would've been dead a long time ago, including many movies, games, books, tv, music, etc.

Just because their lobbies tend to push the boundary of copyright into the absurd doesn't mean these industries aren't worth saving. There should be actually respectful lawmakers who seek for a balance of public and commercial interests.

> Without copyright, entire industries would've been dead a long time ago, including many movies, games, books, tv, music, etc.

Citation needed. There are many ways to make money from producing content other than restricting how copies of it can be distributed. The owner should be able to choose copyright as a means of control, but that doesn't mean nobody would create any content at all without copyright as a means of control.

There's nothing preventing people from producing works and releasing them without copyright restriction. If that were a more sustainable model, it would be happening far more often.

As it is now, especially in the creative fields (which I am most knowledgeable about), the current system has allowed for a incredible flourishing of creation, which you'd have to be pretty daft to deny.

> If that were a more sustainable model, it would be happening far more often.

that's not the argument. The fact that there currently are restrictions on producing derivative works is the problem. You cannot produce a star wars story, without getting consent from disney. You cannot write a harry potter story, without consent from Rowling.

That's not actually true. There's nothing stopping you from producing derivative works. Publishing and/or profiting from other people's work does have some restrictions though.

There's actually a huge and thriving community of people publishing derivative works, in a not-for-profit basis, on Archive of Our Own. (Among other places.)

> There's actually a huge and thriving community of people publishing derivative works, in a not-for-profit basis, on Archive of Our Own. (Among other places.)

Yes, and none of those people are making a living at creating things. That's why they are allowed by the copyright owners to do what they're doing--because it's not commercial. Try to actually sell a derivative work of something you don't own the copyright for and see how fast the big media companies come after you. You acknowledge that when you say there are "restrictions" (an understatement if I ever saw one) on profiting from other people's work (where "other people" here means the media companies, not the people who actually created the work).

It is true that without our current copyright regime, the "industries" that produce Star Wars, Disney, etc. products would not exist in their current form. But does that mean works like those would not have been created? Does it mean we would have less of them? I strongly doubt it. What it would mean is that more of the profits from those works would go to the actual creative people instead of middlemen.

> Yes, and none of those people are making a living at creating things.

Again, not true. One of the most famous examples is likely Naomi Novik, who is a bestselling author, in addition to a prolific producer of derivative works published on AO3. Many other commercially successful authors publish derivative works on this platform as well.

> It is true that without our current copyright regime, the "industries" that produce Star Wars, Disney, etc. products would not exist in their current form. But does that mean works like those would not have been created? Does it mean we would have less of them? I strongly doubt it. What it would mean is that more of the profits from those works would go to the actual creative people instead of middlemen.

Speculate all you want about an alternative system, but you really don't know what would have happened, or what would happen moving forward.

> not true

Sorry, I meant they're not making a living at creating derivative works of copyrighted content. They can't, for the reasons you give. Nor can other people make a living creating derivative works of their commercially published work. That is an obvious barrier to creation.

> the current system has allowed for a incredible flourishing of creation

No, the current system has allowed for an incredible flourishing of middlemen who don't create anything themselves but coerce creative people into agreements that give the middlemen virtually all the profits.

People do not put out their stuff. People get lured into contracts selling their IP to a shitty company that then publishes stuff, of course WITH copyright so they can make money while the artist doesnt

Given that copyrighting is automatic at the instant of creation, that is, um, debatable.

Slapping 3 lines in LICENSE.TXT doesn’t override the Berne convention.

Are you claiming that an author cannot place their work in the public domain?

Yes, they can't, because there is no legally reliable way to do it (briefly, because the law really doesn't like the idea of property that doesn't have an owner, so if you try to place a work of yours in the public domain, what you're actually doing is making it abandoned property so anyone who wants to can claim they own it and restrict everyone else, including you, from using it). The best an author can do is to give a license that basically lets anyone do what they want with the work. Creative Commons has licenses that do that.

In most of the world no, they can't.

Copyright laws prevent piracy. It is interesting to live in a country with no enforced copyrights and EVERYTHING is pirated. I think it is easy to not know about that context and just see the stick side of copyright vis-a-vis big money corporations

Technically speaking, copyright laws create piracy as without them we would still have our free speech rights to share whatever we want without the approval from third parties and thus so-called piracy aka copyrigh infringement would not be a thing. Laws also hardly prevent sharing of copyrighted content, they only make it illegal.

> we would still have our free speech rights to share whatever we want

This is a false dichotomy. It's not "free speech" to copy someone else's video game and then sell it for your own profit. By "copy", in the old days that was literally copying the distribution CDs and providing a cracked keycode (it was not even a question of trademarks being close or what not. It's literally people taking the stuff, duplicating it, and selling it for their own profit. Eastern European mafia were greatly financed by this and ran this type of operation at industrial scale).

> Laws also hardly prevent sharing of copyrighted content, they only make it illegal.

Yeah, that's the point. Without that, everything is bootlegged. Imagine video games - they get bootlegged. DVDs, all bootlegged. Clothing bootlegged. Whatever your business is - bootlegged. Zero copyright is not a utopia of free speech, it is people ripping everyone else off. Per lived experience, I'm just saying the other extreme is not a utopia.

So true! Copyrights that last 20 years would be completely reasonable. Maybe with exponentially increasing fees for successive renewals, for super valuable properties like Disney movies.

Nobody cares anymore. We're sick of their rent seeking, of their perpetual monopolies on culture. Balance? Compromise? We don't want to hear it.

Nearly two hundred years ago one man warned everyone this would happen. Nobody listened. These are the consequences.

"At present the holder of copyright has the public feeling on his side. Those who invade copyright are regarded as knaves who take the bread out of the mouths of deserving men. Everybody is well pleased to see them restrained by the law, and compelled to refund their ill-gotten gains. No tradesman of good repute will have anything to do with such disgraceful transactions. Pass this law: and that feeling is at an end. Men very different from the present race of piratical booksellers will soon infringe this intolerable monopoly. Great masses of capital will be constantly employed in the violation of the law. Every art will be employed to evade legal pursuit; and the whole nation will be in the plot. On which side indeed should the public sympathy be when the question is whether some book as popular as “Robinson Crusoe” or the “Pilgrim’s Progress” shall be in every cottage, or whether it shall be confined to the libraries of the rich for the advantage of the great-grandson of a bookseller who, a hundred years before, drove a hard bargain for the copyright with the author when in great distress? Remember too that, when once it ceases to be considered as wrong and discreditable to invade literary property, no person can say where the invasion will stop. The public seldom makes nice distinctions. The wholesome copyright which now exists will share in the disgrace and danger of the new copyright which you are about to create. And you will find that, in attempting to impose unreasonable restraints on the reprinting of the works of the dead, you have, to a great extent, annulled those restraints which now prevent men from pillaging and defrauding the living."


Books, music, and games are a lot older than copyright.

Have you looked at who created these things by and large? For the most part, you have: - aristocrats that were wealthy that didn't need to "work" to survive and put food on the table - crafts people supported through the patronage of a rich person (or religious order) who deign to support your art - (kinda modern world) national governments who want to support their national art often as a fear that other larger nations cultural influences will dwarf their

Are you implying that these three pillars will be able to produce anywhere near the current amount of content we produce?

How in the world where digital copies are effectively free to copy and infinitum would a creator reap any benefits from that network effect?

A modern equivalent would be famous YouTubers who all they do all day is "watch" other people's hard earned videos. The super lazy ones will not direct people to the original, don't provide meaningful commentary, just consumes the video as 'content' to feed their own audience and provides no value to the original creator. The position to kill copyright entirely would amplify this "just bypass the original source" to lower value of the original creator to zero.

> Are you implying that these three pillars will be able to produce anywhere near the current amount of content we produce?

Do you think the vast "amount of content we produce" is actually propped up by copyright? Have you ever heard of someone who started their career on YouTube due to copyright? On the contrary, how often have you heard of people stopping their YouTube career due to copyright, or explicitly limiting the content they create? I have only heard of cases of the latter. In fact, the latter partially happened to me.

> How in the world where digital copies are effectively free to copy and infinitum would a creator reap any benefits from that network effect?

You are making an assumption that people should reap (monetary) benefits for creating things. What you are ignoring is that the world where digital copies are effectively free is also the world where original works are insanely cheap as well. In this world, people create regardless of monetary gain.

To make this point: how much money did you make from this comment that you posted? It's covered by copyright, so surely you would not have created it if not for your own benefit.

Spending 6 minutes of my life engaging in political discourse is a far swing from hundreds of individuals producing a movie that took millions of dollars to produce. Both are just as easily digitally repeatable, but the expensive content is likely way more beneficial to society as a whole. I am choosing to engage in this hobby because I receive the means to provide this content recreationally. I fail to see this scaling to anything of any real quality outside of some isolated instances. For instance, some video game enthusiasts are using the work of Bethesda to make a new game call fallout London. It's a knock off fallout game using the base code engine that Bethesda built for their commercial games. The game is exceptional in that it could actually achieve a mostly compatible level of a commercial product as long as you ignore that they're leveraging the engines and story which were developed by commercial interests. In the same time, 10's to hundreds of thousands of people are employed every year to produce video games for commercial reasons. Will they all stop making games if copyright was dead? No, but the vast majority would.

> Are you implying that these three pillars will be able to produce anywhere near the current amount of content we produce?

Yes, and better quality content too as it doesn't need to be compromised as much to allow for commercial exploitation in the current model.

But these are also not the only ways to fund content. Patronage in particular does not need to be restricted to singular rich patrons but can be extended to any group of people that decide to come together to make something exist. This does already happen to some extend (e.g. Kickstarter) but is actually hobbled by copyright where the norm is that the creator retains all rights while individual contributors to the funding are restricted in how they are allowed to share the creation they helped realize.

> How in the world where digital copies are effectively free to copy and infinitum would a creator reap any benefits from that network effect?

By having fans willing to pay him to create new content.

For that matter, if you think China ripping everyone else off is bad now… well, just wait until every company can do that.

If everyone could do it, it wouldn't be as big a deal - small western businesses would be on a more level playing field, since they would be almost as immune from being sued by big businesses as Chinese businesses are. As it is, small businesses aren't protected by patents (because a patent is a $10k+ ticket to a $100k+ lawsuit against a competitor with a $1M+ budget for lawyers) while still being bound by the restrictions of big business's patents. It's lose/lose.

Trademark isn't copyright, so no.

Yeah many industries like:

- Big Corps that buy IP

- Patent Trolls

- Companies that fuck over artists

Why would anyone make video games if they couldn't make money from selling them?

Video games would actually be better of if the profit incentive was removed. Modern high-budget video games have become indistinguishable from slot machines that are optimized by literal psychologists to get you to waste as much of your money (and time) as possible without providing any meaningful experience. I'd rather see much fewer games created if what remains are games focused on having artistic and/or educational value rather than investment opportunities for wall street.

This is just your own sanctimony, go to a gamestop and ask people if they think we should have an IP regime where there is no gta or football games. What a ridiculous response.

Out of passion for the art. See also: free (libre) software video games released and distributed for free (gratis).

Of course, money is a huge motivator, but so is self-expression.

Well they certainly aren't on level with each other in terms of motivation, so I don't think it's fair for you to say they are both huge motivators.


This is a specious argument. It is impossible for us to gesture at the works of art that do not exist because of draconian copyright. Humans have been remixing each others' works for millions of years, and the artificial restriction on derivative work is actively destroying our collective culture. There should be thousands of professional works (books, movies, etc.) based on Lord Of The Rings by now, many of which would surpass the originals in quality given enough time, and we have been robbed of them. And Lord Of The Rings is an outlier in that it still remains culturally relevant despite its age; most works will remain copyrighted for far longer than their original audience was even alive, meaning that those millions of flowers never get their chance to bloom.

> It is impossible for us to gesture at the works of art that do not exist because of draconian copyright.

We can gesture at the tiniest tip of the iceberg by observing things that are regularly created in violation of copyright but not typically attacked and taken down until they get popular:

- Game modding, romhacks, fangames, remakes, and similar.

- Memes (often based on copyrighted content)

- Stage play adaptations of movies (without authorization

- Unofficial translations

- Machinima

- Speedruns, Let's Play videos, and streams (very often taken down)

- Music remixes and sampling

- Video mashups

- Fan edits/cuts, "Abridged" series

- Archiving and preservation of content that would otherwise be lost

- Fan films

- Fanfiction

- Fanart

- Homebrew content for tabletop games

> "- Speedruns, Let's Play videos, and streams (very often taken down)"

Very often taken down, only by nintendo.

There are several other publishers who regularly go after gameplay footage of people playing their games. It's not as visible, because it's hard to notice the absence of a thing.

This is all true, and in a vacuum I agree with it. There's a pretty core problem with these kinds of assertions, though: people have to make rent. Never have I seen a substantiative, pass-the-sniff-test argument for how to make practical this system when your authors and your artists need to eat in a system of modern capital.

So I'm asking genuinely: what's your plan? What's the A to B if you could pass a law tomorrow?

> What's the A to B if you could pass a law tomorrow?

Top priority: UBI, together with a world in which there's so much surplus productivity that things can survive and thrive without having "how does this make huge amounts of money" as its top priority to optimize for.

Apart from that: Conventions/concerts/festivals (tickets to a unique live event with a crowd of other fans), merchandise (pay for a physical object), patronage (pay for the ongoing creation of a thing), crowdfunding/Kickstarter (pay for a thing to come into existence that doesn't exist yet), brand/quality preference (many people prefer to support the original even if copies can be made), commissions (pay for unique work to be created for you), something akin to "venture funding", and the general premise that if a work spawns ten thousand spinoffs and a couple of them are incredible hits they're likely to direct some portion of their success back towards the work they build upon if that's generally looked upon favorably.

People have an incredible desire both to create and to enjoy the creations of others, and that's not going to stop. It is very likely that the concept of the $1B movie would disappear, and in trade we'd get the creation of far far more works.

> UBI, together with a world in which there's so much surplus productivity that things can survive and thrive without having "how does this make huge amounts of money" as its top priority to optimize for.

The poster didn't posit it as "how does this make huge amounts of money," they asked how copyright authors are supposed to pay their rent in your scenario. Your solution of course, has nothing to do with copyright policy.

Yeah, this is what I was expecting. I have no love for Disney et al but I think that this is dire (aside from UBI, which would be great but is fictional without a large-scale shift in American culture).

"Everybody else gets paid for the work they do; you get paid for things around the work you do, if you're lucky" is a way to expect creatives to live that, to put a point on it, always ends up being "for thee, but not for me". It's bad enough today--I think you described something worse.

The current model is "most people get paid for the work they do, but you get paid for people copying work you've already done", which already seems asymmetric. This would change the model to "people get paid for the work they do, and not paid again for copying work they've already done".

We converged on a system that protects the commercialization of copies because, in practice, "the first copy costs $X0,000" is not a viable way to pay your rent.

If we want art to be the province of the willfully destitute or the idle rich (and I do mean rich, the destruction of a functional middle class has compacted the available free time of huge swaths of society!), this is a good way to do it. I would rather other voices be included.

We converged on a system that makes copying illegal because that system was invented in an era when the only people who could copy were those with specialized equipment (e.g. printing presses). In that world, those who might do the copying were often larger than those whose works were being copied, and copyright had more potential to be "protective".

That system hasn't been updated for a world in which everyone can make perfect-fidelity copies or modifications at the touch of a key; on the contrary, it's been made stricter. And worse, per the story we're commenting on here, the much larger players who are mass-copying works largely by individuals or smaller entities have become effectively exempt from copyright, while copyright continues to restrict individuals and smaller entities, and the systems designed by those large players and trained on all those copied works are crowding individuals out of art and other creative endeavors.

I don't think the current system deserves valorizing, nor can it be credited as being intentionally designed to bring about most of the effects it currently serves.

I'm not suggesting that deleting copyright overnight will produce a perfect system, nor am I suggesting that it has zero positive effects. I'm suggesting that it's doing substantial harm and needs a massive overhaul, not minor tweaks.

> the much larger players who are mass-copying works largely by individuals or smaller entities have become effectively exempt from copyright

That's not true. I'm a copyright attorney and I spend my day extracting money from the largest players on behalf of individuals.

I was referring to AI training here.

We'll see, but hopefully they will not.

They don't have to copy work, they can make their own work!

Many of the funding models Josh listed are directpayment for creative work being done. If anything, in the current model creative work is often not paid directly (unless done as work for hire where the creative doesn't get to own their creation) but instead is a gamble that you can later on profit from the "intellectual property".

Not the person you responded to, but:

>So I'm asking genuinely: what's your plan? What's the A to B if you could pass a law tomorrow?

Patreon (or liberapay etc). Take a look at youtube: so many creators are actively saying "youtube doesn't pay the bills, if you like us then please support us on Patreon". Patreon works. Some of the time, at least - just like copyright. Also crowdsourcing (e.g. Kickstarter), which worked out well for games like FTL and Kingdom Come: Deliverance.

Although, I personally don't believe copyright should be abolished - it just needs some amendments. It needs a duration amendment - not a flat duration (fast-fashion doesn't need even 5 years of copyright, but aerospace software regularly needs several decades just to break profitable), but either some duration-mechanism or a simple discrimination by industry.

Also, I think any sort of functional copyright (e.g. software copyright) ought to have an incentive or requirement to publish the functional bits - for instance, router firmware ought to require the source code in escrow (to be published once copyright duration expires) for any legal protections against reverse-engineering to be mounted. Unpublished source code is a trade secret, and should be treated as such.

Also, these discussions don't seem to mention fanfiction, which demonstrates plenty of people write good works without being professionally paid and without the protection of copyright.

How many subscribers on patreon are there because the creators provides pay-walled extra content? How many would remain if that pay-walled content would be mirrored directly by youtube or on youtube?

Crowdsourcing might work better, but how many would donate to a game where, instead of getting it cheaper as a kickstarter supporter, they could get free after it is released?

I completely forgot about Patreon's paywalled content. Plenty of channels don't have any, though, so I don't think it's that important.

Copyright is not optimized for making sure artists and authors get enough to eat. It's optimized for people with a lot of money to make even more money by exploiting artists and authors.

I doubt there's a simple answer (I certainly don't have one), but the current system is not exactly a creators' utopia.

My own business model is to create Things That Don't Exist Yet. This (typically bespoke work) is actually the majority of work in any era I think. For me, copyright doesn't do much, it mostly gets in the way.

If you pass the law tomorrow -all else being equal- my profits would stay equal or go up somewhat.

Fashion is traditionally not copyrightable[1] , and the fashion industry is doing rather well.

Similarly our IT infrastructure is now built mostly on [a set of patches to the copyright system][2] called F/L/OSS that provided more freedom to authors and users, and lead to more innovation and proliferation of solutions.

So even just in the modern west, we can see thriving ecosystems where copyright is absent or adjusted; and where the outcomes are immediately visible on the street.

[1] Though a quick search shows that lawyers are making inroads.

[2] One way of describing it at least, YMMV.

That ship sailed long ago. While copyright can and is used at times to protect the "little guy", the law is written as it is in order to protect and further corporate interests.

The current manifestation of copyright is about rent-seeking, not promoting innovation and creativity. That it may also do so is entirely coincidental.

Also, if it wasn't about rent-seeking and preventing access to works, copyright wouldn't have to last for decades, many multiples of a work's useful commercial life. The fact that it does last this long shows that it's not about promoting innovation and creativity.

Copyright was invented by a cartel of noblemen, the British Stationer's Company, who, due to liberal reform, were going to lose their publishing monopoly. The implementation of copyright law as they helped pen allowed them to mostly continue their position while portraying it as "protecting the little guy".

Funny how both the rhetoric and intentions are the same after three hundred years.

Copyright’s purpose is a cudgel to be wielded to enrich the holder for, ideally, eternity. If “eternity” is threatened, you use proceeds from copyright to change copyright law to protect future proceeds.

works the same for banks and owing them money

Violate billions or millions is what they used to nail warez folks with. So there is that.

> acting like that's the immutable status quo

It is immutable.

What are you going to do about it? Confiscate everyone's home gamer PCs?

Even in the most extreme hypothetical where lawsuits shutdown OpenAI, that doesn't delete the stable diffusion models that I have on my external hard drives.

The tech is out there. It's too late.

Somehow this argument does not seem to hold for copyright enforcement of works that have been shared over BitTorrent and it's predecessors for decades.

I can start downloading any major/popular piece of media, starting right now, in under 60 seconds through bittorrent.

I cannot think of a better example of how futile copyright enforcement has been than the example that you just brought up.

That's a significant over simplification of how it works though to the point of almost not being a useful analogy.

If your analogy was you were a human who memorized every variation of a problem (and every other known problem) and there was a tiny perctange of a chance where you reproduced that exact varation of one you memorized, but then added an after the fact filter so you don't directly reproduce it...

It's more like musicians who basically copy a bunch of music patterns or chord progressions before then notice their final output sounds too similar to another song (which happens often IRL) then changes it to be more original before releasing it to the public.

> If you analogy was you were a human who memorized every variation of a problem (and every other known problem)

This is mere assumption. AI is supposed to work like that, but that's a goal, and not the result of current implementations. Research shows that they do memorize solutions as well, and quite regularly so. (This is an unavoidable flaw in current LLMs; They must be capable of memorizing input verbatim in order to learn specific facts.)

> and there was a tiny perctange of a chance where you reproduced that exact varation of one you memorized

This is copyright infringement. Actionable copyright infringement. The big music publishers go after this kind of accidental partial reproduction.

> but then added an after the fact filter so you don't directly reproduce it...

"Legally distinct" is a gimmick that only works where the copyright is on specific identifiable parts of a work.

Changing a variable name does not make a code snippet "legally distinct", it's still copyright infringement.

Meh I still see that as a big oversimplification. Context matters. Even if the copyright courts often ignore that for wealthy entities. Someone reproducing a song using AI and publishing it as their own copyright infringement, a person specifically querying an AI engine, that sucked up billions of lines of information and generates what you ask it do with a sma probability it will reproduce a small subset of a larger commercial project and sends it to someone in a chatbox is not exactly the same IMO.

This is Github Copilot after all. I use it daily and it autocompletes lines of code or generates functions you can find on stackoverflow. It's not letting giving you the source code to Twitter in full and letting you put it on the internet as a business under another name.

We are currently seeing the music industry reacting to AI learning a bunch of music patterns and chord progressions and outputting works that sounds very similar to existing music and artists. They are not liking it.

To just see how much they disliked it, youtube copyright strikes is basically a trained AI to detect music patterns to identify sound with slight variations or copyrighted songs and take videos down. Generating slight variations was one of the early method that videos used to bypass the take down system.

From the article:

> The most recently dismissed claims were fairly important, with one pertaining to infringement under the Digital Millennium Copyright Act (DMCA), section 1202(b), which basically says you shouldn't remove without permission crucial "copyright management" information, such as in this context who wrote the code and the terms of use, as licenses tend to dictate.

> It was argued in the class-action suit that Copilot was stripping that info out when offering code snippets from people's projects, which in their view would break 1202(b).

> The judge disagreed, however, on the grounds that the code suggested by Copilot was not identical enough to the developers' own copyright-protected work, and thus section 1202(b) did not apply. Indeed, last year GitHub was said to have tuned its programming assistant to generate slight variations of ingested training code to prevent its output from being accused of being an exact copy of licensed software.

So (not a lawyer!) this reads like the point about GitHub tuning their model is not a generic defense against any and all claims of copyright infringement, but a response to a specific claim that this violates a provision of the DMCA.

I don't know whether this is a reasonable defense or not, but your intuitions or mine about whether there is a general copyright violation or what's fair are not necessarily relevant to how the judge construes that very specific bit of legal code.

What I got from this is, you can copy someone's copyrighted work provided you tweak a few things here and there. I wonder how this holds up in court if you don't have billions at your disposal.

Weird Al should be in the clear then, he changes probably 85% of all the song lyrics in his covers.

Weird Al explicitly seeks out permission from copyright holders and won't do a cover if he doesn't get their go-ahead [1].

Pretty much the exact opposite of all these AI companies :p


I'm implying that he doesn't seem to have to.

Just to set the stage and not entirely specific to this complaint... It really depends on what is and isn't subject to copyright for software.

Broadly, there is the distinction between expressive and functional code. [1]

And then there are the specific tests that have been developed by the courts to separate the expressive and functional aspects of software. [2] [3]

In practice it is very expensive for a plaintiff to do such analysis. For the most part the damages related to copyright are not worth the time and money. Plaintiffs tend to go for trade secret related damages as they are not restricted by the above tests.

There are also arguments to be made of de minimis infringements that are not worth the time of the court.

Most importantly the plaintiff fundamentally has the burden of proof and cannot just say that copying must have taken place. They need concrete evidence.

[1] https://en.wikipedia.org/wiki/Idea–expression_distinction

[2] https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

[3] https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

The guy who owns the machine is really rich, while you are more or less (all due respect of course) not worth suing.

That’s why I think the opposite of what you claim is true: if you were to do this, absolutely nothing would happen. When they do it, they will get sued over and over until the law changes and they can’t be sued, or they enter some mutually-beneficial relationship with the parties who keep suing.

> if you were to do this, absolutely nothing would happen

Read up on the DMCA and the impact it has on e.g. nintendo emulators and the developers thereof

Those emulators are very popular though to the point of potentially impacting another business's bottom line. Where an individual putting it out a small block of code isn't exactly going to attract expensive lawyers.

I'm skeptical Github Copilot reproducing a couple functions potentially used by some random Github project is going to be a threat to another party's livelihood.

When AI gets good enough to make full duplicates of apps I'd be more concerned about the source. Thousands of smaller pieces drawn from a million sources and being combined in novel ways is less worrying though.

There is no impact to a company's bottom line when you are emulating a product they do not sell.

Yuzu, the emulator that was sued by Nintendo, was emulating the Nintendo Switch, which is a product Nintendo does sell.

Yuzu is not the only emulator taken down by Nintendo and Nintendo is not the only company that has gone after emulators.

In that case, could you clarify what instances of this you're referring to?

The death of Citra wasn't really a deliberate action on the part of Nintendo, it was collateral damage. Citra was started by Yuzu developers and as part of the settlement they were not able to continue working on it. Citra's development had long been for the most part taken over by different developers, but the Yuzu people were still hosting the online infrastructure and had ownership of the GitHub repository, so they took all of it down. Some of the people who were maintaining Citra before the lawsuit opened up a new repository, but development has slowed down considerably because the taking down of the original repository has caused an unfortunate splintering of the community into many different forks.

There is some speculation Nintendo was involved with the death of the Nintendo 64 emulator UltraHLE a long time back, but this was never confirmed. If indeed they did go after UltraHLE, then this would just like Yuzu be a case of them taking down an emulator for a console they were still profiting from, as UltraHLE was released in 1999.

The most famous example of companies going after emulators is Sony, which went after Connectix Virtual Game Station and Bleem!. Both were PS1 emulators released in 1999, a period during which Sony was still very much profiting from PS1 sales. Sony lost both lawsuits and hasn't gone after emulators since.

In 2017, Atlus tried to take down the Patreon page for RPCS3, a PS3 emulator. However, Atlus only went after the Patreon page, not the emulator itself, which they did because of their use of Persona 5 screenshots on said page. The screenshots were simply taken down and the Patreon page was otherwise left alone. Of note is that Atlus is a game developer, so they were never profiting from PS3 sales. However, they were certainly still profiting from Persona 5 sales, which had only released in 2016.

These are the only examples I can remember. Did I miss anything?

emulators for many nintendo consoles have been developed and released while the console was still sold and have been left alone as long as they had no direct links to piracy, recent events are a bit of a change.

> There is some speculation Nintendo was involved with the death of the Nintendo 64 emulator UltraHLE a long time back, but this was never confirmed.

iirc it got c&d but a case was never filed in court, the source code turned up eventually anyways.

the bnetd emulator, that let Diablo and StarCraft players not have to pay Blizzard for the privilege of buying the game, though that's a bit different.

Yes there is. If I can emulate Super Mario Odyssey on my PC, I don't need to buy a Nintendo Switch. If it wasn't available there, I'd have to buy a Nintendo Switch to play it. That's a lost sale for Nintendo. You could argue that I wasn't going to buy a switch anyway, but then we're getting too into hypotheticals.

This is the same reasoning the music and movie industries use when they go after people downloading music. And contrary to the popular opinion, I think it is wrong: if people want to pay, they will pay. Same for movies: if people would really want to pay for a movie, they would go to a cinema. Or stream it after a week or two. But there are also people who would jump through hoops than pay for music or movies. And that is not a lost sale because there was never an intention to buy something in the first place.

Music isn't video games or movies, and is experienced differently, so while there are similarities, it's not the same because they aren't same thing.

Locks keep people honest. Unfortunately, software lockpicks have the unfortunate reality of being as easily distributable as the software itself.

I enjoy how you removed the “I think” qualifier which suggested that it’s very possible that you’re right.

I’m quite well read on the DMCA but admit you probably know far more about how Nintendo wields it.

Still, I suggest that it’s a lot more likely that GitHub is going to get sued than you or GP.

Finally, I believe using the legal system to bully independent software developers is, in legal terms, super lame. We are probably in the same side here.

DMCA (at least the take down requests part) is not really suing someone and not really about making money. Its about getting certain works off the internet.

You are probably more likely to be on the wrong end of a dmca take down request as a poor person since you dont have the resources to fight it, and its not about recovering damages just censorship.

We are really losing the plot of what this thread is about here, but: DMCA takedown requests that are ignored or wheee the site does not comply with the process are subject to private civil action. Obviously, a takedown request is distinct from suing someone. And the way that the rights holder forces the site to remove the content is under threat of monetary penalties.

> How is it any different when a machine does the same thing?

I think the argument is that the machine is not doing that, or at least there isn't evidence that it is doing that.

Specificly no evidence that github is doing both 1 and 2 at the same time. There might be cases where it makes trivial changes to code (point 2) but for code that does not meet the threshold of originality. Similarly there might be cases with copyrighted code where the idea of it is taken, but it is expressed in such a different way that it is not a straightforward derrivitave of the expression (keeping in mind you cannot copyright an idea, only its expression. Using a similar approach or algorithm is not copyright infringement)

And finally, someone has to demonstrate it is actually happening and not just in theory could happen. Generally courts dont punish people for future crimes they haven't comitted yet (sometimes you can get in trouble for being reckless even if nothing bad happens, but i dont think that applies to copyrighg infringement)

No clue.

But what if the generative AI were used to create music instead of code would the court have ruled differently?


In 2015, a federal judge order Thicke & Pharrell to pay 50% of proceeds to the Marvin Gaye estate for being “too similar” to the song, “Gots to Give It Up”.

Comparison and commentary: https://youtu.be/7_UiQueteN4?si=SkClbyBMOcucigRm

Comparison of both songs: https://youtu.be/ziz9HW2ZmmY?si=3_VZzfoLT-NrozoK

Regardless of the details here, it's become quite clear that the judicial system is for corporations. It doesn't matter whether they win, lose, or settle, as they win regardless, since the monetary benefits of what got them in court in the first place far outweigh any punishment or settlement cost.

You probably do this all the time. Forget memorizing but undoubtedly you've read code, learned from it, and then likely reproduced similar code. Probably nothing terribly important, just a function here or there. Maybe even reproduced something you did for a previous employer.

arr.sort((a, b) => a - b);

comes to mind. I bet most js devs have written this verbatim.

The machine alone doesn't do anything. The user and machine together constitute a larger system, and with autocomplete, the user is charge. What's the user's intent?

I suspect that a lot of copyright violations are enabled by cut-and-paste and screenshot-taking functionality, and maybe we need to be careful with autocomplete, too? It's the user's responsibility to avoid this. We should be careful using our tools. Do users take enough care in this case? Is it possible to take enough care while still using CoPilot?

I've switched from CoPilot to Cody, but I use them the same way, to write my code. There's no particular reason to use CoPilot's output verbatim and lots of good reasons not to. By the time I've adapted it to my code base and code style and refactored it to hell and back, it's an expression of how I want to solve a problem, and I'm pretty confident claiming ownership.

Is that confidence misplaced? Are other people more careless?

> The machine alone doesn't do anything.

By the same token, the machine alone can't download pirated movies. Yet the sites hosting those movies are targeted as the infringers.

There's a point at which foisting this responsibility on the users is simply socializing losses. Ultimately Copilot is the one serving the code up - regardless of the user's request. If the user then goes on to republish that work as their own it becomes two mistakes. It'll be interesting to see if any lawyers are capable of articulating that well enough in any of these lawsuits.

> Is that confidence misplaced? Are other people more careless?

I would say yes, for two reasons. One is that using code of unknown provenance means you're opening yourself to unknown legal risks. The second is if you're rewriting it fully (so as not to run afoul of easily spotted copyright) that's not actually "clean room" and you're still open to problems. I'd also wonder what the point of using a code writing LLM is anyways if you're doing all the authorship yourself. It seems like doing double the work.

It is a lot of work to do a lot of rewrites, but it’s noncommercial and I’m not in a hurry. And autocomplete is still pretty useful.

>I assume that I would get my ass kicked legally speaking.

Why? This is no different than copy pasting and modifying a bit of code from some documentation/other project/tutorial/SO. Surely if that were a basis for copyright infringement most semi-large software projects would be infringing on copyright.

I don't think anyone here should be willing to open the can if worms that is copy pasting small snippets of code and modifying them.

The judge seems to argue that the non-identical copies are at issue here and that they only happen under contrived circumstances. My moral opinion is that this is irrelevant and that even the defendant is the wrong person. Even verbatim copies of code snippets shouldn't be copyright infringement and suing the company providing the AI is wrong to begin with, as the AI or its providercan not possibly be the one to infringe.

I don't think it works that way. During the course of your professional career as a developer you change jobs. And let's say that at every job you create APIs. Besides the particular functions those API provide, the API code itself (how you interact with clients, databases etc.) will be pretty much the same as whatever you did at previous jobs. Does this constitute copyright experience or is just experience?

My analogy is that if Copilot doesn't provide 100% code from another repository it is OK to be used by other people trained with code available on GitHub.

It would. And this is where some legislation "in the spirit of" would have helped. So Microsoft's huge legal arm can't just wiggle their way out on technicalities. Clearly, the law is not prepared to face the challenge of copyright violations on the scale created by the LLMs.

I also think it's not just copyright. It's simply not right to create a product on top of the collective work of all open source developers monetize them on the absurt scale Microsoft operates and never ever credit the original creators.

Why stop there? Extrapolate that thought, keep generating more variants of the code, claim copyright, and seek rent from other people doing the same thing. To extrapolate full circle, there would be a business opportunity to generate as many variants as possible for the original author, to prevent all this from happening.

As long as we're not required to register copyright there's no reason to think the above will play out. International copyright agreements are not limited to verbatim copies only.

> Why stop there? Extrapolate that thought, keep generating more variants of the code, claim copyright, and seek rent from other people doing the same thing. To extrapolate full circle, there would be a business opportunity to generate as many variants as possible for the original author, to prevent all this from happening.

This has already been done[1] in music, though in their case they released them to the public domain. Admittedly I think that was more of a protest than anything.

[1]: https://www.vice.com/en/article/wxepzw/musicians-algorithmic...

You are taking the plaintiff statement as is, which is wrong. You can blame the media that didn't made it clear that it was a statement from the plaintiff.

> I assume that I would get my ass kicked legally speaking. That reads to me exactly like deliberate copyright infringement with willful obfuscation of my infringement.

It looks like wilful obfuscation because the obfuscation is so simplistic. But as the obfuscation gets increasingly sophisticated, it becomes ever harder to distinguish wilful obfuscation from genuine originality.

> But sufficiently complex obfuscation of infringement is very hard to distinguish from genuine originality.

for the purposes of copyright, originality is not required, just different expressions. It's ideas (aka, patent) that require originality.

The 'sufficiently complex obfuscation' is exactly what people's brains go through when they learn, and re-produced what they learnt in a different context.

I argue that AI-training can be considered to be doing the same.

Some different scenarios:

(1) You leave your employer, don’t take any code with you, start your own company, reimplement your ex-employer’s product from scratch, but you do it in a very different way (different language, different design choices, different tech stack, different architecture)

(2) You leave your employer, take their code with you, start your own company, make some superficial changes to their code to obscure your theft but the copying is obvious to anyone who scratches the surface

(3) You leave your employer, take their code with you, start your own company, start very heavily manually refactoring their code, within a few months it looks completely different, very difficult to distinguish from (1) unless you have evidence of the process of its creation

(4) You leave your employer, take their code with you, start your own company, download some “infringement obfuscation AI agent” from the Internet and give it your employer’s codebase, within a few hours it has transformed it into something difficult to distinguish from (1) if you didn’t know the history

(1) is unlikely to be held to be infringing. (2) is rather obviously going to be held to be infringing. But what about (3)? IANAL, but I suspect if you admitted that is how you did it, a judge would be unlikely to be very sympathetic. Your best hope would be to insist you actually did (1) instead. And then the outcome of the case might come down to whether the judge/jury believes your claim you actually did (1), or the plaintiff/prosecution’s claim you did (3).

And (4) is basically just (3) with AI to make it a lot faster and quicker. Such an agent likely doesn’t exist yet, but it could happen.

Timing is obviously a factor. If you leave your employer and launch a clone of their app the next week, everyone is going to think either you stole their code, or you were moonlighting on writing it (in which case they may legally own it anyway). If it takes you 12 months, it becomes more believable you wrote it from scratch. But if someone uses AI to launder code theft, maybe they can build the “clone” in a few days or weeks, and then spend a few months relaxing and recharging before going public with it

Numbers 2, 3, & 4 are all illegal because they start with an illegal action.

If I find a dollar on the sidewalk and put it in my wallets, is that stealing? If I punch a man getting change at a hotdog stand and a dollar falls on the sidewalk and then I put that in my wallet, is that stealing?

It doesn't matter what the scenario is after you stole code from your former employer, all actions are poisoned after.

Although the question is - obviously the ex-employee is likely to be found guilty of copyright infringement (civilly or criminally or both). But what is the copyright status of the resulting work? Does its infringing origins condemn it to always be infringing? Or at some point if it is refactored/rewritten enough it ceases to so be?

Imagine the ex-employee open sources it, and I’m an innocent third party using that code base, ignorant of its unlawful origins. Am I infringing their ex-employers copyright (even if unintentionally)? For (2), obviously “yes”. But what about (3) or (4)?

I agree. I don’t see the difference.

That’s the entire reason “clean room reverse engineering” is done.

Using nothing but the binary itself, work out how things are done. Making sure that the reverse engineers don’t even have access to any material that could look like it came from the other organization in question. And that it is provable.

How is it anything different? You have no money. And Microsoft has. The problem on this is that it will give a huge leverage to rich companies over poor, because those rich can steal (memorize with AI) anything including music

It seems the total disregard that the tech community showed toward copyright when it was artists losing out has come back to bite. Face-eating leopards, etc.

The actual answer here, regardless of a court ruling, is that you'd go broke if anyone big enough tried to go after you for it.

Legal protections for source code are still pretty fuzzy, understandably so given how comparatively new the industry is. That doesn't stop lawyers from racking up huge fees though, it actually helps because they need so much more prep time to debate a case that is so unclear and/or lacking precedent.

> How is it any different when a machine does the same thing?

Because intent matters in the law. If you intended to reproduce copyrighted code verbatim but tried to hide your activity with a few tweaks, that's a very different thing from using a tool which occasionally reproduces copyrighted code by accident but clearly was not designed for that purpose, and much more often than not outputs transformative works.

> clearly was not designed for that purpose,

I'm not aware of evidence that support that claim. If I ask ChatGPT "Give me a recipe for squirrel lemon stew" and it so happens that one person did write a recipe for that exact thing on the Internet, then I would expect that the most accurate, truthful response would be that exact recipe. Anything else would essentially be hallucination.

Recipes are not copyrightable for that exact reason.

Substitue recipe for literally any other piece of unique information.

Copyright doesn't apply to unique pieces of information. Copyright applies to unique expressions. You can't copyright a fact.

i think you are misconceiving then how LLMs work / what they are

You can certainly try to hit a nail with a screw driver, but that doesn't make the screw driver a hammer.

As I understand it, LLMs are intended to answer questions as "truthfully" as they can. Their understanding of truth comes from the corpus they are trained on. If you ask a question where the corpus happens to have something very close to that question and its answer, I would expect the LLM to burp up that answer. Anything less would be hallucination.

Of course, if I ask a question that isn't as well served by the corpus, it has to do its best to interpolate an answer from what it knows.

But ultimately its job is to extract information from a corpus and serve it up with as much semantic fidelity to the original corpus as possible. If I ask how many moons Earth has, it should say "one". If I ask it what the third line of Poe's "The Raven" is, it should say "While I nodded, nearly napping, suddenly there came a tapping,". Anything else is wrong.

If you ask it a specific enough question where only a tiny corner of its corpus is relevant, I would expect it to end up either reproducing the possibly copyright piece of that corpus or, perhaps worse, cough up some bullshit because it's trying to avoid overfitting.

(I'm ignoring for the moment LLM use cases like image synthesis where you want it to hallucinate to be "creative".)

I get that's what you and a lot of people want it to be, but it isn't what they are. They are quite literally probabilistic text generation engines. Let's emphasise that: the output is produced randomly by sampling from distributions, or in simple terms, like rolling a dice. In a concrete sense it is non-deterministic. Even if an exact answer is in the corpus, its output is not going to be that answer, but the most probable answer from all the text in the corpus. If that one answer that exactly matches contradicts the weight of other less exact answers you won't see it.

And you probably wouldn't want to - if I ask if donuts are radioactive and one person explicitly said that on the internet you probably aren't going to tell me you want it to spit out that answer just because it exactly matches what you asked. You want it to learn from the overwhelimg corpus of related knowledge that says donuts are food, people routinely eat them, etc etc and tell you they aren't radioactive.

They are all hallucinations. Calling lies hallucinations and truths normal output is nonsense.

Perfect analogy.

Not in copyright. The work speaks for itself, and the function of code is not a copyrightable aspect.

The intent of the work can matter when determining if de minimis applies as well as fair use.

Part of my point is that fair use doesn't apply.

Training a model doesn't involve reproducing a copyrighted work, preparing a derivative work, distributing that work, or performing that work.

Fair use isn't required because none of the exclusive rights afforded by copyright apply.

It's equally plausible to say you don't intend to reproduce copyrighted code verbatim but occasionally do so given either a sufficiently specific prompt or because the reproduced code is so generic that it probably gets rewritten a hundred times a day because that's how people learned to do basic things from books or documentation or their education.

Um, the entire intent of these "AI" systems is explicitly to reproduce copyrighted work with mechanical changes to make it not appear to be a verbatim copy.

That is the whole purpose and mechanism by which they operate.

Also the intent does not matter under law - not intending to break the law is not a defense if you break the law. Not intending to take someone's property doesn't mean it becomes your property. You might get less penalties and/or charges, due to intent (the obvious examples being murder vs manslaughter, etc).

But here we have an entire ecosystem where the model is "scan copyrighted material" followed by "regurgitate that material with mechanical changes to fit the surrounding context and to appear to be 'new' content".

Moreover given that this 'new' code is just a regurgitation of existing code with mutations to make it appear to fit the context and not directly identical to the existing code, then that 'new' code cannot be subject to copyright (you can't claim copyright to something you did not create, copyright does not protect output of mechanical or automatic transformations of other copyrighted content, and copyright does not protect the result of "natural processes", e.g 'I asked a statistical model to give me a statically plausible sequence of tokens and it did'). So in the best case scenario - the one where the copyright laundering as a service tool is not treated as just that, any code it produces is not protectable by copyright, and anyone can just copy "your work" without the license and (because you've said if you weren't intending to violate copyright it's ok) they can say they could not distinguish the non-copyright-protected work from the protected work and assumed that therefore none of it was subject to copyright. To be super sure though they weren't violating any of your copyrights, they then ran an "AI tool" to make the names better and better suit your style.

I am so sick of these arguments where people spout nonsense about "AI" systems magically "understanding" or "knowing" anything - they are very expensive statistical models, the produce statistically plausible strings of text, by a combination of copying the text of others wholesale, and filling the remaining space with bullshit that for basic tasks is often correct enough, and for anything else is wrong - because again they're just producing plausible sequences of tokens and have no understanding of anything beyond that.

To be very very very clear: if an AI system "understood" anything it was doing, it would not need to ingest essentially all the text that anyone has ever written, just to produce content that is at best only locally coherent, and that is frequently incorrect in more or less every domain to which it is applied. Take code completion (as in this case): Developers can write code without essentially reading all the code that has ever existed just so that they can write basic code, because developers understand code. Developers don't intermingle random unrelated and non-present variables or functions in their code as they write, because they understand what variables are and therefore they can't use non existent ones. "AI" on the other hand required more power than many countries to "learn" by reading as much as possible all code ever written, and then produce nonsense output for anything complex because they're still just generating a string of tokens that is plausible according to their statistical model - the result of these AIs is essentially binary: it has been in effect asked to produce code that does something that was in its training corpus and can be copied essentially verbatim, with a transformation path to make it fit, or it's not in the training corpus and you get random and generally incorrect code - hopefully wrong enough it fails to build, because they're also good at generating code that looks plausible but only fails at runtime because plausible sequence of tokens often overlaps with 'things a compiler will accept'.

I actually once tracked this claim down in the case of stable diffusion.

I concluded that it was just completely impossible for a properly trained stable diffusion model to reproduce the works it was trained on.

The SD model easily fits on a typical USB stick, and comfortably in the memory of a modern consumer GPU.

The training corpus for SD is a pretty large chunk of image data on the internet. That absolutely does not fit in GPU memory - by several orders of magnitude.

No form of compression known to man would be able to get it that small. People smarter than me say it's mathematically not even possible.

Now for closed models, you might be able to argue something else is going on and they're sneakily not training neural nets or something. But the open models we can inspect? Definitely not.

Modern ML/AI models are doing Something Else. We can argue what that Something Else is, but it's not (normally) holding copies of all the things used to train them.

I think this argument starts to break down for the (gigantic) GPTs where the model size is a lot closer to the size of the training corpus.

Thinking in terms of compression, the compression in generative AI models is lossy. The mathematical bounds on compression only apply to lossless compression. Keeping in mind that a small fraction of the training corpus is presented to the training algorithm multiple times, it's not absurd to suggest that these works exist inside the algorithm in a recallable form. Hence the NYT's lawyers being able to write prompts that recall large chunks of NYT articles verbatim.

Well, certainly up to GPT-3 that would seem a little odd. Models of somewhat similar capability are not THAT big, really. Eg:

  $ ollama list                            
  NAME                    ID              SIZE    MODIFIED     
  yi:34b                  ff94bc7c1b7a    19 GB   7 days ago  
  mistral:latest          61e88e884507    4.1 GB  2 months ago
  mixtral:8x22b           bf88270436ed    79 GB   2 months ago
  llama3:70b              be39eb53a197    39 GB   2 months ago
  phi3:latest             a2c89ceaed85    2.3 GB  2 months ago
  dolphin-mistral:latest  5dc8c5a2be65    4.1 GB  2 months ago
  yarn-mistral:7b-128k    6511b83c33d5    4.1 GB  2 months ago
  yarn-mistral:latest     8e9c368a0ae4    4.1 GB  2 months ago
  llama3:latest           a6990ed6be41    4.7 GB  2 months ago

For comparison, here's some stable diffusion checkpoints.

  ComfyUI/models/checkpoints $ du -h .
  6.5G    breakdomainxl_v03d.safetensors
  6.5G    dreamshaperXL10_alpha2Xl10.safetensors 
  6.5G    sd_xl_base_1.0.safetensors
  5.7G    sd_xl_refiner_1.0.safetensors

And I seem to recall there are some theoretical lower bounds on even lossy compression. Some quick back of the envelope fermi estimation gets me a hard lower bound of 5TB for "all the images on the internet"; but I'm not quite confident enough in my math to quite back that up right here and now.

> And I seem to recall there are some theoretical lower bounds on even lossy compression.

I'm not sure what your math is coming from and it seems trivially wrong. A single black pixel is a very lossy compression of every image on the internet. A picture of the Facebook logo is a slightly-less-lossy compression of every picture on the internet (the Facebook logo shows up on a lot of websites). I would believe that you can get a bound on lossy compression of a given quality (whatever quality means) only if you assume that there is some balance of the images in the compressed representation. There are a lot of assumptions there, and we know for a fact that the text fed to the GPTs to train them was presented in an unbalanced way.

In fact, if you look at the paper "textbooks are all you need" (https://arxiv.org/pdf/2306.11644) you can see that presenting a very limited set of information to an LLM gets a decent result. The remaining 6 trillion tokens in the training set are sort of icing on the cake.

Ok, that's a really low lower bound.

I think you'll agree that it would be a bit absurd to threaten legal action against someone for storing a single black pixel.

OTOH Someone might be tempted to start a lawsuit if they believe their image is somehow actually stored in a particular data file.

For this to be a viable class action lawsuit to pursue, I think you'd have to subscribe to the belief that it's a form of compression where if you store n images, you're also able to get n images back. Else very few people would have actual standing to sue.

I think that when you speak in terms of images, for a viable lawsuit, you need to have a form of compression that can recall n (n >= 1) images from compressing m (m >= n) images. Presumably n is very large for LLMs or image models, even though m is orders of magnitude larger. I do not think that your form of compression needs to be able to get all m images back. By forcing m = n in your argument, you are forcing some idea of uniformity of treatment in the compression, which we know is not the case.

The black pixel won't get you sued, but the Facebook logo example I used could get you sued. Specifically by Facebook. There is an image (n = 1) that is substantially similar to the output of your compression algorithm.

That is sort of what Getty's lawsuit alleges. Not that every picture is recallable from an LLM, but that several images that are substantially similar to Getty's images are recallable. The same goes with the NYT's lawsuit and OpenAI.

Thank you for talking with me!

I do realize the benefits of the 'compression' model of ML. Sometimes you can even use compression directly, like here: https://arxiv.org/abs/cs/0312044 .

I suppose you're right that you only need a few substantively similar outputs to potentially get sued already. (depending on who's scrutinizing you).

While talking with you, it occurred to me that so far we've ignored the output set o, which is the set of all images output by -say- stable diffusion. n can then be defined as n = m ∩ o .

And we know m is much larger than n, and o is theoretically practically infinite [1] (you can generate as many unique images as you like) , so o >> m >> n . [2]

Really already at this point I think calling SD a compression algorithm might be just a little odd. It doesn't look like the goal is compression at all. Especially when the authors seem to treat n like a bug ('overfit'), and keep trying to shrink it.

That's before looking back at the "compression ratio" and "loss ratio" of this algorithm, so maybe in future I can save myself some maths. It's an interesting approach to the argument I might try more in future. (Thank you for helping me to think in this direction)

* I think in the case of the Getty lawsuit they might have a bit of a point, if the model might have been overfitted on some of their images. Though I wonder if in some cases the model merely added Getty watermarks to novel images. I'm pretty sure that will have had something to do with setting Getty off.

* I am deeply suspicious of the NYT case. There's a large chunk of examples where they used ChatGPT to browse their own website. This makes me wonder if the rest of the examples are only slightly more subtle. IIRC I couldn't replicate them trivially. (YMMV, we can revisit if you're really interested)

[1] However, in practice there appear to be limits to floating point precision.

[2] I'm using >> as "much greater than"

> Also the intent does not matter under law - not intending to break the law is not a defense if you break the law

Intent frequently matters a great deal when applying laws.

In the specific area of copyright law, it doesn't itself make the use non infringing, but it can absolutely impact the damages or a fair use argument.

Great point, I wonder how the court will look at OpenAI's internal discussions around utilizing copyrighted materials.

If you tell a programmer to implement a function foo(a, b) then there are actually only a tiny number of ways to do that, semantically speaking, for any given foo. The number of options narrows quickly as the programmer implementing it gets more competent.

Choosing function signatures is an art form but after that "copying" is hard to judge.

> a function foo(a, b) then there are actually only a tiny number of ways to do that

I'd argue there are infinite ways to implement any function, just almost all of them are extremely bad.

You would not get your ass kicked legally speaking. Copyright is not that broad. It's not a patent.

> How is it any different when a machine does the same thing?

Literally the bank account behind the action...

it depends on how much tax you are paying really. if you pay billions in taxes annually, they might see past it. if the company you copied from pays billions in taxes anually. you will go to jail. if this isn't painfuly obvious by now...

Doesn't seem that taxes are the deciding factor seing as how little the government cares about tax dodging done by corporations and the rich.

Adding to the sibling comments:

First: every human is per se doing that already. We have – to handwave – a "reasonable person" bar to separate violations versus results of learning and new innovation.

Second: You can be a holder of copyright and your creations result in copyrightable artifacts. Anything generated by the program has been held as uncopyrightable.

who gets to copyright claim the various array sorting algorithms then?

Days like this, I wonder what Borges would have made of such questions.

"Pierre Menard, author of redis"

I know from experience that parents are aggressively pushing their children into STEM to maximize their chances of being economically secure, but, I really feel that we need a generation of philosophers and humanists to sift through the issues that our technology is raising. What does it mean to know something? What does authorship mean? Is a translated work the same as the original? Borges, Steiner, and the rest have as much to contribute as Ellison, Zuckerberg, and Altman.

Rules for thee but not for me (rich companies). Think of the shareholders!

> I assume that I would get my ass kicked legally speaking.

Maybe, maybe not. It's not as simple as you made it out to be. If you write a book with lots of stuff and you got inspiration from other books, and even put in phrases wholesale, but modified to use your own character names instead, I'm not convinced you would lose.

The court would look at the work as a whole, not single pieces of it.

They would also check if you are just copying things verbatim, or if you memorize a pattern and emit the same pattern - for example look at lawsuits about copying music, where they'll claim this part of the music is the same as that part.

It's really not as cut and dry as you make it out to be.

> The anonymous programmers have repeatedly insisted Copilot could, and would, generate code identical to what they had written themselves, which is a key pillar of their lawsuit since there is an identicality requirement for their DMCA claim. However, Judge Tigar earlier ruled the plaintiffs hadn't actually demonstrated instances of this happening, which prompted a dismissal of the claim with a chance to amend it.

It sounds fair from how the article describes it

Huh. There have definitely been well publicized examples of this happening, like the quake inverse square root

You can't copyright a mathematical operation. Only a particular implementation of it, and even then it may not be copyrightable if its a straightforward and obvious implementation.

That said the implementation doesn't appear to be totally trivial and copilot apparently even copies the comments which are almost certainly copyrightable in themselves.

https://x.com/StefanKarpinski/status/1410971061181681674 https://github.com/id-Software/Quake-III-Arena/blob/dbe4ddb1...

However a twitter post on its own isn't evidence a court will accept. You would need the original poster to testify that what is seen in the post is actually what he got from copilot and not just a meme or joke that he made.

Also the plaintiffs in this case don't include id-Software and there is some evidence that id-Software actually stole the fast inverse sqrt code from 3dfx so they might not want to bring a claim here anyways.

Not sure where you thought I said you could copyright a mathematical operation, I was clearly referring to the implementation due to the mention of “quake”.

When it was reported, I was able to reproduce it myself.

Weren't people getting it to spit out valid windows keys also?

GPT4 regurgitated almost full NYT articles verbatim. It's strange that this lawsuit seems to be so amateurish that they failed to properly demonstrate the reproduction. Though of course it might require a lot of legal technicalities that we naively think are trivial but they might be not.

I read that case.

Absolutely there were a few outliers where a judge might want to look more closely. I'd be surprised if -under scrutiny- there wouldn't be any issues whatsoever that OpenAI overlooked.

However, it seemed to me that over half of the NYT complaints were examples of using the -then rather new- ChatGPT web browsing feature to browse their own website. In the case, they then claimed surprise when it did just what you'd expect a web browsing feature to do.

> You can't copyright a mathematical operation.

i agree from a philosophical pov, but this is clearly not the case in law.


The second step is to remove from consideration aspects of the program which are not legally protectable by copyright. The analysis is done at each level of abstraction identified in the previous step. The court identifies three factors to consider during this step: elements dictated by efficiency, elements dictated by external factors, and elements taken from the public domain.


Its even simpler, iD is owned by ZeniMax. ZeniMax is owned by Microsoft.. who would they even sue?

That's not how that works.

All the plaintiffs would need to do is provide evidence that copywritten code was produced verbatim. This includes showing the copyrighted code on GitHub, showing copilot reproducing the code (including how you manipulated copilot to do it), showing that they match, and showing that the setting to turn off reproduction of public code is set.

It makes no difference who owns the copyrighted code, it need only be shown that copilot is violating copyright. Microsoft can't say "uhh that doesn't count" or whatever simply because they own a company that owns a company that owns copyright on the code.

"Trust no one... even yourself"

Algorithms can and are definitely patented in utility patents in the US.

It reads like the judge required them to show it happened to their code, not to any code in general. That's a much higher bar. There are thousands of instances of fast inverse square root in the training data but only one copy of your random github repositories. Getting to model to reproduce your code verbatim might be possible for all we know, but it isn't trivial.

>It reads like the judge required them to show it happened to their code, not to any code in general.

Rightly so, you have to show some sort of damage to sue someone, not just theoretical damages.

of course for standing. but it seems like with the right plaintiffs this could have gone forward

But that’s like saying my lawsuit alleging Taylor Swift copied my song could have gone forward with a plaintiff who had, years ago, written a song similar to what Ms. Swift recorded recently. That”s true, but perhaps the lesson here is that damages that hinge on statistically rare victims should not extrapolated out to provide windfalls for people who have not been harmed.

i think that is a weak analogy and also unnecessary bc it is already clear what i am saying

If it only copies code that has been widely stolen already then that's a lot weaker of a case and is something they can do a lot to prevent on a technical level.

Code that has been copied widely != code that has been widely stolen.

Open source licenses allow sharing under certain conditions.

It could be forced, of course. I can republish my copyrighted code millions of times all over the internet. Next time they retrain there is a good chance my code will end up in their corpus, maybe many many times, reinforcing it statistically.

The article mentions that GitHub copilot has been trained to avoid directly copying specific cases it knows, and that although you can get it to spit out copyright code by prefixing the copyrighted code as a starting point, in normal us cases its quite rare.

yes, but you need to show that it happened _in your case_, not that it can happen in general.

Fast inverse square root is now part of the public domain.

Also, even if this weren’t the case you can’t sue for damages to other people (they’d need to bring their own suit)

Is the particular implementation that the model spits out 70+ years old?


But copilot distributed it (allegedly) without complying with the GPL license (which requires any distribution to be accompanied by the license) so it still would be an instance of copyright infringement. https://x.com/StefanKarpinski/status/1410971061181681674

Has it really already been 70 years since John Carmack died?

Ah, you're right. I was wrong to say "public domain".

It would be more correct to say Quake III Arena was released to the public as free software under the GPLv2 license.

There is a large gap between public domain and GPL. For starters if Copilot is emitting GPL code for closed source projects... that's copyright infringement.

That would be license infringement, not copyright infringement.

Copyright infringement is emitting the code. The license gives you permission to emit the code, under certain conditions. If you don't meet the conditions, it's still copyright infringement like before.


Copyright infringement could be emitting the code in a manner that exceeds fair use.

The license gives you permission to utilize the code in a certain way. If Copilot gives you GPLed code that you then put into your closed source project, you have infringed the license, not Copilot.

> If you don't meet the conditions, it's still copyright infringement like before.

Licensing and copyright are two separate things. Neither has anything to do with the other. You can be in compliance with copyright, but out of license compliance, you can be the reverse. But nothing about copyright infringement here is tied to licensing.

To be clear: I am a person who trashed his Reddit account when they said they were going to license that text for training (trashed in the sense of "ran a script that scrubbed each of my comments first with nonsense edits, then deleted them"). I am a photographer who has significant concerns with training other models on people's creative output. I have similar concerns about Copilot.

But confusing licensing and copyright here only muddies waters.

Without adhering to the conditions of the GPL you have no license to redistribute the code and are therefore infringing the copyright of the author.

Apparently, the court disagrees with you, and doesn't find "emitting" the code a copyright infringement.

It'd be a long bow to draw to say that what is akin to a search result of a snippet of code is "redistributing a software package".

Where it gets ethnically dubious is that:

1. The copilot team rushed to slap a copyright filter on top to keep these verbatim examples from showing up, and now claims they never happen.

2. LLMs are prone to paraphrasing. Just because you filter out verbatim copies doesn't mean there isn't still copyright infringement/plagiarism/whatever you want to call it. The copyright filter is only a legal protection, not a practical protection against the issue of copyright infringement.

Everyone who knows how these systems work understand this. The copilot FAQ to this day claims that you should run copyright scanning tools on your codebase because your developers might "copy code from an online source or library".

Github has it's own research from 2021 showing that these tools do indeed copy their training data occasionally: https://github.blog/2021-06-30-github-copilot-research-recit...

They clearly know the problem is real. Their own research agreed, their FAQs and legal documents are carefully phrased to avoid admitting it. But rather than owning up to the problem, it's "Ner ner ner ner ner, you can't prove it to a boomer judge".

> The copilot team rushed to slap a copyright filter on top to keep these verbatim examples from showing up, and now claims they never happen.

More than that: the fact that they claimed it wasn't possible before adding the filter, to filter out the thing that said wasn't possible. This doesn't help me trust anything else they might say or have already said.

My take on that was always: if it isn't possible, then why are MS not training the AIs on their internal code (like that for Office, in the case of MS with their copilot product) as well as public code? There must be good examples for it to learn from in there, unless of course they thing public code is massively better than their internal works.

How do you know they aren’t training it on their internal code?

Since you really need to work hard to make the AI spit out anything verbatim, and you have no knowledge of their internal code, how could you ever prove or deny it?

> How do you know they aren’t training it on their internal code?

Because if they were, they would have said.

It would be an excellent answer to the concerns being discussed here: “we are so sure that there is nothing to worry about in this regard, that we are using our own code as well as the stuff we've schlepped from github and other public sources”.

> Just because you filter out verbatim copies doesn't mean there isn't still copyright infringement/plagiarism/whatever you want to call it.

Actually, it does. The production of the output is what matters here.

If you copy someone else's copyrighted work and then rearrange a few lines and rename a few things, you're probably still infringing.

For a book or a song, for sure, although that isn't really punished. Search the drama surrounding a popular YA author in the 10's, Cassandra Claire. For code since you can only copy the form and not the function that might actually be enough.

People do clean room implementations because of paranoia, not because it's actually a necessary requirement.

Moving a few things around means your internal process already had copywrite infringement.

Probably not. Copyright infringement in the manner we're talking about presumes you already have license to access the code (like how Github does). What you don't have license to do is distribute the code -- entirely or not without meeting certain conditions. You're perfectly free to do whatever naughty things you want with the code, sans run it, in private.

The literal act of making modifications isn't infringement until you distribute those modifications -- and we're talking about a situation where you've changed the code enough that it isn't considered a derivative work anymore (apparently) so that's kosher.

First the case would be dismissed if Copilot had permission to make copies. Clearly they didn’t. Copyright cares about copies, for profit distribution just makes this worse.

> you already have license to access the code

This isn’t access, that occurs before the AI is trained. It’s access > make copy for training > AI does lossy compression > request unzips that compression making a new copy > process fuzzes the copy so it’s not so obvious > derivative work sent to users.

Clearly Copilot had permission to make (unmodified) copies, the same way Github's webserver had permission to make (unmodified) copies. The lawsuit is about making partial copies without attribution.

GitHub's terms of service (TOS), in my non-lawyerly opinion, clearly states the license for uploaded works granted to them by users doesn't cover using the data to train an LLM or any kind of model beyond those used to improve the hosting service:

>You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time

>This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.


I think the important questions are (1) whether "the Service" includes Copilot, and (2) whether GitHub is selling users' content with Copilot.

For (1), I'm unhappy to admit Copilot probably does fall under "the Service," which is nebulously defined as "applications, software, products, and services provided by GitHub." But I'll still say that users' could not agree to this use while GitHub was training The Copilot model but hadn't yet announced it. At that time, a reasonable user would've believed GitHub's services only covered repository hosting, user accounts, and the extra features attached to those (issue trackers, organizations, etc).

GitHub could defend themselves on point (2) by saying they aren't selling the code, instead selling a product that used the code as input. But does that differ much from selling an online service that relies on running user code? The code is input for their servers, and it doesn't need to be distributed as part of that questionable service. But it's a clear break from the TOS.

GitHub’s web server is not the same thing as Copilot and needs separate permission.

GitHub didn’t just copy open source code they copped everything without respect to license. As such attribution which may have allowed some copying isn’t generally relevant.

Really a public repo on GitHub doesn’t even mean the person uploading it owns the code, if they needed to verify ownership before training they couldn’t have started. Thus by necessity they must take the stance that copyright is irrelevant.

If you’ve copied three lines and rearrange and reword them, there’s little infringement left.

If you copy a whole book and do the same, there’s still lines-3 infringement left.

> 1.

Isn't that akin to destruction of evidence?

Legally? No.

In spirit? ... Probably?

Unlike most LLMs, Github copilot can trivially solve their copyright problem by just using only code they have the right to reproduce.

They have a giant corpus of code tagged with license, SELECT BY license MIT/Equivalent and you're done, problem solved because those licenses explicitly grant permission for this kind of reuse.

(It's still not very cash money to take open source work for commercial gain without paying the original authors, and there's a humorous question if MIT-copilot would need to come with a multi-gigabyte attribution file, but everyone widely agrees it's legal and permitted.)

The only reason you'd hack a filter on top rather than doing the above is if you'd want to hide the copyright problem. It's an objectively worse solution.

> Unlike most LLMs, Github copilot can trivially solve their copyright problem by just using only code they have the right to reproduce.

Absolutely not trivial, in fact completely impossible by computer alone. You can't determine if you have the right to reproduce a piece of code just by looking at the code and tags themselves. *Taps the color-of-your-bits sign.*

* I can fork a GPL project on Github and replace the license file with MIT. Okay to reproduce?

* If I license my project as MIT but it includes code I copied inappropriately and don't have the right to reproduce myself, can Github? (No) This one is why indemnity clauses exist on contracted works.

* I create a git repo for work and select the MIT license but I don't actually own the copyright on that code and so that license is worthless.

There is no difference when it comes to MIT and GPL here. If your model outputs my MIT licensed code, you still need to provide attribution in the form of a copyright notice as required by the MIT license.

Have the copyleft people, or anyone else, produced some boilerplate licenses that explicitly deny use in training models?

I would think it is pretty obviously not.

Is taking away a drunk driver's keys (before they get in the car) destruction of the evidence of their drunk driving?

This is not what I meant. By placing a copyright filter and claiming it never happened (please read the line I was replying to) before the system can be audited, they're indeed taking away the drunk driver's keys, which is a good thing, but also removing the offending car before Police arrives.

In this metaphor, removing the car of someone who was going to drink and drive but didn't, is certainly not a crime. Presumably though you mean removing the car after drunk driving actually took place - which might be, but probably depends a lot on if the person knew, and what the intent of the action was.

In the current case - its unclear if any crime took place at all, it seems clear that the primary intent was to prevent future crime not hide evidence of past ones. Most importantly the past version of the app is not destroyed (presumably). Github still has the version of the software without the copyright filter. If relavent and appropriate, the court could order them to produce the original version. It can't be destroying evidence if the evidence was not destroyed.

Yes, sorta. We're talking about software, therefore a piece of code that does something programmatically isn't like the drunk driver in a car that may cause more accidents, and although we aren't sure about that we prevent him/her to drive anyway just to be safe. The software would most certainly repeat its routine because it has be written to do so, that's why I wondered about destruction of evidence; by removing/modifying it, or placing filters, they would prevent it from repeating the wrongdoing, but also take away any means of auditing the software to find what happened and why.

Not in any way I'm aware of - and would be required if they were served a DMCA notification/Cease and Desist against a specific prompt.

The people that think Copilot is infringng their copyright would be happy with that I would think? Unless they take a much stricter definition of fair use than current courts do.

No more so than scanner/printer manufacturers adding tech to prevent you from scanning and printing currency is destruction of evidence that they are in fact producing illegal machines for counterfeiting.

> The copilot team rushed to slap a copyright filter on top to keep these verbatim examples from showing up, and now claims they never happen.

Well if the copyright filter is working they indeed aren't happening. Putting in safe gaurds to prevent something from happening doesn't mean you're guilty of it. Putting a railing on a balcony doesn't imply the balcony with railing is unsafe.

> LLMs are prone to paraphrasing. Just because you filter out verbatim copies doesn't mean there isn't still copyright infringement/plagiarism/whatever you want to call it

Copyright infringement and plagerism are different things. Stuff can be copyright infringement without being plagerized, and can be plagerized without being copyright infringement. The two concepts are similar but should not be conflated, especially in a legal context.

Courts decide based on laws, not on gut feeling about what is "fair".

> They clearly know the problem is real

They know the risk is real. That is not the same thing as saying that they actually comitted copyright infringement.

A risk of something happening is not the same as actually doing the thing.

> "Ner ner ner ner ner, you can't prove it to a boomer judge".

Its always a cop-out to assume that they lost the argument because the judge didn't understand. I suspect the judge understood just fine but the law and the evidence simply wasn't on their side.

> Well if the copyright filter is working they indeed aren't happening. Putting in safe gaurds to prevent something from happening doesn't mean you're guilty of it. Putting a railing on a balcony doesn't imply the balcony with railing is unsafe.

Doesn't mean you weren't, at some point, guilty of it, either. It doesn't retcon things.

Sure, which is why we require evidence of wrong doing. Otherwise its just a witch hunt.

After all, you yourself probably cannot prove that you didn't commit the same offense at some point in time in the past. Like Russel's teapot, its almost always impossible to disprove something like that.

Yeah but I think the main concern in this situation is copilot moving forward, not their past mistakes.

This is so stupid. Going after likeness is doomed to fail against constantly mutating enemies like booming tech companies with infinite resources. And likeness itself isn’t even that big of a deal, and even if you win it’s such a minor case-by-case event that puts an enormous burden of proof on the victims to even get started. If the narrative centers around likeness, they’ve already won.

The main issue, as I see it, is that they took copyrighted material and made new commercial products without compensating (let alone acquiring permission from) the rights holders, ie their suppliers. Specifically, they sneaked a fair use sticker on mass AI training, with neither precedent nor a ruling anywhere. Fair use originates in times before there were even computers. (Imo it’s as outrageous as applying a free-mushroom-picking-on-non-cultivated-land law to justify industrial scale farming on private land.) That’s what should be challenged.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact