Your code is not in that thing. That thing has merely read your code and adjuste...

klabb3 · on Nov 3, 2022

> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

This is kinda smug, because it overcomplicates things for no reason, and only serves as a faux technocentric strawman. It just muddies the waters for a sane discussion of the topic, which people can participate in without a CS degree.

The AI models of today are very simple to explain: its a product built from code (already regulated, produced by the implementors) and source data (usually works that are protected by copyright and produced by other people). It would be a different product if it didn't have used the training data.

The fact that some outputs are similar enough to source data is circumstantial, and not important other than for small snippets. The elephant in the room is the act of using source data to produce the product, and whether the right to decide that lies with the (already copyright protected) creator or not. That's not something to dismiss.

nickelpro · on Nov 3, 2022

It's not something to dismiss but it is something that has already been addressed. Authors Guild v Google. Google Books is built upon scanning millions of books from libraries without first gaining permission from copyright holders, this was found to not be a violation of copyright.

Building a product on top of copyright works that does not directly distribute those works is legal. More specifically, a computer consuming a copyright work is not a violation of copyright.

TAForObvReasons · on Nov 3, 2022

At the time the suit was launched, Google search would only display snippet views. The very nature presents the attribution to the user, enabling them to separately obtain a license for the content.

This would be more or less analogous to Copilot linking to lines in repositories. If Copilot was doing that, there wouldn't be much outrage.

The fact that they are producing the entire relevant snippet, without attribution and in a way that does not necessitate referencing the source corpus, suggests the transgression is different. It is further amplified by the fact that the output itself is typically integrated in other copyrighted works.

nickelpro · on Nov 4, 2022

Attribution is irrelevant in Authors Guild, the books were not released under open source licenses where attribution is sufficient to meeting the licensing terms. Google never sought or obtained licenses from any of the publishers, and the court ruled such a license was not needed as Google's usage of the contents of the books (scanning them to build a product) did not represent a copyright infringement.

Attribution is mentioned in this filing because such attribution would be sufficient to meet the licensing terms for some of the alleged infringements.

It's an irrelevant discussion though, the suit does not make a claim that the training of Copilot was an infringement which is where Authors Guild is a controlling precedent.

couchand · on Nov 4, 2022

Attribution goes directly to factors 1, 3, and 4 of the fair use test.

nickelpro · on Nov 4, 2022

In some contexts it's used to characterize the purpose of the copying, but it's not a consideration that was made in Authors Guild.

klabb3 · on Nov 4, 2022

> Authors Guild v Google. Google Books is built upon scanning millions of books from libraries

I agree it's relevant precedent, but not exactly the same. Libraries are a public good and more importantly Google books references the original works. In short, I don't think that's the final word in all seemingly related cases.

> More specifically, a computer consuming a copyright work is not a violation of copyright.

I don't agree with this way of describing technology, as if humans weren't responsible for operating and designing the technology. Law is concerned with humans and their actions. If you create an autonomous scraper that takes copyrighted works and distributes them, you are (morally) responsible for the act of distributing them, even if you didn't "handle" them or even see them yourself.

Neither of the important aspects – remixing and automation – is novel, but the combination is. That's what we should focus on, instead of treating AI as some separate anthropomorphized entity.

nickelpro · on Nov 4, 2022

Your disagreement and feelings about how copyright and the law should work are valid, they have very little to do with how copyright is addressed judicially in the United States

forgotpwd16 · on Nov 4, 2022

>Authors Guild v Google

At which case Google paid some hundred million $ to companies and authors, created a registry collecting revenues and giving to rightsholders, provided opt-out to already scanned books, etc. Hey, doesn't sound that bad for same thing to happen with Copilot.

yenwodyah · on Nov 3, 2022

But Copilot has been shown to distribute (parts of) the copyrighted works used to create it. That’s the difference.

nickelpro · on Nov 4, 2022

A) No it doesn't, there's nothing in the Copilot model or the plugin that represents or constitutes a reproduction of copyright code being distributed by GH/MS. The allegation is it generates code that constitutes a copyright violation. This distinction is not academic, it's significant, and represents an unexplored area of copyright law.

B) "parts of" copyright works are not themselves sufficient to constitute a copyright violation. The violation must be a substantial reproduction. While it's up to the court to determine if the alleged infringements demonstrated in the suit (I'm sure far more will be submitted if this case moves forward) meet this bar, from what I've seen none of them have.

Historically the bar is pretty high for software, hundreds or thousands of lines depending on use case. A purely mechanical description of an operation is not sufficient for copyright, you cannot copyright an implementation of a matrix transformation in isolation no matter what license you slap on the repo. Recall that the recent Google v Oracle case was litigated over tens of thousands of lines of code and found to be fair use because of the context of those lines.

I've yet to see a demonstrated case of Copilot generating code that is both non-transformative and represents a significant reproduction of the source work.

fsflover · on Nov 4, 2022

> The allegation is it generates code that constitutes a copyright violation.

The weights of the Copilot very likely contain verbatim parts of the copyrighted code, just like in a zip archive. It chooses semi-randomly which parts to show and sometimes breaks copyright by displaying large enough pieces.

https://news.ycombinator.com/item?id=33458603

nickelpro · on Nov 4, 2022

Speculation, and furthermore the model itself isn't distributed to consumers.

xtracto · on Nov 3, 2022

Say you publish a song and copyright it. Then I record it and save it in a .xz format. It's not an MP3, it is not an audio file. Say I split it into N several chunks and I share it with N different people. Or with the same people, but I share it at N different dates. Say I charge them $10 a month for doing that, and I don't pay you anything.

Am I violating your copyright? Are you entitled to do that?

To make it funnier: Say instead of the .xz, I "compress" it via π compression [1]. So what I share with you is a pair of π indices and data lengths for each of them, from which you can "reconstruct" the audio. Am I illegally violating your copyrights by sharing that?

[1] https://github.com/philipl/pifs

Aeolun · on Nov 3, 2022

What you are actually giving people is a set of chords that happen to show up in your song, the machine can suggest an appropriate next chord.

It’s also smart enough to rebuild your song from the chords _if you ask it to_.

varajelle · on Nov 3, 2022

I take your code and I compress it in a tar.gz file. Il call that file "the model". Then I ask an algorithm (Gzip) to infer some code using "the model". The algorithm (gzip) just learned how to code by reading your code. It just happened to have it memorized in its model.

Aeolun · on Nov 4, 2022

Yeah, and that’s completely fine.

I’ve seen this point made before, but it assumes you use the entire input as output, which is silly.

varajelle · on Nov 4, 2022

Oh no, I'm not using the entire input, just a few functions of interest. And not the copyright headers of course.

BizarroLand · on Nov 3, 2022

With the exception that there are infinite types of chords in this case, and even though many musicians follow familiar chord structures the underlying melodies and rhythms are unique enough for any familiar person to be able to differentiate "Red Hot Chill Peppers" from "All-American Rejects", and now there is a system where All-American Rejects hit a few buttons and a song is generated (using audio samples of "Under the Bridge") that sounds like "Under the Bridge pt 2, All-American Rejects Boogaloo".

That's why it's actionable and why there is meat on the bone for this case. The real issue is going to be if they can convince a jury that this software is just stealing code and whether its wrong if a robot does it.

williamcotton · on Nov 3, 2022

Actually the real issue is if Copilot can stand up to these legal doctrines:

https://en.wikipedia.org/wiki/Idea–expression_distinction

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

2muchcoffeeman · on Nov 3, 2022

I was thinking of something similar as a counter argument and lo and behold, it’s a real thing maths has solved with a real implementation.

obiefernandez · on Nov 3, 2022

This analogy is flawed

andrewmcwatters · on Nov 3, 2022

This is demonstrably false. It is a system outputting character-for-character repository code.[1]

[1]: https://news.ycombinator.com/item?id=33457517

adriand · on Nov 3, 2022

If I use Photoshop to create an image that is identical to a registered trademark, is the rights violation my fault or Adobe’s fault?

xigoi · on Nov 3, 2022

Photoshop can't produce copyrighted images on its own.

metadat · on Nov 3, 2022

To play devil's advocate: Co-Pilot can't reproduce copyrighted work without appropriate user input.

Just trying to demonstrate a point- this analogy seems flawed.

heavyset_go · on Nov 3, 2022

If I draw some eyes in Photoshop, it won't automatically draw the Mona Lisa around it for me.

metadat · on Nov 3, 2022

Until you sprinkle a bit of Stable Diffusion V2 or 3 on it, or perhaps some GaN.

The more I think about it, the more this all seems like another dimension of Jack and the Magic Beanstalk crossed with The Matrix.

WithinReason · on Nov 3, 2022

If you Google Mona Lisa the result is the Mona Lisa. If you query Copilot for a common piece of code you get that code.

heavyset_go · on Nov 4, 2022

Google doesn't sell its search feature as a product that you can just plagiarize the results from and they're yours. Microsoft does that with Copilot.

Copilot is as much of a search engine as Stable Diffusion or DALL-e are, which is to say they aren't at all. If you want to compare it to a search engine, despite it being a tortured metaphor, the most apt comparison is not to Google, but to The Pirate Bay if TPB stored all of their copyrighted content and served it up themselves.

WithinReason · on Nov 4, 2022

With Copilot it's your responsibility not to use it as a search engine to copy-paste code. It's completely obvious when it's being used as a search engine so it's not a problem at all.

Stable Diffusion works on completely different principles and they can't exactly replicate a pixels from their training data.

adriand · on Nov 4, 2022

So the problem you have with it is the UI?

kyruzic · on Nov 3, 2022

No because that's not a trademark violation in anyway. Using GPL code in a non GPL project is a violation of copyright law though.

Aeolun · on Nov 3, 2022

Ok, cool. Presumably that is because it’s smart enough to know that there is only one (public) solution to the constraints you set (like asking it to reproduce licensed code).

Now, while you may be able to get it to reproduce one function. One file, and definitely the whole repository seems extremely unlikely.

naikrovek · on Nov 3, 2022

[flagged]

xigoi · on Nov 3, 2022

Individual words can't be copyrighted.

pmarreck · on Nov 3, 2022

It can be modified to not do that (example: mutating the code to a "synonym" that is functionally but not visually identical).

It can also be modified to be opt-in-only (only peoples' code that they permit to be learned on, can use the product)

falcolas · on Nov 3, 2022

Perhaps you are right, and it could be so modified.

Could be, but isn’t. And that matters.

ImPostingOnHN · on Nov 5, 2022

plagiarism with some words swapped is still plagiarism

Cort3z · on Nov 3, 2022

Just to be clear; I cannot prove that they have used my code, but for the sake of argument, lets assume so.

They would have directly used my code when they trained the thing. I see it as an equivalent of creating a zip-file. My code is not directly in the zip file either. Only by the act of un-zipping does it come back, which requires a sequence of math-steps.

Filligree · on Nov 4, 2022

But there is no equivalent of "unzipping" for Copilot.

This is a generative neural network. It doesn't contain a copy of your code; it contains weightings that were slightly adjusted by your code. Getting it to output a literal copy is only possible in two cases:

- If your code solves a problem that can only be solved in a single way, for a given coding style / quality level. The AI will usually produce the same result, given the same input, and it's going to be an attempt at a solution. This isn't copyright violation.

- If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?

account42 · on Nov 4, 2022

There is no guarantee that a ML network only produces the input data under those two conditions. But even for

> If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?

Replication is not a violation if the terms of the license are followed. Many open source projects are replicated hundreds of times with no license violation - that doesn't mean that you can now ignore the license.

But even if they did violate the license, that doesn't give you the right to do it too. There is no requirement to enforce copyright consistently - see e.g. mods for games which are more often than not redistributing copyrighted content and derivatives of it but usually don't run into trouble because they benefit the copyright owner. But try to make your own game based on that same content and the original publisher will not handle it in the same way as those mods. Same for OSS licenses: The original author does not lose any rights to sue you if they have ignored technical license violations by others when those uses are acceptable to the original author.

heavyset_go · on Nov 3, 2022

Neutral nets can and do encode and compress the information they're trained on, and can regurgitate it given the right inputs. It is very likely that someone's code is in that neural net, encoded/compressed/however you want to look at it, which Copilot doesn't have a license to distribute.

You can easily see this happen, the regurgitation of training data, in an over fitted neural net.

CuriouslyC · on Nov 3, 2022

This is not necessarily true, the function space defined by the hidden layers might not contain an exact duplicate of the original training input for all (or even most) of the training inputs. Things that are very well represented in the training data probably have a point in the function space that is "lossy compression" level close to the original training image though, not so much in terms of fidelity as in changes to minor details.

heavyset_go · on Nov 3, 2022

When I say encoded or compressed, I do not mean verbatim copies. That can happen, but I wouldn't say it's likely for every piece of training data Copilot was trained on.

Pieces of that data are encoded/compressed/transformed, and given the right incantation, a neutral net can put them together to produce a piece of code that is substantially the same as the code it was trained on. Obviously not for every piece of code it was trained on, but there's enough to see this effect in action.

naikrovek · on Nov 3, 2022

> which Copilot doesn't have a license to distribute

when you upload code to a public repository on github.com, you necessarily grant GitHub the right to host that code and serve it to other users. the methods used for serving are not specified. This is above and beyond the license specified by the license you choose for your own code.

you also necessarily grant other GitHub users the right to view this code, if the code is in a public repository.

eropple · on Nov 3, 2022

Host that code. Serve that code to other users. It does not grant the right to create derivative works of that code outside the purview of the code's license. That would be a non-starter in practice; see every repository with GPL code not written by the repository creator.

Whether the results of these programs is somehow Not A Derivative Work is the question at hand here, not "sharing". I think (and I hope) that the answer to that question won't go the way the AI folks want it to go; the amount of circumlocution needed to excuse that the not actually thinking and perceiving program is deriving data changes from its copyright-protected inputs is a tell that the folks pushing it know it's silly.

naikrovek · on Nov 3, 2022

copilot isn't creating derivative works: copilot users are.

the human at the keyboard is responsible for what goes into the source code being written.

to aid copilot users here, they are creating tools to give users more info about the code they are seeing: https://github.blog/2022-11-01-preview-referencing-public-co...

devmor · on Nov 3, 2022

Your argument is essentially the same as the argument that the pirate bay didn't infringe copyright, it only facilitated infringement.

And we all saw how well that went legally.

account42 · on Nov 4, 2022

Actually pirate bay was even less of an infringement as they did not dsitribute the copygihted content or derivatives themselves, only indexed where it could be found. With Copilot all the content you're getting goes trough Microsoft.

AnnasVirtual · on Nov 4, 2022

that is not how similar at all that is not how machine learning works OMG

devmor · on Nov 5, 2022

Machine learning is not important to this line of argument. We are talking about the legal responsibility of a tool.

Filligree · on Nov 4, 2022

Pirate Bay couldn't be used to do anything but infringe copyright, practically. That's not true for Copilot.

devmor · on Nov 4, 2022

Nonsense. It tracked millions of legitimate torrents.

8note · on Nov 4, 2022

The page surrounding the code in the GitHub UI is a derivative work, isn't it?

It's an html file containing both the licensed code and some other html

BenjiWiebe · on Nov 4, 2022

It still has attribution.

vharuck · on Nov 4, 2022

The relevant part of GitHub's terms of service:

"4. License Grant to Us

We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program."

https://docs.github.com/en/site-policy/github-terms/github-t...

I don't think these terms allow using content for Copilot.

heavyset_go · on Nov 3, 2022

It's served under the terms of my licenses when viewed on GitHub. Both attribution and licenses are shared.

This is like saying GitHub is free to do whatever they want with copyrighted code that's uploaded to their servers, even use it for profit while violating its licenses. According to this logic, Microsoft can distribute software products based on GPL code to users without making the source available to them in violation of the terms of the GPL. Given that Linux is hosted on GitHub, this logic would say that Microsoft is free to base their next version of Windows on Linux without adhering to the GPL and making their source code available to users, which is clearly a violation of the GPL. Copilot doing the same is no different.

LtWorf · on Nov 4, 2022

Then github should make sure that people only upload stuff they are copyright owner of… which it has never done, warned about or tried to enforce.

vkou · on Nov 3, 2022

> It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.

So what? Why shouldn't we update the rules of copyright to catch up to advances in technology?

Prior to the invention of the printing press, we didn't have copyright law. Nobody could stop you from taking any book you liked, and paying a scribe to reproduce it, word for word, over and over again. You could then lend, gift, or sell those copies.

The printing press introduced nothing novel to this process! It simply increased the rate at which ink could be put to pages. And yet, in response to its invention, copyright law was created, that banned the most obvious and simple application of this new technology.

I think it's entirely reasonable for copyright law to be updated, to ban the most obvious and simple application of this new technology, both for generating images, and code.

civilized · on Nov 3, 2022

> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

Completely incorrect. False dichotomy. It's widely known that AI can and does memorize things just like humans do. Memorization isn't a defense to violating copyright, and calling memorization "adjusting a generative model" doesn't make it stop being memorization.

If you memorized Microsoft's code in your brain while working there and exfiltrated it, the fact that it passed through your brain wouldn't be a defense. Substituting "generative model" for "brain" and the fact that it's a tool used by third parties doesn't change this.

moralestapia · on Nov 3, 2022

Whatever you say man :^)

https://twitter.com/docsparse/status/1581461734665367554

NicoleJO · on Nov 3, 2022

You're wrong. See exposed code. https://justoutsourcing.blogspot.com/2022/03/gpts-plagiarism...

lamontcg · on Nov 3, 2022

> but snippets of it cannot

Yeah they can, and the whole functions that Copilot spits out are quite obviously covered by copyright.

> especially when they are used in a different context.

That doesn't matter.

ouid · on Nov 3, 2022

it is essentially a weighted sum of your code and other copyright holders code. Do not let the mystique of AI fool you. Copilot does not learn, it glues.

tevon · on Nov 3, 2022

I agree.

If I read JRR Tolkien and then go and write a fantasy novel following a unexpected hero on his dangerous quest to undo evil, I haven't infringed, even if I use some of Tolkien's better turns of phrase.

LtWorf · on Nov 4, 2022

Games aren't even allowed to use the word "hobbit" without paying royalties. I'm sure you completely ignore what you're talking about.

Filligree · on Nov 4, 2022

Hmm. Are you sure that's true?