Hacker News new | past | comments | ask | show | jobs | submit | archontes's comments login

We've reached the era of Trurl's Electronic Bard.

https://electricliterature.com/wp-content/uploads/2017/11/Tr...


I agree that interested parties are trying to steer the ship there. I just don't see the legal arguments that will get them there.

Given the fact that images are transmitted to a person in a manner that doesn't violate copyright (and even if they are, the transmitter, not receiver is guilty of infringement), training an AI is not something that copyright law limits.

The AI weights that result are about the farthest thing from a derivative work, as the weights as a separate object, don't seem to contain the slightest remnant of the original work.


I am also not a lawyer; I have some background and training in IP law as it pertains to engineering.

As far as I can tell, the image you describe and your example sentence are closer than you might think to each other. Mickey Mouse is a copyrighted character, and Disney could certainly claim infringement for both. Whether you have a fair use claim is down to the tenets of fair use, and whether they sue you is down to their estimation of how likely it is it'd be profitable for them to do so.

So what is fair use? https://www.law.cornell.edu/uscode/text/17/107

Put simply, you have to argue about it in court and decide on a case by case basis, but the factors are:

The nature of use, such as for profit vs. non-profit.

The nature of the copyrighted work. Your art might be considere literary criticism. How central to that message is Mickey Mouse?

The amount and substantiality of the copyrighted work appearing in your work. Mickey Mouse is the sole feature, so large.

How likely is it that your Mickey Mouse creation will serve as a substitute for people consuming normal Mickey Mouse content?


Aren't some versions of Mickey Mouse out of copyright now...


Steamboat Willie.


If you put it on the internet, someone can read it.


I'm a fan of decriminalization rather than legalization, mainly because I believe that people should be able to engage with it, but I don't want companies marketing it. Marketing in the US is too effective, and I don't want to load that gun with anything that makes people less well, because it'll find a way to maximize consumption.


Wouldn't it make more sense to legalize it and ban advertising it?


You try writing a law that corporate America can't corrupt.


When are people going to get that this isn't a right folks have?

If your code is readable, the public can learn from it.

Copyright doesn't extend to function.


The public is not learning from it. A person or corporation is creating a derivative work of it. Training a model is deriving a function from the training data. It is not "a human learning something by reading it".


It's an extreme stretch to say that the model weights are a derivative work of the training data given the legal definition of "derivative work".


It's not more a stretch than saying that re-encoding a PNG as a JPEG is a derivative work even though the process is lossy and the resulting bits look nothing alike.


I'm not sure you're being intellectually honest.

You think that a model that's capable of being prodded into producing an infringing output in addition to all the other non-infringing outputs it could produce is no different than a compression algorithm?


It is processed data at the end of the day. And no it is not like human reading. You can't read whole Github.


That doesn't make it a derivative work.

If I "process data" by doing a word count of a book, and then I publish the number of words in that book (not the words themself! Just a word count!) I haven't created a derivative work.

Processing data isn't automatically infringement.


People aren't going to get it, because you don't get them.

People have the right to learn non-copyrightable elements from your code.

The claim is that AI learns copyrightable elements.


The comment chain you are replying to includes a request to not train an AI on one's code.

I agree it's certainly possible for AI to produce infringing output.

Nevertheless, people don't have the right to enforce a limitation on training.


And to give a concrete example, in my view it should be allowed to use any source code to train a model such that the model learns that code is bad or insecure or slow or otherwise undesirable. In other words, it should be allowed to train on anything as long as the model does NOT produce that training data verbatim.


Maybe you should update your view with 17 USC 106.

https://www.law.cornell.edu/uscode/text/17/106


What copyrightable elements of the original work persist in the model, if it is incapable of outputting them? I can derive a SHA-1 hash from a copyrighted image, and yet it would be absurd to call that a derivative work.


Not in copyright. The work speaks for itself, and the function of code is not a copyrightable aspect.


The intent of the work can matter when determining if de minimis applies as well as fair use.


Part of my point is that fair use doesn't apply.

Training a model doesn't involve reproducing a copyrighted work, preparing a derivative work, distributing that work, or performing that work.

Fair use isn't required because none of the exclusive rights afforded by copyright apply.


You might not get your ass kicked. Copyright doesn't protect function, to the point where the court will assess the degree to which the style of the code can be separated from the function. In the even that they aren't separable, the code is not copyrightable.

https://www.wardandsmith.com/articles/supreme-court-announce...

https://easlerlaw.com/software-computer-code-copyrighted#:~:...


Software like Blackduck or Scanoss is designed to identify exactly that type of behaviour. It is used very often to scan closed source software and to check whether it contains snippets that are copied from open source with incompatible licenses (e.g. GPL).

To be able to do so, these softwares build a syntax tree of what your code snippet is, and compare the tree structure with similar trees in open source software without being fooled by variable names. To speed up the search, they also compute a signature for these trees so that the signature can be more easily searched in their database of open source code.


And that's all well and good, but that code that asserts to be protected by GPL still has to stand the abstraction-filtration-comparison test.

The plain fact is that you can claim copyright on plenty of stuff that isn't copyrightable.

Consider AI model weights at all: they're the result of an automatic process and contain no human expression; almost by definition, model weights shouldn't be copyrightable, but people are still releasing "open source" models with supposed licenses.


But there has to be a threshold. If a GPL project contains a function which takes two variables and returns x+y, and I have functionally identical code in a project I made with an incompatible license, it is obviously absurd to sue me.


You are right but there is no legally defined threshold so it's subjective.

As a matter of fact, the Eclipse Foundation requires every contributor to declare that every piece of code is their own original creation and is not a copy/paste from other projects, with the exception possibly of other Eclipse Foundation or Apache Foundation projects because their respective licenses allow that. Even code snippets from StackOverflow are formally forbidden.

If I am not mistaken, in the Oracle-Google trial over Java on Android, at the end Google re-implementation of Java API on Android was considered fair-use, because Google kept the original "signatures" of the Java SDK API and rewrote most of the implementation with the exception of copying "0.4% of the total Java source code and was minimal" [1] However the trial came to this conclusion after several iterations in court.

[1] https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....


You're right, there is. The threshold is whatever a court decides is "substantial similarity" in that particular case. But there's no way to know that ahead of time as the interpretation/decision is subjective.


The simple version is that code is copyrightable as an expression. And the underlaying algorithm is patentable.

The legal term you're looking for here is the "Abstraction-Filtration-Comparison" test; What remains if you subtract all the non-copyrightable elements from a given piece of code.


Algorithms have become patentable only very recently in the history of patents, without a rationale being ever provided for this change, and in some countries they have never become patentable.

Even in the countries other than USA where algorithms have become patentable, that happened only due to USA blackmailing those countries into changing their laws "to protect (American) IP".

It is true however that there exist some quite old patents which in fact have patented algorithms, but those were disguised as patents for some machines executing those algorithms, in order to satisfy the existing laws.


Doesn't really matter, the point is that they're patentable. They clearly shouldn't be IMO, but they are.


US copyright does protect for "substantial similarity" [0]. And at the other end of the spectrum, this has been abused in absurd ways to argue that substantially different code has infringed.

In Zenimax vs Oculus they basically argued that a bunch of really abstract yet entirely generic parts of the code were shared, we are talking some nested for loops, certain combinations of if statements, and due to a lack of a qualitative understanding of code, syntax, common patterns, and what might actually qualify for substantively novel code in the courtroom, this was accepted as infringing. [1]

Point is, the legal system is highly selective when it comes to corporate interests.

[0] https://en.wikipedia.org/wiki/Substantial_similarity

[1] https://arstechnica.com/gaming/2017/02/doom-co-creator-defen...


> US copyright does protect for "substantial similarity"

Substantial similarity refers to three different legal analyses for comparing works. In each case what the analysis is attempting to achieve is different, but in no case does it operate to prohibit similarity, per se.

The Wikipedia page points out two meanings. The first is a rule for establishing provenance. Copyright protects originality, not novelty. The difference is that if two people coincidentally create identical works, one after another, the second-in-time creator has not violated any right of the first. (Contrast with patents, which do protect novelty.) In this context, substantial similarity is a way to help establish a rebuttable presumption that the latter work is not original, but inspired by the former; it's a form of circumstantial evidence. Normally a defendant wouldn't admit outright they were knowingly inspired by another work, though they might admit this if their defense focuses on the second meaning, below. The plaintiff would also need to provide evidence of access or exposure to the earlier work to establish provenance; similarity alone isn't sufficient.

The second meaning relates to the fact that a work is composed of multiple forms and layers of expression. Not all are copyrightable, and the aggregate of copyrightable elements needs to surpass a minimum threshold of content. Substantial similarity here means a plaintiff needs to establish that there are enough copyrightable elements in common. Two works might be near identical, but not be substantially similar if they look identical merely because they're primarily composed of the same non-copyrightable expressions, regardless of provenance.

There's a third meaning, IIRC, referring to a standard for showing similarity at the pleadings stage. This often involves a superficial analysis of apparent similarity between works, but it's just a procedural rule for shutting down spurious claims as quickly as possible.


> Point is, the legal system is highly selective when it comes to corporate interests.

I don't even think it's that. In recent cases like Oracle v. Google and Corellium v. Apple, Fair Use prevailed with all sorts of conflicting corporate interests at play. The Zenimax v. Oculus case very much revolved around NDAs that Carmack had signed and not the propagation of trade secrets. Where IP is strictly the only thing being concerned, the literal interpretation of Fair Use does still seem to exist.

Or for a more plain example, Authors Guild. v. Google where Google defended their indexing of thousands of copywritten books as Fair Use.


In fact, go to far as to argue your example of Authors Guild v. Google is a good indication that most cases will probably go an AI platform's way. It's a pretty parallel case to a number of the arguments. Indexing required ingesting whole works of copyright material verbatim. It utilized that ingested data to produce a new commercial work consisting of output derived from that data. If I remember the case correctly, google even displayed snippets when matching a search so the searcher could see the match in context, reproducing the works verbatim for those snippets and one could presume (though I don't recall if it was coded against), that with sufficiently clever search prompts, someone could get the index search to reproduce a substantial portion of a work.

Arguably, the AI platforms have an even stronger case as their nominal goal is not to have their systems reproduce any part of the works verbatim.


> In fact, go to far as to argue your example of Authors Guild v. Google is a good indication that most cases will probably go an AI platform's way.

The more recent Warhol decision argues quite strongly in the opposite direction. It fronts market impact as the central factor in fair use analysis, explicitly saying that whether or not a use is transformative is in decent part dependent on the degree to which it replaces the original. So if you're writing a generative AI tool that will generate stock photos that it generated by scraping stock photo databases... I mean, the fair use analysis need consist of nothing more than that sentence to conclude that the use is totally not fair; none of the factors weigh in favor it.


I think that decision is much narrower than "market impact". It's specifically about substitution, and to that end, I don't see a good argument that Co-Pilot substitutes for any of the works it was trained on. No one is buying a license to co-pilot to replace buying a license to Photoshop, or GIMP, or Linux, or Tux Racer. Nor is Github selling co-pilot for that use.

To the extent that a user of co-pilot could induce it to produce enough of a copyrighted work to both infringe on the content (remember that algorithms are not protected by copyright) and substitute for the original by licensing in lieu of, I would expect the courts to examine that in the ways it currently views a xerox machine being used to create copies of a book. While the machine might have enabled the infringement, it is the person using the machine to produce and then distribute copies that is doing the infringing not the xerox machine itself nor Xerox the company.

Specifically in the opinion the court says:

>If an original work and a secondary use share

>the same or highly similar purposes, and the secondary use

>is of a commercial nature, the first factor is likely to

>weigh against fair use, absent some other justification for

>copying.

I find it difficult to come up with a good case that any given work used to train co-pilot and co-pilot itself share "the same or highly similar purposes". Even in the case of say someone having a code generator that was used in training of co-pilot, I think the courts would also be looking at the degree to which co-pilot is dependent on that program. I don't know off hand if there are any court cases challenging the use of copyright works in a large collage of work (like say a portrait of a person made from Time Magazine covers of portraits), but again my expectation here is that the court would find that while the entire work (that is the magazine cover) was used and reproduced, that reproduction is a tiny fraction of the secondary work and not substantial to its purpose.

Similarly we have this line:

>Whether the purpose and character of a use weighs in favor

>of fair use is, instead, an objective inquiry into what use

>was made, i.e., what the user does with the original work.

Which I think supports my comparison to the xerox machine. If the plaintiffs against Co-Pilot could have shown that a substantial majority of users and uses of Co-Pilot was producing infringing works or producing works that substitute for the training material, they might prevail in an argument that co-pilot is infringing regardless if the intent of github. But I suspect even that hurdle would be pretty hard to clear.


Of the various recent uses of generative AI, Copilot is probably the one most likely to be found fair use and image generation the least likely.

But in any case, Authors Guild is not the final word on the subject, and anyone trying to argue for (or against) fair use for generative AI who ignores Warhol is going to have a bad day in court. The way I see it, Authors Guild says that if you are thoughtful about how you design your product, and talk to your lawyers early and continuously about how to ensure your use is fair and will be seen as fair in the courts, you can indeed do a lot of copying and still be fair use.


I agree. Nothing is going to be the final word until more of these cases are heard. But I still don't think Warhol is as strong even against other uses of generative AI, and in fact I think in some ways argues in their favor. The court in Warhol specifically rejects the idea that the AWF usage is sufficiently transformed by the nature of the secondary work being recognizably a Warhol. I think that would work the other way around too, that a work being significantly in a given style is not sufficient for infringement. While certainly someone might buy a license to say, Stable Diffusion and attempt to generate a Warhol style image, someone might also buy some paints and a book of Warhol images to study and produce the same thing. Provided the produced images are not actually infringements or transformations of identifiably original Warhol works, even if they are in his style, I think there's a good argument to be made that the use and the tool are non-infringing.

Or put differently, if the Warhol image had used Goldsmith's image as a reference for a silk screen portrait of Steve Tyler, I'm not sure the case would have gone the same way. Warhol's image is obviously and directly derived from Goldsmith's image and found infringing when licensed to magazines, yet if Warhol had instead gone out and taken black and white portraits of prince, even in Goldsmith's style after having seen it, would it have been infringing? I think the closest case we have to that would have been the suit between Huey Lewis and Ray Parker Jr. over "I Want a New Drug"/"Ghostbusters" but that was settled without a judgement.

I do agree that Warhol is a stronger argument against artistic AI models, but it would very much have to depend on the specifics of the case. The AWF usage here was found to be infringing, with no judgement made of the creation and usage of the work in general, but specifically with regard to licensing the work to the magazine. They point out the opposite case that his Campbell paintings are well established as non-infringing in general, but that the use of them licensed as logos for soup makers might well be. So as is the issue with most lawsuits (and why I think AI models in general will win the day), the devil is in the details.


A key finding that the judge said in the Authors Guild v. Google case was that the authors benefited from the tool that google created. A search tool is not a replacement for a book, and are much more likely to generate awareness of the book which in turn should increase sales for the author.

AI platforms that replaces and directly compete with authors can not use the same argument. If anything, those suing AI platforms are more likely to bring up Authors Guild v. Google as a guiding case to determine when to apply fair use.


Copyright is abused often. Our modern version of copyright is BS and only benefits large corps who buy a lot of IP.


Yep. Now it is a legal cudgel wielded most effectively by corporate giants. It has mutated to become completely philosophically opposed to what it was expressly created to protect.


If I were to license a cover of a song for a music video, I'd have to license both the original song and the cover itself.

I'd say this is extremely relevant in this case.


if that is the case why do people ever license covers?

to clarify - I thought you just had to negotiate with the cover artist about rights and pay a nominal fee for usage of the song for cover purposes - that is to say you do not negotiate with the original artist, you negotiate with a cover artist and the whole process is cheaper?


You're maybe thinking about this in a way that's not helping you to understand the system and why it works the way it does. It's very clear when you think of a specific case.

Say you want to make a recording of "Valerie" by the Zutons. You need permission (a license) from the songwriters (the Zutons presumably) to do this. You usually get this permission by paying a fee. Having done that, you can do your recording. Whenever that recording is played (or used) you will get a performance royalty and they will get a songwriting royalty.

Say you want to use a cover of "Valerie" by the Zutons in your film or whatever. Say the Mark Ronson version featuring Amy Winehouse. You need permission (a license) from the person who produced that version (Mark Ronson or his company) and will need to pay them a fee, some of which goes to the songwriter as part of their deal with Mark Ronson which gave him the license to produce his cover in the first place.

The Zutons don't have the right to sell you a license to Mark Ronson's version so if that's the version you want you have to negotiate with him. Likewise he doesn't have the right to sell you a license like the license he has (ie a license to do a recording/performance) so if you want that you have to negotiate with them.


OK it seems exactly what I thought and described, and the opposite of what the parent poster described. The parent poster said that if you want to use the cover of the song you need to negotiate with both the people who did the cover and the original rights owner.

The closest I could get to a situation like that would be if I told Band B do a cover of Song A for my movie and I paid the licensing costs as part of my deal with Band B, but still not the same as the parent poster's description.


Cover songs have a special abd explicit law covering them. Not relevant.


While correct, the example given is that they COPY the code, then make adjustments to hide the fact. I suspect this is still a copyright violation. It’s interesting that a judge sees it differently when it’s just run through a programme. I’m not a legal expert so I’m guessing it’s a bit more complex than the headline?


Ok I read the article and it looks like the issue is the DMCA specifically, which require the code to be more identical than is presented. I’m guessing separate claims could still come from other copyright laws?


No copy-paste was explicitly used. They compressed it into a latent space and recreated from memory, perhaps with a dash of "creativity" for flavor. Hypothetically, of course.

The distinction is pedantic but important, IMHO. AI doesn't explicitly copy either.


But isn’t that the same as memorising it and rewriting the implementation from memory? I’m sure “it wasn’t an exact reproduction” is not much of a defence.


I sure think so. I also think that (to first order) this is exactly what modern AI products do. Is a lossy copy still a copy?


I would have thought so but I’m not a lawyer. The article suggests DMCA is intended for direct copies so that’s why it failed here. Maybe more general copyright laws would apply for lossy copies.


When you write something on the internet, you automatically obtain a copyright on it.

Copyright provides the exclusive rights to reproduce, adapt, publish, perform, and display that thing.

Training an AI model isn't any of those things.

If you transmit a thing to me, and I have those bits on my computer, you don't get to determine that I can't train an AI on it, unless we signed an agreement further restricting my use prior to you transmitting it to me.

Now. My AI might produce a work that is sufficiently similar to your work that it is considered a reproduction or adaptation, but that doesn't mean that the training was an infringement.

Also, courts have repeatedly held that webscraping is entirely legal.

If you don't want folks (or their computers) learning from things you create, don't put them on the internet.

NOW for the hilarious follow-on: Copyright is not granted for the results of an automatic process. Training an AI is an automatic process, and it's plausible that attempting to claim copyright on model weights would fail if it were litigated fully. It's more likely they'd qualify for trade secret protection.


Per-app vpn settings when?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: