Authors sue OpenAI for using their works without proper licensing

transitivebs · on Sept 22, 2023

slavboj · on Sept 25, 2023

"But Defendants’ LLMs endanger fiction writers’ ability to make a living, in that the LLMs allow anyone to generate—automatically and freely (or very cheaply)—texts that they would otherwise pay writers to create"

This kind of luddism sees copyright as a way to enrich rights holders, as opposed to "promoting the progress of science and the useful arts".

rezonant · on Sept 25, 2023

It appears this lawsuit is complaining that ChatGPT can write fan fiction and they don't like that.

I was onboard initially thinking we were talking about OpenAI ingesting Game Of Thrones as training material, but it appears George et al are just mad because it can make stories with their characters.

This is far from the authorship/copyright problem of AI.

papercrane · on Sept 25, 2023

If you read the claims to relief (start on page 44 of complaint) it's mostly just standard copyright infringement during training. The claims about ChatGPT writing works that infringes their works seem to be to be to try and head-off a fair use defense. One of the tests for fair use is the effect on the potential market for original work.

papercrane · on Sept 25, 2023

Edit: This comment was for a different headline, doesn't apply anymore.

Terrible headline. They're not suing for theft, they're suing for copyright infringement.

The rest of the article is reasonable, and they link to the complaint, which is something every article about a lawsuit should do.

https://fingfx.thomsonreuters.com/gfx/legaldocs/xmvjlbqbnvr/...

zeroCalories · on Sept 25, 2023

Theft is the word used by their lawyers. Seems fair to use in the title. The differences between theft and copyright infringement isn't important in this case anyway.

Either way, it will be interesting to see how this goes. Lots of weird arguments on both sides, so it will be interesting to see what rulings we get.

erehweb · on Sept 25, 2023

It's just a headline, theft is in scare quotes, and is mentioned by the plaintiffs. Doesn't seem unreasonable to me as quick summary.

smegsicle · on Sept 25, 2023

'computer' scare quotes

EDIT: i just asked an ai and it said that single quotes are a british-ism?

1vuio0pswjnm7 · on Sept 22, 2023

The complaint:

https://ia904703.us.archive.org/10/items/gov.uscourts.nysd.6...

Named plaintiffs include David Baldacci, John Grisham and Scott Turow.

mattw2121 · on Sept 22, 2023

Here's a point that I struggle with. Let's imagine a point in the future where technology has progressed to the point that a machine can become "assisted memory" for a human. This could be useful for degraded memory conditions or even just to buff up human capabilities. In this scenario how do we deal with licensing and copyright? The "memory" is trained on books, artwork, etc and then human intelligence accesses that computer aided memory and constructs something new.

Seems like this lawsuit could set precedence that the future I describe would not be allowed.

distract8901 · on Sept 22, 2023

I think it depends on the use. The kind of AI you're suggesting wouldn't need to ingest the entirety of every book and movie you've seen, it would just need to store a summary and a location for the original file.

It's legal for me to make digital copies of copyrighted works I own so long as they're not redistributed. If I have a local AI maintaining my personal data for reasons other than generating competing works, that shouldn't involve copyright at all.

fnordpiglet · on Sept 22, 2023

I think the question boils down to what is fair use in these situations. I actually think it would be useful to have a public effort to build a pan humanity model that claims fair use for public benefit and consumes all productions of humanity.

What specifically hurts OpenAI is the monetization and commercial benefit from other people’s work. They also fall clearly in the space of civil claims of copyright violation, if not criminal (however the use of models to produce copyright materials is likely criminal infringement). (IANAL, but a law professor friend made these claims to me, YMMV)

munksbeer · on Sept 22, 2023

I think this is all mostly huff and puff and frankly going to be irrelevant. At some point anyone with a decent enough home computer will be able to train their own models and use them as they see fit. Any law that tries to stop that is going to be stifling and unwieldy in the extreme.

And it'll be cross border, so even more difficult to enforce.

nemo44x · on Sept 22, 2023

How is it different than a search index? It takes existing content as an input, processes it and then outputs data structures from it. Those data structures are then used to power full text search.

A LLM does the same thing but instead of search results it emits a stream of tokens.

Supply5411 · on Sept 25, 2023

DMCA 2024 - A nice big report button for when generated content is too close to copyrighted content. It is then on the AI company to supplement the training materials around that content, to dilute the generation of content that could be seen as infringing. So instead of George RR Martin prequels with the same names and characters (because of a lack of training materials), it generates something more generic for the input prompt.

Win/win?

papercrane · on Sept 25, 2023

The actual complaint is about using their copyrighted works in the training of the LLM without a license. OpenAI is claiming it's fair use, the authors disagree. It's going to take a ruling from a judge to get clarity on the issue, and no matter what it'll be appealed until it hits the SC.

zeroCalories · on Sept 25, 2023

Do we know if OpenAi bought the book, or did they just "accidentally" pirate the book?

papercrane · on Sept 25, 2023

That's what discovery will be for, the complaint alleges that the likely source was libgen. Most of these authors haven't released DRM-free ebooks, and it seems unlikely that OpenAI has a large scale book scanning effort (and even if they did, that authors would likely claim that to be infringement itself.)

imglorp · on Sept 25, 2023

What if it never accessed the book, but read everything relevant like episode summaries, fan wikis, and forum discussions? It would still be as conversant. Is it still infringement?

smegsicle · on Sept 25, 2023

oh right was it really proven that they were training on bittorrent book collections?

chottocharaii · on Sept 25, 2023

Or, just let people and computers be inspired

Ideas have never been the scope of copyright and it wasn’t in its democratic mandate. If creatives want that change, fine, advocate for a change of the law

Barrin92 · on Sept 25, 2023

>Ideas have never been the scope of copyright

This isn't about ideas, it's about a specific individuals work given that the reproduced text lifts literal characters out of Martin's book. That has always been covered by IP law. Canonical example, you cannot write a novel about Harry Potter, you can write a book about a wizard going to a magical school.

Supply5411 · on Sept 25, 2023

If a model generates large amounts of text that is very close to something you've written, because there isn't much else like it, how is that "inspired"? It needs more dilution.

dahart · on Sept 25, 2023

We would have to change the law to allow the kind of ‘inspiration’ you are talking about, which is why there are multiple lawsuits here. That’s what OpenAI is asking for - redefinition of ‘fair use’. NNs aren’t copying ideas, they train on what copyright calls ‘fixation’ - they deal with text, audio, and pixels, not ideas. We keep hoping and looking for understanding in the NNs, but we have ample evidence that they don’t actually understand much, if anything, they are just really good at copying in a way that make understanding seem plausible to the layperson.

dahart · on Sept 25, 2023

It’s a good idea to make this easier to report, but… shouldn’t it be on the AI company to train using legally acquired content in the first place? It’d be great if the training data was opt-in and curated. Wouldn’t that be better than a shoot first ask questions later policy? There’s definitely room to improve copyright and room to allow AI to exist, but do we really want to allow AI to ingest all copyrighted material and call it ‘fair use’? That would be giving them a ridiculous and unprecedented amount of freedom to take any and all content and turn around and auto-generate enough to obsolete the people who made the training material. It seems like the race is on to supplant Google as the portal for information, and it does feel like downloading everything in the world and then crying fair use after the fact is wishful thinking that more or less admits to copyright violation.

Supply5411 · on Sept 26, 2023

>shouldn’t it be on the AI company to train using legally acquired content in the first place

I don't think so. It's not illegal to look at or learn from copyrighted materials. If you start producing the materials it becomes a different question. I think the same applies to AI.

dahart · on Sept 27, 2023

Your argument doesn’t work because OpenAI has admitted that ChatGPT is producing copyrighted material. They’re trying to carve an exception for AI, but have already acknowledged that training does copy the materials, literally, and that it does not “learn” from the the same way humans do. The intent with AI may be to remix them, but the whole reason there are multiple lawsuits here (as well as with Stable Diffusion and other NNs) is because they have repeatedly demonstrated they sometimes memorize the training data and can produce it more or less verbatim. They have violated current copyright law. In that light, we have two primary options: change the law, or enforce the current law. OpenAI is hoping to change the law, but whether they have copied some training data and produced it for the output is not even up for debate, this is already the different question you referred to.

Nevermark · on Sept 22, 2023

There seems to be confusion. :)

Or disagreement anyway, about how comparable photocopiers & copyright are to generative models and protection from unauthorized automated style reproduction.

How I look at it:

1. In both cases, reproduced copies or reproduced styles, automation destroys economic incentives for creators to make any sustained effort.

Without economic protection, it isn’t even a question of less motivation. Creator’s like everyone else need to eat.

2. So we protect creative works from complete copies in order to have more creative works.

And it is primarily about automation and mass reproduction.

Nobody is worried about people hand copying Atlas Shrugged.

3. But we also protect copyrighted works from partial copying.

Only copying chapters 1-3? Not allowed

Only copying the plot but changing all the names, locations, fashion and colors? Not allowed.

4. So now it turns out a different substantial part of a work can be copied via automation. It’s style.

Well if you can protect a works plot from automated copies, why not a works style?

It is a substantial piece of a creative work.

Reasons for protecting style come down to protecting any major part of a copyrighted work.

The only thing different now is we have “style reproducers”.

So we have to decide, is this essentially the same situation as copyright addresses, or not?

5. It is.

The exact same trade offs between protection and incentivization exist for extracted & mass reproduced style as they do for extracted and reproduced plot.

gremlinsinc · on Sept 22, 2023

How many books written have taken some style or concepts from other books?

Stranger things takes liberally from a lot of Stephen King as Spielberg elements not outright but in spirit and tone, why isn't Stephen King suing the Duffer brothers for reading his shit and coming up with ideas for books based on that?

ghaff · on Sept 22, 2023

One is that basic plots are copied all the time and there’s a meme that there are only seven basic plots. Of course there’s much more variety at the detail level.

Was Sword of Shannara pretty derivative of Tolkien? Yeah. But I assume it was pretty far from a copyright violation.

Nevermark · on Sept 22, 2023

So 7 basic plots. But an actual plot for an original story isn’t just a basic plot is it? It’s an original work.

Movies are sued all the time for copyright infringement due to substantially copying plot and character elements. [0]

Because these cases tend to each be unique, the line between infringement and non-infringement gets settled very much on a case by case basis.

As a result of this inherent unpredictability, most cases involve the accused settling with the aggrieved party to get the lawsuit dismissed.

This is common in many areas of civil law.

A few examples:

1. *"The Island" (2005)* - Accusation: Similarities to the 1979 film "Parts: The Clonus Horror." - Outcome: Settled out of court. [1]

2. *"Frozen" (2013)* - Accusation: Claimed similarities to a short film named "The Snowman." - Outcome: Disney settled the case. [2]

3. *"Coming to America" (1988)* - Accusation: Art Buchwald claimed the movie was based on his script. - Outcome: Paramount settled for an undisclosed amount. [3]

4. *"The Terminator" (1984)* - Accusation: Harlan Ellison claimed it was similar to an episode of "The Outer Limits." - Outcome: Settled out of court, and an acknowledgment was added to later copies. [4]

5. *"Disturbia" (2007)* - Accusation: Accused of being similar to Alfred Hitchcock's "Rear Window." - Outcome: Initially dismissed, but a settlement was reached. [5]

[0] https://movieweb.com/movies-accused-of-copyright-infringemen...

[1] https://en.m.wikipedia.org/wiki/Parts:_The_Clonus_Horror

[2] https://ew.com/article/2015/06/25/disney-frozen-lawsuit-the-...

[3] https://en.m.wikipedia.org/wiki/Buchwald_v._Paramount

[4] https://www.cbr.com/terminator-harlan-ellison-credit/

[5] https://www.flixist.com/new-disturbia-and-rear-window-lawsui...

lostmsu · on Sept 22, 2023

Wait, plots are protected? I don't believe so.

Nevermark · on Sept 22, 2023

Copyright infringement cases over copying character and plot elements are extremely common.

See my reply to a sibling comment of yours.

rendall · on Sept 22, 2023

This is good, and this is inevitable. Creators have control of their copyright, which should include permissions to be used in AI training.

Kim_Bruning · on Sept 22, 2023

Authors currently don't control who or what reads their works, of course.

Personally I currently feel that (at life +70 years) the copyright pendulum has gone too far towards the rights of publishers (not necessarily authors) as is.

That said, I'm open to good arguments to change my mind. Why do you feel that authors should be given this additional right to control what is used for AI training? What would be the public good or public trade-off here?

ghaff · on Sept 22, 2023

And they don’t control fair use or promulgating the ideas in the book. That Wikipedia article summarizing the key contents in an editor’s own words? Perfectly legit.

What copyright buys is that no one else can distribute verbatim copies of large amounts of your work. But a lot of other uses are allowed.

freejazz · on Sept 22, 2023

Copyright indisputably covers more than the verbatim reprinting of a book

ghaff · on Sept 22, 2023

For example?

It covers the expression of ideas. Which in the case of a book is mostly the text as written. And, yes, doing some substitution of character names etc. may still violate copyright but you certainly can’t keep me from writing an article about the main points you make in your book.

rendall · on Sept 22, 2023

> What would be the public good or public trade-off here?

Consider the aesthetic landscape where creators do not have control over whether their work is used to train an AI versus one where they do. It's hard to predict with certainty, but my model is this:

No control: Anyone's work is fair game to be trained. If I want to make a prompt of "A graphic novel in the visual style of Moebius, written by Stephen King, set in Westeros" I can get something based on King and Martin's actual words and Moebius' actual drawings, without compensating them. Neat! However, potential new novelists see that quality novels can just be churned out for free or low cost and so, actually sitting down to write a new novel becomes a niche, geek thing to do. There's no money in it. These new novels just get thrown into the ml bin, fodder for the next version.

With control: novelists and other creators know they can make money from their work because they can make business decisions about how and when their work trains a model. We all get to see more new, professional quality creativity. Those who want to read Conan as written by Lord Dunsany can still see that, since those works are in the public domain.

harlanji · on Sept 22, 2023

Training AI feels inherently commercial with intent to commercially distribute, which seems to be licensed specially in most domains.

In a sense it's like driving down the highway with a duffel bag of cannabis flower in a state where possessing and traveling with few ounces is no problem--something commercial is probably happening. Why is that prohibited? Perhaps another debate, but just trying to connect the implied intent aspect.

If an AI were being trained for strictly academic reasons then I'd agree with fair use and that type of arguments. But if the AI itself has a subscription fee, then whoever is subscribing is also probably using the work for real or anticipated commercial gain. Hence investing money.

True that hobbyists spend money with no intention to gain commercially, and we may do that at a higher than average rate as tech workers because we have usually have a decent amount of excess money from our work. But money is pretty scarce to most people and businesses with set non-investment budgets, so if they're spending it on AI there's little doubt it's with commercial intent.

So in conclusion, I do think there's both merit to the authors' case related to intent to commercialize and room for doing unlicensed non-commercial AI training.

gaganyaan · on Sept 22, 2023

What is inevitable is that copyright is dead. Do you think China will respect western copyrights? They already don't. People would just use LLMs hosted elsewhere.

This is a good thing. We're going to see an explosion of indie games, movies, and more that never could've been made before by a single, dedicated person.

amanaplanacanal · on Sept 22, 2023

“Should include” is exactly not the way the law works.

beej71 · on Sept 22, 2023

Is it illegal to SHA-256 a book without the author's permission?

If not, how are we going to legally codify the difference between that and an LLM?

gkbrk · on Sept 22, 2023

You can't extract content from a SHA-256(book), but you can extract some content from LLM(book).

beej71 · on Sept 22, 2023

Can I, though? In my tests with ChatGPT I haven't been able to get it to regurgitate books or even small portions of them.

gkbrk · on Sept 22, 2023

Yep, you can. I tested by typing the first 13 words of the book Frankenstein and GPT3.5 happily completed the rest. Feel free to try.

Here are screenshots of the completion and the diff with the real book.

https://ibb.co/album/sJBB11

beej71 · on Sept 22, 2023

That's a public domain text. So, yes, it can do it. I stand corrected there. But can you coax a copyrighted text out of it?

When I ask it for anything from Game of Thrones it refuses beyond offering a summary.

But this is all begging the question.

How are you going to codify that? "You can't feed my text into an algorithm that has the ability to reproduce it"?

gkbrk · on Sept 22, 2023

I also tested this with Harry Potter books (which are still not public domain AFAIK). It starts auto-completing the correct words quickly, and then hangs and eventually stops producing output. You can call the API to generate more tokens, but it stops again after producing a few more correct words.

I think for a few high-profile authors (for books like Harry Potter and Game of Thrones), OpenAI probably installed some output filters in order to not get sued too hard. Of course, I can't definitively check that without access to the raw model. Which OpenAI conveniently doesn't provide.

1vuio0pswjnm7 · on Sept 22, 2023

The complaint mentions LibGen, Z-Library and Bibliotik, and Sci-Hub in a footnote.

Thought experiment:

What if every person who dowloads materials from the above sources claimed that they were doing so only to "train AI".

Many such persons who download from those sources are probably doing so for noncommercial purposes, for example, academic research. Whereas, according to this compaint, OpenAI "intend[s] to earn billions from this technology."

ChrisArchitect · on Sept 25, 2023

[dupe]

More discussion days back when this was news:

https://news.ycombinator.com/item?id=37585157

https://news.ycombinator.com/item?id=37599261

Charon77 · on Sept 22, 2023

I don't get it.

So AI works can't be copyrighted but training AI using copyrighted materials are copyright infringement?

DiabloD3 · on Sept 22, 2023

Only humans can violate laws: the AI isn't using copyrighted materials, it's human owner/operator is.

The human owner/operator did not have proper licensing to perform this action, trying to argue "but with an AI" doesn't change the act.

yffi_bit · on Sept 22, 2023

Not disagreeing with your end conclusion, but surely the concept of a limited company exists exactly to have a distinction between legal entities, some of which are not humans but may still violate laws. Take it this way: if the EU fines a company for GDPR violations, it doesn't really fine an individual. Perhaps no individual broke the law explicitly, but as a collective the end result is a law violation.

DiabloD3 · on Sept 22, 2023

Technically yes, but how that is handled is up to the country. In the US, a concept known as "corporate personhood" exists. A strange concept, because if a company murders someone, the company, nor any of its executives, go to prison.

The law is weird.

HerculePoirot · on Sept 22, 2023

GRR Martin, the author of Game of Thrones, had the audacity to join this lawsuit. The only thing I expect from AI in this context is NOT to reproduce the shitshow GOT ended up to be.

I'm wondering what will the authors do if we develop AIs that are able to find new artistic styles that are not in the dataset ? Would it still pose problem to use their content to learn how NOT to imitate them ?

Seems it is possible in collaborative filtering:

> Yes, in collaborative filtering, finding empty classes is possible. To recommend items for these gaps, utilize adjacent class information or employ techniques like matrix factorization, content-based filtering, or hybrid systems. These methods predict preferences based on observed patterns, similarities between items, and user preferences, filling in missing data.

Tade0 · on Sept 22, 2023

I see a business opportunity for someone who can produce a watermark that will visibly poison the learning set.

laudefra · on Sept 22, 2023

Interesting how this particular case is generating such conflicting political views

smegsicle · on Sept 25, 2023

currently anything generated by llm is not copyrightable, right? so are they really threatened by public domain or what?

romanovcode · on Sept 25, 2023

They are threatened by the fact that LLM can write better stories about their invented characters then themselves.

kayhi · on Sept 25, 2023

Maybe the default should be opt in instead of opt out. Why should copyright holders now have to do work to protect their already protected works?

slavboj · on Sept 22, 2023

[flagged]

rendall · on Sept 22, 2023

People misleadingly conflate these two concepts all the time, but no, human learning and machine learning are not equivalent.

If you want to learn to draw in the style of Moebius then go for it. If you want to train an AI on Moebius' work, you should ask permission from Moebius' estate, since they stand to lose revenue by your model.

slavboj · on Sept 22, 2023

What if I only train on licensed human art "in the style of moebius" but legally distinct - you can't copyright a "style". A lot of critics seem to be arguing for a conception of copyright that's far broader than actually exists as a subsidy for artists (by which everyone in effect means Disney et al).

pc86 · on Sept 22, 2023

You're right, because clearly anyone who takes an AI-generated Moebius-style image and puts it on a blog post about message queues or a recipe or whatever would have just bought a licensed Moebius image otherwise.

freejazz · on Sept 22, 2023

So your argument is that infringement is okay, because they would have done it anyway? Not a great legal defense

pc86 · on Sept 22, 2023

I'm replying to "you should ask permission from Moebius' estate, since they stand to lose revenue by your model" - there are no damages. Whether it's infringement or not is not really relevant to that point.

rendall · on Sept 22, 2023

You brought up a thing that was not machine learning to argue that machine learning does not infringe?

pc86 · on Sept 22, 2023

I don't know how many different ways I can say "I'm talking about infringement."

I'm not talking about infringement.

freejazz · on Sept 22, 2023

You said there's no damages. The lost licensing fee is the basis for actual damages, plus there's likely statutory damages available here too.

lostmsu · on Sept 22, 2023

Ehm, what exactly is the difference between estate revenue loss due to AI vs my works in the style?

AbrahamParangi · on Sept 22, 2023

Why do we think that the moebius estate deserves that revenue? Why is that the optimal state of the world, rather than just a capitalist incentive to create more art?

freejazz · on Sept 22, 2023

>rather than just a capitalist incentive to create more art?

That's the entire point. It's in the US constitution, stating that exact reason.

rendall · on Sept 22, 2023

Your issue is with capitalism apparently. Like it or not, creators rely on copyright to make a living and have an incentive to create more, and quality, art.

deeviant · on Sept 22, 2023

Copyright protects the work, not the thoughtspace. Unless the LLM recreates their work in a form similar enough to be legally defined as the copyrighted material, there is no copyright issue, period.

Starting a legal precedent that copyrights spreads like fungus and encompasses any thought related to the copyrighted work seems like a horrible idea, one that can only lead to a dystopian future.

rendall · on Sept 22, 2023

> Copyright protects the work, not the thoughtspace.

I think that is a good argument, but conflates is and ought, and I have two counters:

copyright owners can dictate how their work is used (with some exceptions), and if that use hurts the copyright owner, the owner should have the right to forbid it.

the intent of copyright is to reward and encourage creators and creativity. If a script kiddie can just train a model and duplicate the hard-won aesthetic work of Molly Crabapple or Ralph Steadman or anyone at all, and either dilute the value of it or actually profit from it, what is the incentive for creators to create new work at all?

clucas · on Sept 22, 2023

> copyright owners can dictate how their work is used (with some exceptions), and if that use hurts the copyright owner, the owner should have the right to forbid it.

Consider a poet who publishes poetry in some unique meter, or has some other unique stylistic structure for which they are well known... should they be allowed to sell copies of their poems that can be used for reading only, but prevents usage of those stylistic devices by other authors?

I'm going to assume that we agree the answer is "no, the author should not be able to prevent those uses" at least for human consumers of their works. This is how art has always developed... even though that use "hurts the copyright owner" by diluting the market for works with that style, the owner does NOT have the right to forbid it.

Now, let's say that same poet drew a lot of inspiration from a bunch of out-of-copyright poets. Let's also say that I train an AI model on the poet's inspirations, but NOT on the poet's work directly. Then I ask the AI to write a poem in the style of the poet's inspirations, and to include the unique stylistic device for which that poet is famous. In your world, is this OK?

rendall · on Sept 22, 2023

> Consider a poet who publishes poetry in some unique meter, or has some other unique stylistic structure for which they are well known... should they be allowed to sell copies of their poems that can be used for reading only, but prevents usage of those stylistic devices by other authors?

I don't think this is a fair analogy. Unfortunately, analogy breaks down because the technology is unprecedented. So, to answer directly, no the poet cannot copyright the unique meter, but, no, machine learning is not that.

If you need an analogy, think copy machine, not human learning. An LLM can only regurgitate that which it has seen before. Absent the poet, the other poets can still make other poetry, but the LLM literally cannot make poetry that it has not seen before. If it produces a poem with that unique meter then it definitely copied that poet, and was not "inspired by" the poetry. If you wrote poetry inspired by EE Cummings your process for doing that would be very different from an LLM's, which would programmatically use his material.

clucas · on Sept 22, 2023

What about the second part of my post, where the LLM has NOT been trained on the specific meter, but it does have some "concept" (maybe not the right word, but bear with me) of what meter is, so the human prompter can say "write a poem about subject S, with meter M" and get something in the style of that poet, without having been trained on it... sounds like you're OK with that scenario?

Full disclosure: I think I probably disagree with you on some points you've made in this thread, but I'm not going for any gotchas right now, I am just trying to map the contours of what you think is OK and not OK. We're all sort of flying blind on this stuff, so getting a sense of what others are thinking is really important in my mind. Appreciate the engagement.

gaganyaan · on Sept 22, 2023

I think you're coming from a fundamental place of misunderstanding. LLMs don't just regurgitate what they've seen before. After you understand how they work, I think the rest will become clear to you.

freejazz · on Sept 22, 2023

This is argued by the people that say it's akin to human learning, but then turn around and say we don't know enough about human learning. it's an utterly fallacious argument

gaganyaan · on Sept 23, 2023

You seem to be arguing against something I didn't say

beej71 · on Sept 22, 2023

> copyright owners can dictate how their work is used (with some exceptions), and if that use hurts the copyright owner, the owner should have the right to forbid it.

Like if I read their books and then write better ones in their style, undermining their profits?

gaganyaan · on Sept 22, 2023

Copyright doesn't let you stop Nazis (or insert objectional set of people of your choice here) from reading your books or seeing your art if they obtain a legal copy. You can't just say "I forbid it!". Why should we create a new restriction on our freedoms to allow for that?

freejazz · on Sept 22, 2023

Copyright absolutely allows you to control who you sell your work to, who you license it to, and who they can sublicense it to. Copyright allows an author to control who displays, reproduces, performs, etc their works. The misunderstands I've seen about copyright here are at times shocking in how incorrect they are, but seem to be consistent with a lot of the positions taken by the posters that express these understandings. For example, someone upthread said that copyright only protects against verbatim copying!

gaganyaan · on Sept 23, 2023

https://en.m.wikipedia.org/wiki/First-sale_doctrine

freejazz · on Sept 23, 2023

>"Copyright absolutely allows you to control who YOU sell your work to"

again, for the people in the back.

gaganyaan · on Sept 23, 2023

What I said:

> from reading your books or seeing your art if they obtain a legal copy.

You're arguing against something I didn't say

AbrahamParangi · on Sept 22, 2023

My issue is not with capitalism but with assuming the present rules of asset ownership are optimal! If we could make housing free - conjure it out of thin air - it would be really bad for landlords. They rely on that income to make a living!

We should still obviously do it. More of a thing that people want is usually good.

yffi_bit · on Sept 22, 2023

Conjuring things out of thin air also tends to have side-effects, and it's better not to stop at the first-order effect of an action before going ahead and "just doing it". Concretely with content generation: if the disregard for copyright leads to a world where people no longer make the effort to produce and think about new things, the only things that you will consume will be produced by AI. Reminds me of The Matrix :-)

AbrahamParangi · on Sept 22, 2023

Conjuring things out of thin air does not have side effects because it is not possible.

The whole point of the phrase was to describe a hypothetical situation with no side effects to avoid sideways arguments about "but actually here's some bad things that would happen if you did that unrelated to the central argument".

yffi_bit · on Sept 22, 2023

I actually agree completely with that; my initial reply doesn't quite put the focus where it needs to be.

So let me try again: in my view, you shouldn't think of policies or laws impacting real people by ever placing yourself outside of the reality in an ideal case, because the hard bit is not conjuring ideals but finding a way of making them happen.

It's always a lot messier where the rubber meets the road. People have already died and suffered because ideals (specifically around asset ownership) that weren't quite thought through, but caught on. Take communism as an example.

Part of my point is that such "implementation details" are not as unrelated to the central argument as they seem. This is very different from the software world where it might be ok to assume that in 2 years we'll have the computations be 10x as fast and work out a solution backwards from there.

Nevermark · on Sept 22, 2023

What exactly would we want to incentivize landlords for if land was free?

???

They are not creating anything unique or otherwise. It’s not remotely comparable.

Nevermark · on Sept 22, 2023

Uh, no.

There is a distinct lack of capitalist incentive to spend your time developing a novel style only to have your style replicated for anyone to use, based on your work, and without permission.

In terms of automation used to destroy incentives for original creators, based on original creators’ works, photocopying machines and generative models are in the same quadrant.

AbrahamParangi · on Sept 22, 2023

Yes, it is value destroying for the creators. It cheapens their work. However, it gives millions more people access to their ideas. Maybe that’s better.

yowzadave · on Sept 22, 2023

In this particular theoretical, it's not value destroying for the creator--he's dead and presumably has no more interest in money. Whether the heirs of his estate make more or less has no impact on the fact that Moebius will be producing no more art.

Nevermark · on Sept 22, 2023

So you are against copyrights?

Please say so if that’s your point of view.

Because the argument is the same either way.

Either we protect creative works from being devalued by automation, to maintain an incentive to create, or we don’t.

Zetobal · on Sept 22, 2023

Protectionism never works it just delays the inevitable.

Nevermark · on Sept 22, 2023

I don’t know what that means.

My entire career has been selling copyrighted works for good money. And I have been able to confidently do that for many years because of copyright protection.

So “it just delays the inevitable” means what? I have to give my money back later?

Zetobal · on Sept 22, 2023

The transition from horses to automobiles is a prime example of how protectionism can't halt technological progress. When cars first arrived, they faced stiff opposition, especially from the horse and carriage industry. Some countries even introduced protectionist policies, like high tariffs on imported cars or regulations favoring horse-drawn vehicles, to shield their traditional industries. In the end the end user will decide or better said they already did in regards of generative art.

The inevitable is that your industry will change and the distinction between what is an "Artist" and an "Operator" will change even more. If most of your clients come to you for the end product you are most likely an "Operator". ie. "Do X like I want it to be done." if you are an "Artist" clients come to you because you are either in the Zeitgeist or people like your way of thinking and the process behind your art. The end product is collaborative.

If you are an "Operator" you will have problems in the near future if you are an "Artist" you will be fine. It's like in VFX and the mark the writers strike made on the industry the "Aritsts" are all fine because the studios want to retain them and the "Operators" are left on the street.

Nevermark · on Sept 22, 2023

I agree with you things will change.

And I agree, requiring permission to use copyrighted works in model building will give legs to the current paradigm, but only delay the change.

However, AI is going to upset everyone’s apple carts, so anything that allows changes to happen more smoothly, less disruptively, is probably with the effort.

AI is likely to devalue all human labor, except for the provenance value of creations (creator, history, associations) and a preference for the human element (many personal services, or services with a personal touch element).

gaganyaan · on Sept 22, 2023

Why? They also stand to lose revenue if a human does it.

willis936 · on Sept 22, 2023

Imagine the kind of hellish world we'd live in if you could just ignore the law and remove the capacity for the market (and by extension all of society) to make human expression possible. Yes, event artists need to be given the possibility of being able to feed themselves.

Imagine what I could accomplish if I was a tech startup flush with cash unencumbered by the law. Surely a planetary hostile takeover would only be a few years and existential gambits away.

ghaff · on Sept 22, 2023

As of today, it is by no means clear that a law is being broken and the IP lawyers I know tend to think not. But the courts and perhaps Congress will decide.

willis936 · on Sept 22, 2023

If laundering of IP were legal then why isn't it already more common?

ghaff · on Sept 22, 2023

I have no idea what you mean by laundering IP. Existing IP is built on all the time in ways that are or are not permissible depending upon the nature of the IP protection, if any, and the nature of the extension. And sometimes things end up going to court, especially in the context of patents.

And lots of things that aren’t generally protectable like new artistic styles and techniques are co-opted all the time by other artists.

willis936 · on Sept 22, 2023

I'm talking about the unlicensed distribution of copyright protected IP.

ghaff · on Sept 22, 2023

Which there is a good argument AI models are not.

willis936 · on Sept 22, 2023

I wouldn't characterize the arguments I've heard to date as "good". They are wishful thinking of those uninterested in abiding the rules.

freejazz · on Sept 22, 2023

Ghaff is an intellectually dishonest poster expressing nonsensical views about copyright law (it only protects against verbatim copying) it is not worth the effort to undo his "bullshit asymmetry principle" https://en.wikipedia.org/wiki/Brandolini%27s_law

vkoskiv · on Sept 22, 2023

Human authors cannot read and perfectly memorize millions of books in a day, and are therefore not comparable to computers running machine learning software.

tananaev · on Sept 22, 2023

That's not how machine learning works. They don't "perfectly memorize" anything. They do learn much quicker than humans, of course, but that alone doesn't seem like a good argument.

skepticATX · on Sept 22, 2023

They do not perfectly memorize but they can spit out entire chapters of books word for word.

nickthegreek · on Sept 22, 2023

I've never heard or seen this unless a model is way overtrained. Could you provide an example?

gkbrk · on Sept 22, 2023

Is GPT3.5 way overtrained? Because it can regurgitate a ton of text verbatim from books.

https://ibb.co/album/sJBB11

nickthegreek · on Sept 22, 2023

what am I looking at? Do you have the prompts that were used?

gkbrk · on Sept 22, 2023

You are looking at a Large Language Model (specifically GPT3.5) output full paragraphs from a book it was trained on. The prompt is the first 13 words of the book, shown with a white background in the first screenshot. The words with green background are the LLM outputs.

The second screenshot is a diff that compares the LLM output with the original book.

nickthegreek · on Sept 23, 2023

What book? I just want to verify your statements. I don’t take 2 cropped screenshots on an image site as evidence to be frank.

gkbrk · on Sept 23, 2023

Frankenstein. You might argue that it is a public domain book, but it demonstrates that these LLMs can and do memorize books.

It also works with Harry Potter, but the API behaves weirdly and after producing correct output in the beginning really quickly, suddenly hangs. You can continue generating the correct words by doing more API calls, but it only does a few words at a time before stopping. It clearly knows the right content, but doesn't want to send it all at once.

I think there is some output filtering for "big" authors and stuff that is too famous that they filter in order to avoid getting sued.

I wrote more details about the weird Harry Potter behaviour here: https://news.ycombinator.com/item?id=37614697

beej71 · on Sept 22, 2023

The could do, but with enough input the odds of that happening are very remote, is my understanding.

And just because something could do something doesn't make it a violation.

Kim_Bruning · on Sept 22, 2023

Do you have a credible source that says an LLM like the ones trained by OpenAI can perfectly memorize millions of books? Or be trained in a single day?

throwawaymaths · on Sept 22, 2023

It's actually provably impossible for openAI to perfectly memorize millions of books.

gkbrk · on Sept 22, 2023

It's provably impossible for a model with 1.76 trillion floating-point parameters (like GPT-4) to memorize millions of books?

How many bytes do you think a million compressed books takes? Consider that the way these models are trained is basically completing the next symbol based on the previous words, which is how most compressors are made.

vkoskiv · on Sept 22, 2023

No, but the point still stands.

nickthegreek · on Sept 22, 2023

No it doesnt. Kim just took your entire thesis away.

vkoskiv · on Sept 25, 2023

Are you really trying to argue that there exist humans that can learn facts at a similar pace to a datacenter running GPT training software on petabytes of scraped data? My point still stands.

chankstein38 · on Sept 22, 2023

What point? To me it sounds like they shot down your entire comment

freejazz · on Sept 22, 2023

Does anyone have a credible source indicating that LLM learning is anything like human learning? That's the first step in this argument

ghaff · on Sept 22, 2023

Well, we would probably have to understand human learning a lot better first.

freejazz · on Sept 22, 2023

"Your honor, I don't know human minds work, but clearly LLMs work the same way" The legal burden of evidence is on LLM proponents to establish that what they are doing is the same as the human mind and therefore should be treated the same way.

Kim_Bruning · on Sept 23, 2023

That's a bit of a false dilemma. To address copyright issues, we needn't prove that machine learning models learn exactly like humans or not. The more relevant point is that neither human learning nor machine learning has the intent to store or replicate copyrighted material; both aim to generalize from data to produce new content. It is in this way that they are similar.

freejazz · on Sept 23, 2023

It's the argument that is being made. Intent isn't a requisite for copyright infringement. Your re-characterization of the argument is so general that it's useless.

Kim_Bruning · on Sept 24, 2023

Heh, we seem to be talking past each other.

I wonder if this might be because Slavboj might be a monist/physicalist, and you might be a dualist[2]? If that's the case we'd all argue until we're blue in the face, if we don't at least recognize this underlying difference. For the record, since I've studied biology, I'm probably closest to some form of mechanism[3] (due to the rejection of vis vitalis [4] in early 20th c. )

Of course, you could also just be a very skeptical monist mindful of the kluger hans effect[5] and working from there.

Let me know which (if any), maybe we can still find middle ground!

[1] https://en.wikipedia.org/wiki/Physicalism

[2] https://en.wikipedia.org/wiki/Mind%E2%80%93body_dualism

[3] https://en.wikipedia.org/wiki/Mechanism_(philosophy)

[4] https://en.wikipedia.org/wiki/Vitalism

[5] https://en.wikipedia.org/wiki/Clever_Hans

marak830 · on Sept 22, 2023

So we should block people with Eidetic memory should be banned?

This is all becoming akin to the 'but on a computer' for patents.

lm28469 · on Sept 22, 2023

> 'but on a computer'

It's exactly the issue here. The capability doesn't matter, what matters is the scale of it.

Would you be ok with your car billing you every single time you exceed the speed limit because "there already are speed cameras out there" ?

lm28469 · on Sept 22, 2023

This analogy never made sense to me.

It's like saying I can kill people with a swiss army knife so I should be able to own a nuclear bomb since it's _literally_ the same thing, besides the scale but that's a detail right :)

ahoka · on Sept 22, 2023

Computers are not people.

the_other · on Sept 22, 2023

Their owners and managers are.

gwd · on Sept 22, 2023

Imagine if, after having read and trained a single, bright student through university, you could clone them a million times and get their clones to churn out content for a penny per 4k tokens.

One solution to this would be to say this:

* The copying done while training a neural net is "fair use", similar to copying the DVD into RAM and onto the screen while watching it.

* The resulting neural net is a derivative work of all copyrighted works used during its training: No copying of the derived work is allowed without permission of all copyright holders

* The output of the neural net is subject to normal copyright laws for humans: i.e., it's only a violation of copyright if it's obviously a copy.

Basically, you can either train every instance of a neural network separately (like you have to do with humans) or get a license for all your training data.