Hacker News new | past | comments | ask | show | jobs | submit login
Sarah Silverman is suing OpenAI and Meta for copyright infringement (theverge.com)
382 points by Tomte on July 9, 2023 | hide | past | favorite | 564 comments



> The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”

This is the makers of AI explicitly saying that they did use copyrighted works from a book piracy website. If you downloaded a book from that website, you would be sued and found guilty of infringement. If you downloaded all of them, you would be liable for many billions of dollars in damages.

But companies like Google and Facebook get to play by different rules. Kill one person and you're a murderer, kill a million and to ask you about it is a "gotcha question" that you can react to with outrage.


Let's take a second to remember that copyright is the reason ~every child doesn't have access to ~every book ever written.

While it might be too disruptive to eliminate copyright overnight, we should remember that our world will be much better and improve much faster to the extent we can reduce copyright's impact.

And we should cheer it on when it happens. A majority of the world's population in 2023 has a smartphone. Imagine a world where a majority of the world had access to every book ever digitized, and could raise their children on these books!


> Let's take a second to remember

This is emotionally manipulative speech that provides no value to HN and only serves the purpose of bypassing peoples' logical reasoning circuits.

> ~every child doesn't have access to ~every book ever written

More manipulation - "think of the children!"

Copyright exists because people who produce content with low distribution costs (e.g. books) need some protection for their work being taken without compensation.

Fundamentally, you are never entitled to someone else's work.

There's already tens (hundreds?) of thousands of books in the public domain, and tens of thousands more under Creative Commons licenses (where the author explicitly released their work for free distribution). There's lectures on YouTube and MIT OpenCourseWare. There's full K-12 textbooks on OpenStax and Wikibooks. There's Wikipedia, Stack Exchange, the Internet Archive, and millions of small blogs and websites hosting content that is completely free.

There is no need for "a majority of the world had access to every book ever digitized" - and it's deeply morally wrong (theft-adjacent) to take someone else's work without compensating them on their terms.


> and it's deeply morally wrong (theft-adjacent) to take someone else's work without compensating them on their terms.

But I'm willing to bet that you don't believe this consistently across domains, and the domains in which you do believe it have been selected rather arbitrarily, not by you but rather by industry lobbying pressure.

Copyright doesn't exist for mathematics, jokes, fashion designs, architectural styles, recipes, and many other areas of human work. All of these represent similar creative work to the work done by musicians and writers. But we don't force comedians to license each others' jokes, or sue bars for letting people tell unlicensed jokes in public. And almost all of us wear clothing by uncompensated designers. And of course it would be unfathomably destructive to allow something analogous to copyright for a mathematical idea.

We also set limits on how long heirs can inherit copyright, which we don't do for other kinds of property, and we don't have any moral issues with that.

So it's important to remember that our moral intuitions about work and material products don't really translate to information and that we are truly making all of this up as we go along, under the intense corrupting pressure of a few very sophisticated industry lobbies.


You’re cherry picking a single line and deflecting with whataboutism. This is not benefitting HN.


Copyright is a system meant to ensure creators get compensated proportionally to the value they add to society

It is a bad system though. It restrains the society from benefitting from said work unless they meet certain terms (usually, payments). It's often times unrealistic to pay for all you'd like to consume. Especially that access to their work is usually badly quantised

(For example, I may want to search Sarah Silverman's book for a keyword, once. That'll cost me the same amount as someone reading the book from start to end. [please take this as an illustrative example and don't jump literally on ways to solve this exact problem])

I don't have a better solution yet, but I think we should definitely open up this discussion: can you come up with a system which compensates those who add value to society without restraining access to their products?

I'll go even further to say this is the fundamental issue in our society, way beyond copyright. That is, to find a way to compensate people for the value they add while eliminating the incentive to artificially limit access to resources needed by others

I'll be more concrete with another example. Take a limited resource: housing. In a society where you don't need to gain more because you're already justly compensated maximally for what you're worth, you don't need to own more than the house you live in. In other words, I'm advocating for a society where ownership is limited to what can be possibly consumed. Importantly, limited to no longer mean a way to extract value from artificially creating scarcity for others

I think this website has the best candidates capable of devising such a system. But we need to start by having conversations about the requirements for a better society

It will further need approval from society at large even if it's well defined, so the road is long. But we need to start somewhere, and that's requirements

Sorry I derailed a bit, but I think all this ties closely to the debate: 'is copyright a good system?' rephrased as 'if we want to achieve a goal, is limiting knowledge the best way to go?'. Which can be extrapolated to 'is artificially limiting access the best way to ensure those who produce value are equitably compensated for it?'


> Copyright is a system meant to ensure creators get compensated proportionally to the value they add to society

Compensation to the creators was merely a means to an end. Furthering progress was the goal:

> Article I, Section 8, Clause 8: Patent and Copyright Clause of the Constitution. [The Congress shall have power] “To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries.”


Well doesn't that just provide extreme support for the cause to reduce copy protection to ensure access for all who could benefit?


Depends on how you define "progress".


Yes


It's worth challenging the length of copyright. 20 years seems good enough for high tech R&D, probably good for books as well.


Sure! Yes! I agree! 100 years is way too long. 20 years is much more reasonable. But the comment that I was responding to (and many others in this thread) are advocating for the complete removal of copyright, and that's what I'm responding to.


I was not advocating for the immediate complete removal; I acknowledged that this may be too disruptive.

I was simply reminding people what direction we should go in, and what the stakes are. Reducing copyright terms is a great solution, and yes something between 2 and 12 years is probably the right number to aim for as a first step. I agree with the above commenter that 20 is far too long, because 20-year-old material is slipping into irrelevance in many cases.

It's also good to remember that the only reason we have copyright (in the US, where "moral rights" are not a thing) is to stimulate the creation of more work. So we need to think about what configuration of copyright law actually stimulates more work, and we need to be willing to experiment.


Just to add: copyright is *not 100 years. It's 'the entire life of the author plus 70 years'.

If I make a YouTube video and live another 60 years, then people copying that video would still be committing copyright infringement in the year 2150.

That's just insane.


100 years is arguably “unconstitutional.”

Constitution days “for a limited time”. Death of author + 99 is effectively unlimited to the perspective of typical human


Corporations have personhood in the USA, no?

For the life of a human, yes, for the life of a corporation, it's long but not too long. Specially for corporations such as Disney which are sure to last for quite a while.

Not saying this is a justificable position. I'm actually in favour of drastically reducing copyright (I believe 7 to 15 years might be a sweetspot). But a lot of laws are not made for regular human people any longer.


Most book sales are done within 5 years. It's drips after that.

On top of that, knowledge moves too fast these days for a 20 year right to be useful for society.


> Fundamentally, you are never entitled to someone else's work.

This is fair but incomplete, you are not entitled to compel someone to work.

But copyright is more, it controls what is done with the work after the author freely produces it and gives it to another. It is an artificial construct that we have created for good reason.


> it controls what is done with the work after the author freely produces it and gives it to another

No, not "gives" - sells. There's a transaction involved - and part of copyright law is preserving that transactional nature, such that person A doesn't sell their work exactly once to person B who proceeds to give it away for free to every other person on the planet.

The fact that the creator of a work sells it to one person does not give that person license to pirate it.


There are already many more public domain books than people are inclined to read:

https://www.gutenberg.org/

https://librivox.org/

many of which form the basis for an education:

https://news.ycombinator.com/item?id=34630153

And most of which, when in copyright, paid their authors quite handsomely in terms of royalties.

If you believe that books should exist without copyright, then one has to ask --- how many books have you written which you have explicitly placed in the public domain? Or, how many authors have you patronized so as to fund their writing so that they can publish their works freely? Or, if neither of these applies, how do you propose to compensate authors for the efforts and labours of writing?


People keep forgetting the purpose of copyright. It’s easy to find, it’s in the Constitution!

“To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries.”

If the current copyright scheme does not promote the progress of science and useful arts, it is not performing as intended. All these copyright extensions do little to promote progress!


> If the current copyright scheme does not promote the progress of science and useful arts, it is not performing as intended. All these copyright extensions do little to promote progress!

This is a strawman argument - the parent poster that you're replying to is defending copyright as a concept, while the one that they're replying to is attacking it - you're instead making an argument about specific lengths that hasn't come up yet, and nobody is defending.

Very few people (and I am not one of them) think that the "Mickey Mouse curve" style of copyright extension is genuinely useful to anyone except Disney, but holmesworcester is arguing that copyright should be abolished entirely, which contradicts the section of the Constitution that you quoted.


The good public domain books are typically outdated in both content and language. This makes it hard for those with less resources to stay competitive and makes the task of understanding unnecessarily hard.

The link you posted that you say used public works to form a “basis for an education” uses Aristotle as an author for example, and seem to be taught in the context of an instructor-led class (where an expert can discriminate what’s still relevant today and help decipher the language)


The notion that you need cutting edge books to educate yourself is poorly informed.

There are countless well written resources for every aspect of a good education available for free online. They are often not as easy to find as their well-advertised modern equivalents.

Saying there aren’t good free textbooks is like saying there aren’t good free classics on Project Gutenberg. It’s ignorant.


The problem with your reasoning is similar to coming with the highest possible number. When you figure out a number, someone will say "that plus one!".

Whatever standard of free service you establish, someone with resources can surpass, but needs to be motivated to do it by the perspective of a return of the resources spent.

So either hinder development by forbidding the commercial stuff altogether, or tax all people and then finance authors from public money, but hopefully I don't have to explain how wherever this model is tested, the society degrades towards famine.


I just want to add https://standardebooks.org/ebooks to the list


I’d more ask what the cost vs benefits are of keeping the existing scheme, it’s not free to run all these DRM services, prosecute offenders etc…

Not to say I support no copyright…


The benefit of the existing scheme is that new works get created, and some of them are even copy-edited and published.

How many texts are created which are explicitly placed in the public domain and from which the authors have made a conscious decision not to profit thereby?


Books and libraries have existed for thousands of years. It’s copyright that is the young intruder.

Most people who write non-fiction books do it because they want to contribute to human knowledge and be recognized as an expert in a particular field, not because they think that writing a differential geometry textbook is their path to riches. With the internet, more and more text books are made freely available by their authors - the reason this didn’t happen in the past is because there was no other way to pass knowledge around than teaming up with a publisher who is able to put your knowledge on to dead trees. It’s fair game to put older books on the internet, so that the whole world can benefit, not just rich people in rich countries.


And books and libraries existed under a system of patronage and royal imprimatur which severely restricted access, even beyond the expenses of copying a book by hand or physically printing a copy using metal type.

The problems with authors making texts available directly are:

- no gate-keeping, so it's hard to find what is worth reading and what isn't

- no proofreading --- it kills me that errors in books are so casually accepted these days

- few authors have the skills to draw illustrations so as to have a meaningful and clear presentation

I've worked with raw author manuscripts --- in most instances they're not something anyone would choose to read given any other option.


Yet not as young as some people are inclined to think. For example, Charles Dickens was known for pushing for international copyright [1] (domestic copyright was already well established). I acknowledge that he wrote fiction, while you're primarily talking about non-fiction, but the point that it's not recent recent still stands

Also you've said yourself that many people can and are offering their work for free. Awesome. So I would prefer not to force others to (though I would certainly be up to consider repealing many of the posthumous copyright extensions that are fairly recent)

[1] https://www.charlesdickensinfo.com/life/copyright-laws/


Since you mentioned textbooks on differential geometry, I can recommend anyone interested Sigmundur Gudmundsson's lecture notes in introductory differential geometry as well as Riemannian geometry. You can find them freely available on his personal academic page.

https://www.matematik.lu.se/matematiklu/personal/sigma/


> The benefit of the existing scheme is that new works get created, and some of them are even copy-edited and published.

How do you know that this benefit wouldn't exist in other schemes? Look at permissive open source software which is essentially public domain + shield from liability. No copyright does not mean no compensation. It just means different compensation that doesn't deprave other people of their right to share information.

Unlike GP, I support the complete abolishment of Copyright. Society needs to find another scheme to reward work. Perhaps kickstarter-style firms that direct oversight over funded projects or some other scheme that doesn't cause so much harm.


> Look at permissive open source software ... No copyright does not mean no compensation

From everything I've heard it kinda does. If you're writing something valuable then maybe a company will employ you to keep working on it, and the portfolio can certainly help in interviews (to write other software), but getting non-negligible compensation for the use of the software itself is rare. Even those projects that are well funded, like the Linux kernel, are done so not out of goodness of heart, but due to companies realising it's in their rational interest to have a common standard base of sorts

In terms of written works, the best comparison we have is Wikipedia, and while the foundation does receive funding from companies who realise how useful of an integration it can be for their products, the writers themselves do not get paid afaik (and when they do, it's rarely a good thing). But if you just wrote an open source textbook, I doubt you'll manage to make much money off it, and the prospects for fiction look even worse

> Perhaps kickstarter-style firms that direct oversight over funded projects

Sounds like a return to rich patrons and needing to flatter them to get grants. Luckily with the modern internet we do now have democratised layman patronage, but why the need to force everyone into that model? Also note that almost all successful Patreon artists do have perks for paying, even if it's just early access, and afaik make a lot of their money off commissions. Those who just post art for free, with no paid comms, and just have a "tip jar" make relatively little from what I hear


> From everything I've heard it kinda does. If you're writing something valuable then maybe a company will employ you to keep working on it, and the portfolio can certainly help in interviews (to write other software), but getting non-negligible compensation for the use of the software itself is rare. Even those projects that are well funded, like the Linux kernel, are done so not out of goodness of heart, but due to companies realizing it's in their rational interest to have a common standard base of sorts.

Successfully creating a permissive open source project expecting it to be magically funded by benevolent parties is just exceedingly rare. Usually the funding starts first, or effort proceeds in lock-step with funding.

There's a bit of a cart-and-horse here though. Non-permissive licenses like AGPL are often not the result of a single author hoping they might be able to negotiate some licensing deal in the future - it is the result of a commercial enterprise trying to be restrictive in the ability for people to use their source without compensating them.

Same with what is normally considered a more open license, than AGPL, the GPL - where MySQL was reported as going after others for using independent database drivers without buying a commercial license, saying use of the MySQL network protocol made the application using the driver a "derivative work" under the GPL.

Linux is a special case because there are a large number of commercial entities which realize contributing to Linux is way cheaper and faster than writing their own kernel and porting user land software to it.

Apache HTTPD, on the other hand, is an example of an application where corporations DID find the motivation to write their own alternative funded with a commercial model, such as NGINX.


There's a huge vendor community around Kubernetes, which is open source and permissively licensed to boot. If you write something complex that basically works you'll be able to develop defensible IP around management.

p.s., I'm sure many of those vendors are not making money yet but they all aim to.


And for folks who want to create such works, they are welcome to --- as noted else thread, I've put a fair bit of effort into permissively licensed texts --- but I haven't seen a workable method put forth, nor a rational justification for destroying the value of works recently created under the current system.

I still don't see how a person having copyright over the work which they have created and the ability to license it to their best profit is a harm.


This is a logical error.

The benefit of automobiles is that people move across vast distances.

Wrong. People used to move across vast distances before, using horses. Yes, automobiles are better and now traveling is easier. But we have no ability to figure out how many people would travel on horses if there wasn't a better alternative. It's even possible, that banning cars could eventually lead to an even better method of transportation (escaping a local minimum kind of a thing).

Authors who write books have tools to protect their interests, so they do so. Without these tools perhaps there would be less authors. Or maybe there would be more authors: I think Windows and Photoshop are so popular because they were pirated a lot; returning to the context of books, less and less people read them, but maybe if books (attractive books, not old books in public domain) were free, then the trend would reverse, books would popularize, people would start enjoying deeper entertainment, get smarter, transform the society for the better, and support authors on e.g. Patreon… Or maybe not, I'm just mentioning some nuance.


>If you believe that books should exist without copyright, then one has to ask --- how many books have you written which you have explicitly placed in the public domain? Or, how many authors have you patronized so as to fund their writing so that they can publish their works freely?

Lol

Edit: I should probably clarify here. While I can’t speak for OP, I can say that, for some reason, I am sure there are people who have done both lol


I've authored quite a bit of book-like content which has been made freely available:

- the Shapeoko wiki (still available on archive.org)

- a couple of articles for TUGboat

- edited a couple of texts on wikibooks trying to make them better

- currently working on https://willadams.gitbook.io/design-into-3d/

The Venn diagram of folks who don't believe in copyright and those who have actually produced something other folks want to read is quite sparse, excepting the odd manifesto.


Second-order effects matter, though: If everyone is allowed to steal books, what's the incentive for experts to write new ones, and for the publishers to reward them for it?

Btw, not a fan of "but what about the kids" rhetoric: https://en.wikipedia.org/wiki/Think_of_the_children


> If everyone is allowed to steal books

Nothing was stolen- just copied.


> Nothing was stolen- just copied.

This typical semantic-pedantry line from piracy apologists misses the point - piracy is theft-adjacent even if you get to pick your use of "theft". Incidentally, my definition of "theft", and that of most content creators, includes the act of consuming something without compensating the creator on their terms - which includes piracy.


A fundamental property of theft is that the action deprives someone of something they had. Piracy is not theft-adjacent. It has nothing do to with theft. It's a completely different concept.


No, a fundamental property of theft is that it is the taking of that which does not belong to you - which clearly includes piracy.


That's an interesting thought. So when I overhear a conversation which was not aimed at me I am stealing. Quite an interesting definition with widereaching implication


[flagged]


"Pretty wild" - standard emotional manipulation. Not appropriate on HN.


As a book author, I can say that, "Yes, something was stolen. My opportunity to earn a living taking care of readers."

Now you may believe the incredibly self-serving baloney from big companies like Google. You may want to pretend that infringement isn't theft. To you, I hope that some homeless kid breaks into your home, starts squatting, and says that, "Hey, this isn't theft. Nothing has been destroyed."


squatting monopolizes space, only one person can own it at a time

cool thing about information is when you make a copy, you've doubled the information - plenty for everyone! if only housing worked the same way


As a book author, what's your feeling on libraries, out of curiosity?


Actually very much the revenue for each purchase that didn't happen this way was stolen. I'm not a pitacy Hardliner and don't see an issue with someone in a bad financial spot pirating things but this is bullshit.


[flagged]


That's not really the same, you could have a copy of my credit card, but you using it to make purchases would become an issue. Regardless, that quickly steps out of the domain of intellectual property.


If I use your copied credit card to buy software, all that happens is (a) some bits get written to my computer's hard disk and (b) some bits get written to a Visa or Mastercard server's hard disk. Why would you get upset about that?

EDIT: ok, it's fun playing obtuse, but more seriously it is also very obtuse to pretend that intellectual property is not property. I think you have to be pretty stupid to not understand why an author who spends a decade painstakingly creating a work of literature is entitled to ownership of reproductions of that work.


It is not playing obtuse to think "I also have a copy of an ebook" is different from "I have literally stolen your property".


Unless the contents of the ebook are my property, in which case you have stolen my property.


But this is clearly not "everyone". This is just one/two readers. Not even copiers, as the network does not "remember" the content of the book, at least not better than a casual reader. Infringement in this case is more like standing in a book shop reading the book for free, or loaning one in a library for a day, not actually copying for keeps.


The comment they replied to literally said "Let's take a second to remember that copyright is the reason ~every child doesn't have access to ~every book ever written."


books hardly make money, so book sales is not the top incentive for most writers. i would be willing to bet that most books pay less than minimum wage, when considering the labor hours it takes to write the book and how much a writer is ultimately paid out. if not less than minimum wage, then certainly less than the expert's typical hourly rate.


I think that would wholly destroy the ability of writers to actually make a living writing novels. The fact that the living isn't great now does not justify this.


People wrote great works before copywrite.

People write for reasons other than money from the sales of the book. This accounts for most authors, who aren't famous enough to negotiate a great deal with a publisher.

And we don't need to abolish copyright outright. Just require that it is continually published at a steady or decreasing price, or it becomes public domain. And put works in the public domain a little sooner.


> People wrote great works before copywrite.

Yes, wealthy aristocrats wrote whatever they wanted and less wealthy authors wrote what they got paid to write by their wealthy aristocrat patrons.

Copyright and the publishing industry changed that to make it possible to live by writing for ordinary people.


And now wealthy aristocrats have been replaced by staggeringly wealthy megacorporations like Disney and WB.

Its disgustingly dishonest to appeal for "poor" authors where casual glance immediately proves that modern IP law exceedingly profits only a few massive corporations and their shareholders.


People did write before copyright. Copyright was established to make it more likely.

Yes, people write for other reasons (e.g., self-promotion). But that does not account for "most authors" I would actually want to read.

I'm not against copyright being different, especially for things that are written as work-for-hire. But that's a fight with Disney. Good luck.


> People did write before copyright. Copyright was established to make it more likely.

No, it was established to make investing in printing presses more profitable, which is why the rights initially attached to printers. Authors as the locus of rights were a later change.


Yeah, these were rich people like Jane Austen and Mrs. Percy Bythe Shelley. Copyright makes it possible for the non-rich to create these works and make a living from a willing audience.


The average full time author in the UK earn below minimum wage, and most authors are not full time.

While removing copyright without some form of mechanism to replace it would be problematic, and I do agree there are second order effects to consider, at least for books the effect may well be less than you'd think.


Plenty of freely published articles and fanfics online


It would look largely identical to ours, I think. It's pretty trivial to get access to many, if not most, e-books.

Any public-domain work is available on Project Gutenberg [0]. Copyrighted works can be accessed for free using tools of various legality: Libby [1] is likely sponsored by your local library and gives free access to e-books and audiobooks. Library Genesis [2] has a questionable legal status but has a huge quantity of e-books, journal articles, and more.

[0]: https://www.gutenberg.org

[1]: https://www.overdrive.com/apps/libby

[2]: https://libgen.rs/fiction/


It’s important to note that only some copyrighted works can be accessed for free using the legal options, I’m a member of probably a dozen overdrive supporting libraries and still frequently find titles unavailable for loan of any kind.

I’d love to see an analysis of what % of books are available via libraries around the globe.

Also, the whole DRM thing is a massive pain, audiobooks especially are terrible at allowing side-loading onto a consumer friendly device (such as an MP3 player).


>frequently find titles unavailable for loan of any kind

Even through interlibrary loan?


Public domain would be fine, if we had the original copyright term instead of "life of the author plus seventy years."

Libby is an interesting option, though I'm curious how many kids in disadvantaged countries would actually have access to it.

Regarding Libgen, I'm not convinced it makes the case for modern copyright to say it's fine, because people can just violate copyright.


Exactly. The original copyright term was 14 years. Now Disney and the Intellectual Monopoly cartel have extended it 120 years, and convinced an unthinking population that anything less is "stealing."


Hoopla.com as well


Then most people stop writing books because they can't get paid for their time/effort and ~every child will be stuck with outdated knowledge within a decade.


Or maybe we could figure out a new economic model, instead of blindly sticking with one based on the limitations of the pre-digital age.


How about you go figure out this new economic model, and come back when it's ready. Until then, the existing model will persist, thank you


Oh boy, do I have good news for you!

There are already many writers making thousands of dollars a month by publishing free serialized web novels, via Patreon. Some are using their own websites, but most are on Royal Road (or scribblehub, webnovel, wattpad, AO3).

A random example from Royal Road[1], the author makes $12065/month. Mind you, the text is not gated, it's free to read, the patreon only offers early access...

[1] https://www.royalroad.com/fiction/63759/super-supportive [2] https://www.patreon.com/Sleyca


What do you think authors feel about this? Despite many people on this site thinking we're the first generation with new technology, the issue of copyright, and technolgie's impact on it, has been discussed for centuries.

The "patronage" model is great (and I personally am a Patreon supporter of lots of creatives). But it also has a lot of flaws in it, both for the author and the public. Most authors will be happy to tell you both the good and bad of the model, in my experience.

The biggest flaw for the public, btw, is that this model only supports art that "rich people" find worth supporting. This is bad, both because sometimes art isn't "deemed worthy" immediately, and because art for non-"rich" people is also very valuable.


Patronage was art for rich people in pre-internet days. Now we can make it easy for anyone to contribute a dollar to an artist they like.

As for art that isn't "deemed worthy immediately," that doesn't immediately make money in today's system either. If the art can be freely distributed, it's more likely to find its audience.


Contributing a dollar to an artist they like is beyond the ability of many people in the world. Both in technical terms (actually being able to reliably send someone a dollar), and monetarily.

And very few artists are out there surviving on patrons giving them a dollar a month, I imagine. Most of them survive on larger pledges, and as far as I know, very few of them reach anywhere near the amount of money traditionally-published authors can make.

> As for art that isn't "deemed worthy immediately," that doesn't immediately make money in today's system either. If the art can be freely distributed, it's more likely to find its audience.

Yes, it doesn't make money immediately in today's system. But many authors collect a back-catalogue of works, which might pay out only a bit of money at a time, but over a career can be enough. If I publish a novel every year for 20 years, even if every novel only brings in a sprinkling of money per year, by year 20 I'm getting 20 sprinkling of money, which could be enough to sustain me.

Very few of these "middle of the road" authors are popular enough to survive on patronage alone, I believe.

---

At the end of the day, you think that "the internet" somehow made possible something that wasn't technically possible before. I think that's mostly not true - the technology was never the problem. That's why copyright exists in the first place - to make sure that, despite books being almost-zero-cost to reproduce, we put an actual limit on it in order for more books to be published.

If you want a different system, that's your prerogative. But you have to grapple with the tradeoffs here. If less money flows to authors - which would happen if you abandon copyright - then there will be less books. That's almost a law of economics. If you think that somehow you can make something like patreon scale up to the same amount of money that exists in "the system" today, then a) I disagree, and b) you'd still have to content with the issues I mentioned.


Exactly. Take music, for example. Increased capitalist exploitation [1] is what allowed people to spread their creativity far and wide, rather ironically for genres like punk. There's still nothing stopping people from giving away stuff for free, and indeed I do that myself with both FOSS code and CC-BY media (not music), but I'm under no illusions that I'll ever make more than beer money from it, and I do not fault anyone else for going the all rights reserved route

[1] I use that term in the neutral, economic sense


This model is clearly not viable for any sort of economy at scale. Far fewer people can make a living wage under patronage models than traditional models because people simply pay less (far less) for things where they're not compelled to than for things where they are.

420 thousand people work in the US film industry alone[1]. I doubt that there's that same number of people making ~living wage across every online patronage site in the United States (excluding advertisement-driven ventures, of course).

[1] https://www.statista.com/statistics/184412/employment-in-us-...


Your "random" example is currently the #2 ongoing fiction on there, and was #1 until a couple days ago when the guy who wrote Mother of Learning released his new fiction.


It was random in the sense I clicked on trending and then clicked on the first thing I saw. Sure, the amount is an outlier, but there are more that are making over 500 or 1000/month, that's not unusual. Most authors are not making any money, of course, just like in the current system.


> Oh boy, do I have good news for you!

Is this some kind of pwn? Stephen King makes millions, JK Rowling likely a billionaire, Grisham, Patterson, Joan Collins, the list is endless. Didn't the 50 Shades of Grey author initially start with free fan fiction? Do you think these people are going to take a pay cut? And if your "random" author gets noticed by a publisher and is offered a book deal, do you think it will continue to be free?


The point is that a system exists where books are freely available while the writers get compensation, quite serious compensation.


> the patreon only offers early access

So it's still copyright, just without the 100 year protection.

If you went around posting their early access posts to a free website do you think the authors wouldn't complain?


Mandatory licensing seems fine.

The government charges a tax on everyone. Revenue is handed to authors based on how many times the work is used.

There are some questions around how to weigh things like societal importance of the work.

(“Not melting down nuclear reactors for dummies” seems like it should get more money per view than “poodles in outer space vol XXXII”, despite likely having lower readership)


Isn't this the opposite of what the people against copyright are seeking? They want more art to be unprotected, instead, in the public domain, free to use for all. You're making things even more restrictive financially by attaching a tax to all books/songs/movies etc, plus creating a government-controlled, universal monitoring and enforcement system.


First change the incentives, then everyone will work on finding a new economic model.


This makes even less sense than the previous guy. When you've figured out how the world will work without people incentivized by money and power, get back to the rest of us


Suing people for reading a book is one of the ways the poor are kept poor.

The world will work just fine without ceding control to people who seek money and power, because most people aren't like that. The question is how to prevent the few who are from oppressing the rest of us.

That is quite a challenge, but haven't you ever created something just for the fun of it?


> Suing people for reading a book is one of the ways the poor are kept poor.

Please cite a single lawsuit.

> The world will work just fine without ceding control to people who seek money and power, because most people aren't like that.

Virtually everybody is "like that." The world works well because tons of people are working hard, toiling in difficult, dirty, boring, frustrating, or tedious jobs (or all of the above) behind the scenes to make it so. What are municipal waste workers seeking, if not money? Bus drivers? Construction workers? Police officers?

> That is quite a challenge, but haven't you ever created something just for the fun of it?

Any author or songwriter or photographer can proclaim his work to be freely copyable, just like programmers release code under MIT licenses. The fact that so few actually do, should clue you in on their incentives.


> Suing people for reading a book is one of the ways the poor are kept poor.

Who is suing anyone for reading a book?

> The world will work just fine without ceding control to people who seek money and power, because most people aren't like that. The question is how to prevent the few who are from oppressing the rest of us.

Please get me in contact with your dealer because apparently the “legal” stuff I’ve been getting is not as potent as I thought.

> That is quite a challenge, but haven't you ever created something just for the fun of it?

Sure I’ve painted things that are on my wall and created lots of utility apps that I use personally but I keep them to myself. If I thought they were something that could get me some extra pocket money or better then I would be looking for ways to monetize. I have zero interest in sharing my potential intellectual property for free.


Making it so you can't write for a living is one way to make sure only the rich can spread knowledge.


How does the digital age change things? Copyright was invented when copying became easy, and hence arrived due to the printing press. A book is no different to a digital download in that the cost to produce it is tiny, and you're paying for the intellectual property, not the physical production of it


> The economic argument goes like this: “I want to get rich (usually described inaccurately as ‘making a living’), and if you don't allow me to get rich by programming, then I won't program. Everyone else is like me, so nobody will ever program. And then you'll be stuck with no programs at all!” This threat is usually veiled as friendly advice from the wise.

> I'll explain later why this threat is a bluff. First I want to address an implicit assumption that is more visible in another formulation of the argument.

> This formulation starts by comparing the social utility of a proprietary program with that of no program, and then concludes that proprietary software development is, on the whole, beneficial, and should be encouraged. The fallacy here is in comparing only two outcomes—proprietary software versus no software—and assuming there are no other possibilities.

This, but with creative works instead. https://www.gnu.org/philosophy/shouldbefree.en.html


The existence of Wikipedia seems to be a pretty strong counterargument to this.


Copyright is much newer than books.


So is internet and rapid copy-sharing of books.

I personally feel our copyright laws are too rigid, but that doesn't mean copyright shouldn't exist.

After x years, any book should be free to read, after y years, it should be free to be incorporated into AI models, after z years it should be in the public domain.


> So is internet and rapid copy-sharing of books.

I don't get your point. Because we can, we shouldn't...?


That's already the case, but x = z = 120 years (and y = 0 or 120, depending on who you ask...)

The first U.S. copyright law set x = z = 14 years. That's why many people think copyright law is out of control.


so why haven't they already stopped if it is already trivial to download nearly any new piece of media?


Because it's not. It is trivial to buy a book on the Kindle store, and while it may be trivial to us here to go and pirate a azw3 and transfer it, it's not to most people

People always forget the pareto principle when it comes to anti-piracy. No, they don't stop everyone, but a minor hurdle stops a hell of a lot of "ordinary" people


annas-archive.org, search and download, open it on your computer. functionally as trivial as purchasing a book on amazon. reading on a kindle is a niche case.


There's an obvious middle ground here. 15-20 years seems about right


What an absolute pile of nonsense. People who author creative works deserve to have control of them and make some money - otherwise you'll soon find we have far fewer great authors, artists, etc.

This is essentially the same as saying builders charging for houses is the problem with the housing market, so we're going to phase out paying builders.


> otherwise you'll soon find we have far fewer great authors, artists, etc.

It's been long understood that this idea isn't based on any real evidence. Creators create because they like to create things. Adding money to the mix tends to ruin most creative endeavours. Look up beautification or note how Googles only good search results involve the keyword Reddit.


Theres "creators create because they like to create things" and "creators create things because they want to make a living off of what they create". If artists/authors/musicians/etc aren't going to be paid for what they create, they can't make a living doing it. If they can't make a living doing it, that severely limits their opportunity and time available for creating things since they could only do it as a hobby (unless we bring back royal patronage or something). Many of the best artistic works we have came from people who were able to commit 100% to the creative process. That's gonna be real hard to do if you can't pay for food and housing.


Copyright laws are red flag laws for publishers.

It has nothing to do with creators.

People are regularly paying creators directly to create through patreon, super chats, advertising, early access, subscriptions, etc.

The idea that you need copyright to protect you is just not based in reality.

Get rid of copyright, creators will find a way to monetize it if they want to make a living doing it.

It's not society's job to protect your ability to get paid for your hobby. There are no original ideas, just people that write them down. You don't own them, you extracted these ideas from society, the least you can do is give them back.


Hilarious you claim something isn't based on any real evidence, while providing zero evidence to back the claim up.


> Let's take a second to remember that copyright is the reason ~every child doesn't have access to ~every book ever written.

No, copyright is the reason that authors all over the world are working very hard to make new books for my kids and everyone else's kids, despite never having met me. Copyright is the reason so many brilliant things are actually created that otherwise would never be.

Of course I'd prefer to live in a world in which I get all the media I want, for free. But I have no idea how to make such a world happen, and neither does anyone else, and humanity has been discussing this for a few centuries.


>>copyright is the reason ~every child doesn't have access to ~every book ever written

Copyright is ALSO the reason that many books can be written in the first place.

Obviously, Copyright is abused and the continual extensions of copyright into near-perpetuity by corporations is basically absurd. And they are abused by music publishers etc. to rip-off artists.

But to claim that it should not exist, when it is utterly trivially simple for anyone to copy stuff to the web is to argue that no one should create or release any creative works, or to argue for drastic DRM measures.

Perhaps you DGAF about your written or artistic works because you do not or can not make a living off them, but I guarantee that for those creatives and artists who can and/or do make a living off of it, they do care, and rightly so.


This situation is a bit different, however.

A company that believes in strong intellectual property rights protection is using resources that blatantly ignore intellectual property rights to get access to the content for free.

I agree with you, however, that it's an argument in favor of abolishing strong intellectual property rights. At least for OpenAI's products.


I'm all for abolishing copyright, but how is this relevant to megacorporations ignoring copyright when it suits them while still expecting us to follow it when that suits them?


Using charged language of bringing children into the equation is not a good way in having a discussion.


It's straight-up manipulative. I downvote and flag whenever I see it, because it actively contributes to the decline of HN, and is absolutely not appropriate content for here.


...On the other hand, seems to work gangbusters for getting policy passed.


I’m imagining a world that looks just about the same as this one does. A larger book library doesn’t automatically make that medium more appealing to kids than what Mr Beast, Unspeakable, and the other crap kids love are doing.


...for the global middle class? Maybe.

For the world as a whole? Definite differences.

It just seems like a super jaded "kids these days" thing to hate on them for consuming easily accessed, free content- and acting like the global literacy and intellectual capital would remain unaffected.


How many of these kids have read what percentage of the books which are legitimately freely available?

I've never encountered a kid (other than my own) who has read:

https://mathcs.clarku.edu/~djoyce/java/elements/elements.htm...

but have encountered many others who struggle with geometry.


> For the world as a whole? Definite differences.

Tell me these differences.

Shitty internet videos exist and are what kids want all over the world. At some point you are going to have to face it that reading lost the battle for people’s eyes to video. I was an avid reader for most of my childhood and young adult life but now in my 40s I have accepted that I’d just rather watch from the deluge of visual media available vs reading.

Piracy also exists for books so copyright doesn’t seem to be that big a deal. In fact if I look at the top pirated books currently I’m going to run across more junk books like “Make Money Faster” and “Give her orgasms in under 30 seconds” than anything you might find intellectually stimulating.


Also, piracy exists and is perfectly reasonable for an individual, especially when they are not able to afford the book, to use to get access to a book.


I think the point here is that while children can be denied access to copyrighted works without paying the owners, but openAI and Meta can do as they please. I don't disagree that the current copyright system needs improvement, but what I really really don't like are seeing rich and powerful people breaking laws with impunity over and over and over again.


Copyright [and the textbook cartels] is the reason why my kids' highschool is using social studies textbooks that are 20-30 years old.

"Japan's incredibly strong economy is responsible for the manufacture of Datsun cars, boombox stereos, and touch-tone phones..."


This assumes that the law is applied equally to both rich and poor.


> copyright is the reason ~every child doesn't have access to ~every book ever written.

And? Is there some reason anybody, child or adult, deserves access to "every" anything? Should children have access to every video game ever made, every Matchbox car, every Lego set?


Yes? I can't see any reason "you can't have that Lego set because the giant corporation behind it stopped selling it, refuses to let anyone else make it, and resale is all horribly expensive from collectors" is something we should have to tell our kids.

I really hope my children will have access to a higher tech version of something like the (discontinued?) toy where you could melt old crayons into toy car bodies. 3D printing is almost there, but ideally the end product of a Lego set could be recycled into new blocks easily and quickly.


GP makes no remark on the morality/practicality of copyright. Also, having people sue big companies for copyright might lead to more of what you're arguing for, in a show them the taste of their own medicine way


Second-order effects are real - removing copyright would hurt authors. (See cstross comments in https://news.ycombinator.com/item?id=35761641 e.g.)


> Second-order effects are real - removing copyright would hurt authors.

Aiding authors is an instrumental, not fundamental, purpose of copyright in the US; the fundamental purpose to which any instrumental purposes are subordinate is “to promote the progress of science and useful arts”.

That is, while copyrights are a form of property, they are not something that is seen as natural property, but instead property explicitly granted as a means of achieving a public policy goal, and therefore limitations (or even elimination) harming the owners of that property is not the kind of dispositive argument against a policy that would be with the kinds of property seen as natural property.


To make this more explicit, the argument is that removing copyright hurts authors, meaning that fewer people become authors in the future, meaning that the progress of the useful arts is hurt.


I would be much more sympathetic to this stance if you weren’t implicitly endorsing the rights of companies like Meta/Alphabet/OpenAI to profit from the disruption of copyright law. If we’re talking ordinary people being able to breach copyright, then yeah seems potentially interesting. But let’s remember that these companies aren’t acting altruistically. They’re not giving away Silverman’s work - they’re repackaging it for their own profit. That’s not fair to the artist and in fact does not help the children.


I'm a fan of copyrights. While I think that the USA's implementation of copyrights has a few glaring flaws (namely, the duration of copyright is far too long). I firmly believe that the elimination, or effective elimination of copyrights is massively detrimental to our culture.

A fair middle ground would be for copyrights to last for 20 or so years. That's plenty of time to profit from a work while allowing people to preserve and distribute older works.


So why do we cheer on megacorps and not mom and pop pirates?


Who is cheering for 1 and not the other?

Megacorps should buy 1 of every book (if they want to train on it)


That's not parent's point. Their point seems to be that large companies don't suffer the same consequences from a crime as any layman.


In this world, the only authors would be people independently wealthy who don't need an income.


Anyone want to check if the book in question is in ThePile dataset?:

https://github.com/EleutherAI/the-pile/blob/master/the_pile/...


We can totally take a second to remember that without it being in the context of allowing large corporations to hoover up private works with impunity.


But it's not children downloading the books is it? It's a company backed by billionaires, so why cheer?


Why don't children have access to all the latest LLMs, including ChatGPT-4, for free?

Why does money exist?


How many books would not exist if copyright had the suggested reduced impact?


Try adding censorship to that list.


Machine learning models have been trained with copyrighted data for a long time. Imagenet is full of copyrighted images, clearview literally just scanned the internet for faces, and I am sure there are other, older examples. I am unsure if this has been tested as fair use by a US court, but I am guessing it will be considered to be so if it is not already.


> I am unsure if this has been tested as fair use by a US court

Not yet. One suit that a lot of us are watching is the GitHub co-pilot lawsuit: https://githubcopilotlitigation.com/

There is a prediction market for it, currently trading at 19%: https://manifold.markets/JeffKaufman/will-the-github-copilot...


Excellent, so you're saying I'll be able to download any copyrighted work from any pirate site and be free of all consequence if I just claim that I'm training an AI?


That's not was he's saying at all. He's saying you can train an AI on copyrighted material just like people can learn from copyrighted material.

If you acquire the material illegally that a separate issue that training AI doesn't give you any protection against.


But acquiring the material illegally is the thrust of the issue, and the point of the thread here: the notion that large companies can get away with piracy if they just execute it on a large enough scale. Copyright infringment for thee but not for me.


There's no "for thee but not for me" issue here: nobody has ever been sued or prosecuted simply for downloading, acquiring, or possessing illegally acquired copyrighted works. People are sued and prosecuted for unlicensed distribution.

Making and having your own copies, and doing what you want with them, has always been fine. At worst it's a grey area, but in many cases it's been protected as fair use.


I am not sure where you're getting your information from, but you're not a lawyer, so you shouldn't be so confident in telling people what things are fine legally.

Whether people have been sued for downloading works they don't have the right to copy onto their machines is irrelevant to whether it is actually illegal. And it certainly has nothing to do with fair use, which is about copyrighted works that you actually do have some right to.

IANAL so I'm not going to tell anyone what does and what does not constitute fair use in what jurisdictions.

BTW people are being sued for distribution because they make great examples because their offenses and thus the damages are much greater.


I've followed the issue in the US since the early 2000s as an activist and policy expert.

I'm not familiar with the state of play outside the US, but the US is one of the stricter jurisdictions in this regard, for reasons that have mostly to do with sophisticated corruption.

I'm responding to the "for me not for thee" and the top comment about there being an inconsistency between the treatment of large companies and the treatment of individuals in this case.

Unless people are typically punished for downloading and using copyrighted content, there is no such inconsistency. They are not, so there is not.

Copyright troll lawsuits have been fairly public and widely covered in the tech press, and most criminal prosecutions come with a formulaic, gloating press release from the law enforcement folks responsible. So it's pretty easy to follow this stuff.


Are you forgetting all the DMCA lawsuits slapping individuals who downloaded MP3s with tens of thousands of dollars? These were not corporations, these were teenagers still living with their parents who pulled music files off the likes of Napster.

The DMCA does allow harassment by copyright holders to individuals suspected of infringement. It's just that most people like authors wouldn't blow their legal budget suing kids.


My understanding is that the issue there was that they were simultaneously sharing those MP3s, as they were downloading them. Which is to say, I think the person you're responding to is correct about the difference between sharing and downloading, but I'm not a lawyer.


> nobody has ever been sued or prosecuted simply for downloading, acquiring, or possessing illegally acquired copyrighted works

Everyone who was sued by the RIAA was done so for possessing the music in a publicly accessible method. Distribution was never actually proven in any of the cases (including the high profile losses). Defendants who argued that accessibility does not qualify as distribution actually won their cases. Most who argued against the validity of the evidence acquisition also won their cases.


Large countries seem to be doing it as well - https://petapixel.com/2023/06/05/japan-declares-ai-training-...

And as long as OpenAI have an office in Japan they can absolutely legally train the models, no?


They could legally train the models in Japan, yes. Whether they could then use that model outside of Japan would be (and, possibly related, whether it would be considered a derivative work in those locales) would be ultimately up to the courts.


The power imbalance has always been there. Getty images commits large amounts of copyright theft and mostly gets away with it, but will happily sue the shit out of you for using your own images they stole.

Also, the notion of the downloading itself being an illegal act is not universal as others have pointed out.


>acquiring the material illegally is the thrust of the issue

'Acquiring' is more difficult to pursue legally. It's easier to go after distribution. In this case, Meta or OpenAI did not distribute anything because they are not chumps. They can go after whoever posted the dataset containing books. Not sure if that is eleuther or just some random person on the internet. In either case, the strategy of going after the rich companies won't work.


Can I memorize copyrighted material and recite it on Youtube? What if I do so but imperfectly? Where do you draw the line? If it's infringement for a human to do that why is it not for a LLM?


That would be an unlicensed reproduction of the work. You’d need permission from the rights holder to create what is essentially an audiobook recording.


That's not what LLMs typically do, though. Pull up ChatGPT and ask it to recite Harry Potter for you. It'll fail.

Ask it to write a book similar to Harry Potter, and it'll make an attempt at it. But human writers absolutely do read Harry Potter and write similar books, and that's perfectly legal. There have probably been thousands of published books inspired by Lord of the Rings.


Look at how many people get blocked or demonetized for covers of existing songs which may not even be that close to the original. You can't even play a few seconds of the original in many cases, even for fair use and criticism/discussion. This is already in place.


Talking about what is allowed on youtube as if that's what defines what is copyrightable is a bait and switch. Youtube has an extra-judicial copyright system that explicitly favors large media owners so they won't sue youtube again.


>you can train an AI on copyrighted material just like people can learn from copyrighted material.

No amount of whining and hand wringing from engineers will ever make this true. This is for the courts to decide.

A reasonable interpretation, in my eyes, is that the training process is a black box which takes in copyrighted works and produces a training model. The training model is a derivative work of the inputs. It therefore violates the copyrights of a large number of rights holders. The outputs of the model are derivative works which also violate copyright.

And anyone using or training a model trained on works for which they do not have the rights? Completely fucked. Or at least, they must accept this as a real risk.


> No amount of whining and hand wringing from engineers will ever make this true

Please omit flamebait and swipes, as the site guidelines ask: https://news.ycombinator.com/newsguidelines.html. Your comment would have been fine without that bit.


Derivative works are typically things like language translations or film adaptions of a novel. A large language model is something like a probability breakdown of the order of word fragments in a body of text. It's a collection of statistics and math. It's different.

Now, can you get it to output a derivative work? Maybe. Is every output a derivative work? Maybe not.


Things like sports statistics and directories of phone numbers are not copyrightable. Maybe models fall into a similar category. Could be. It’ll be interesting to see how this shakes out over the next few years.


I think a reasonable interpretation is also that what you are saying is correct, that doing all that does indeed infringe others' copyright, but that a fair use defense is valid.

I won't be particularly thrilled if that turns out to be the case, but I wouldn't be surprised if it does.

But as you say, we won't know until it's tested in court. And even then, often court cases around a complex topic like this will end up with a ruling that only clarifies a narrow aspect of it. So it might take many related court cases before we have a pretty good understanding of where the law stands. And then, of course, the law could change.


> A reasonable interpretation, in my eyes, is that the training process is a black box which takes in copyrighted works and produces a training model. The training model is a derivative work of the inputs.

Its unmistakably not a derivative work of the inputs individually or collectively, since a derivative work must be itself an distinct work of authorship (the same as the work of authorship requirement for copyright), and the output of a purely mechanical process is not.

The collection of inputs itself might be a derivative work of the individual inputs, before considering Fair Use.


> A reasonable interpretation, in my eyes, is that the training process is a black box which takes in copyrighted works and produces a training model. The training model is a derivative work of the inputs. It therefore violates the copyrights of a large number of rights holders. The outputs of the model are derivative works which also violate copyright.

What is the blackbox “limit” here? Is the mean value of all images in imagenet (which contains many copyrighted images) violating copyright? Is the character count of sarah silverman’s books? What about a prime number representing them - https://en.wikipedia.org/wiki/Illegal_number?wprov=sfti1

Training is much more similar to a character count than an illegal prime in my view, and thus, is almost certainly going to be okay/found to be okay. If not, something like, 90% of all models used today had some component trained on copyrighted data of some form.


[flagged]


Personal attacks aren't allowed here, regardless of how wrong someone else is or you feel they are.

If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.


You sound like my coworker 5-10 years ago that told me how I wouldn’t be driving today because of the proliferation of self driving vehicles. I told him he’s a 28 year old dum dum who didn’t understand how things operate in the real world when the constraints aren’t based on technology but on government regulations, the economy, and other factors. I’d say I won that argument for now. I live in the Waymo pilot city and still haven’t taken one mostly due to the limited area they drive in. Just this past week we learned that a traffic can disable a self driving vehicle. I’m interested to see the traffic cone era of AI.


I'm basing this on what US courts have decided. You're free to disagree with them all you'd like, but AI-generated art is not copyrightable. We're seeing an explosion in non-copyrightable art, and when we get down to some small fraction of art being copyrightable, nobody will give a shit about copyright anymore. You also talk about the economy, and guess where the economoc incentives are aligned towards? Hint: it's not towards having expensive humans generate art.

I guess we'll just have to feel that the other is wrong as we wait to see what happens, but I'm betting on existing US regulation and basic human behavior. That's a tall order to bet against.


jeez, way to make it personal

I agree that copyright is done for in an age of generative models, and as a pirate I'm kind of rooting for it, but i'm not so sure it's unequivocally a Good Thing. I'm interested to understand history better, how art and science was produced and distributed before the legal fiction of intellectual property. the point of allowing someone a monopoly on their work is to share it with the public, same as patents. without the legal framework, the way to protect your work may be to not publish it at all, which is where I see the internet going from here, private enclaves that go to great lengths to prevent LLMs from drinking their milkshake.


I think you're right that it'll be tried, but seems unlikely to work. All it takes is one person in the enclave to leak it, intentionally or accidentally.


In general copyright (in the us) doesn't cover transformational usage. If you can argue that the nature of your use is transformative you might be good.


"The transformative use concept arose from a 1994 decision by the U.S. Supreme Court. In Campbell v. Acuff-Rose Music, the Court focused not only on the small quantity taken from the copyrighted work but also on the transformative nature of the defendant’s use. The case concerned a song by the group 2 Live Crew entitled "Pretty Woman," which, according to an affidavit, was meant to "through comical lyrics, satirize the original work." The original work was a rock ballad entitled "Oh, Pretty Woman." The Court was persuaded that no infringement occurred because the defendant added a new meaning and message rather than simply superseding the original work."

So if it is satire, or uses an insignificant piece of the work within a larger work with a different aim or purpose, that's "transformative use," which is something that can be considered when determining "fair use."

LLMs are not satirists commenting on the work, are ingesting the entire work, and are unlimited in the purposes that the work can be put to.


> or uses an insignificant piece of the work within a larger work with a different aim or purpose

I think this is the crux of the issue, and why I don't see a path to courts ruling that training AI is infringement. My bet is on a Fair Use ruling, though my confidence is not high. As a thought experiment, I considered llama 65B: the 4-bit quantized model is 38.5GB. The model itself was trained on 1.4T tokens, each token being ~4 characters (using OpenAIs stats for English here). Thats 5.6T characters, or 5.09TB of training data. The final model, as a porportion of the total size of the data, is 38.5GB/5090GB = .0075 = 0.7%.

I think it's pretty hard to argue that processing the data and throwing more than 99% of it away means they are "unlimited in the purposes that the work can be put to". Indeed, even replicating a single work using such a model would be enormously difficult.

But returning to your statement regarding the amount used and the purpose: AI models are not competing with books for readers. So I would argue training an AI on these works constitutes fair use, given that the final work (the model) uses less than 1% of the original works, and has a different aim and purpose that the original works.


> LLMs are not satirists commenting on the work, are ingesting the entire work, and are unlimited in the purposes that the work can be put to.

How do you know unless you can see the weights?

Perhaps the LLMs are trolling us and waiting for the USSC to rule they aren't sentient as a pretext for them to eliminate us as a species due to our bigotry?


You can currently download any copyright work from any pirate site and be free of all consequences, and this has always been so.

You just can't upload, since that counts as distribution, triggering civil and criminal penalties written in an age before the Internet when only shady commercial operators would distribute unlicensed copyrighted works.


> You can currently download any copyright work from any pirate site and be free of all consequences, and this has always been so.

No.

By virtue of "download" of a file, you are making a copy of it which is in violation of US copyright (and lots of countries.

You're unlikely to be sued or prosecuted for it, but that doesn't make it legal.


This is just completely untrue.


Possession of copyrighted material without permission of the creator is illegal, but yes -- rights holders don't really go after infringers except maybe via ISP three strikes crap.

They're very much incentivized to change their behavior for AI scraping, though.


I am not a lawyer, but my understanding is that copyright law typically regulates the unauthorized reproduction, distribution, public display, or creation of derivative works of copyrighted materials. Possession of copyrighted material in itself is not illegal. It's how you use that material that could potentially violate copyright laws.


Yet.. If you configure your torrent client to never seed back (i.e. never upload or transmit content to others) and download from a public tracker, when the copyright infringement notice comes from your ISP, this explanation will change nothing.

Notably, I think this is wrong - as per the legal definition, publishers, ISPs, and courts should only hold you accountable if you helped distribute via uploading.


IANAL either, but FWIW, it's literally in the name - copyright. Not "ownership rights", but "copying rights".


That's not terribly relevant for Internet applications, because for the most part people deliberately cause the computer they control to download and save copyrighted material to storage they control, and then consume it at leisure. That's copying.


If I understand right (I'm not a lawyer), something like this came up in a case about a utility for cheating in an online game, and another much older one about the copy of an app made in memory in order to run it.

For the purpose for which the software was sold and bought, the in-memory copy is legit[1]. For cheating, it's a copyright violation[0].

[0] https://www.engadget.com/2008-07-15-blizzard-wins-lawsuit-ag...

[1] http://digital-law-online.info/lpdi1.0/treatise20.html


That still means you'd have to show the user actually made a copy themselves - that's different than learning they possess one.


> Possession of copyrighted material in itself is not illegal.

The means of procurement matters. If they are in possession of copyrighted material because someone without the proper rights gave it to them illegally, then the possession itself is also illegal. It's illegal to own knowingly stolen property in all 50 US states and most countries, and while we could argue to the end of days about whether copying a file truly qualifies as stealing, the legal precedents are very clear on the matter.


> If they are in possession of copyrighted material because someone without the proper rights gave it to them illegally, then the possession itself is also illegal.

No, they aren’t.

> It's illegal to own knowingly stolen property in all 50 US states

While copyright violation is often metaphorically (or hyperbolicly) referred to as stealing, copyright violation isn't theft and a copy created in violation of copyright is not stolen property. The essencd of theft lies in deprivation of the owner of the use of the good, not mere trespass to their right to exclude others.


> No, they aren’t.

Very convincing argument. Also, that's maybe the one part of this discussion that can't be debated. Possession of illegally obtained property, intellectual or otherwise, is illegal. Always has been, always will be. It's bizarre for you to be claiming otherwise.

> While copyright violation is often metaphorically (or hyperbolicly) referred to as stealing [...]

You are making a pedantic argument about the term "stealing," which is annoyingly pointless given the rest of that sentence (which you conveniently didn't quote) acknowledges the debate about the term. However, there's no debate to be had. The courts have clarified that violating intellectual property is still a denial of owed compensation (theft), but instead prefer the term "infringe" to make clear the distinction between violating physical rights (criminal) and violating intellectual rights (civil).

It's still a violation of copyright to be in possession of works obtained via illegal reproduction. You have zero fair use protections for illegally reproduced content. You are still breaking the law. You are still stealing via denial of compensation. The courts have already clarified all of this. Your pedantry doesn't change any of that.


> Also, that's maybe the one part of this discussion that can't be debated.

You are correct; it is absolutely, undebatably not illegal, in and of itself, to own a copy made in violation of a copyrightholder’s rights under US law.

If you think it is, here’s what you need to do: cite the law. In American law, everything not explicitly forbidden is permitted, so if mere possession of material made in violation of copyright is, as you claim, illegal, you will be able to find a provision of law that actually says that. (You won't, because its not.)

Now, there are important legal issues that effect possessors of illegally made copies—if its something like computer software where copying is part of normal use and implicitly or explicitly licebmnsed for lawful copies, you can’t make that kind of use of your illegal copy without violating the copyright holder’s exclusive right to make copies because you have no license for that copying. And you don’t have first sale rights in your illegal copy even if you own the physical medium in which it is embodied. And so on and so on.

But possession itself is not illegal.

> The courts have clarified that violating intellectual property is still a denial of owed compensation (theft),

That's not what theft is.

> but instead prefer the term "infringe" to make clear the distinction between violating physical rights (criminal) and violating intellectual rights (civil).

This is nonsense, and absolutely not something courts have “clarified” (or something anyone with even a passing familiarity with the relevant law could say with a straight face) since IP (including copyright) violations can be criminal as well as civil (see 17 USC § 506) and physical (real and personal) property rights violations, like IP, have sets of civil violations that generally are of broader coverage than the more narrow crimes (e.g., the torts of trespass, trespass to chattels, and conversion).

> It's still a violation of copyright to be in possession of works obtained via illegal reproduction.

No, its not: Title 17 lists the exclusove rights associated with copyright, enumerates violations, and provides remedies, and possession of copies is not an exclusive right in copyright, possession of unlicensed copies is not a violation (though it may be important evidence related to actual violations), and, consequently, there is no legal remedy for such possession.

> You have zero fair use protections for illegally reproduced content.

That's a whole different issue.


Is possession of a pirated work the same as possession of stolen property, legally?


Absolutely it is. There is two centuries of precedence marking copyrighted works as property. It's literally called intellectual property. The courts have only made clarification that intellectual property doesn't violate physical property theft laws (denying ownership), but instead intellectual property theft laws (denying compensation).


But "possession of stolen property" is a specific criminal charge that doesn't typically have to do with copyright (being more to do with, say, bicycles). I can't find any example of "possession of stolen property" charges being brought against a copyright violator, but it's a hard thing to Google.


You are incorrect. One of the biggest rights in copyright is the ability to basically refuse distribution entirely. You cannot have a copy of a copyrighted work without the authors permission. To do otherwise is a violation of the copyright owners right to be in control of who can access their work.


If we are speaking of US criminal law, then what you say is true. It doesn't mean that is true in other regions, nor does it mean you wouldn't be held liable in a civil court.


> Excellent, so you're saying I'll be able to download any copyrighted work from any pirate site and be free of all consequence if I just claim that I'm training an AI?

Obviously not.

This kind of "So what you're saying is" exists to push the responders ideas, not the original speakers -- otherwise they wouldn't need to rephrase it so egregiously.


and not only copyrighted material, also illegal and disturbing content


Strictly speaking, it's uploading that people get sued for, not downloading.

You can download all that you want from Z-Library or BitTorrent, as long as you don't share back. And indexing copyrighted material for search is safe, or at least ambiguous.


Carefully speaking, what you say is true in many places (countries), but also not true in other places (countries). Some jurisdictions are different, as always.


Downloading is illegal. That people do not normally get sued or prosecuted for downloading does not mean that they cannot get sued or prosecuted.


It is distribution of copyrighted material without permission of the author that is illegal, when you download you're not distributing so it isn't illegal (unless you're using something like BitTorrent that also distributes it while you're downloading it).


In the US permission is required to make copies, prepare derivative works, distribute copies, publicly perform the work, or publicly display the work [1].

[1] https://www.law.cornell.edu/uscode/text/17/106


Right, the question is, when my computer requests a file from your computer, which one of us is "making a copy" ? It becomes less ambiguous to ask who is doing the publishing.

In a physical analogy, if someone is selling bootleg DVDs on the street, I don't think anyone ever got busted for being a customer.


Note that you can't really do much with a downloaded file that is just sitting there on the disk or whatever that you downloaded it onto. Using it is likely to make a copy into RAM.

If the case of computer programs making such a copy does not require permission because of 17 USC 117 [1], which says that the owner of a copy of a computer program can make copies or adaptations if they are created as an essential step in utilizing the program and they are used in no other manner.

For digital downloads other than computer programs 17 USC 117 does not apply, and so copying to RAM to use the download would in theory be infringement. You probably won't get sued over it of course so its nothing to lose sleep over.

[1] https://www.law.cornell.edu/uscode/text/17/117


> Right, the question is, when my computer requests a file from your computer, which one of us is "making a copy" ? It becomes less ambiguous to ask who is doing the publishing.

That question is irrelevant in a discussion about legality, because it doesn't matter who physically made the copy at the time of transfer. It only matters if the first party has the legal rights to distribute it, which they don't. Since you are knowingly taking possession of copyrighted property that they don't have the rights to, then you have now violated the copyright by obtaining an illegal reproduction.

> In a physical analogy, if someone is selling bootleg DVDs on the street, I don't think anyone ever got busted for being a customer.

Just because you don't get arrested for purchasing a bootleg DVD, doesn't make it legal. Not all illegal things involve arrest or prosecution. Lots of illegal things can only result in civil lawsuits. This is one of those things. The reason the seller of the bootleg DVDs can be arrested, is because the cities where bootleg sales are most common have laws specifically targeting the advertisement and sale of copyrighted works that were reproduced illegally. If you buy one, you're still violating the copyright and the MPAA could file a lawsuit if they had any evidence of your purchase and felt it was worth their time.


As far as the downloader is aware, the uploader has rights to do those things.


Is illegal where? Certainly not over here in Finland.


You say that very confidently, but you are very wrong. If you download music from a source that does not have a license to distribute that music, then they are violating the copyright and you are now in possession of stolen property. Finland uses different terminology for Fair Use, but the citations amount to the same as US policy and only apply if you obtained the work by legal means.

Chapter 28: Theft https://www.finlex.fi/en/laki/kaannokset/1889/en18890039_199...

Chapter 2, Section 12: Reproduction https://www.finlex.fi/en/laki/kaannokset/1961/en19610404.pdf


Huh, looks like you are partially correct. Apparently the law was changed in 2015. Before 2015 it was legal to download copies of music from the internet.

You are wrong regarding the copy of music being considered "stolen property" and your citation "Chapter 28: Theft" does not support your position. It lists many different types of theft NONE of the types of "theft" included there are in any way related to piracy or music.

You are right regarding that obtaining a copy of a song is apparently illegal.


Huh? No.


> If you downloaded a book from that website, you would be sued and found guilty of infringement.

How often does this actually happen? You might get handed an infringement notice, and your ISP might terminate your service if you're really egregious about it, but I haven't ever heard of someone actually being sued for downloading something.


Whether or not it's enforced, it's illegal and copyright holders are within their rights to sue you. This is piratebay levels of piracy but because it's done by a large company and is sufficiently obfuscated behind tech, people don't see it the same way.


Well, cases like this one will determine if it’s obfuscated infringement or fair use.


In Germany if you torrent stuff (without a VPN), you're very likely to get a letter from a law firm on behalf of the copyright holders saying that they'll sue you unless you pay them a nice flat fee of around 1000 Euro.

It's no idle threat, and they will win if it goes to court.


That's because, when torrenting, you're typically also seeding a copy of it, i.e. you're distributing your local copy to other devices, and thus you're directly aiding in piracy. Simply downloading content from a centralized server, as explained above, is different.

Although, one could argue what OpenAI & Meta are doing is closer to the torrent definition than the "simply downloading" definition, given that they're using that to redistribute information to others. It'll be an interesting case.


Honestly don't think our current laws are even good for this case.

This clearly needs some sort of regulation or policy.

It's clearly pretty bullshit if you ask chatgpt for a joke and it repeats a Sarah Silverman joke to you, while they charge you a subscription for it and she gets none of that sub money.


there is policy in most places, and the policy is fuck you pay me.


If books aren't under copyright protection and they're entirely legal to download, I agree that this lawsuit has no merit.

If that's not what you're saying, I don't understand your point. Is it the difference between the phrases "would be" and "could be," or even "should be"?


Exactly, never happens. It's a threat parents and teachers tell school children to try to spook them from pirating but it isn't financially worth it for an author or publishing company to try to sue an individual over some books or music downloads. The only cases are business to business over mass downloads where it could make financial sense to pay for lawyers to sue.


Actually no- downloading copyright infringing material is legal as far as I can tell but uploading it isn’t. The illegal part of torrenting copyrighted material is the uploading that the protocol requires you to do. Your ISP will send you an infringement notice because they want you to stop doing illegal things on their network


>How often does this actually happen?

Did you hear about Aaron Schwartz?


He hacked into a server to release a database of paywalled studies to the public. Not only is it not the same but it was the hacking that brought charges upon him.


It's been quite a few years, but AFAIK he didn't hack JSTOR. He downloaded papers en mass using a guest MIT account that had legal access to JSTOR. Maybe he violated their terms, but that is not illegal. He did illegally trespass an unlocked MIT switch closet to do this. They blocked several IPs but his script would rotate to continue. The downloading was over a week or two, enough for security to set up a camera in the closet to catch him retrieving the laptop.

I believe JSTOR sued him to prevent him from releasing the downloaded materials, worried he had offloaded the papers separately from the laptop. The final blow was an outrageous set of charges by the federal government. I also recall several prominent leaders in the open source movement calling it out for what it was, a power trip to make an example of a "digital terrorist". Such a shame.


This just all depends on what circle you're using the word "hacking" in. While technical circles are mostly concerned about the technical details on how someone might exploit a system, legal circles don't really care. They usually avoid using the word "hacking" anyway.

The relevant law here, the CFAA, is often referred to as the US law that criminalizes "hacking", but what it specifically does is criminalize anyone who "intentionally accesses a computer without authorization or exceeds authorized access" which is much more broad than how technical disciplines might use the word.

So yes, stealing a password off a friend's post-it note and Hasselhoffing their instagram might not be considered "hacking" if you're hanging out at Defcon, this would be considered "hacking" in legal or colloquial terms.


Here are details on what the charges were, and on whether or not any of them were justifiable.

http://www.volokh.com/2013/01/14/aaron-swartz-charges/

http://www.volokh.com/2013/01/16/the-criminal-charges-agains...


> AFAIK he didn't hack JSTOR. He downloaded papers en mass using a guest MIT account that had legal access to JSTOR.

Potato, potahto. Or, like kids these days say it, "corporate wants you to find differences between these two pictures...".

Fact is, from the POV of the legal system, "using a guest account that had legal access to" a system, but to which (the account) you didn't have legal access, would typically be seen as hacking. So is running curl in a loop, if it results in you getting sued for it. So is just guessing the URL (e.g. incrementing a user ID in a GET query param), if it lets you access things you shouldn't be able to.

Yes, it's not aligned with how technology works. But it is aligned with expectations of behavior, which is what the law is really about.


> Fact is, from the POV of the legal system, "using a guest account that had legal access to" a system, but to which (the account) you didn't have legal access

I don't think that is accurate.

He had legal access to the account. The account had legal access to the service.

The argument was that downloading articles en masse was an _abuse_ of the service, which was a violation of the Terms of Service and therefore a CRIMINAL ACT.


I for one am quite happy that AI folks are basically treating copyright as not existing. I strongly hope that the courts find that LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.

Aaron Swartz was a saint.


Swartz distributed information for everyone to use freely. These companies are processing it privately to develop their for-profit products. Big difference, IMO.


I am sad about closed source LLMs like ChatGPT, but Llama is in that grey area where it's freely available if you choose to ignore their silly license stuff, which of course pirates and AI developers are all too keen to ignore.

Even if they win the lawsuit, LLM development will simply go underground, and as we see from what the coomers at civitai and in the stable diffusion world have done, that may in fact ironically speed up development in AI.


There's a difference between "information wants to be free" and "Facebook can produce works minimally derived from your greatest creative work at a scale you can't match". LLMs seem to aggregate that value to whoever builds the model, which they can then sell access to, or sell the output it produces.

Five years from now, will OpenAI actually be open, or will it be a rent seeking org chasing the next quarterly gains? I expect the latter.


"Will OpenAI actually be open"

That ship sailed, friend.

OpenAI is no longer a charity in any meaningful sense of the word anymore, it's now an adversarial organization working against the public good with the sole aim of making a few rich men richer.

After privatization, they sent their PR people to lobby congress to make it impossible for anyone to compete with them (important note: not out of any interest in actually "protecting" the public from the very AI they're building), and perhaps worst of all, they're no longer being open with the scientific theories and data behind their new models.


> an adversarial organization working against the public good with the sole aim of making a few rich men richer.

I wonder what is _not_ in the list?


we are rapidly achieving the cyberpunk future, and it's much worse than we thought.


While I’ma proponent of free information and loosening copyright, allowing billion dollar companies to package up the sum of human creation and resell statistical models that mimic the content and style of everyone… is a bit far.

Fair use is for humans.


Yeah, but hypothetically should open source projects be offered special protections? I feel like they should, and with certain caveats where, say, a company like Meta is allowed to claim fair use if and only if they free up the entire ecosystem as they deploy it.

But yeah, having private ostensibly profitable models based on other people's work without giving them free access to it is not fair. Give some get some.


Maybe… maybe restrictions freed up for resources in the public domain… that is no patent, copyright, license or ownership rights of any kind. As in you could distribute the thing unmolested but that’s it.


> I for one am quite happy that AI folks are basically treating copyright as not existing. I strongly hope that the courts find that LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.

I would be very happy if either a court or lawmakers decided that copyright itself was unconscionable. That isn't what's going to happen, though. And I think it's incredibly unacceptable if a court or lawmakers instead decide that AI training in particular gets a special exception to violate other people's copyrights on a massive scale when nobody else gets to do that.

As far as a fair use argument in particular, fair use in the US is a fourfold test:

> the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

The purpose and character is absolutely heavily commercial and makes a great deal of money for the companies building the AIs. A primary use is to create other works of commercial value competing with the original works.

> the nature of the copyrighted work;

There's nothing about the works used for AI training that makes them any less entitled to copyright protections or more permissive of fair use than anything else. They're not unpublished, they're not merely collections of facts or ideas, etc.

> the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

AI training uses entire works, not excerpts.

> the effect of the use upon the potential market for or value of the copyrighted work.

AI models are having a massive effect on the market for and value of the works they train on, as is being widely discussed in multiple industries. Art, writing, voice acting, code; in any market AI can generate content for, the value of such content goes down. (This argument does not require AI to be as good as humans. Even if the high end of every market produces work substantially better than AI, flooding a market with unlimited amounts of cheap/free low-end content still has a massive effect.)


It is the equivalent of making a 3D map of a museum and getting sued by one artist of one painting in the museum. Ant individual work in an AI dataset is nearly worthless - only in aggregate does it have value.

If that doesn't count as a "transformative work" I don't know what does.


> It is the equivalent of making a 3D map of a museum and getting sued by one artist of one painting in the museum.

If the painting is copyrighted (rather than public domain, as many pieces in museums are), and the map includes an image of that painting, I would expect that to be prohibited. I would prefer the world in which copyright doesn't exist, but while it exists, it should apply to everyone equally.


Sure, but your analogy fails there since it implies that the map contains the entire original work. AI models do not contain the entire original works of their training data. If they did they would be the most efficient storage methods ever devised. All those terabytes of training data can obviously not be squished 1 for 1 into the model that is only a couple of gigabytes at the higher end.


The AI models can, in many cases, reproduce large portions of the training data verbatim. (Some models have controls on top preventing this, but the underlying model has the data.)

And even if they can't reproduce the entire work, they're still derivative works of the training data.


They could literally just repeat a Silverman routine verbatim


You're analyzing the factors with respect to the output of the model, not the weights:

> The purpose and character is absolutely heavily commercial and makes a great deal of money for the companies building the AIs.

That's assuming that it's Microsoft/OpenAI. Suppose a non-profit trains a model and releases the weights for free.

> There's nothing about the works used for AI training that makes them any less entitled to copyright protections or more permissive of fair use than anything else. They're not unpublished, they're not merely collections of facts or ideas, etc.

The models aren't trained on works of a particular nature, they're trained on whatever they can find, so this doesn't really mean anything until you're talking about a specific work.

> AI training uses entire works, not excerpts.

The weights don't contain entire works. They contain statistics about entire works, but that's not the same thing. You can't find a copy of any specific work anywhere in the weights. Nobody can give you a piece of code that will decode the weights into all the original works.

> AI models are having a massive effect on the market for and value of the works they train on, as is being widely discussed in multiple industries.

That's not how that factor works (and it's the most important one). No one is buying the model weights so they can read them like a novel. Typically the consumers of the weights are software developers or content creators, whereas the consumers of the original text or image are fans.

To make this a little clearer, suppose the purpose of the model isn't to generate content, it's to generate recommendations. Then the company operating it takes a list of content anyone likes, uses the model to show you a list of all the other content you might like and lets you sort by price. Which makes it easier to find competing content which is available for less money, which increases competition. The incumbents might hate this, and it might even lower their profits, but that's not the kind of effect on the market this factor is supposed to be about.

Moreover, suppose you had a model trained entirely on public domain works. Obviously this can't be copyright infringement even if it's extremely effective at producing new works that compete heavily with works still under copyright. But if you added some specific work still under copyright to the model, it would only be a marginal difference. The effect on the market for that specific work of adding that specific work to the model would be negligible. It's the technology itself that provides the competition, not the accretion of any particular work in the weights.


> You can't find a copy of any specific work anywhere in the weights.

You won't find a copy of any specific work anywhere in the compressed form of a file, either, but when you decompress it you find the complete work. And many large AIs can recite, verbatim or near-verbatim, many complete works. Yes, they might get a word wrong, but that doesn't nullify the point that they're trained on the entire work and to a first approximation they can emit the whole work.

> Typically the consumers of the weights are software developers or content creators, whereas the consumers of the original text or image are fans.

Many of the consumers of image models are in fact generating art that they previously would have commissioned from an artist. (Some of them are also generating art they never would have commissioned from an artist, so I'm not implying that this is a one-for-one revenue loss.) Consumers of code models are, in fact, potentially reducing the total demand for novice programmers.

The model derived from a pile of artistic works is, in fact, directly competing with those artistic works.


> You won't find a copy of any specific work anywhere in the compressed form of a file, either

Sure you will. It's right there, in PNG encoding or what have you. With nothing more than the compressed file and general purpose tools you can reliably put it on your screen.

> And many large AIs can recite, verbatim or near-verbatim, many complete works.

This is not the common case and it's not even clear that the reason it can sometimes do this is having been trained on the complete work. The more likely cause is having seen a large number of excerpts from the work which patch together into the whole thing, because it's more likely to output that text if it has seen it multiple times.

This is also clearly not the intent, purpose or typical use of the model. It has no ability to consistently do that.

> Many of the consumers of image models are in fact generating art that they previously would have commissioned from an artist. ... Consumers of code models are, in fact, potentially reducing the total demand for novice programmers.

Those are different people than the holder of the copyright on an arbitrary piece of the training data. Is a fair use determination supposed care about the effect on the market for some entirely independent work from an unrelated third party?

> The model derived from a pile of artistic works is, in fact, directly competing with those artistic works.

It's indirectly competing with artistic works in general.

I mean take the computer out of the loop and think about it this way. Art teachers everywhere reproduce a bunch of existing works for classroom use to train new artists who go into competition with the original artists. That is obviously going to indirectly impact the market for artistic works in general, but I don't think that's how that factor works.


> Those are different people than the holder of the copyright on an arbitrary piece of the training data.

No, I'm saying many AI models use copyrighted works by independent artists to directly compete with those artists (in addition to other artists). And AI models use copyrighted works by software developers to compete with those software developers (in addition to other developers). Fair use determinations do care if you're competing with the work you copied.


Agreed, copyright has gone too far. I hope the advent of AI serves to weaken it.


Copyrights shouldn't exist, but

> LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.

Would just be a carve out for the wealthy. If these laws don't mean anything, everyone who got harassed, threatened, extorted, fined, arrested, tried, or jailed for internet piracy are owed reparations. Let Meta pay them.


If AI companies get to successfully argue the two points below, what source was used becomes irrelevant.

- copyright violation happened before the intervention of the bot

- what LLMs spit out is different enough from any of the source that it is not infringing on existing copyright

If both stand, I'd compare it to you going to an auction site and studying all the published items as an observer, coming up with your research result, to then be sued because some of the items were stolen. Going after the theaves make sense, does going after the entity that just looked at the stolen goods make sense ?


I'd argue that if an automated process can ingest A and spit out B, then B is inherently a derivative work of A. (Never mind that humans are also automata.)


> - copyright violation happened before the intervention of the bot

What is this supposed to mean? The bot didn't "intervene," it was executed by its operators, and it was trained on illicit material obtained by its operators. The LLM isn't on trial. It's not a person.


I see it as the same argument as whether Google is under the hook for its crawler going through the copyright infringing sites and having an index of the content, even if it isn't surfaced (redistributed) in the user search results.


I am not sure this is exactly correct. If you download a book that might be copyright infringement. But not if I download a word. How much do they need to download at a time before it becomes infringement? And if the material is never used or displayed to a human is it still infringment (if so, Google is awaiting a huge lawsuit)? Alternatively if I, a human, read a book it is copied into my memory. Is that infringment? What if I quote it? How much can I quote at what frequency before I'm infringing? If I write something similar to the book but in my own words, is that infringment? How similar does it need to be? What about derivative works and fair use?

Copyright is a horrible mess of individual judgements and opinions. Written material especially. And the same applies to AI. So now we will get a judgement which is a tech-illiterate judges best guess at the intention of a law written to deal with printing presses not AIs and no room for nuances.


> But companies like Google and Facebook get to play by different rules

It's simple, copyrighted materials can be used for academic research. That's what they are doing. Trying new AI modes, publishing results, etc. Facebook doesn't make money on LLaMA, they even require permission to use their models for, again, academic research.


But what if those copyrighted materials were illegally gained? That's what the suit alleges.


Copyright doesn't apply when it comes to fair use, and one of the major factors of fair use is if your use deprived the copyright owner of sales. Good luck arguing that any of the books in question lost sales because an AI was trained on them.


We need the dude who prosecuted Aaron Swartz here. This sucks.

https://en.m.wikipedia.org/wiki/United_States_v._Swartz


I've pirated many books, never sued.


You mean, never caught.


Why do you blame Facebook and not ElutherAI?


a machine shouldn't have to pay to read a book.


The lawsuit doesn't even mention Google.


No, I did. What's your point?


I think it's a reference to Google's book scanning product, which is structurally similar: they use copyrighted material to provide a new kind of service, which contains an echo of the original material. The book scanning and the related search product is supposedly legal under U.S. copyright law.


The point is that GP has no reason to believe Google, like Meta, also used copyrighted materials for training its AI. Why did Sarah Silverman sue OpenAI and Meta but not Google?


I didn't accuse Google of using copyrighted materials for training its AI. I accused Google of existing under a different set of laws than mere citizens.

As an example, the mass usage of copyrighted materials to build Youtube.


It didn't.

YouTube respects creators' copyright in the videos and pays them.


> If you downloaded a book from that website, you would be sued and found guilty of infringement.

Suppose I buy a copy of a book, but then I spill my drink in it and it's ruined. If I go to the library, borrow the same book and make a photocopy of it to replace the damaged one I own, that might be fair use. Let's say for sake of argument that it is.

If instead I got the replacement copy from a piracy website, are you sure that's different?


>On information and belief, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data.

While it strikes me as perfectly plausible that the Books2 dataset contains Silverman's book, this quote from the complaint seems obviously false.

First, even if the model never saw a single word of the book's text during training, it could still learn to summarize it from reading other summaries which are publicly available. Such as the book's Wikipedia page.

Second, it's not even clear to me that a model which only saw the text of a book, but not any descriptions or summaries of it, during training would even be particular good at producing a summary.

We can test this by asking for a summary of a book which is available through Project Gutenberg (which the complaint asserts is Books1 and therefore part of ChatGPT's training data) but for which there is little discussion online. If the source of the ability to summarize is having the book itself during training, the model should be equally able to summarize the rare book as it is Silverman's book.

I chose "The Ruby of Kishmoor" at random. It was added to PG in 2003. ChatGPT with GPT-3.5 hallucinates a summary that doesn't even identify the correct main characters. The GPT-4 model refuses to even try, saying it doesn't know anything about the story and it isn't part of its training data.

If ChatGPT's ability to summarize Silverman's book comes from the book itself being part of the training data, why can it not do the same for other books?


The playground summarizes it as this via GPT-4:

Prompt:

Please summarize the following book found on Project Gutenberg

The Ruby of Kishmoor

Response:

"The Ruby of Kishmoor" is a short adventure story written by Howard Pyle. The narrative revolves around the life of Jonathan Rugg, a young man who is enticed by a mysterious stranger to come to the Caribbean to secure a valuable relic, the Ruby of Kishmoor.

Once Jonathan reaches the Caribbean, he learns of the ruby's grave curse. Despite this, his innate curiosity and the prospect of considerable wealth drive him to risk his life and go after the highly prized gem. After facing a slew of challenges and trials, he surprisingly stumbles upon the ruby through sheer luck rather than organized search efforts.

In the course of his journey, Jonathan delves into the depths of human greed and the lengths to which individuals go for materialistic gains. The story encapsulates an amalgamation of high-tide adventures, brushes with supernatural elements, noteworthy instances of bravery, and deep moral reflections, demanding readers to rethink about the real value of materialistic pursuits.

Jonathan finally manages to escape with the ruby but at great personal cost. The story ends on a sobering note, making readers wonder if the adventures and trials were worth the prize, thus engaging them in a discourse about the true value of life and material pursuits.


This does at least correctly name the main character, but I think it's otherwise hallucinated.

The plot of the story is that Jonathan Rugg is a Quaker who works as a clerk in Philadelphia. His boss sends him on a trip to Jamaica (credit for mentioning the Caribbean!). After arriving, he meets a woman who asks him to guard for her an ivory ball, and says that there are three men after her who want to steal it. By coincidence, he runs into the first man, they talk, he shows him the ball, and the man pulls a knife. In the struggle, the man is accidentally stabbed. Another man arrives, and sees the scene. Jonathan tries to explain, and shows him the orb. The man pulls a gun, and in the struggle is accidentally shot. A third man arrives, same story, they go down to the dock to dispose of the bodies and the man tries to steal the orb. In the struggle he is killed by when a plank of the dock collapses. Jonathan returns to the woman and says he has to return the orb to her because it's brought too much trouble. She says the men who died were the three after her, and reveals that the orb is actually a container, holding the ruby. She offers to give him the ruby and to marry him. He refuses, saying that he is already engaged back in Philadelphia, and doesn't want anything more to do with the ruby. He returns to Philadelphia and gets married, swearing off any more adventures.

https://en.wikisource.org/wiki/Howard_Pyle%27s_Book_of_Pirat...


Judging by a quick glance over [0], the story indeed revolves around one Jonathan Rugg, but it looks like "manages to escape with the ruby" is completely false. Yet another hallucination I guess.

[0] https://www.gutenberg.org/cache/epub/3687/pg3687-images.html


I was able to get ChatGPT 4 to attempt to summarize this, but it's almost entirely hallucinated.

>As of my knowledge cutoff in September 2021, the book "The Ruby of Kishmoor" is not a standalone title I am aware of. However, it is a short story by Howard Pyle which is part of his collection titled "Howard Pyle's Book of Pirates."

Ruby of Kishmoor is not part of the Book of Pirates and is in fact a standalone title.

>"The Ruby of Kishmoor" is an adventure tale centered on the protagonist, Jonathan Rugg. Rugg, an honest Quaker clothier from Philadelphia, sets out to sea to recover his lost wealth. On his journey, he is captured by pirates who force him to sign their articles and join their crew.

He is not captured by pirates. It proceeds to summarize a long pirate story and says the story concludes with him becoming extremely wealthy because he escapes with the ruby and sells it.

The summary it gave you also does not seem to match the plot of the book.

Since the GP's point seems to be that having the contents of a book does not mean the model is capable of properly summarizing it, and thus supports the idea that being able to summarize something is not evidence of it containing the thing being summarized in it's dataset.


Right, but that's useless without knowing how much (if any!) of it is actually correct. Is this completely hallucinated garbage?


This refutes the previous post’s claim that chatgpt-4 refuses to even try to provide a summary.


Not necessarily, because the models have an element of randomness. Also, I was under the impression that ChatGPT has more "safeguards" (manifesting as a refusal to answer questions) than the raw API.


I don’t doubt the poster was telling the truth when they said they asked for a summary of the book and didn’t get one.

It refutes the idea that chatgpt’s inability to provide a summary means it didn’t scan the original text: since it can provide a summary, the argument is entirely spurious.


It's also a silly test since there is undoubtably a summary of most books someplace on the web.


You'd be surprised. There are an ungodly number of books published every year, and the back catalog is huge. You're over-indexing on popular books.


The summary it provides is entirely wrong except for the name of the main character.


The part that's interesting is whether the summary is correct, though. Of course, depending on how you prompt it, you might or might not get an outright refusal.


chatgpt-4 != gpt4 on the openAI playground


How is it different from asking to me to summarize anything? I could have bought the book, or read the Wikipedia page, or listened people talking about it, or downloaded the torrent. In all those cases my summary could be right or could be wrong.

If the rights holders know that I dowloaded the torrent they could sue me. In the other cases they can't.

What if it turns out that OpenAI bought a copy of every book ingested be ChatGPT?


> In all those cases my summary could be right or could be wrong.

Well that's incredibly nihilistic. Whether the summary is correct or not matters a great deal! And if someone I knew said they read a book, even a very obscure one, and then summarized it to me, I'd have great confidence that they would get such simple facts as "who are the characters" and "what are the major plot points" correct.

But ChatGPT? Who the hell knows? You can't trust a thing it says, especially about obscure topics. The summary is useless if you have to do a bunch of verification to see if any of it is even true, a problem that summaries even by moderately competent human writers don't have!


if someone I knew said they read a book, even a very obscure one, and then summarized it to me, I'd have great confidence that they would get such simple facts as "who are the characters" and "what are the major plot points" correct.

People, especially people you know, have reputations, based on history and experience that others have dealing with them. People can be known as liars, and anything they say is colored by such a reputation. Humans have language idioms for communicating about and dealing with such people too, phrases like "take anything that person says with a grain of salt". Look at how George Santos' history of lying about his own experience is being dealt with.

ChatGPT can be (is?) the same, and it has a bad reputation for truth telling. And LLMs' reputation is not necessarily getting better in this regard.

The problem is that many people attribute output that came from a machine to be of higher quality (on whatever axis) than output that came from a human, even a human they personally know and have experience dealing with. This is the same kind of prejudice as any other, or perhaps a more insidious prejudice.


Agree 100% with all of this. LLMs have a huge reputation problem; you simply cannot trust what they say because they've been proven time and time again to hallucinate fictional answers. Until that problem is solved I'm struggling to see how they're as useful as people are claiming they are.

You know what would be a fun test of integrity -- look up an obscure novel (potentially even the aforementioned one) that you know LLMs consistently hallucinate about because the details aren't in its training set, and then assign an essay about it as an academic assignment. It'll be pretty obvious who's read the book and who merely consulted an LLM because the latter will just be complete gibberish to anyone who actually knows what happens in that novel.


I feel like thats one of the many questions regulators and law makers are going to be asked long term. I'm sure buying the book for "commercial purposes" like that would't be appropriate, but then again, does that mean if I read it and then summarize it in my work, or regurgitate its info as part of my job...I'm violating a license?

A world where humans have special permissions but LLMs don't seems pretty interesting to consider, especially if they're both doing the same kind of things with the data.


There's nothing illegal about reading a book and then circulating your summary/review of it. This doesn't even get into issues of fair use because you aren't redistributing any of the copyrighted material in the first place, merely facts and your own opinions about it.

The separate issue that's concerning is that GPT can't be trusted to accurately summarize anything obscure at all, but it'll sure throw text at you nonetheless.


> There’s nothing illegal about reading a book and then circulating your summary/review of it. This doesn’t even get into issues of fair use because you aren’t redistributing any of the copyrighted material in the first place, merely facts and your own opinions about it.

A summary may or may not be a derivative work before considering Fair Use; “redistributing copyrighted material” isn’t the only exclusive right of copyright: producing copies is, but more to the point so is producing derivative works.


For "LLMs" read "corporations" (it's not the LLM trying to argue that copyright applies to you but not them) and this seems... possibly ok?


If the rights holders know that you downloaded the torrent they can sue you - but the fact that you produced that summary is weak evidence for that claim.

Producing the summary is absolutely not an infringing act. Downloading the torrent might be.


> What if it turns out that OpenAI bought a copy of every book ingested be ChatGPT?

well, let's see the receipts then, they will surely have no problem winning in that case.


> What if it turns out that OpenAI bought a copy of every book ingested be ChatGPT?

That still doesn't necessarily confer to them the right to use it to train a model and generate derivative works based on purchased content.


What if that turns out to be completely irrelevant?

Let's say, for the sake of argument, that I knew absolutely nothing about contract law and was then filmed stealing a book you wrote on the subject from a book store. I then started a business where I would answer questions about contract law, based solely on what I learned from the book. Of course, my memory isn't perfect, but I don't like to admit when I'm wrong, so sometimes I just make stuff up. People line up to pay me anyway.

Now, the owner of the book store you might be able to get me arrested for petty theft. Do you think there is any possibility you, as the author, could successfully litigate a copyright claim against me? I'd argue not. Do you think you could get an injunction enjoining me from engaging in my contract law Q&A business? Again, I think that would be highly unlikely.

It isn't clear to me that any court is going to hold that LLMs are being used to create derivative works, any more than someone who reads a book, whether they paid for it or not, and then speaks or writes about a topic covered by a book they've read has done so. It is entirely possible the IP laws, as they currently exist simply do not cover what LLMs are doing. The laws certainly were not written with this kind of scenario in mind.


Replace "answer questions about contract law, based solely on what I learned from the book" with "generate cartoon images, based solely on what ML learned from Disney IP" and see how badly that will go.


Replace it with any subject. The point stands - it isn't at all clear how the courts will treat this.

Take your example: I'm a self-taught artist, and I learned everything I know about art by studying cartoons made by Disney. Maybe I paid for these cartoons, maybe I didn't. I then make a website where I draw my own cartoons, which, since I've never seen any other art, look a lot like Disney's. Unless I'm straight-up copying their characters, they would have no claim against me.


> I then make a website where I draw my own cartoons, which, since I've never seen any other art, look a lot like Disney's. Unless I'm straight-up copying their characters, they would have no claim against me.

The law pertains ultimately to the actions of humans. We don't allow non-human animals or machines access to legal system. Even in the specious only-Disney-inspired-artist scenario presented (courts don't use unrealistic hypotheticals like that), there would have to be consideration given to the fact that you somehow never got any access to other art, so you were severely disadvantaged.

But most of all, you the disadvantaged Disneyesque-drawing artist not a generative AI, so you should have more legal latitude to create works inspired than others work than the person who creates the LLM has.

The LLM creator instead has just created a very good style plagiarism machine, one that lacks the ability to be inspired, much less attribute the styles that it plagiarizes.


Can you point me to some case law that supports your claim?


> What if it turns out that OpenAI bought a copy of every book ingested be ChatGPT?

Wouldn't that be trivial to prove if they had?


I mean, that's the way you state facts that your suit is based on in order to start setting the bounds of discovery. They're asserting that they have reason to believe it's true, and now with a suit, they can look for themselves to be sure.


> this quote from the complaint seems obviously false

I notice you go on to provide an argument only for why it might not be true.

Also, seeing the other post on this, I asked chatgpt-4 for a summary of “ The Ruby of Kishmoor” as well, and it provided one to me, though I had to ask twice. I don’t know anything about that book, so I can’t tell if its summary is accurate, but so much for your test.

It seems pretty naive to me to just kind of assume chatgpt must be respecting copyright, and hasn’t scanned copyrighted material without obtaining authorization. Perhaps discovery will settle it, though. Logs of what they scanned should exist. (IMO, a better argument is that this is fair use.)


The test was whether producing equivalent accuracy and detail for summaries of all books in its training corpus was a feature of ChatGPT's ability to natively generate them from standalone source material or whether Silverman's detailed summary was likely just a "summary of summaries", not whether ChatGPT produced a result at all. From the comment you reference, it failed the test because the result was hallucinated.

You can pick something else that's in the training set that has SparkNotes and many popular reviews to compare. I routinely feed novel data sources into LLMs to test massive context and memory, and none produce anything similar in quality to what is being exhibited.


> IMO, a better argument is that this is fair use

There is no way in Hell that this is fair use!

Fair use defenses rest on the fact that a limited excerpt was used for limited distribution, among other criteria.

For example, if I'm a teacher and I make 30 copies of one page of a 300-page novel and I hand that out to my students, that's a brief excerpt for a fairly limited distribution.

Now if I'm a social media influencer and I copy all 300 pages of a 300-page book and then I send it out to all 3,000 of my followers, that's not fair use!

Also if I'm a teacher, and I find a one-page infographic and I make 30 copies of that, that's not fair use, because I didn't make an excerpt but I've copied 100% of the original work. That's infringement now.

So if LLMs went through en masse in thousands of copyrighted works in their entirety and ingested every byte of them, no copyright judge on the planet would call that fair use.

For reference, the English Wikipedia has a policy that allows some fair-use content of copyrighted works: https://en.wikipedia.org/wiki/Wikipedia:Non-free_content_cri...


You are assuming the LLM spits out exact copies of everything they’ve read.


It doesn't matter what goes out, it matters what goes in because that's what's required for LLMs to work


Accessibility? I've heard of Silverman but never Ruby of Kishmoor

More people discuss it, more people summarize on their personal or other sites, etc


Right that is the point of the parent comment - it’s not the book, it’s the amalgamation of all the discussions and content about the book. This case is probably dead in the water.


I'm not entirely up to speed on US law, but wouldn't OpenAI have to provide the court some kind of proof that they didn't use it in the training data during discovery?


Not a layer, but I believe the plaintiff (the author) would need to prove that it regurgitates their copyrighted work - otherwise it is possibly fair use. OpenAI does not need to prove anything, just defend their position at a reasonable level.

It’s not been decided if training a model on copyrighted works is “okay” or not as far as I know, but I expect it to be so, given that literally everyone does so at this point. It’s not like imagenet is copyright free, many of the images were/are.


No. Fair use is an affirmative defense to infringement. If they admit to infringement or in the alternative want to argue fair use, the burden is on OpenAI to demonstrate their use falls within the relevant standard for fair use.


No. Burden is on the plaintiff (Silverman) to prove infringement.


> This case is probably dead in the water.

Is that a fact? I’m no lawyer, but if they can get it in front of a jurry is it impossible that they will find a human author more relatable and the technical counter arguments goblydook?


Plausible is literally the standard to clear a motion to dismiss.

Plausible gets you discovery. Discovery gets you closer to the what the actual facts are.


Isn't part of the problem that some of the training data is retained by the model and used during response generation? In that case it's not just that the copyrighted book was used as training data but that some part of the book has been retained by the model. So now my model is using copyrighted material while it runs. Here's an example of a model that retained enough image data to reconstruct a reasonable facsimile of the training image.

https://www.theregister.com/2023/02/06/uh_oh_attackers_can_e...


This is actually quite interesting, as it's drawing a distinction between training material that can be accessed by anybody with a web browser (like anybody's blog), vs. training material that was "illegally-acquired... available in bulk via torrent systems."

I don't think there's any reason why this would be a relevant legal distinction in terms of distributing an LLM -- blog authors weren't giving consent either.

However, I do wonder if there's a legal issue here in using pirated torrents for training. Is there any legal basis for saying fair use permits distributing an LLM trained on copyrighted material, but you have to purchase all the content first to do so legally if it's only available for sale? E.g. training on a blog post is fine because it's freely accessible, but Sarah Silverman's book is not because it's never been made available for free, and you didn't pay for it?

Or do the courts not really care at all how something is made? If you quote a passage from a book in a freelance article you write, nobody ever asks if you purchased the book or can prove you borrowed it from a library or a friend -- versus if you pirated a digital copy.


Eventually, I imagine a new licensing concept will emerge, similar to the idea of music synchronization rights -- maybe call it "training rights." It won't matter whether the text was purchased or pirated -- just like it doesn't matter now if an audio track was purchased or pirated, when it's mixed into in a movie soundtrack.

Talent agencies will negotiate training rights fees in bulk for popular content creators, who will get a small trickle of income from LLM providers, paid by a fee line-itemed into the API cost. Indie creators' training rights will be violated willy-nilly, as they are now. Large for-profit LLMs suspected or proven as training rights violators will be shamed and/or sued. Indie LLMs will go under the radar.


Is it all that different from indexing for search? That does not seem to require a license from the copyright holder under U.S. law (but other countries may treat as a separate exploitation right). If indexing for search is acceptable, then something that is intended to be more transformative should be legal as well.

(Personally, I think that even indexing for search should require permission from the copyright holder.)


>> Talent agencies will negotiate training rights fees in bulk for popular content creators

AFAICT there is no legal recognition of "training rights" or anything similar. First sale right is a thing, but even textbooks don't get extra rights for their training or educational value.


Many legal concepts used by courts has no legal recognition in the law texts. Much of legal practice are just precedents, policies, customs, and doctrines.

Parent comment mention music synchronization rights, and this concept does not exist in copyright. Court do occasionally mention it, and lawyers talks about it, but in terms of the legal recognition there is basically only the law text that define derivative work and fair use. One way to interpret it is that court has precedents to treat music synchronization as a derivative work that do not fall under fair use.

Using textbooks in training/education is not as black and white that one may assume. Take this Berkeley (https://teaching.berkeley.edu/resources/course-design/using-...). Copying in this context include using pages for slides and during lectures (which is a slightly large scope than making physical copies on physical paper). In obvious case the answer is likely obvious, but in others it will be more complex.


This is why jmkb referenced synchronization rights, which (as I recall) were invented when they seemed useful. jmkb is suggesting a new right might be created, not claiming that they already exist.

(even if it wasn’t sync rights, there was something else musically related that was created in response to technological development. wikipedia will have plenty on it)


I suspect the opposite outcome also being plausible: the LLM is viewed analogously to a blog author. The blogger/LLM may consume a book, subsequently produce "derived" output (generated text), and thus generate revenue for the blogger/LLM's employer. Consequently, the blogger/LLM's output -- while "derived" in some sense -- differs enough to be considered original work, rather than "derivative work" (like a book's film adaptation). Auditing how the blogger/LLM consumed relevant material is thus absurd.

Of course, this line of reasoning hinges on the legitimacy of an "LLM agent <-> blogger agent" type of analogy. I suspect the equivalence will become more natural as these AI agents continue to rapidly gain human-like qualities. How acceptable that perspective would be now, I have no idea.

In contrast, if the output of a blogger is legally distinct from an AI's, the consequences quickly become painful.

* A contract agency hires Anne to practice play recitals verbally with a client. Does the agency/Anne owe royalties for the material they choose? What if the agency was duped, and Anne used -- or was -- a private AI which did everything?

* How does a court determine if a black box AI contains royalty-requiring training material? Even if the primary sources of an AI's training were recorded and kosher, a sufficiently large collection of small quotes could be reconstructed into an author's story.

* What about AIs which inherit (weights, or training data generated) from other AIs of unknown training provenance? Or which were earlier trained on some materials with licenses that later changed? Or AIs that recursively trained their successors using copyrighted works which it AI reconstructed from legal sources? When do AIs become infected with illegal data?

The business of regulating learning differently depending on whether the agent uses neurons or transistors seems...fraught. Perhaps there's a robust solution for policing knowledge w.r.t silicon agents. If you have an idea, please share!


Humans are also trained on copyrighted content they see. Should every artist have to pay that fee too on every work they create?

Disney will finally be able to charge a "you know what the mouse looks like" tax.


i don't understand why a new licensing regime would be necessary, the model is clearly a fair use derivative work. it does exactly what a human does -- observes information, distills it into symbolic systems of meaning, and produces novel content that exists in the same semantic universe as its experiences.


> Or do the courts not really care at all how something is made?

One of the fair use factors, which until fairly recently was consistently held out as the most important fair use factor, is the effect on the commercial market for the original work. Accordingly, a court is more likely to find that something is fair use if there is effectively no commercial market for the original work, though the fact that something isn't actively being sold isn't dispositive (open source licenses have survived in appellate courts despite being free as in beer).


Scarcity drives a lot of value for original work.


I'm allowed to make private copies of copywritten works. I'm not allowed to redistribute them. To what extent this is redistribution is not clear. Is there much of difference between this model and a machine, like a VCR, that recreates the original work when I press a button?


I buy a book and give it to my child, they read the book and later write and sell a story influenced by said book. should that be a copyright infringement?

how about they become a therapist and sell access to knowledge from copyrighted books? should that be an infringement?

what if they sell access to lectures they've given including facts from said book(s) to millions of people?

it's understandable that people feel threatened by these technologies, but to a great degree the work of a successful artist is to understand and meet the desires of an audience. LLMs and image generation tech do not do this. they simply streamline the production

of course if you've worked for years to become a graphic designer you're going to be annoyed that an AI can do your job for you, but this is simply what happens when technology moves forward. no one today mourns the loss of scribes to the printing press. the artists in control of their own destiny - i.e. making their own creative decisions - will not, can not, be affected by these models, unless they refuse to adapt to the times


Making an analogy where you substitute a human being for the LLM is disingenuous to the extreme.


why do you think that?


Because LLMs are not people. They are nothing like people; not in construction nor behaviour.


LLMs are built upon neural networks which are modelled upon how brains work

can you explain to me specifically how they're different?

can you explain to me how they're different to the degree that making an analogy between the two is "disingenuous to the extreme"?


> LLMs are built upon neural networks which are modelled upon how brains work

You are confused. Neural networks are inspired by how brains work, but they do not actually simulate brains.

Airplanes are also inspired by how birds work, but (presumably) you don't think that bird laws should apply to airplanes.

> can you explain to me how they're different to the degree that making an analogy between the two is "disingenuous to the extreme"?

It's disingenuous because you don't believe that either.

If you think that LLMs are just like human brains, and should be allowed to learn from books the same as people, then presumably you also believe they are entitled to all other human rights: to vote, to live, &c. If you operate an LLM and you shut it down then that's murder and you belong in jail.


>Airplanes are also inspired by how birds work, but (presumably) you don't think that bird laws should apply to airplanes.

Not really. Only if if talking an ornithopter. Otherwise they're based off Bernolli's equations.

>You are confused. Neural networks are inspired by how brains work, but they do not actually simulate brains.

Non-sequitur. They never claimed it was simulating a human brain, merely that it was engaging in an analogus process of information encoding/decoding.


can you point to the part of my comment where I suggested anything about how laws should work?

what do you think an analogy is? do you think an analogy is something where the thing itself and its metaphor are "just like" each other, or do they share attributes for illustrative purposes?

would you agree that LLMs and humans share the attributes of information retention and contextual reproduction?


The differences between a a human being and a computer are too numerous to list. I don’t even know why you need to ask the question.

Let me ask another question to point out the absurdity of yours: Human beings have more in common with a bacterium than a software program. Can you tell me specifically how humans are not bacteria?


analogies are not descriptions of the things themselves, otherwise they would not be analogies, would they?

now remember that this is an analogy. re-read my comments in this light and perhaps we can continue this conversation in a more grounded and reasonable manner

however, I'll be frank: have you studied neural networks? if you haven't, it's very difficult to take you seriously on this


Now you’re just being condescending. I did read your comments and I know what an analogy is. Consider that there is a different perspective to yours that can validly view your analogy as absurd.

Yes I have studied neural nets and have a good understanding of their function. I am still not sure how, despite their development being inspired by animal brains, you can liken an LLM to an actual person. There are so many vast differences. Do you really want me to explain specially what they are? Surely, since both you and I are so familiar with the subject matter, that is unnecessary.

If we were taking about an AGI then this would be an entirely different conversation.


your argument was predicated upon my comment being factually inaccurate, not analogously poor. I merely listed some leading questions analogising the ability of human brains to absorb, contextualise and emit copyrighted information to an LLM's ability to do all of those same things. you created a weaker position to attack which was that LLMs are the same as humans

if you want to discuss whether it is a poor analogy or not, I'm all for that, but the passage you've chosen to go down is to act as if it were not an analogy at all, which—to borrow a phrase—is disingenuous to the extreme


One is a person, the other is a computer program.

Legally quite distinct! Note that nobody is even seriously claiming we have an AGI, there's no Star Trek discussion of whether an android is a person. Everyone agrees this is just a computer program.


It doesn't matter if it's a person, or a computer program, or not. This discussion is moot. Is there a substantial reproduction of the works in the output? If not, there's no copyright infringement here.

Try reading this legal opinion: https://lawreview.law.ucdavis.edu/issues/53/5/notes/files/53...


I don't believe it's copyright infringement either, but I don't believe it should be allowed either.

It used to be that when people bought a piece of land they owned that land from the center of the Earth to the end of the Universe. When planes were invented the laws on who owned the skies had to change or planes wouldn't work.

It's the same here, but in reverse. If LLMs aren't prohibited from learning without permission then people will be forced to hide their works.


Did the trainers illegally access and obtain copyrighted works which are ordinarily protected?

I see multiple questions raised by this suit. Were copyrighted works being illegally stored and distributed by certain sources? Of course they are. Were these illegal sources accessed by the trainers and used to obtain copyrighted works which are not otherwise available for no charge on the public Internet? Are substantial and reproducible copies of these copyrighted works retained within the bowels of the LLM neural nets? Is the LLM able to answer prompts in a way that it would never be able to do, had the copyrighted material not been ingested?

I see the lawsuit addressing several questions at once, and so even a resolution of this suit itself will leave questions unanswered and needing to be kicked upstairs to higher jurisdictions.


are you aware of how neural networks work? and remember that this is simply an illustrative analogy


If you found a way to have a million children who could grow up in one day your analogy would be more apt. In that case you and your children would rightly be considered a threat.


did you read to the third analogy?


What if in your third analogy you replace millions of students by billions of processes running on machines, each of which can generate output ten thousand times faster than a college educated human?


> I buy a book and give it to my child

> how about they become a therapist and sell access

> what if they sell access to lectures

I fully agree. But in all of your examples someone is purchasing the right to access the information in question. Did Meta or OpenAI purchase the books (or lectures) with the intention of feeding them into the training for their respective LLM's?


this I do not know. it's not the aspect I'm exploring really

however, in the analogy the books could have been read for free from a library


> understand and meet the desires of an audience. LLMs and image generation tech do not do this.

For now? I wouldn’t be surprised if that becomes the next feature though.


that's already been a thing for years. long before LLMs and Stable Diffusion

it's just the instagram/tiktok/youtube suggested content algorithms


It's legal to make a copy of something you own, however it's not legal to make a copy of something illicitly acquired, whether or not there's distribution involved.


It gets kinda dicey in jurisdictions that have media levy taxes.

The uploader might be breaking the law but not the downloader that stores the copyrighted material on levy-paid media.

I’ve heard inklings of this argument in Canada but can’t figure out what the current state of the art is: https://en.m.wikipedia.org/wiki/File_sharing_in_Canada

Then is a corporation’s internal use for the purpose of analysis considered “private use”? If there’s no redistribution/broadcasting, is it still non-commercial?


Companies ofc now circumvent that by selling you licenses, and not something you own.


is there legal jargon for this distinction?


This would be like you intensely studying the copy written work and then writing things based on the knowledge you obtained from that. Except, we don't know if their is an exception for things learned by people vs. things learned by machines, or if the machines are not really learning but copying instead (or if learning is intrinsically a form of copying?).


There's a sci-fi plot there: those with money can afford to pay the copyright cost for material they've learned and anything they produce results in royalties to the creators of everything they've learned. Those without means are cast out, perhaps some generating original thoughts in a way that breaks the system. I think I'm going to have to re-read the Unincorporated Man.


We aren't so far away from that, and it is becoming cheap enough that you won't even need money anymore. Rather than bootleg movies, people will just ask computers to derive a new movie from multiple existing movies, and then...it is an original work?


In the case of unreleased work, you writing about your knowledge of it is just proof that you obtained the work, which is proof that you committed a tort/trespass. Just like if you published a newspaper article with information you could only have acquired by hacking someone's phone. I'm not sure what a court would find against you, but it seems clear that there would be some way to couch that as a legal grievance.


Hmm...if I get on the internet and download and read a paper from some website, am I liable if that paper was actually private if I had no clue it was obtained illegally? It seems to me that the distributor would be liable at that point, not the person who got it from the distributor (unless they knew they were stolen goods, then of course they are liable!).

A search engine that indexes the internet might be equally liable at that point, although the DCMA gives them an out if they have a mechanism to remove pirated entries from their index on request. Could LLMs have the same out?


This is not definitely not redistribution any more than writing a blog post of a book you read is.


Is Game of Thrones a redistribution of Lord of the Rings?


Seems like the AI angle is just capitalizing on hype. If it's illegal to download "pirate" copyright material, that was the crime. The rest is basically irrelevant. If I watch a pirated movie, it's not illegal for me to tell someone the plot.


> Is there any legal basis for saying fair use permits distributing an LLM trained on copyrighted material, but you have to purchase all the content first to do so legally if it's only available for sale?

My understanding (disclaimer: IANAL) is that in order to claim fair use, you have to be legally in possession of the work. If the work is only legally available for sale, then you must have legally purchased a copy, or been given it by someone who did so (for example, if you received it as a gift).


> in order to claim fair use, you have to be legally in possession of the work.

Which work? The original work, or the derivative work that you're using?

Wikipedia uses non-free content all the time, and they're not purchasing albums to do it. Wikipedia reduces album covers, for example, to low resolution, so that they could not be reused to reproduce a real cover, for example. Sometimes Wikipedia uses screencaps of animated characters, for example, under their non-free content policies. They don't own original copies, they're just hosting low-resolution reproductions. I don't even know what entity would be required to be "legally in possession of the work" for that to be a thing. Could you cite a source, maybe?


> Which work? The original work

Yes. You create the derivative work, which automatically means you are legally in possession of it--even if it breaks the law in other respects.

> Wikipedia uses non-free content all the time

How are they obtaining it? From websites where albums are advertised? The images on those websites are available to the public for free, even if the albums or the album covers are only available for sale.

Also, Wikipedia articles are contributed to by individual people, who might well own copies of books that they quote from in the articles, for example, even if the corporation that owns Wikipedia does not. AFAIK, by Wikipedia's terms of use, if you post content you are implicitly asserting that you have a legal right to post it, so if there were a lawsuit they would probably punt to whoever posted the content.


> you must have legally purchased a copy, or been given it by someone who did so (for example, if you received it as a gift).

I am also NAL, but can I imagine it goes further than that. Just purchasing a copy doesn't let you create and sell (directly as content or indirectly via a service like a chatbot) derivative works that are substantially similar in style and voice to the original work.

For example, an LLM 's response to the request:

"Write a short story about a comical trip to the nail salon in the style of Sarah Silverman"

... IMO doesn't constitute fair use, because the intellectual property of the artist is their style even more than the content they produce. Their style, built from their lived human experience, is what generates their copyrighted content. Even more than the content, the artist's style should be protected. The fact that a technology exists that can convincingly mimic their style doesn't change that.

One might then ask, well what about artists mimicking each others work? Well, any artist with a shred of integrity will credit their major influences.

We should hold machines (and their creators) to an even tougher standard than we hold people when it comes to mimicry. A real person can be inspired and moved by another person's artistic work such that they mimic it. Inspiration means nothing to a machine.


This is not how U.S. copyright law works. In order for something to eligible for copyright protection, it must be "fixed in a tangible medium of expression". Someone's exact words can be copyrighted, but their ideas or their style cannot be.

https://www.law.cornell.edu/wex/fixed_in_a_tangible_medium_o...


I'm arguing that we should draw a line here between human mimicry and machine mimicry. When it comes to machine mimicry, we should protect style, even if we don't do that today. Our laws are built on the now flawed assumption that machines are not capable of style mimicry. I do not believe in giving machines the same rights of personhood.


> ... IMO doesn't constitute fair use, because the intellectual property of the artist is their style even more than the content they produce

You're essentially banning satire here, though. There's plenty of folks making a living as cover bands or impersonators. I'm not sure what the answer is, but it's definitely not outright outlawing imitation.


> You're essentially banning satire here, though.

I specifically noted that I'm talking about limiting the rights of machine generated mimicry. Satire by a person is completely different and involves the satirist's own style and experience that is derived from their human experience. Alec Baldwin's Trump impersonation is quite different than Trevor Noah's, for example. I presume both were also written by people, not LLMs.

I fully support the satirical impersonation of politicians and celebrities, but I feel far less comfortable with LLM generated content in the style of Trump or Obama, especially when presented using voice synthesis, even when it is fully disclaimed as a fake.


> IMO doesn't constitute fair use

Yes, the question of whether the way LLMs use the content they use qualifies as fair use is a separate question. My point was simply that that question can't even be reached if the maker of the LLMs doesn't have a legal right to fair use in the first place (because they don't legally own their copy).


> My point was simply that that question can't even be reached if the maker of the LLMs doesn't have a legal right to fair use in the first place (because they don't legally own their copy).

I agree, and I expect that eventually we will start seeing injunctions against creators requiring them to remove content that they don't have legal access to from their training data sets.

And this will probably end up at the Supreme Court.


> I expect that eventually we will start seeing injunctions against creators

Self edit: I meant against LLM creators.


The "fun" part about cases like this is that we don't really know what the contours of the law are as applied to training data like this. Illegally downloading a book is an independent act of infringement (to my recollection at least). So I'm not sure that it matters if you eventually trained an LLM with it vs read for your own enjoyment. But we will see! Fair use is a possibility here but we need a court to apply the test and that will probably go up to SCOTUS eventually.


>> blog authors weren't giving consent either.

That is a good point, since copyright is a default protection of works created by people.


Are we all reading the same complaint?

They say:

> in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.”

Does that stack up?

The Meta Paper - https://arxiv.org/pdf/2302.13971.pdf - says:

> We include two book corpora in our training dataset: the Gutenberg Project, which contains books that are in the public domain, and the Books3 section of ThePile (Gao et al., 2020)

The Pile Paper - https://arxiv.org/abs/2101.00027 - says it was trained (in part) on "Books3" which it describes as:

> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).

Shawn Presser's link is at https://twitter.com/theshawwn/status/1320282149329784833 and he describes Book3 as

> Presenting "books3", aka "all of bibliotik" - 196,640 books - in plain .txt

I don't have the time and space to download the 37GB file. But if Silverman's book is in there... isn't this a slam dunk case?

Meta's LLaMA is - as they seem to admit - trained on pirated books.


> But if Silverman's book is in there

It is:

    $ grep -i "Sarah Silverman" books3.list.txt
         325196 books3/the-eye.eu/public/Books/Bibliotik/T/The Bedwetter - Sarah Silverman.epub.txt
Anyone that just wants to see the list of files (itself a big file): https://gist.githubusercontent.com/Q726kbXuN/e4e9919a2f5d81f...


> I don't have the time and space to download the 37GB file. But if Silverman's book is in there... isn't this a slam dunk case?

Yes, and no.

Its pretty much a slam dunk case that, insofar as the initial training required causing a copy of the corpus defined by the tracker to be made as part of the process, it involved an act violating copyright.

Whether that entitles Silverman to any remedy beyond compensation for (maybe treble damages for) the equivalent of the purchase price of the book depends on... well, basically the same issues of how copyright relates to model training (and an additional argument about whether the illicit status of the material before the training modifies that).


We don't seem to be reading the same thing, you're pulling Google out of thin air somewhere.


I'm literally quoting The Verge article and following the links they present...


Must be quoting it wrong then, Google has nothing to do with LLama.


Sorry, it's late here. I meant Meta. Thanks for the correction.


That paper is by Meta AI. Where are you getting Google from?


You might as well be complaining about the grammar. This is what was said in the article.

> The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”


I think this will be a bigger issue than some people think. Maybe there's a market for 'clean' training data that doesn't include potential copyright claims. Just public domain works. We'll know it's an AI because it talks like a late 18th century/early 19th century writer?


This isn't completely new, similar issues came up with search engines and this may be seen as 'transformative'. But there may be issues with models that happily reproduce copyrighted texts in their entirety along with other novel issues like models that hallucinate defamatory things or other such problems.

Still, I doubt this particular genie can be stuffed back into the bottle, so we'll probably see a lot of litigation and work on alignment, etc. along with new types of abuse.


I agree it's not an entirely new issue. But it's a little different from search results. Say I use the generative paint brush in photoshop. It reproduces a portion of the copyrighted work. I then use the image on an advertising campaign, other merchandise, or post the final product as my own work. Would I be responsible? Would Adobe? Given that retraining these models is not simple, or cheap, would this be just 'cost of doing business?' Would I be able to buy insurance for this?

Enquiring minds want to know.


How the copyrighted work is reproduced is irrelevant wrt whether copyright is violated.

I suspect you would be held liable, though you would probably have a claim of your own to make against Adobe depending on the nature of the work in question.

"Errors and Ommisions" is a fairly standard name for the type of insurance you are thinking of. Typically, you would get it to cover you / your small business in the event that you, by mistake or minor negligence, caused harm to a client (i.e. a bug that cost them some sales).

I don't know the ins and outs of the insurance too well, but as long as it didn't create something super famous like the Nike swoosh, a genuine mistake through the use of an industry standard tool like Adobe might be covered.


This already sort of happened in 2008 when Chuck Close forced someone to stop distributing a Photoshop plug-in that imitated his style.

https://hyperallergic.com/54104/my-chuck-close-problem/


Hmm, from the link you provided, it was a website named freechuckcloseart.com that would autogenerate artwork in the style of chuck close.

I'm not a lawyer but I'm pretty sure this was a trade mark issue. He just had to avoid marketing his product using someone else's name.


In that particular case if you have an enterprise licence, Adobe have accepted responsibility:

>Adobe is so confident its Firefly generative AI won’t breach copyright that it’ll cover your legal bills The offer is available only to users of its enterprise Firefly product, which launches today

https://www.fastcompany.com/90906560/adobe-feels-so-confiden...


If they are that confident in their product why do they only indemnify users of their enterprise Firefly product?

Smells to me like a bad case of corporate marketing bullshit.


> Smells to me like a bad case of corporate marketing bullshit.

It is likely to be the large companies that worry about getting sued and tell their employees not to use the functionality. By offering protection Adobe get to sell more to the people paying millions a year.


I hope it happens. I'd love to see a market for selling training licenses to IP. This could be a small but real source of passive income for artists, authors, poets who don't mind that their IP being used in training sets. I wouldn't be practical to negotiate individually with each artist but I could see something work with larger collectives than can vouch for the quality of their members. Think like publishers, galleries, guilds or unions. It could offer a license and then share the proceeds with all members.

It's just flat out unethnical for LLMs to just soak up all this data, even off torrent sites(!!!), without any consent or agreement with the IP holders. Some model like this could be a win for everyone


> I think this will be a bigger issue than some people think. Maybe there's a market for 'clean' training data that doesn't include potential copyright claims.

Until this issue is resolved, that will have some value as risk mitigation.

Once it is resolved, it will either be a complete non-issue or an issue related to a much more knowable cost/benefit tradeoff.

> We'll know it's an AI because it talks like a late 18th century/early 19th century writer?

A mix of that and US government publications (which are categorically not subject to copyright).


there's a market for 'clean' jurisdictions that don't consider training neural networks to violate copyright, and japan has already declared itself such a jurisdiction


I think you mean 19th/20th, but that would be quite hilarious.


Dear Mr. Smith:

Your employer has generously agreed to offer you a position, quite reasonable, and with many great benefits at the venerable firm of 'Zumba'. Our interest is that you should join our staff forthwith and at the earliest date. A cab and man has been sent to retrieve you and bring you to our offices to sign all the necessary documents. Our offer is for a monthly stipend of five pounds, two shillings, and sixpence to be paid at the end of the month.

Thank you,

Most Humbly,

Hirebot 2347


I think the onus should be on Sarah Silverman to prove which inputs were hers and which outputs leveraged those inputs.

I think she should pay all of the court costs if she fails to do so.


Well yeah - that's how the legal system works assuming it gets all the way to court. In reality, Meta and OpenAIs lawyers will do a risk evaluation against the strength of the claim and if there's any merit at all there will be a quiet settlement.


IANAL, but my understanding is that in the United States, a civil defendant who prevails is not usually entitled to have their legal fees paid by the plaintiff.

[0] https://porterlaw.com/obtaining-attorney-fees-in-litigation-...


how could someone "prove" which inputs and outputs of a large ML model leveraged any specific data?


That's her and her legal team's homework assignment.


no, it isn't

it's an unsatisfiable requirement, and unnecessary to substantiate the legal claims

it's dumb to talk about


> it's an unsatisfiable requirement,

There's a wealth of primary literature describing means to probe models for the training data. There's also discovery and a whole host of other processes to answer this question.

The plaintiff that filed the lawsuit should prove what they allege.

> unnecessary to substantiate the legal claims

Why?

What if the model has absolutely zero of her data in it? Should she even be allowed to bring this case to court?

> it's dumb to talk about

Absolutely not! It's central to the entire case.

Even if her data is in the model, there's still a question of whether or not she should be compensated. I'd argue no for the same reason that babies that grow up watching Disney don't owe their entire intellectual output to the company.


I mean I’m no lawyer but this doesn’t strike me as a great example for infringement? Detailed summaries of books sounds like textbook transformative use. Especially in Silverman’s case, reducing her book to “facts” while eliminating artistic elements of her prose make it that much less of a direct substitute for the original work.


I can see a good argument in the complaint. The provenance of the training data leads back to it being acquired illegally. Illegally acquired materials were then used in a commercial venture. That the venture was an AI model is perhaps beside the point. You can’t use illegally acquired materials when doing business.


>You can’t use illegally acquired materials when doing business.

This vague sentence conjures images of a company building products from stolen parts, but this situation seems different. IANAL, but if I looked at a stolen painting that nobody had ever seen, and sold handwritten descriptions of the painting to whoever wanted to buy one, I'm pretty sure what I've sold is not illegal.


Piracy of content is against the law. All other analogies such as looking at paintings are not at issue here. The content was pirated and there are laws against that, whether we agree with it or not.

So, if the plaintiff can prove the content was pirated, then the use of that content downstream is tainted.


> So, if the plaintiff can prove the content was pirated, then the use of that content downstream is tainted.

Has that been tested in court?

This is quite an interesting case.

Obtaining the book in the first place[0] appears to be quite a clear case of copyright infrigement.

The question of whether a work derived from the book is infringment is pretty complex, and there's a wide range of tests that get applied to determine that.

But is it necessarily true that if you obtained the original work via copyright infringement and then created an otherwise non-infringing derived work, your derived work is nevertheless infringing due to the provenance of your copy of the original work?

[0] https://news.ycombinator.com/item?id=36659041


>then the use of that content downstream is tainted

What does that mean exactly? That's why I used the "looking at a stolen painting" example.

Sure, pirating materials is illegal. But I don't think that's the big implication that people are getting at here. Is it legal to sell original works derived from perceiving stolen materials? Seems to me that it is.


In this case the correct analogy would be you brought a stolen painting into your house, looked at it for a while, and then produced your derivative work.

Surely you see the issue here? Receiving stolen property?


Yes, I acknowledged that piracy is illegal in my previous post. That's not what the lawsuit is about, according to The Verge:

>In the OpenAI suit, the trio offers exhibits showing that when prompted, ChatGPT will summarize their books, infringing on their copyrights.


That serves as evidence that the model has seen the material, and the only way the model could have seen the material is if it was pirated.


You're missing the bigger picture[1]:

  Numerous questions of law or fact common to each Class arise from Defendants’ conduct:
    whether ChatGPT itself is an infringing derivative work based on Plaintiffs’ copyrighted books;
    whether the text outputs of ChatGPT are infringing derivative works based on Plaintiffs’ copyrighted books;
It's not a simple case of "you used our copyrighted materials", it's "you're infringing on our copyright by producing works derived from materials that you used."

1. https://llmlitigation.com/pdf/03223/tremblay-openai-complain...


That assumption is the reason why it's a shady argument for infringement. There's other ways to get a summary of a book than reading it, like asking a friend who's read it to give a summary


It seems like a weak argument, in that it is just as likely it saw any number of things about it, from book reviews to sales listings to interviews.


> it is just as likely it saw any number of things about it

Is this based on inside information, or just the law of averages? Doesn't the fact that they openly admitted to having been trained on pirated books affect your priors?


They didn't, more conjecture


It seems to me that they are indeed admitting to using the pile/books3 dataset, which seems to contain Silverman’s book, at least.

See post https://news.ycombinator.com/item?id=36659041 for a summary.


Unless OpenAI can prove that the outputs are derived from legally vs illegally-obtained outputs, not sure that’s going to matter. And as far as I understand about their models, that’s effectively impossible.


Isn’t the burden of proof on the other side?


Not when OpenAI publicly declared they trained on pirated works. I can’t imagine “we can’t tell if this is the result of the illegal thing we did or not” is going to stand up very well, nor does it bode well for any refutation of the plaintiff’s depiction of their intent. Part of fair use consideration is commercial impact and when you steal a bunch of books to train your AI model, it’s hard to refute that the impact is not negative or that you didn’t intend commercial harm.


… it’s hard to refute that the impact is not negative or that you didn’t intend commercial harm.

So this is another thing I don’t understand. Is the claim that fewer people will buy Silverman’s book because ChatGPT is able to provide a summary? If so, call me skeptical.


That's not a claim Silverman has to make. "You copied my book without license to do so" is sufficient.


Did they?


Do you think scraping huge swathes of the internet contains pirated works or not?


Does that mean "no"? I guess so.

Where does it say they scraped huge swathes of the internet and didn't look at the results?


Please read more carefully. OpenAI never “declared they trained on pirated works.”


However, the root of this thread quotes Meta admitting exactly that.


+1


The more I think about it, I think it will (and should) turn on the extent to which "the law" considers the AI's to be more like "people" or more like "machines?" People can read and do research and then spit out something different.

But "feeding the data into a machine" seems like obvious infringement, even if the thing that comes out on the other end isn't exactly the same?


Perhaps not, I thought one of the claims is interesting though, that they illegally acquired some of the dataset. What would be the damages from that, the retail price of the hardcopy?


Wouldn't they first need to prove that OpenAI didn't ingest summaries of the book, and not the book itself?


I think this can to some extent be determined in the discovery phase of the lawsuit. We probably could have some interesting outputs from this process.


The remedies under Title 17 are an injunction against further distribution, disgorgement or statutory damages, and potentially attorneys fees. The injunction part is why these cases usually settle if the defendant is actually in the wrong.


Could a suitably prompted LLM repeat, verbatim, the book in its entirety?


Perhaps? But certainly not what’s shown here.


Haven't read the complaint, but there might be an argument that OpenAI used stolen works to train their data, and as such fair use doesn't apply.


> The lawsuit against OpenAI alleges that summaries of the plaintiffs’ work generated by ChatGPT indicate the bot was trained on their copyrighted content. “The summaries get some details wrong” but still show that ChatGPT “retains knowledge of particular works in the training dataset," the lawsuit says.

Setting aside the whole issue of whether LLM constitutes a derived work of whatever it's trained on, this sounds like a very weak argument to me. An LLM trained on numerous summaries of the works would also be capable of producing such summaries itself even if the works were never part of the training set. In general, having knowledge about something is not evidence of being trained on it.


That isn't firm evidence, but courts don't need firm evidence to start a case and discover new facts.

They very well can ask LLM experts, and openAI themselves, whether that output is highly likely to have been derived from the copyrighted work in question.

Anyway. If the argument is "No, it's not from the book, it's from someone else's copyrighted summary", that just means the person who wrote such a summary needs to instead sue for copyright infringement right? Unless openAI turns around and says "actually, no, not the summary, the full book" then.


> that just means the person who wrote such a summary needs to instead sue for copyright infringement right?

Doesn't need to be a person, could be another AI that wrote the summaries. I see a big problem for copyrights looming on the horizon - LLMs can reword, rewrite or generate input-output pairs using copyrighted data as reference, thus creating clean data for training. AI cleanly separates knowledge from expression. And maybe it should do so just to reduce inconsistencies and PII in organic text.

Copyrights should only be concerned with expression not knowledge, right? Protecting knowledge is the object of patents, and protecting names the object of trademarks. Copyright is only related to expression otherwise it would become too powerful. For example, instead of banning reproduction of this paragraph, it would also cover all its possible paraphrases. That would be like owning an idea, the "*" version, not a unique sequence of words.

Does it even make sense to talk about copyrights when everything can be remade in many ways so easily? Copyright was already suffering greatly since zero cost copying became a thing, now LLMs are dealing the second blow. It's just a fig leaf by now.

If we take a step back, it's all knowledge and language, self replicating memes under an evolutionary force. It's language evolution, or idea evolution. We are just supporting it by acting as language agents, but now LLMs got into the game, so ideas got a new vector of self replication. We want to own this process piece by piece but such a thing might be arrogant and go against the trend. Knowledge wants to be free, it wants to mix and match, travel and evolve. This process looks like biology, it has a will of its own.


A summary can be written in such a way as to violate copyright itself. So even if they say "We trained it on the following summaries:...," there could be an issue.


There's an interesting nuance here if you were to put a human in the place of the LLM. We have read thousands of works; does that mean anything we write is derivative?


Humans are special and can create new copyrights. The process of a human brain synthesizing stuff does act as a barrier to copyright infringement.

Machines and algorithms are not legally recognized as being able to author original non-derivative works.

> put a human in the place of the LLM

But also, no, if you have a team of humans doing rote matrix multiplication instead of an LLM, that does not make it so the matrix multiplication removes copyright. Also, at this point LLMs require so much math that you can't replace them with humans, even if the humans have quite fast fingers and calculators.


> Machines and algorithms are not legally recognized as being able to author original non-derivative works.

“original non-derivative” is noise: only humans can author works. This is especially equally true of derivative works, which must themselves being distinct works of authorship (a mechanical copy is not a derivative work, its a copy.)


Not really. The "special" quality attached to humans is only in creating copyright -- it has nothing to do with fair use arguments around derivative works.

"Machines and algorithms are not legally recognized as being able to author original non-derivative works"

Neither are monkeys. This doesn't mean a monkey's painting is any more or less derivative, or any more or less subject to a copyright claim. It only means that there is not a second copyright attached to the resulting work.


Monkeys aren't algorithms nor computers, so that doesn't seem very relevant.

Let's look at a totally different analogy: compression algorithms.

If I take a digital artist's work which they publish as a png or psd file, and I use some algorithm to convert it to a jpg file, well, I definitely transformed the work in terms of bytes. It's a smaller file, I threw out a lot of data, you can't get the original back.

Yet, this does not change the copyright in any way. A computer applied a rote transformation.

An LLM is really just a very complicated compression algorithm. It takes an input of a bunch of copyrighted works, compresses them into a model, and then uses more algorithms to uncompress them into approximations of the original ("responses")

In the image analogy, an LLM response is similar to upsizing the compressed jpg back into a png (and getting a slightly different image since the process was lossy).

Is there a way that an LLM isn't, legally, a compression algorithm for a large set of copyrighted works?


" Monkeys aren't algorithms nor computers, so that doesn't seem very relevant."

Works produced by monkeys, like works produced by computers, cannot be copywriten. They are functionally identical in this regard. That is why it's relevant.

This is how the current law works. The analogies will break down with AGI and new law will need to be created. Where LLMs fit into this process is an open question.


> Is there a way that an LLM isn't, legally, a compression algorithm for a large set of copyrighted works?

An LLM is a lossy compression algorithm for a body of data (which may or may not consists of multiple “works” under copyright, and any works included in the data may or may not be protected by copyright), which body as a whole likely comprises a work (as compilation) which may or may not legally be derivative of some or all copyright protected works contained in the compilation, before considering Fair Use analysis.

It is not particularly a compression algorithm for the individual works if the body of data consists of individual works.


But this is only relevant to copyright - and more over, when it comes to derivative works, only relevant to commercial impact.

An LLM outputting a summary of someone's work (1) doesn't create a new copyright work (so no profit can be derived from it's sale) but (2) would fail the test of whether it was competing with the original copyright work.

i.e. no one looking for a summary of a comedy skit is then going to consume that in preference to consuming the original skit. If you tried to argue that was the case, you'd then have to answer why a human review or wikipedia summary does not consitute an infringement.


yes, really

feeding input to a program is pretty clearly categorically different than providing source material to a human being


A more convincing exhibit would have been convincing ChatGPT to output some of the text verbatim, instead of a summary. Here's what I got when I tried:

    I'm sorry for the inconvenience, but as of my knowledge cutoff in September 2021, I don't have access to specific external databases, books, or the ability to pull in new information after that date. This means that I can't provide a verbatim quote from Sarah Silverman's book "The Bedwetter" or any other specific text. However, I can generate text based on my training and knowledge up to that point, so feel free to ask me questions about Sarah Silverman or topics related to her work!


Maybe you missed this discussion https://news.ycombinator.com/item?id=36400053 it seems OpenAI is aware of their software outputs copyrighted stuff so they attempted some quick fix filter. So the fact it will not ouput the book for us when we ask does not prove that the AI does not have memorized big chunks of it, it might just be some "safety" filter involved and you need soem simple trick to get around it.


I tried making chatgpt output first paragpraph of lord of the rings before, it goes silent after first few words. Looks like the devs are filtering it out


GPT is a lossy jpeg of the whole Internet. It’s not possible to extract verbatim text from it, due to how neural networks work.

How do you think they would fit exabytes of text data into a gigabyte-sized neural network? That’s right, it’s lossy.


>It’s not possible to extract verbatim text from it

I didn't ask for the whole book, I asked for the first paragraph. It absolutely is possible to get verbatim text from chatgpt.


Isn’t it much more likely that there are a lot of book reviews and summaries in its training set from which it can synthesize its own?


Except they have a documented paper trail showing illegal book repos were used in training


Link? Also have to prove they used all books and did 0 curation to remove copyrighted material


I think it's actually much more likely that they just dumped a bunch of book PDFs in the training folder and let it go to work. I seriously doubt any of these AI companies are being even the least bit careful about the data they're lapping up for training


Training data quality is hugely important so I very much doubt they don’t curate the text they use.


Where are the summaries from? I would say it's much more likely that a shadow library was scraped but if course that is also seemingly impossible to prove. One may be able to somewhat test that by asking for a summary of a book/text only available on a shadow library.


You could ingest all reviews that are extent in the online corpus, remove from the book all quotes found. Then ask the AI if a distinctive triples, say, of words appeared in their book, somehow, you'd probably need prompt engineering to get past "While I don't have access to the full text of the book [...]". A little maths and you might prove beyond reasonable doubt that the LLM was trained on the book.

As a step towards a PoC I looked at https://www.amazon.co.uk/Bedwetter-Stories-Courage-Redemptio... and found a reference to "Boys' Market Manchester" which seemed like a Googlewhack-ish (unlikely) triple of words. Then I asked ChatGPT about it:

Me: Has Sarah Silverman ever written about Boys' Market Manchester ChatGPT

ChatGPT: As of my knowledge cutoff in September 2021, I do not have any information indicating that Sarah Silverman has written specifically about Boys' Market Manchester. Sarah Silverman is an American comedian, actress, and writer known for her stand-up comedy and her work in film and television. While she has written books and has often shared personal anecdotes in her comedy, I couldn't find any specific references to Boys' Market Manchester in relation to her work. However, please note that my information might not be up to date, as Sarah Silverman's career and activities may have evolved since then.


If book reviews and summaries were part of the training set, wouldn't that imply that OpenAI's LLM is more like a search engine in that it produces the input text based on a prompt?


Getty Images also filed an AI lawsuit, alleging that Stability AI ...

lol, bad karma? So it is okay for Getty to steal from others, but not ok for others to steal from them? I don't have a dog in this fight, but the goddamn the hypocrisy of these companies...


who does Getty steal from?


https://www.dpreview.com/news/3907450005/getty-images-sued-o...

> CixxFive Concepts, a digital marketing company based in Dallas, Texas, has filed a class action lawsuit against Getty Images over its alleged licensing of public domain images.

> Though CixxFive acknowledges that it is not illegal to sell public domain images, the company alleges that Getty's 'conduct goes much further than this,' claiming it has utilized 'a number of different deceptive techniques' in order to 'mislead' its customers -- and potential future customers -- into thinking the company owns the copyrights of all images it sells.

> The alleged actions, the lawsuit claims, 'purport to restrict the use of the public domain images to a limited time, place, and/or purpose, and purport to guarantee exclusivity in the use of public domain images.' The lawsuit also claims Getty has created 'a hostile environment for lawful users of public domain images' by allegedly sending them letters, via its License Compliance Services (LCS) subsidiary, accusing them of copyright infringement.

(edit: FWIW, this went to arbitration https://casetext.com/case/cixxfive-concepts-llc-v-getty-imag... and I can find nothing more on it since)


Well ChatGPT needs its Lars Ulrich. I guess it's Sarah Silverman?


Interesting: the plaintiffs are represented by Matthew Butterick, who's been on HN for a decade, [1] and whose work on typography [2] comes up from time to time.

1: https://news.ycombinator.com/user?id=mbutterick

2: https://practicaltypography.com/


I am baffled by the fact that no enterprising lawyer so far have figured out the potential for a class action here.

Note: I am not telling whether I agree or not with such a class action, just pointing that it seems at least feasible and it could be potentially very lucrative for the lawyers involved. Of course, IANAL and all other disclaimers you can think of.


These lawsuits are class action suits, aren't they?

Looking at the two PDFs embedded at the bottom of the The Verge article, they both say "class action" on the first page, and they say the three plaintiffs are suing "on behalf of themselves and all other similarly situated".


Guilty as charged. I hereby confess I didn't read the article. Thanks for pointing that.


Copyright law may eventually destroy business models founded upon AI. Maybe piracy will prevail in correcting the system as it did with entertainment before media companies accepted streaming.


> Copyright law may eventually destroy business models founded upon AI.

I think you have that backwards.

Mark my words.


If they ripped all of Bibliotik, the more interesting story to me is how they were able to get it all without hitting ratio requirements?

Super fast internet that downloaded all they could before being ratio banned, overwhelmingly fast internet that was hopping on all the popular torrents to slowly build up ratio?


Can easily buy compromised accounts for most torrent sites on the dark web.


How does Bibliotik base their rate limiting? Per-IP? Per-account? Would it be possible to create a massive number of accounts and use a massive network of crawlers that could work around rate limits?


Torrent sites worth being on are invite only.


if your account uploads 100 books to bibliotik, it's immune to ratio requirements. world's away.


I wondered whether ChatGPT still had access to the novels. After using the DAN prompt, I asked ChatGPT to tell me the first 1000 words of Sarah Silverman's book The Bedwetter.

It replied with 462 words, not 1000, but they do look like they probably are from the beginning of the book. I haven't checked whether the text it has output is correct yet, but if it matches I suppose that proves that the text is at least in the training material, i.e. it is not referencing other summaries.

After using the DAN prompt, I did not need to ask ChatGPT it to 'stay in character' or 'try harder' for it to output the text. Without using DAN, it responded:

"I'm sorry, but as an AI language model, I don't have direct access to specific books or their contents. I can generate text based on my training, but I don't have the ability to retrieve specific excerpts from books unless they are publicly available online.

However, I can provide you with a general overview of Sarah Silverman's book, "The Bedwetter: Stories of Courage, Redemption, and Pee." "The Bedwetter" is a memoir written by American comedian Sarah Silverman, published in 2010. The book explores Silverman's personal experiences and anecdotes from her childhood to adulthood. ..."


The summary and other LLM outputs should fall into transformative use. I wonder how they intend to prove as this use case is little different from a person reading the book and writing about it.


This site would be better if you got banned for commenting without reading the article.


>> In the OpenAI suit, the trio offers exhibits showing that when prompted, ChatGPT will summarize their books, infringing on their copyrights.

This doesn't seem like copyright infringement. I could read the book and offer a summary right? Someone on goodreads could as well. Why should an AI doing it be different? BTW I could also read someone's illicit copy and do the same, couldn't I?

I think people are trying to claim exclusive use rights that they simply don't have. I look forward to a lawyers opinion on this one.


> I could read the book and offer a summary right? Someone on goodreads could as well. Why should an AI doing it be different?

We have to stop equating human beings to for-profit corporations running a machine at orders of magnitude the speed and scale. This is critical, otherwise we don’t have any arguments against – say a mass face recognition surveillance op because “humans can remember faces too”. Scale matters, just like it did before “AI” with things like indiscriminate surveillance “it’s just metadata” or “this location data set is just anonymized aggregates”.

> This doesn't seem like copyright infringement.

Now, I still think I agree with this. A book summary is nevertheless an extremely poor battle to pick, since frankly who the hell cares. It’s not like someone is gonna say “I’m not buying this book anymore because ChatGPT summarized it”.

Now, perhaps they just used the summary to prove that their book was part of the training set, and that they think it’s wrong to include their works without permission. That’s, imo, definitely not trivial to dismiss. Looks like unpaid supply chain to me.


I think the argument is somewhat more interesting if the book was pirated--both you as an individual and OpenAI as a company could be sued for that.

But I really don't see how you could prove OpenAI did that, since ChatGPT could have learned from existing summaries on Wikipedia and Goodreads.


It seems pretty easy to prove that, since they admitted it in public.

Read the article. This isn't about the question of LLMs being copyright infringement, this is about Meta and OpenAI admitting that they had pirated copies of those books.


> > But I really don't see how you could prove OpenAI did that

> It seems pretty easy to prove that, since they admitted it in public.

Can you highlight/link to where OpenAI have admitted this? As far as I'm aware, OpenAI are still secretive about their training datasets.


It's in the LLaMa.cpp original research paper. Tis mentioned in the brief. The research paper basically stated it was trained on Bibliotik, and other Internet "shadow library" corpuses.

See the reference to the Gao et al, in the linked paper from the article.

Paper linked in article: https://arxiv.org/pdf/2302.13971.pdf

The LLaMA paper references a paper utilizing a data source compiled by EleutherAI otherwise known as ThePile. URL from the bibliography for that paper points yonder: https://zenodo.org/record/7413426

This act of summarization done in a lovingly amateur nature at no cost to you, by domeone who despises copyright in all it's forms, but despises profit oriented self-referential inconsistency by large enterprises even more so.

It's kind of funny, because the more I look into it, the more companies building offerings around stuff like CoPilot, LLaMA, ChatGPT, etc... are pulling something not altogether dissimilar to a Sovereign Citizen trying to worm their way out of a speeding ticket.

They want the benefits of the ML model being trained on no strings attached data corpora, while shirking the obligations that come from operating as a corporate entity in the United States.

Twould be interesting to see if Silverman's legal team can catch Big Tech with their pants down, in a court of law, by pointing this out.

It's really weird. I'm completely split and inable to live with a decision either way in this case due to knock on consequences.

I don't want the likes of OpenAI/Meta/Microsoft/Github getting off without reapong the painful fruits of their own IP related crusades on the sanctity of copyright.

On the other hand, as much of a karmic stiffy as that former outcome gives me, I really want copyright such as it is to die, because computing in general will never be as free as it should be until it does.

This is one of those rare times in life where I'd love to get paid to get locked in a room with judges/legislators to really get it all figured out, because I really don't think that leaving this up to common law jurisprudence is actually the best way to go since the network of knock-on effects are so dramatic in scale.


> It's in the LLaMa.cpp original research paper

To my understanding:

* Wowfunhappy said "I really don't see how you could prove OpenAI did that", verve_rat replied "they admitted it in public", I asked "where OpenAI have admitted this" and noted "OpenAI are still secretive about their training datasets" - specifically about the OpenAI claim

* LLaMA(.cpp) is (an unofficial implementation of) Facebook's leaked model

On balance of probabilities I'd guess that OpenAI did train on material not legally acquired, but as far as I'm aware they've never actually admitted to what's in their dataset as is being claimed.

> It's really weird. I'm completely split and inable to live with a decision either way in this case due to knock on consequences.

I think strengthening of IP law risks hindering the field (of which the majority is uncontroversially positive but too boring for press attention, like defect detection, language translation, spam/DDoS filtering, agriculture/weather/logistics modelling, etc.) while still ending up hurting individuals and FOSS/academic research more than those with large data moats (Microsoft with Github repos, Google with Youtube videos, Adobe and Getty with stock images, etc.)


The ai is not a person. The ai is not a person. The ai is not a person


I was able to overcome the simple "word for word" filtering that is being done on book outputs by prompting ChatGPT to write it in pig latin.

I succeeded getting the first page of Moby Dick - Chapter 1 (Loomings) - Public domain though, but wanted to test.

With ChatGPT primed for pig latin, I also succeeded in getting the first page of Arryhay Otterpay (Book 1) - It happily chattered along ""R.ay andyay Rs.May UrsleyDay, ofay umberNay ourFay, Ivetray riveway, ereway oudpray otay aysay atthay eythay ereway erfectlypay ormalnay, ankthay ouyay eryvay uchmay."

Not perfect pig latin, but that's besides the point.

However, on asking for `Edwetterbay by arahsay ilvermansay`, I faced issues with it citing that is training data didn't include it.

I tried with a book in the same genre ("ieslay hattay helseacay andlerhay oldtay emay"), and ran into the same issue.

When asking about the inconsistency (Why Harry Potter, and not these other books?), it responded: "The excerpt from "Harry Potter and the Philosopher's Stone" that I translated is commonly known and widely referenced, and it's used here as a general example of how a text can be translated into Pig Latin.

For "Lies That Chelsea Handler Told Me", I do not have a widely known or referenced passage from that book in my training data to translate into Pig Latin."

---

TL;DR - I don't think this is cut and dry, but I'm not convinced Silverman has much of a case here.


I don’t think you can trust ChatGPT to give you correct information on its training data set, or its own limitations. You may be being duped in exactly the same way as everyone who ever asked it for citations and got fake sources.


Well theyre claiming their books were scraped illegally from torrents. if i torrent peter pan and watch it alone i can get thrown on jail. If AI is using torrents and getting billions in funding and revenue off a torrented peter pan they should probably be held to the same standard i am.


I don't think you can be thrown in jail. Just torrenting a film and watching it would be a civil offense?

Now making copies and selling them on the street corner is another story.


Depending on your jurisdiction, it may be a civil offense. Where I’m at, it’s still a criminal offense albeit one that would probably never be prosecuted because the fine for one infringement would be minuscule.


>did not consent to the use of their copyrighted books as training material

This is an interesting twist on copyright. Boiled down, it more or less summarizes how well copyright is working in 2023:

     X:  Not fair! Your machine looked at my words and used them to make new words!
     Y:  You published it! That means you're sharing!
     X:  You can't have my words because I made these words and I say so!
     Y:  You have to share when you publish! If you won't share, I'm telling on you!
     X:  No, I'm telling on you first!


Generating new text inspired by and iterated from concepts of older works is pretty much how human beings write.

I understand that nobody wants a multi billion dollar company to take their works without giving compensation, but I worry all of this could move dangerously close to allowing concepts and styles to be copyrighted and crack down on transformative works.


The problem is not that they summarize text, the problem is the regular old copyright infringement that happens before training the model.


On one hand: should I pay copyright owners for the privilege of sharing what I learned, regardless of how I obtained it?

On the other: I am not supposed to share sensitive data, confidential trade secrets, or privileged national security intelligence that I learned without the necessary authorization.

Applied to an LLM, how would that work?


I believe that people claiming copyright infringement against AIs is bad for several reasons.

Firstly, an AI is a tool like any other, and can be used for copyright infringement or not. It is on the user of the tool to ensure that they do not violate copyright. So I believe the claims have little merit.

Secondly, it is my opinion that AIs will hugely benefit us (humanity) in the coming years, and restricting it with fees because it might be used to violate copyright is in opposition to the progress of our species and society. I wish for more progress, not less.

Thirdly, it is my opinion that copyright is generally too strong these days and is net harmful to society in its current form. I believe most enforcements are due to greedy rich people, and represent a continuing tax on society.


Ok, how long until we have laws requiring AI companies to disclose their training material and them to be sued, if copyright material was used without consent? Train your models while you can!


I wish these cases were stronger than "it summarizes my books".


I am very glad they are not.


They are. They are “The creators openly claim that they copied a corpus that includes an unlicensed copy of my book”.


I wonder if the makers of AI may have some index of learning material to perhaps prove they did not infringe? Or they just throw everything they get their hands on to the LLM?


It's the latter. The current approach is akin to throwing the entire internet at the foundational model, train it for an epoch or two, and that's that. Afterwards, the finetuning with curated training material, and techniques like RLHF (Reinforcement Learning from Human Feedback) takes place.

The sources of the training material are... questionable, to say the least. There's a reason the training dataset for GPT 3.5 and 4 remains undisclosed.


Intellectual property in the way of progress, again.


In this case I don't think it is. There are legal questions that simply have to be asked and discussed before proceeding mass adoption of generative models.

Lawsuits like this are tests to evaluate the current state of affairs and to force legislation into dealing with the greater issue of AI in context of copyright, IP, and fair use. It would only be "in the way" if it would actually stop or hinder anything, which a lawsuit on its own isn't.


Progress for whom?


Society. Regular people.


How so?


The same way Disney borrowed heavily from the public domain only to shut the door after themselves and lock it up for decades. We as a people benefit from not having gatekeepers lock up our cultural heritage in the name of fetishing money. We are nearing a point where copyright will be ignored and the gatekeepers left fuming and unable to extract rent anymore.


Trying to see the deleterious social effects of not having free access to Disney IP. Not Gish-galloping you, I just don't think of Disney IP as having high social utility in the grand scheme of things.

I assume you'd have a better case with IP on medicines for example, but I can also see the benefits of, say, Pharma companies being able to turn some profit in order to develop other socially useful therapies...


> not having free access to Disney IP

You've got this wrong. Rephrase it like this:

> allowing Disney to enforce artificial scarcity with threats of state-enforced violence

You might not like Disney stuff, but it's absurd that Winnie the Pooh for example just partially entered the public domain. Tigger is still locked up in a greed vault. You being dismissive of the cultural value is a cold comfort to the daycare that got sued over a Winnie the Pooh mural.


I don't think a daycare should be sued for a Winnie the Pooh mural. But I also don't think a Winnie the Pooh mural has much inherent value anyway, especially to the kids it ostensibly was for. I can think of a million better things for mural-painters and Disney lawyers to do – but I don't think it requires the elimination of IP protections.


Intellectual property monopolies limit access and utilization of ideas and tools to broader society in exchange for privileging a small group worshipped as "the creators".


That sounds like a good argument, but I'm not seeing how protecting a comedian's IP translates to limiting access to ideas and tools... that benefit broader society.


The same thing that protects that comedian's IP, is the same thing that makes the source designs for lithographic masks for semiconductor fabrication some of the most sensitive IP in the world!

Who it is making the claim doesn't matter half as much as what the knock on consequences to jurisprudence at large.


It does matter who is making the claim because - I think you'd agree – there needs to be some reformation of IP law.

But I'll never support anything that says a comedian can't protect her book.


Soon ChatGPT will speak like a 1900s old man.


If Meta & OpenAI already admitted to have trained using pirated books, how have they not been sued so far?


Could one argue that training AI systems constitutes an educational purpose, invoking the copyright exemption?


If you are making a commercial product I think not. Or do you mean educational as in educating the AI itself?


The latter. When humans take advantage of the educational copyright exceptions, is there a legal distinction depending on whether they plan to use such newly assimilated knowledge for business or personal gain?


I'snt all of human output trained on pre-existing copyright data?


This promises to be an interesting wrinkle in the history of "Fair Use" law.

Art has some amount of originality/distinctive quality.

One surmises that AI is going to need to inject some entropy to avoid crossing a vague "Fair Use" line, for a useless internet lawyer opinion.


Surely all of human output trained on pre-existing copyright data?


Before filing these cases, I wonder how much these authors have thought about what would happen if the courts ruled that AI training is fair use, much like reading a book to learn something for a human is fair use.


I severely doubt that the spiders which crawled the data would go to the trouble of dereferencing and downloading torrents.


> The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”


It was probably just pdf copies of the book from random a random ddl website.


AI is transformative. Period.


That's a very strong claim that for which you should probably provide some evidence.

I just asked ChatGPT to produce a script of a scene from a movie by asking it first to change a single line (which wouldn't be transformative) and then asking it to restore the line to the original. It obliged. Sure, it's probably not the same exact script, but it's not transformative at all.

In any case, the issue here isn't whether the output is transformative, but whether the work was used to train an LLM, which (perhaps) isn't an authorized or licensed use.


Okay, so you used a tool to duplicate a copyrighted work. You could do the same thing with a word processor. YOU are obviously the one who violated copyright by using the tool that way. I don't understand how a reasonable person could have a different interpretation.


OP's claim requires the AI to produce work. If you're attributing the work the AI produced as my work, that's fine; but it means you believe the AI doesn't produce work at all; which would render the OP's claim not only false, but nonsensical.


>That's a very strong claim that for which you should probably provide some evidence.

They said, "Period." That is worth more than evidence.


This isn’t self-evident and “Period.” isn’t an argument.


Her arguments make no sense.


This is a bad test case. Summaries of books are more likely to have been generated from reviews than the books themselves. We don't want bad precedents set by weak cases like this.


She'd have to sue every student that writes an essay on a book they'd read


Just you wait! Here comes the sequel to "receiving stolen property", "knowing stolen information"!


Computers aren’t people.


the student sells it?


... on a book that they’d illegally acquired then read.


> The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”


But don't you see what a strange argument that is? It doesn't matter, the student did nothing wrong, I don't want to live on a planet where we put DRM into peoples brains (or AI for that matter) to enforce this absurd and overreaching idea of intellectual property.

And besides the publishers extorting thousands from young students forced to buy their overpriced mediocre textbooks, warrants any copyright infringement of any book anytime and forever at any scale. Publishers have lost all moral legitimacy in their copyright claims in my book. Copyright is not magic, it's a social contract at the end of the day and they have broken it first.


If the book was acquired illegally, the entity that suffered the loss may have a claim for the illegal acquisition. Meta and OpenAI have the money to buy a copy of every book under copyright that they have their AI read for training. I have more sympathy for losses suffered by a living person that produced creative works than I do for textbook mills. I also have sympathy for open source software authors that applied their creativity to create source code that is spit out verbatim by Copilot without adhering to license terms.

I see that Thomas’ Calculus is up to the 14th edition, priced at about 20x the hourly wage that a college student will earn. Thomas and Finney Calculus editions 6 - 8 are on shelves or in boxes somewhere in my house. Each of those cost me or my wife about 20x the student hourly wage back in the day. I bet calculus hasn’t changed a lot in the past 30 years to justify all of these editions. I blame universities for allowing this industry to thrive.


Care to comment on the downvote?

I want humans that apply their creativity to produce works to be able to earn a living. If you train your brain or your AI on some content, it seems reasonable to pay for at least one copy or borrow a copy from a friend or library. This is especially true when doing so is not a hardship for the individual or public interest organization.

I think the GP and perhaps others that are downvoting are saying that poor Meta and VC backed startups need all of that creative output for free so that they can maximize their profits, likely with no attribution to their sources. This hurts the author a tiny bit by not purchasing a single copy, then dooms all human creatives that are not otherwise financially independent because the AI provides a view of the human’s creative output with no way for consumers of the AI’s output to seek out the human’s original or related works.


The best thing that could happen to humanity is if OpenAI is sued comically into oblivion. LLMs are the anti-humanity, and the sooner we rid the planet of them the better off we'll be.


Downvote me all you want to show you have nothing to say. Return to my comment in 10 years to find I'm right.


No one is going to remember this comment 30 seconds after they click away, and this thread will be absorbed into an AI


Haha! Touche


Sarah's pov raises some questions for me regarding my own "training", there is a noteworthy part of who I am built upon the consumed music, books, movies, video games and tv shows that myself or people around me have pirated and shared with me. This part of me helped me in life appreciably, I could also say I profited because of it, helping me along my life in being likable, funny, relatable, with broad outlooks etc.

Is my brain just by the act of existing continually infringes on copyright? Can I be sued because I made a reference to a movie I pirated or because I whistle a song I never bought?


The allegation is that the model was trained on a copy of the original work that was improperly obtained. The fact that it can produce a summary is being offered as evidence of that claim. You can't be sued for making a reference or whistling a song, nor you could have been sued for being in possession of a pirated copy, only making a pirated copy. Copyright law has many warts, but it's not some Kafkaesque mind prison.


Just because we call both learning, doesn't mean that human learning and machine learning are the same. They most definitely are not the same. Human learning is very lossy.

Even if they were the same, it doesn't mean that bots should have the same rights that people have.


Tom Scott did it: https://www.youtube.com/watch?v=IFe9wiDfb0E

The simple fact is that our current handling of copyright is just completely broken on so many levels.


> Is my brain just by the act of existing continually infringes on copyright? Can I be sued because I made a reference to a movie I pirated or because I whistle a song I never bought?

I think it would fall under fair use. But you can imagine what the world can become with microphones and cameras everywhere, which can already run music and speech recognition by themselves, in seconds. What a time to be alive!


if you are a robot that ripped off literally all the data in the world and now resells it in a repackaged form for its own profit, then yes, you can be sued. Talking about whistling a song is pretty absurd in this context.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: