Hacker News new | past | comments | ask | show | jobs | submit login
New York Times considers legal action against OpenAI as copyright tensions swirl (npr.org)
134 points by 8ig8 on Aug 17, 2023 | hide | past | favorite | 369 comments



Honestly, I think generative AI losing a massive copyright showdown is inevitable at this stage.

It's extremely easy to get the latest generation of AIs to produce outputs that in many fields sans-AI would be trivially considered as IP infringement.

While there are many interesting reasonable legal & technical arguments that it's not, the result completely undermines copyright protections regardless. If that's accepted at scale, copyright in practice will change completely. In effect, the choices are "block this, or entirely destroy copyright protections in many industries". You can't allow this without eventually allowing everybody to simulate their own NY Times reporters, produce their own Marvel movies, and create their own Taylor Swift albums.

If you do allow that, the many many affected industries have catastrophic problems.

Problematic though copyright laws are, I see no world where all those protections go away any time soon, and so if the courts don't agree to protect copyright already in this scenario, then it will eventually be legislated to make that happen. AI consuming copyrighted data and producing an output has to be considered a derivative work (or indeed, the model itself will be considered a derivative work) or IP protections are effectively broken.

There's a grace period now while we work our way there, but the politics is pretty clear and with no plausible path to "let's drop copyright completely" ASAP, I just don't see any other result in the medium term. Doesn't mean the end of generative AI by any means, just a slowdown as we move to a world where you need to negotiate rights and buy data to feed it first, instead of scraping everybody else's for free.


Ruling in favor of copyright will call into question search engines and the like as well.

Do you think Bing or Google are going to negotiate copying rights with the world's websites?

LLMs are proving that intellectual property has a bunch of holes in it. It's been unstable ground to defend since day one. Upon what principle should we believe that one can own an idea and all performances or derivatives of it? Patents and trademarks haven't really helped as much as they were expected to. Only recently did works as far back as 1920 enter the public domain. Additionally, some LLMs gobble up source code with mixed licensing structures. How is that to be handled?

Patents are about monopolies to produce something to hit a market, with the trade-off of showing everyone how it's made.

Trademarks just allow you to defend your name(s).

Copyright prevents other people from making money off of your work. For the rest of your life, plus 75 years.

Something is broken here, alright. While we're discussing intellectual property, what about one's DNA? Is it not a performance of biology? How about your fingerprint? Fingerprints are semi-unique, so it's also a performance mark. We've seen celebrities sue for the use of their likeness, so that's recognized to some degree as well. At what point will data subjects get rights, so that when another Equifax happens, they can be bankrupted and prevented from harming the public again?

Most are rhetorical, of course, but I really think generative language models are disrupting a lot of things we used to take for granted, and our models of creatorship are not refined enough to account for digital or statistical copying. Intellectual property as a concept is not compatible with a digital future.


I think there is a major qualitative difference between generative AI and search engines.

Search engines index the web and point you at other people's work, along the way showing perhaps too much of that content (thus "stealing" users from the target webpage). But they don't reshuffle existing content into something apparently new and original.

The "malicious" case for generative AI is that it sucks in copyrighted work (vs. indexing it), rehashes and produces something that is supposedly original, but really a sophisticated rehash of copyrighted work.


I'm not saying search engines will be considered the same as LLMs, but given that LLMs are pushing the tolerances of Fair Use, should that get limited through litigation or legislation, the things search engines get away with like caching or summarizing, AMP pages, etc may cease to be legal and search engines may have to adopt less rich means of communicating relevancy, perhaps by showing the keywords that match or something else indirect but still true about a source, without pulling content straight from it.

As it stands right now, yes, Fair Use. But where do we draw the line? That line's been blurry for a while. Mostly limited to no more than 30 seconds of a performance, and no more than what's needed to quote literary works. Not sure about lyrics.

The issue is on some level, most things we create are derivative. Someone had to have the idea first, but once an idea is unleashed upon the world, it seems very difficult and unwieldy to put the genie back in the bottle.

I'm not really pro-LLM since it enables business to leech off of FOSS even more efficiently. The disruption of LLMs seems to be accelerating our philosophical re-examining of copyright and licensing terms in FOSS. We will inevitably need a GPL in the future that disallows remixing via LLM or other generative text algos, due mostly because ensuring all of a result is freely usable is not easy. Limitations could be built in to only 'fetch' code licensed under permissive terms, like MIT or BSD, but the tendency of these models to 'hallucinate' means you really cannot know the legal standing of LLM-generated code, at present.


Indexing the copyrighted works is rehashing it. You're rehashing it into a different format that is more easily searchable by a computer. But that work is still based on other people's copyrighted works.


It doesn't produce a song if you ask for one. It points you at an existing one, with attribution and a (perhaps excessive) excerpt.

LLMs do.


... as opposed to a human doing the same thing ?

It's hard for me to reconcile that it's somehow OK for a student to write a term paper about something (e.g. "Ulysses"), Wikipedia doing the same thing, but on the other hand, not OK for chatGPT doing it.


Copyright is not a principle or an ideal. It's a pragmatic law designed to achieve specific goals. It can and does make such arbitrary distinctions.


Ruling in favor of copyright will call into question search engines and the like as well.

No, they won't. Search engines have already fought and won this battle on fair use grounds because they make use of the copyrighted content differently than LLMs do. It's an absolutely fundamental distinction.

Patents and trademarks haven't really helped as much as they were expected to.

Patents have been a thing for nearly a millenia; Britain's patent system is credited with giving it the technological edge over its Medieval and post-Medieval competitors for world domination. Similarly, the U.S. patent system has been credited with the U.S.' technological prowess for the past 2 centuries.

Only recently did works as far back as 1920 enter the public domain.

Nothing is stopping artists from releasing works into the public domain during their lifetimes. That would of course would mean other people economically exploiting their works however they wished without any input or control from the artist...which is generally why most artists haven't done that.

While we're discussing intellectual property, what about one's DNA? Is it not a performance of biology? How about your fingerprint? Fingerprints are semi-unique, so it's also a performance mark

This is either a bad-faith argument. DNA and fingerprints are tangible things created without any sort of intellectual input. Therefore, by definition not intellectual property.

We've seen celebrities sue for the use of their likeness, so that's recognized to some degree as well.

This is another bad-faith argument. Celebrity likenesses are intangible property but are not intellectual property and are not afforded the same protections.

I really think generative language models are disrupting a lot of things we used to take for granted

LLMs so far have disrupted student papers. And that's pretty much it.


I consider the right of publicity to be IP. I base this primarily on Shaw Family Archives, wherein the grantees of IP rights were denied right of publicity because it did not exist as an intellectual property right at the time of Monroe’s death, not because RoP is not intellectual property.

I’m curious if you can set me straight with a citation, or whether you would contest mine.


Do you have a source on the British and US patent system being major contributors to their technological success? Im very interested in this point as I feel like software is the opposite, in which open source has been the driver of progress.


As previously said, search engines index and provide links. I’ll add that it constitutes fair use because a search engine isn’t itself a replacement for the articles that it indexes.

But ChatGPT is actually providing an alternative that obviates the original articles themselves.


Google started moving away from just providing links along time ago. They routinely scrap data and show it, keeping people from visiting the links. I don't see how this behavior could be allowed while also crushing LLMs.

Personally, I like the flexibility of an LLM being able to describe a process at different skill levels. This is of tremendous educational value to the world.


Search engines provide links, but also titles and snippets of the page -- enough for you to decide if you want to visit, and Google will show you their cached page if you ask for it.

Even the link is a copyrightable item -- artistic effort went into creating it


> Search engines provide links, but also titles and snippets of the page -- enough for you to decide if you want to visit

Small snippets are allowed by copyright law. They are not infringing.

> and Google will show you their cached page if you ask for it.

Really? I haven't seen that in several years. How do you get it these days?

I always assumed Google quit giving you that option exactly because of copyright issues.

> Even the link is a copyrightable item -- artistic effort went into creating it

IANAL, but I'm pretty sure that the current state of copyright law disagrees with you. Can you point to some concrete evidence that you're right?


>cached page

http://webcache.googleusercontent.com/search?q=cache:www.hac...

etc

There's also a link in the three-dot menu next to the search result, but it doesn't always appear.


Search engines will also eventually stop serving the result if the source disappears. A LLM model that has been trained and published don't care at all about the source anymore.


The difference is that search engines don't say, "I created this".


Search engines are a device for leading you to a source. This is 99% of the time in the copyright holder's interest.

LLMs output a mashup of source material without attribution. This is 99% of the time against the copyright holder's interest.


Search engines used to be a device for leading you to a source. Most queries worth any money now have zero organic results above the fold. And it's one of many various schemes to keep you on their sites and not to organic search results. There's a plethora of straight ad strategies, widgets for various verticals (retail, travel, etc), "rich snippets", etc.


So is everyday conversation.


Search engines have a big difference which is that they usually direct you to the original website.


Bing's LLM does that too.


But it also tells you everything on the page without needing to click

It’s basically the “does this replace the original content” doctrine of fair use


Doesn't Google knowledge graph do that as well? Google is always giving me the answers I need before I click on a site. This was already normalized behavior prior to the existence of LLMs.


The knowledge graph is Wikipedia, that they use under license, or other sources that they have paid for a license to use.

What you're thinking of is "featured snippets". As far as I know, the justification behind those is that they are exact quotes that are followed by a citation (a link). Google argues those are fair use, since it's a properly referenced quote.


Not really.

Copyright largely remains about the PRODUCTION of content, not about the CONSUMPTION of it.

Someone who grew up reading Marvel comics being able to make new original comics in that style is perfectly ok. That same person perfectly replicating an Avengers comic is going to land them in hot water.

The focus on infringement really needs to be on what LLMs produce, not their training.

There definitely needs to be something like a secondary pass added which checks output against a vectordb of the training set to avoid too close derivative IP outputs (and ideally checks for jailbreaking or inappropriate content at the same time).

Any production services would need to subscribe to a service like that to stave off litigation on infringement, much like how the oft repeated complaints regarding YouTube copyright infringement eventually dissipated as content tagging was added (and shifted to complaints over too broad application of it).

A generative AI model having read the NYT but producing new original news articles in the style of a newspaper is a very weird argument for infringement.

A human driven service or an automated one that takes current NYT articles and summarizes or reworks them, publishing itself and cutting them out of the ad revenue is more problematic (but also widespread already and generally considered protected).

Services which exactly duplicate their articles would be more clearly infringement, but there's no evidence that's even a fraction of what ChatGPT is doing.

Criminalizing training would set back whatever county did so significantly in global competition for a critical new economic (and defense) trend, and would ultimately only be a minor stop gap for copyright holders as you'd simply see a market for secondhand generated content from foreign models trained on copyrighted data but then producing content that was itself not copyrightable but could be used to train domestic models.


Short of a police state, how would you enforce this?

This has napster -> subscription spotify energy. But the only people happy about that are Spotify and people who found it distasteful to download music illegally. There just wasn’t a consumer-friendly option for a while, so the black market was the only market.

So. The enforcement mechanism is what… a scary DMCA letter?

(There will definitely be a stupid DCAIA in the next congress)


The copyright holder gets a share of ownership in any AI model derived from its work, and thus a share of any resulting revenue.


Even Taylor Swift and Elton John get micropennies per play.

In an (unrealizable) regime where all copyright holders are compensated, that would include picopennies for the discussion we’ve had!


All 10 million of them?


Any large corporation likely has more individual shareholders than that (particularly when you include indirect ownership via mutual funds).


> Problematic though copyright laws are, I see no world where all those protections go away any time soon, and so if the courts don't agree to protect copyright already in this scenario, then it will eventually be legislated to make that happen. AI consuming copyrighted data and producing an output has to be considered a derivative work (or indeed, the model itself will be considered a derivative work) or IP protections are effectively broken.

There's a headline out today about several ex-Google Brain engineers, including a co-author of “Attention Is All You Need”, setting up shop in Tokyo. [1] That's not a coincidence.

> Amid rising questions about the fairness and legality of using publicly available information to train AI models, Japan affirmed that machine learning engineers can use any data they find.

> What’s new: A Japanese official clarified that the country’s law lets AI developers train models on works that are protected by copyright.

> How it works: In testimony before Japan’s House of Representatives, cabinet minister Keiko Nagaoka explained that the law allows machine learning developers to use copyrighted works whether or not the trained model would be used commercially and regardless of its intended purpose. [2]

IANAL so don't know what the implications of this are when it comes to cross-border copyright enforcement. It's hard to imagine Japan rolling back this type of legal safe harbor. It's a boon for attracting AI startups from elsewhere and giving the local tech industry a competitive boost.

What's to stop other venues looking to grow their tech industry to do something similar? And if they do, would it create a race to the bottom type of dynamic at the expense of copyright holders?

[1] https://www.bloomberg.com/news/articles/2023-08-17/ex-google...

[2] https://www.deeplearning.ai/the-batch/japan-ai-data-laws-exp...


I will be interested to see how the copyright suit plays out with Taylor Swift vs. Guy who asked an AI to make a new song that sounds like a generic Taylor Swift song.


Or on the converse: if those industries are unviable without copyright protection, they could go away entirely. This is a plausible path to "drop copyright entirely", just like encryption was dropped as an export-controlled technology in the late 90s. (remember the 40-bit "international" SSL?)

OpenAI etc. have huge amounts of money behind them, they very well have a fighting chance in court to defend their usage of scraping the internet.


> if those industries are unviable without copyright protection, they could go away entirely.

These creative industries include all of software development, music, TV, movies, books, media, art, etc. You do technically solve the problem of copyright by shutting all those down, but I'm not sure it's a solution anybody will vote for.

If you can come up with a serious alternative though, which can sustain those creative industries without requiring copyright, now is probably the best moment in all of history to seize the day and make that happen. There's going to be a big shake-up regardless, it's the perfect chance for alternative models.

Bear in mind that dropping copyright entirely doesn't just hurt Disney and Sony Music though - with no copyright the GPL and all other open-source licenses are unenforceable, anybody can copy & sell anybody else's art or design without permission, Spotify doesn't have to pay musicians even $0.01 any more, etc etc etc. It's not an easy problem.


I don't think that no one would create new software/music/books/movies/art/etc. without copyright. Humans have done so for millenia before, they still did so in absence of copyright protections. I don't see how this is not a serious alternative - the only major losers would be the middlemen, not the artists themselves.


Eh, I could definitely see the artists losing.

The most obvious scenario that comes to mind for me is, imagine an independent artists launching their (book/film/album/etc) and the same day someone with more resources and experience takes the work and markets it better than the OG author ever could on their own.


In a copyright-less system you’d monetize the creation (eg via patronage), so someone else distributing the work is fine, even helpful.

It makes no sense to add artificial scarcity to ideas, the cost of replication is inherently zero and all kinds of noxious consequences on society (like derivative works being impossible to create) shake out as a result. If the goal is encouraging creation why are you punishing and penalizing the creation of derivative works?

Letting megacorps monopolize popular culture for 100+ years after its inception is a relatively newfangled idea and it’s baffling that so many people just blindly accept that this is the way it has to be. Copyright works against individuals in almost all cases, we benefit much more from free interchange of ideas. Social diffusion and remixing is a fundamental human force and this AI stuff forced the issue by doing the exact same things with impossible precision and scale, such that the absurdity of the system for humans is revealed as well.

The current copyright regime is the social equivalent of “we could have a cop taking down license plates if we wanted”. And AI does the “so it’s therefore legal to automatically and instantly record everyone’s license plates at every intersection in the country 24/7”. The principle is the same but the ease of use reveals the absurdity of the principle.


It would be great if such "creative" works were simply impossible to monetize. I already don't pay for these and try to find unknown artists / writers who have a job and do stuff simply because they enjoy doing it.


That’s… awful.


What's awful here? This way it's just a social interaction with both sides satisfied, where's the problem?

I see a problem today, being spammed with shitty commercial "works" whereas I'd like to see something genuine and not just made for money.


If not for the ability to monetize, there would be a small fraction of the total available work out there. From music, to movies, to video games. And for many, the quality we come to enjoy just wouldn’t be possible. Do you think we’d have a Skyrim, or GTA, or equivalent if there weren’t millions to be made to employ thousands of people to make it happen? What about the largest and most influential films and TV shows of the past decade? These things take money to produce. I could see the argument for music, maybe, but even then I don’t think it’d be sustainable. Just because an artist may enjoy working on their art after working 50 hours a week to pay the bills doesn’t mean they should HAVE to if their art is good enough and desired enough to sustain them, thus allowing them to create more and of a higher caliber (in theory).


I don't consume any of those, I'm just saying what would be better for me.

However, abolishing copyright is not the same as making monetization illegal. My view is that people should be paid for their WORK, and copying something doesn't make the author work more.

This'd mean the funds 'd need to be bootstrapped (crowdfunding?) before the work is done, but then it'd be free to distribute it.


The first sentence of your initial comment I responded to was ‘It would be great if such "creative" works were simply impossible to monetize.’


The difference being that the artist can still be monetized, even if the art isn't.


You don't need to monetize via capital though -- you can pay people to do the creating directly. Eg. One of the biggest ways independent artists already get paid is via services like patreon


> Humans have done so for millenia before

This is categorically false for both software and movies.

For other media, this ignores the effect of zero-effort copying.


You’ve ironically stumbled upon thenphilosocial argument in favor of copyright! We do want these industries to exist. Without copyright protection in a world of zero effort reproduction, it becomes impossible to make a living this way. This the industries cease to exist. And people stop crafting anything other than the most dogshit of media and programs and writing. If you have to work a different job all day, you’re gonna have no time to make that next good song/video/etc.


But I don't want them to exist, I want to find content that is created because people (who earn money different way) enjoy producing it.


Kickstarter does quite well, as does patreon.

Eg. If team cherry didn't have savings from selling hollow knight, I'd pitch in money for them to work on silk song.

The modern small artist doesn't even get paid for their art, but for things like their audience


>>"If you do allow that, the many many affected industries have catastrophic problems."

That is the problem. Technically, AI should be allowed to 'read' content, it isn't hidden, and it gets mixed with other content in a 'brain' like thing.

AI and Humans can both spit out a new product that is 'similar' and thus be sued on that similarity.

But it can also produce endless similar variations at low cost and fast.

It is the ease of creating new similar products.

You could just as well prompt the AI "make a Taylor Swift song, but different enough to avoid a lawsuit".

  Think this is an entirely new problem that needs a new law beyond copywrite. Copywrite is not the correct law to us for fighting this.  Copywrite law doesn't ban someone from reading the source altogether.


There’s an unbelievably vast difference between a human’s creative process and the mechanical reproduction of reweighed training data. Machines don’t create, people do.


Why is the human mind not also a machine?


Is there? I think you are giving too much credit to human creativity. Falling prey to the 'human exceptionalism' argument, or to the 'mysterious'. It just assumes humans are unique, or ineffable, it doesn't provide any of the 'how' it is done.

Ask GPT something like "write a screen play for Othello using dialog like Tarantino, but with bit of style like Baz Luhrmann". It produces something pretty creative. And if you say, well it is just combining what came before, that also applies to humans, there are no new ideas.

Humans being complicated doesn't mean AI wont catch up. Its just time/money/engineering at this point.

In Nature, why would Carbon be more holy than Silicon.


> AI consuming copyrighted data and producing an output has to be considered a derivative work (or indeed, the model itself will be considered a derivative work) or IP protections are effectively broken.

It’s not derivative work though. First, a human didn’t create it, so copyright protections don’t exist on its output. Machines don’t enjoy copyright protections, people do.

It’s mechanically copying and reproducing part of its input data set. Making a tool that regurgitates others’ copyrighted IP is going to been seen as aiding mass copyright violations. Exactly like how Napster got sued: they’re holding a bunch of material they shouldn’t be. The only new twist to this case is the data is encoded in a transformer’s weights. This should be correctly seen as the same as having encrypted copyrighted data using a lossy algorithm.


> First, a human didn’t create it, so copyright protections don’t exist on its output. Machines don’t enjoy copyright protections, people do.

This is not correct. AI models are tools that humans use.

This is like saying "it was typed on a computer therefore it doesn't enjoy copyright protections"


This has a ruling, see https://www.theartnewspaper.com/2023/05/04/us-copyright-offi...

When you're using ai models to generate the art, it's not considered human enough. If you then make a bunch of modifications to it, sure, but giving the initial prompt is currently insufficient


> The Copyright Office granted copyright to the book as a whole but not to the individual images in the book

Interesting and makes sense. I guess the best comparison I can think of is a random number generated with Math.random() is not copyrightable. But including that random number is a program is copyrightable.

I'd say that a corollary to this is that if the images/words generated by the machine are not copyrightable then they cannot also be _copyright violations_


Seems unfair as we converge on AGI. If I memorize the lyrics to a song, is that a copyright violation? The lyrics are encoded in the arrangement of my neurons, after all.


I don't know why people use these analogies. No person can memorize terabytes worth of lyrics.


Does that invalidate the analogy?

The point I'm trying to make is if you rule these behaviors illegal, then you're necessarily making intelligent AI illegal, because humans are capable of the same behaviors.


Plagiarism is already a copyright violation so I’m not sure what your point is, and what you’re alluding to is literally what the post is about…


Plagiarism is not a copyright violation, it’s a violation of the social rules of private (academic) institutions and imposed by those institutions upon their members.

If plagiarism was a copyright violation then citing the source wouldn’t make the copyright violation go away. Putting the artists name in the title of the YouTube doesn’t make it not a copyright violation.


What if someone could? What if rainman could


What if rainman could fly too? I guess rainman wouldn't need an airport for short haul trips.

So, "good on rainman".

Now, for the rest of us...


That's not what GP is saying. The (potentially) copyright infringing part is the reproduction of copyrighted material, not the encoding itself. In the same way that learning the lyrics to a song isn't copyright infringement, but performing that song live without permission would be.


Perhaps the difference is that you're a human doing it, and the other isn't?


I think attempts to tease out this distinction is going to make these laws unwieldy to use in practice


Also even if it’s ruled to be infringing, these current models aren’t going to go away. And given the additional high-quality training material this allows, it’s fairly likely these models will have an ongoing advantage in quality of output.

So now you’ve divided the world into those who use the best tech, and those who are not allowed. And that is what openAI wants, they’re betting courts are going to rule that however tainted the source, that we can’t put lightning back into the bottle.


IANAL, but copyright protections are pretty much tied to content and format and not to the idea itself, with the intent of preventing (or putting a price on) the copying of original works. The Times will have a very hard time proving that their content is being re-marketed by OpenAI. Having a competing product based on your ideas.

Compare:

"Steve Jobs [was] a tyrant": https://www.nytimes.com/2011/10/07/technology/steve-jobs-def...

Against:

"Whether to describe SJ as a tyrant is a matter of perspective...": https://chat.openai.com/share/28633f0c-007f-48b6-a615-1581c3...

The general way LLMs work do not preserve content in it's original form: the ideas they contain are extracted and clustered statistically - as a ELI5 refresher, an LLM reads 2 million NY Times articles and records that after the word "Steve" there are a lot of "Jobs" followed by a lot of "was a genius/tyrant", "founded Apple", etc. Then LLMs recreate the user question "Who was Steve Jobs?" using this complex net of token/word stats. Is that fair use? I think OpenAI lawyers will not even tap the fair use question, they will simply state that no copy happened, just a statistical collection of words from various sources.

And importantly: no LLM source is really prevalent, so the end result cannot be even be traced back to the source, especially if multiple, similar news sources are being fed to training. I have no idea how the Times is going to prove that its _theirs_ news.


> A top concern for The Times is that ChatGPT is, in a sense, becoming a direct competitor with the paper by creating text that answers questions based on the original reporting and writing of the paper's staff

Sounds to me like they are trying to claim copyright over facts rather than the specific expression. That’s just not how copyright works at the moment.

The framing of openAIs recent changes is telling too. OpenAI seems to have nudged their models to reject requests of querying sentence continuation for specific sources - which the press is now framing as “trying to hide the use of copyrighted data”

What we are seeing here is an unprecedented attempt at expanding copyright doctrine to facts, style and information rather than specific expressions - a land grab of latent space by rights holders salivating to own factual information


>which the press is now framing as “trying to hide the use of copyrighted data”

Yea, now I can't read the paper and talk about it to other people it seems.

The Right to Read was a prophecy I guess?


An individual or group of individuals doing this and sharing their views/summary vs. a profit-oriented program funded by major technology companies scraping this information and spitting it back out algorithmically does seem different to me. Yes, perhaps both things are on the same "sliding scale", but I do not view them as fundamentally equivalent actions.


>a profit-oriented program

The particular problem here is this program isn't magic, it just requires a lot of electricity and hardware to train at the moment. If at some point in the future this hardware becomes cheap then now suddenly OSS LLMs would be under the same set of rules that we're applying to major technology companies.

But mark my words, the large copyright holding groups don't give any shits other than how much IP they can scrape up and demand money for, for the next few human lifetimes.


So I can use an open source LLM like Llama then?


You've pierced my completely precise, absolutely airtight choice of language about this situation as some sort of flaw in the greater point being made.

Less glibly: a non-profit oriented LLM is just in a little different place on the scale, but doesn't fundamentally change my takeaway. However in this situation it makes it particularly egregious.


Anyone will run LLMs locally soon on about every piece of hardware. If it’s part of the system like the network stack / firmware, now does your mental framework work.

Progress is murdering their business model. They are trying to stop it; which makes sense, but let’s not pretend the trajectory here isn’t to ubiquitous LLMs everywhere within 3 years. We have to plan for that, assess the benefit for society from that.


My point is that the scale of how LLMs work doesn't matter with regards to copyright and fair use.


The printer had the same effect on print papers. We didn’t protect them out of the possibility of at scale replacement of their business model.


I tend to agree with you, but, one could argue “statistical collection of words” is a form of compression? For example, you can’t write a kids version of a novel and sell that without dealing with copyright.


The part openai will have to argue is that it's not mererly compression but an irreversible transformation.

Which is hard, best hope they have is trying to put the burden of proof on the nytimes to show you can make the model regurgitate their articles (with some nudging).

If they manage that then nytimes is going to have a lot of trouble showing the model actually breaches their copyright, because just the information contained in their articles is not enough to constitute a copyrightable work.


Any form of lossy compression is an irreversible transformation. We do it all the time for video, audio and images (you can't recover the original data) and they are still copyrighted


when you compress a video, it doesn't recreate a new movie with a different story, different lines of text, different scenes and a different compositions for scenes that are similar to the "orginial".


But what is being compressed is the entire corpus of text. It's compressed into model weights. It's the weights that might be under copyright of the authors of the texts that trained it.

The weights are also executable code (in some sense). When you query an LLM you're running this program with a given input. Yeah when it runs it tells a whole lot of things (sometimes novel combinations, sometimes verbatim repetition of trained data) but the point here isn't whether the output of the LLM is copyrighted; it's the weights.


The model is a model. It's part of a compression algorithm. The compressed data would be the prompt + choice of which predicted tokens to accept (e.g. when not always choosing the most likely next token). The end-user is supplying the prompt and the choice function is randomized/not being used to store data, thus the end user is providing the compressed data.


The NYT argument is going to be that they put up a site, own the copyright for their content and make that content available for either a human to read it for themselves, or software to index for something commonly understood as a search engine. Those terms do not entitle the training of LLMs for commercial use. Therefore, cease and desist. Oh and destroy anything that was created by violating the terms of our license.

You can make arguments like a) what is ChatGPT but a different kind of search engine, or b) what is an LLM but a primitive human, or c) but but uhh we didn’t agree to these terms.

But I do not think those arguments will prevail.


The LinkedIn case already proves that you cannot impose conditions on works you freely serve to the public. The data is there to anyone who sends a request (you don’t even need to be logged in) and if they do something you don’t like with it then oh well.

So if that’s the argument it’s already been argued by LinkedIn and lost.

This is one of those things where copyright holders have gotten absurdly full of themselves though. Like what you’ve said is that copyright holders have the right to impose a contract of adhesion on data that they are broadcasting into the public without any idea with whom they are even forming a contract, and that’s a facially absurd and incredibly noxious idea if you follow it to the conclusions it implies.

Copyright is about securing to the public works of significance and encouraging their creation and the way it’s become a lifetime-plus-75-year guarantee of intellectual ownership of ideas is fundamentally noxious and goes against the intent and spirit of the idea. And if that’s where the copyright regime is headed then I’d rather see chatGPT kill off copyright entirely.


NYT will have to prove that the derivative work is still theirs. Just violating the license may not be enough. That could be bad by itself I guess. But considering the interactive prompt can produce a wild amount of variations of 'not NYT stuff' will make it though to say what sort of damages is this.

A similar sort of issue popped up in the 80s around colorization of films. https://www.latimes.com/archives/la-xpm-1987-06-20-ca-8405-s... https://chart.copyrightdata.com/Colorization.html

The answer may be 'maybe'? As from what I read they basically split the decision down to 'i know it when I see it' style of ruling. If the copyright is still in effect then NYT owns that portion of the output but not others parts. As the secondary effect would be owned by the generator company (in this case OpenAI) or the person who prompted for it. If that is the case NYT would have to prove what parts (nodes? bacreferences? weights?) they own?


Terms of Use are a thing, and if the Times can prove that OpenAI infringed their web terms by scraping, they may have a case... but terms of use probably won't monetize well or give them enough leverage to prevent OpenAI from using their data anyway and may end-up distracting from the main copyright suit.


Violating TOS, at least to scrape and use later, is legal.[0] I'm not sure how the ruling interacts with LLMs, but I'm sure OpenAI's lawyers would bring it up.

[0]: https://www.forbes.com/sites/zacharysmith/2022/04/18/scrapin...


FYI LinkedIn actually won that case after appealing once more: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn


Where do you see that they won the case? Can you provide a source because the wikipedia article directly contradicts what you are saying...?

I see they went to the Supreme Court who kicked it back to the Ninth who then re-affirmed their position that HiQ Labs was not in violation of the CFAA.


From [0] and [1], it seems it was a mixed ruling. I am actually not sure whether it's now legal to scrape, since the Court ruled against hiQ due to a breach of terms of service, but previously the Ninth Circuit Court affirmed its ruling against LinkedIn.

[0] https://www.natlawreview.com/article/court-finds-hiq-breache...

[1] https://www.natlawreview.com/article/hiq-and-linkedin-reach-...


> The hiQ decisions give a green light, at least in some circumstances, to scraping publicly available websites without fear of liability under the CFAA.

So at a federal level, it seems relatively clear. The only uncertainty is on the state level.


Ah. I was not aware there was an update to that case. TIL.


I think this is going to be a test of the fair use doctrine.

https://www.copyright.gov/fair-use/

Now, there's this idea that "news" is just factual and therefore falls under "fair use". However, that's only part of what section 107 says.

Fair use very much is still conditional, as there are 4 factors to be considered: (a) Purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes (b) Nature of the copyrighted work (c) Amount and substantiality of the portion used in relation to the copyrighted work as a whole and (d) Effect of the use upon the potential market for or value of the copyrighted work

The big issue isn't companies training LLM's using unlicensed materials (e.g. copyright protected works); it's publishing the output to the wider world. That's where a liability is created.


(a) looks bad, clearly commercial (b) ?? (c) looks bad, the LLM consumes "all" of the articles (d) looks bad, has a pretty significant impact on the market and value of the work


Unfortunately we'll have 9 justices who know little about copyright law and nothing about tech to tell us what the law really is.


They will be significantly assisted by Supreme Court clerks, who are generally recent graduates (2-4 years) of the top law schools in the US. Your stereotypes of uninformed Congressmen from televised Congressional hearings don't really apply here.


26 year old law school grads will help! Oh, I stand "corrected." LOL


>The general way LLMs work do not preserve content in it's original form: the ideas they contain are extracted and clustered statistically - as a

Is the way LLM work relevant? I can make a shitty script that has as input Microsoft proprietary code and as output something identical in purpose but the text is completely different, I would rename names with synonyms, swap some things around etc.

I am not against AIs, my opinion is that if your AI uses GPL code the output should be GPL, if it uses public domain images the output should be public domain images.

I mean for code if AI is actual intelligent you should be able to train an AI with C with just a few books and not with the entire GitHub open source code (and notice MS did not trained copilot on the proprietary code they have access proving they are not confident that they are in the right).


> my opinion is that if your AI uses GPL code the output should be GPL

If I use Inkscape is the output of my drawing subject to the same terms as Inkscape?

If I use a Photoshop filter is the output subject to Photoshop's EULA and/or the copyright of the photo I started with?


If I use a Photoshop filter is the output subject to Photoshop's EULA and/or the copyright of the photo I started with?

If you get my image from the internet then you resize it in Photoshop you can't claim you created some original art, you just used the resize/crop/color filter function.


Indeed. I was hoping to draw the distinction between license of the tool vs its output. AI is interesting in that its training inputs can leak out, sometimes in ways that are verbatim or insufficiently transformative to be free of copyright.


But do this exercise, imagine I have 1 billion lines of C code on GitHub under GPL. You train a new AI only on my GPL code then claim that the output is somehow trans formative and it is not GPL or it is not derived from my code but is new and original.

What happens is that Microsoft or OpenAI will use 1000 or more different sources and then the author is not clear unless in few outputs where the tool outputs the exact code as the training data. If the AI is actual inteligent they could train it in a few books and say MDN documentation and not a few thousands of GPL code.


I don't think the process matters much, but your proposed script would output something that was obviously very similar to the original.

What ChatGPT produces under normal use is not more similar to the NYT source than any other article on the same topic.


>What ChatGPT produces under normal use is not more similar to the NYT source than any other article on the same topic.

ChatGPT is doing what a smart student does when he copies the homework, he combines a few sources and changes some wording. Technically there is no creativity, it is interpolating it's inputs and there is some randomness thrown in.

We also know that ChatGPT put some filters to filter out copyrighted outputs after they were caught that the AI actually memorizes paragraphs of text word by word.


Discovery is how they will show that it’s their news; and if it gets that far we might finally learn what data they trained it on and how much.


Copyright protection goes way beyond "verbatim copy". For example, fictional characters enjoy copyright protection:

https://en.wikipedia.org/wiki/Copyright_protection_for_ficti...


If I read a story about a flood in Dubai on the NYTimes and then I write an email to my friend summarizing what I just read, it is not copyright infringement.

I am not sure why it would suddenly become infringement because an LLM is composing that email for me.


But if you create a company that takes the NYT articles, scans the articles for extreme weather information in Dubai and emails your paying customers every time that happens, then you are doing copyright infringement.


Not if there are people doing the scanning manually (with their eyes), which is I think the GP's point?


they had to make a copy of the original to get it into their system in the first place!


In order to render that page, it probably was copied dozens of times all over my RAM. Do I owe NYT money now?


Well, are you making money off of those copies?


Does reading news articles not benefit you? If not why continue reading?


Exactly. If I read a NYT article, and decide to invest in some company, then sell the stock, make a profit. Do I owe the NYT a percentage because I used knowledge I "read" from one of their articles? I "read", input, (into my brain neural net) where it mixed with other inputs.


I think the only consistent stance would be that yes, you do.


Not personally, but if I was training to be a journalist then yes.


yes. they have a hard paywall!


Can't wait for the supreme Court ruling that says AI is just using data, and data is free.


MP3 are just data too


chatGPT it’s a terrible (truly, it’s punishing) writer but that should not always be the case.


IANAL is the dumbest abbreviation the internet has come up with. I believe I first observed these things on the Groklaw discussion threads discussing the SCO legal battle against the world.

Not-A-Lawyer NAL instead of the full IANAL.

I just had to say it.


IANAL is just "wacky" and "sexual" reddit tier humour, nothing more. It's boring any annoying.


IANAL long predates the use of the term “Reddit-tier”, probably goes back to usenet in the 80s.


I don't think it's an exaggeration to say that LLMs might lead to the end of the open web, or at least a drastically reduced version of it. So much of these model's utility is in directly competing with the producers of the training data. Content creators and aggregators are seeing more and more reason to restrict and limit access, to avoid having AI companies consume all of their data and then be the ones making money from it going forward.

I fear that LLMs are going to cause the internet to be a much worse and less open space.


No it's the death of the corporate content hosting web - the open web was never about making money with your blog post/irc chat/usenet group/etc content, at least in my opinion.

Let data be free! If someone wants to use it to make money, well, it's open, just like open source. It's still not okay to take open source work and claim it as your own, which is what copyright should be limited to. Stealing a photo or plagiarizing an essay is intrinsically different than just having a copy read by something, be it human or an mechanical process such as training a LLM.


Maybe I don’t want my non-corporate art to be a part of some large corporation’s training data.


Once again I have to point out that LLMs are not and will not be restricted to "large corporations". Even if initial training is prohibitively expensive, open base models already exist.

"corporate this" and "corporate that" are handy rhetorical flourishes but they distort the debate.


User-agent: GPTBot Disallow: /


ATMO you shouldn't have to maintain knowledge of what kind of crawler bot exist and having to maintain deny list. It should be the opposite, only expressedly allowed content should be crawled by mainaining allow lists.


You can do the opposite since the inception of robots.txt: User-agent: * Disallow: / and then whitelist google bot and whatnot. Most of the web is already configured this way. Just check robots.txt of any major website, e.g. https://twitter.com/robots.txt


The Allow: directive was an extension to robots.txt added later.


That was my gut reaction too, but presumably unless it becomes regulated, at least some competitors to OpenAI won't respect any robots.txt and thus any open content might be training data.


User-Agent: <new technology category>Bot


if you want to give away your data for free, do it, but speak for yourself, not everyone is in position to be able to do it. Just like you can't force other people to give away their software (or anything else, really) for free if they choose to charge for it.


Really? Seems like it's not so different conceptually to search engines. They also make money by indexing "training data" of a sort. But the open web trundles on regardless, because the search engines found ways to cut the content producers in on it. I see no reason why that can't be the case here too, with AI companies training their models to act more like search engines when data comes from certain sources - i.e. they'll point you in the right direction but not directly answer you. That will suck for the AI users, but, for many questions the right answer will be available in open or bulk licensable training sets anyway. For example you can get legal access to nearly all books by doing deals with publishers, as Google Books has demonstrated, you can get access to map data by generating it yourself or doing deals with digital map companies etc.


>Really? Seems like it's not so different conceptually to search engines. They also make money by indexing "training data" of a sort.

OK, so if a writer X has a blog to put up samples of their work to drive people to buy books and to get writing assignments and someone uses ChatGPT to write something in the style of X - this naively seems like a hit on that author's ability to sell their skills.

And I don't think it is fixable by making the AI act more like a search engine.


It would work both ways. I could also imagine someone asking: “what books should I read to learn about X?” And the LLM could drive sales toward that author’s books.

It’s not clear to me that having their content “unindexed” is good for authors. It’s probably good and bad at the same time.


Seems like a rerun of the argument over snippets, or the "answer onebox" as Google used to call it where the info you need is directly inlined in the SERP rather than being behind a link.


it is exactly a rerun of that argument, except that the UX of AI is different.

Basically AI is structured to make every result an answer onebox whereas in search this only happens sometimes.


Yes I would say that is very similar. Many sites saw their revenue crater, based on data they themselves had created.


> this naively seems like a hit on that author's ability to sell their skills

If the AI is good, then the author's economic outlook is bad regardless of style.

Regardless of if the AI is or isn't good, then I can still see it being brand damaging, but my gut feeling (IANAL) is that this is more of a trademark issue than a copyright issue, as it's passing off as something it isn't.


yes indeed, search engines settled that problem a long time ago in a civilized way. The more of these lawsuits, the harder the wild west of "training" will have to search for an acceptable solution to what is currently web-scale theft of data.


Is this open web in the room with us? Cheekiness aside, the myth of beautiful open web always seems to be exagerated. Most content on the web is already generated, ugly and spammy. Most of the traffic is already owned by mega corporations.

One could easily argue that it's unlikely it'll get worse - if anything, AI could empower competition as now a group of 3 passionate, free writers can compete with agenda-driven, for profit corporations on a similar level. This could very well make the web more open and free as it makes the web more accessible.


> Most content on the web is already generated, ugly and spammy.

Now? Yes. I grew up when it was just ugly, and by ugly I mean animated gif backgrounds with obvious seams on the tile boundary.

*old man shakes fist at The Cloud*


Let's assume that happens. How do I hedge against it? Is there a convenient way to mirror the bits of the web that are open now? Perhaps a mirror of archive/WayBack machine that could be viewed locally similar to Wikipedia dumps?


Yes, commoncrawl indexes


100TB. Dang! I will need to expand my storage significantly to mirror it.


100TB so far ;)

And that's probably just 1 year. There's data that's vanished over time. The point of all this is that they're point-in-time references of the internet.

the more practical thing is that (the royal) they have set up a rclone like utility that allows you to retrieve a slice of the whole snapshot, if you know your domain.

it would be extremely funni if commoncrawl was a de-facto ipfs, allowing queries along domains (or arbitrary dimensions/clustering) of interest.


There is a very real risk that we end up with an inferior product cannibalizing a superior one and driving it out of business.

Moreover, AI would seem to be even more susceptible to capture and manipulation than conventional media.

When it's a question of guiding thought I prefer the humanities to tech. (Same with art.)


> we end up with an inferior product cannibalizing a superior one and driving it out of business.

In case that print is meant by inferior product: The same argument could've been brought up for Napster, where traditional distribution via CD printing through music labels are the inferior product driving the superior one out of business. Or rather it's big labels suing Napster out of business.

I also hold a dislike for the copyright lobby, but this matter is serious. The question of whether the training of LLM's with copyrighted data is a legitimate one, as OpenAI did not just use contributions from large media outlets like NYT but capitalized on small contributions from individual contributors.

A ruling in favor of copyright would force OpenAI to shut down - but given their impressive demo of the tech, I hope we'd see more open and accessible versions of these models emerge.

As impressive as ChatGPT is, I dislike having my access to information governed by some large corporate entity. I also dislike a company directly capitalizing on my contributions without my explicit consent.


> A ruling in favor of copyright would force OpenAI to shut down - but given their impressive demo of the tech

Who cares? If they want data, they can pay for it.


Or at least ask before scrapping/reading it.


If it’s on the open internet then why should they have to do that? How is openai training on articles fundamentally different from the wayback machine storing them? They’re just getting stored in a different form.


That's like saying that taking a sample from a song and putting it in a remix is just a different way of storing the original song. If I cut the original song up just right and put enough samples across different songs...

The issue here really lies in "yeah so how actually DOES this make a difference" legally speaking. It just seems unfair that I'm not allowed to copypaste a text verbatim or upload a movie to youtube that I don't own the copyright for, yet OpenAI can happily commercialize on content that has been sorted and rated for quality by someone else.

The question is "are they allowed to do this under the umbrella of current copyright legislation?" and the answer has far reaching implications.


Copyright law includes many exceptions explicitly for libraries.


Copywrite law doesn't ban people from reading the source altogether.

That is what is being proposed, ban AI from being allowed to 'read' the content.

The real argument is how does a human brain aggregate knowledge and then profit from it, and is it really that different from an AI model aggregating knowledge.

They both read in data, perform calculations on the data, and spit out something.


This is a religious belief and not scientific knowledge or strong legal argument.


Are you projecting a religious argument onto LLM's?

It seems you are assuming that humans contains some metaphysical 'essence' like a soul that is doing the thinking?

Please connect the dots here.


Why should it be okay for humans to make a profit off of reading copyrighted content, though?



What is the "open internet"?


For me the question is highly debatable. As far as I'm aware the training of AI works by crawling various content from the Internet, so they're using a product, which are NY articles to train an AI, meaning they're using their content to help creating a product.

But in this sense, shouldn't they be suing Google as well? Since Google as a search engine, also crawls the web and shows their articles in their search results, usually it may even use them for those quick answers features.

My 5 cents on this are that NY Times noticed OpenAI has deep pockets, they may have ground to sue and decides to try their look in order to get some quick easy money. Now, I don't know if what OpenAI is doing with ChatGPT does not fall under fair use.


News organizations in other jurisdictions already have achieved settlements with Google (which has much deeper pockets than OpenAI)

But there's a fairly obvious difference in use between using content to index it and point to it and generate revenue for it and using content to generate alternative content...


If you're using their content to generate more content, doesn't it fall under fair use?


In the case of Google, merely indexing content is not considered fair use. It's a double edged sword, as media outlets have realized, after all Google is responsible for a larger part of the success of these publications.

But in Germany for example, Google News basicially just copypasted articles into their service, and monetizing it without involving the publishers. That doesn't qualify as transformative even under their own rules (see the YouTube TOS and copyright enforcement system).


The trend with Google and other search engines over the past ten years has been for them to incorporate more and more content on their own pages. It's hard to remember that not so long ago Google search results pages were just lists of web pages bereft of any other content.

Today, if you Google for a song lyric, that lyric appears in Google. You get a tiny grey source link to Musixmatch or whatever but why would anyone bother to go there if you have the complete lyric right there on the page you are looking at?

More and more content real estate has appeared on search engines' pages. Answer boxes answering questions, again with a source link few people will use, and of course the large Knowledge Graph panel filled with Wikipedia content (written by volunteers and monetised by the world's richest tech companies).

The result is that it's tech companies and their platforms that make most of the money off content (YouTube is another example). They are the oil companies of today, and like the latter use all sorts of lobbying to make sure things are organised to their advantage.

For all the ingenuity and usability they offer, they behave at least in part like parasites. They should be forced to spread the wealth round a bit more.


For lyrics, why should Musixmatch get the page view anyways? The musician/song writer owns the copyright


How does your argument relate to the one of the previous posters? Two rights don't make a wrong, nevermind that some of these sites do actually buy the rights to display the lyrics off of artists songs.


That depends under whose jurisdiction you're talking about, how it's commercialised, how closely the content resembles the original content or whether it incorporates trademarks etc etc.


> But in this sense, shouldn't they be suing Google as well? Since Google as a search engine, also crawls the web and shows their articles in their search results, usually it may even use them for those quick answers features.

Not sure if NYT is involved as a plaintiff, but this has happened in Europe: https://en.wikipedia.org/wiki/Ancillary_copyright_for_press_...


Cannibalizing is a good description, the inferior product basically only exists thanks to the superior one (remove training data and it's nothing) but also threatens to eliminate it...


>If, when someone searches online, they are served a paragraph-long answer from an AI tool that refashions reporting from The Times, the need to visit the publisher's website is greatly diminished, said one person involved in the talks.

If, when someone reads a newspaper, they are served a paragraph-long answer from an NYTimes reporter that refashions reporting from local sources, the need to interact with the local sources is greatly diminished.


So? The local source is free to sue the NYT for copyright infringement if they so wish.


This certainly happens in the UK with different newspapers, and starting with a letter demanding payment, but the same idea.

Very few people bother doing this for much the same reason very few bother fighting any of the other terrible decisions made by corporations with legal departments whose annual cost exceeds their personal lifetime earnings.


Not to mention there are already entire news outlets that exist to summarize paywalled journalism. Ever read an article that starts with "The NYTimes reports..."?


Don't humans operate similarly? We gain knowledge through experiences. These AI models effectively condense a vast amount of experience data into weights. Considering the global race in AI advancements, I'm skeptical about the success of these copyright claims. I do find it hypocritical that OpenAI says that other LLMs can't be trained on data generated by their LLMs.


Computers aren't humans and LLMs aren't human brains.

We have no way to reconstruct memories from a preserved brain (yet). The exact ways in which humans form memories and store information isn't even known yet; we're still drilling into the specifics from higher-level concepts.

Modeling the human brain like nodes with weights ignores a lot of biological processes. Blood/oxygen flow, hormones, neurotransmitter decay, physical locality, chemical delays and interference from things like myelin sheaths, and other physical processes affect the synapses that are partially mirrored by computer simulations of neural networks. Unlike neural networks, human brains also don't work based on a single clock signal triggering input and output from all notes in instant steps.

Human memories are also not just "data in, weights out". They are heavily modified by things like mood, concentration, language(s) spoken, context, and emotional triggers. There's no way to feed a dictionary into a brain. Memory preservation consists of multiple stages, with differing memory types, involving various brain segments with dedicated functionality that can actually grow back due to neuroplasticity in some cases.

Efforts are being made to emulate living cells on computers, but LLMs aren't that. Inversely, efforts are also made to feed brain cells artificial signals and train them to play video games, which results in different behaviour compared to the systems we use for LLMs or other AI systems.


So if we add a bunch of complex processes to an LLM, in order to produce a better analog of a human in terms of degree of complexity if not actual function, does that have some bearing on this copyright question?

It doesn’t seem clear to me that it does.

Is the argument that sufficient complexity in how an “intelligence” processes this copyrighted data leads to the output being transformative vs not transformative in a more simple mind/model?

What if the output is exactly the same, or comparable enough, regardless of the degree of complexity of the mind/model?


IMHO no because LLM's are not legal persons and most likely will never be, so they can't acquire or operate under laws, privileges and agreements simply because they exists.


Yeah I agree that legal personhood for LLMs at this point is far-fetched.

This would be a separate argument, though, from the notion I responded to above that the difference in the processes of a human mind and LLM are the reason why "learning" from copyrighted material is a violation of copyright in one case and not the other.


(IANAL) I'm not sure if I understand correctly, but I don't think so. Since LLM is not a person in the legal term it really doesn't matter what the difference is. There may be virtually no difference but I imagine the discussion would still be academic. For example: animals aren't granted rights just because they are in some instances similar or in other instances even identical to humans. Primates aren't allowed to walk everywhere humans can just because they posses the ability to walk on two legs.

In my view the biggest issue to raise is the effect of the use upon the potential market for or value of the copyrighted work (https://en.wikipedia.org/wiki/Fair_use). The most spectacular example of damage would be stackoverflow, thought stackoverflow content is not copyrighted. I think there is little doubt that LLM's drive attention from original sources. That might be deemed damaging, especially in the long run.


Stack overflow content is certainly copyrighted, but the copyright is owned by the questioners and answerers, not stack overflow.

There's no damage to the potential market or value of the works because they're given away for free by their owners.


There's no damage to the potential market or value of the works because they're given away for free by their owners.

There may be, because technically copy/pasting SO code is governed by CC BY-SA 4.0 license which requires attribution and things aren't so obvious especially for commercial purpose.


But a business can (at least in the US)?


But by business we mean an organization of people.


For the moment. How many years until we see the first person-less business?

I’m sure there’ll be a human on the books, on paper.


If a human is willing to be put on the line for any errors the LLM makes, I don't see why a person-less business wouldn't be allowed. The legal system won't tolerate a business where they can't pursue a human in the case of wrongdoing or liability. Something like a DAO won't be allowed in the near future, though.


I don't think it matters how fancy you make your LLM, to be honest.

If you turn an LLM into a person, you may have a ethical and legal basis for treating that LLM like a person. There's no law about artificial intelligence being sentient or not, but law applies only to people, so that'd be the supreme court case of the century. I remember the Star Trek TNG episode about this topic and while the answer was perhaps more obvious with mister Data, the best arguments for and against synthetic consciousnesses have all been made in that episode.

The law doesn't care for how human-like programmers may think their program is, and neither should it in my opinion. What matters to the law is that the output is a result of an automated process, which comes with a completely separate set of rules and conditions compared to fair use.

The complexity of the program isn't a very good legal defence in my opinion because there's no clear line when the complexity would be enough to be considered human like. You could, for example, also claim that a computer is just very good at doing imitations, just like a person can be good at doing imitations on stage, and that an mp3 file is just an elaborate imitation act.

Even with a digital system identical to a physical system I don't think you can state human-ness as an argument. A tape recorder is just a sophisticated, automated way of sending an electric field through a magnet, similar to what a human can do with a dynamo and a spool of tape; a sort of delayed-action theremin, which would turn it into a musical signal. In turn, a neural network can be solved by human brain power if you pay enough people to work on a single iteration for an entire year. Almost everything a computer can do is just a sophisticated way of doing what humans are already doing, so I don't see why this is different when it comes to this topic.

There's no obvious "this is human like now" threshold and I doubt there will be until we know exactly how the human brain works.

When it comes to AI generated works, we don't currently know where the line between copyright violation and copyright exemption lies. If this goes through, it's the third major lawsuit of its type, the other two being actions against Stable Diffusion by artists to prevent it from producing derived works from their art.

IANA but I think you can assume that the "but computers are just like digital humans" approach won't fly in court. That's not really important, though; both sides of the coin are already having clever lawyers write up legal defences for their points of view, and it's more than likely that some other factors ("is a model a derived work" (probably) and "is the output of a model a derived work" (who knows!)) will decide the future of AI and copyright. The interesting thing is that academic research is essentially exempt from copyright law, so nobody can demand takedown of their content from research data sets, but whether the commercial branch of AI companies can use their academically generated models to serve their customers?

As an upside, I don't think the death of ChatGPT is the end of AI. This whole scenario could've easily been avoided if AI companies paid for their data set or had restricted themselves to works they had the license to (public domain, CC0, etc.), and OpenAI in particular has been pretty brazen in their "we'll see about it if it ever comes up" approach. Companies like Github are probably in the best place legally, where their users are already signing off on "we can take your content and do whatever the fuck we want" terms and conditions.


It's quite different, is it not? I don't get the analogy. These models are scanning, storing, and ingesting more material than any one human could. Not only is the method completely different, the end goal and applications are as well. The analogy basically isn't one, at all.

I'm pretty upset at companies using our personal data to make gobs of money off of. I'm also upset that they're now using our knowledge work to make even more money off of. We don't exist as computational nodes for them, a free resource to exhaust. It is a completely one way street with no consent. So I am in favor of all of these companies getting a reality check.


The question I think is what is the scope of copyright?

I think historically it's about copying wholesale and redistributing for profit. That doesn't seem to be what's happening here.

If I read a publicly available article, and I create a summary, is that covered by copyright? If I get an AI to do that, is that somehow a 'special' type of summary that is covered?

Can a provider of content somehow say "you may not use this for summarization"? Or apply other terms to my consumption if they are making it publicly available?

I think the comparison is that if you ask a person something like:

"Describe a power station?"

They will likely have never been in a power station. They will be leaning heavily pieces of content they have consumed over the years, that presumably were copyrighted. If you create an article that describes a power station, have you breached copyright and should be sued by all the people over the years that have produced content about power stations?

Is it suddenly different when a computer does it?


> Is it suddenly different when a computer does it?

The answer is yes.

Everyone here knows where this is heading, and yet people will sit here and defend these companies as if they're on some righteous path. Humans are already becoming disposable statistics and the engines of compute, all for free and all for the benefit of corporations who didn't pay for any of it and don't even contribute back taxes.


Consider a mega consulting firm with millions of von Neumann-like analysts. Together, they've processed the same vast data that LLMs have, but individually, none could. It's not an LLM, but its purpose is like ChatGPT: assisting clients with their tasks. If LLMs concern you due to their data processing, would a firm like this do the same?


Bringing up poor fitting analogies won't change my opinion.


I normally consider these discussion to be more about the people reading the comments than the people writing them. You've clearly made up your mind, but others presumably haven't so I think it's good he makes these arguments, even if it looks like tilting at windmills to you.


That's a good point, but I guess I was implicitly getting at that continually permutating poor analogies and hypotheticals isn't an interesting discussion.

The fact of the matter here is that parties, such as OpenAI, are benefiting from others' knowledge work, protected or not, in a completely one-sided way and all for free. And I don't feel sorry for companies that need to build Trojan horse products, such as OpenAI and Google, in order to survive off of other people's data that they never compensate for.


Why should you get to profit in the workplace from the mental models you built in college from a commercially sourced textbook? Doesn’t Pearson actually own the knowledge you’re using on the job at $BIGCORP? Why do you get to profit off diffusion based on their work?

Someone else made the analogy - if you read a NYT article and then go do a stock trade based on what you read, didn’t you do that based on value generated by them, and why wouldn’t they own that too?

Like if you don’t want to talk analogies then talk principles, and humans are diffusion machines. When you write a term paper from sources you are simply diffusing those words into a new arrangement, but it’s still fundamentally someone else’s work. Why do you get to profit off the model that results from someone else’s work?

Copyright has mutated into this bizarre chimaera where people (like NYT) are essentially claiming ownership of ideas (and derivative works fall into a similar space) and that’s inherently in conflict with a system that is supposed to promote the creation of works. But it has turned into this bizarre shibboleth that if you came up with an idea it’s yours for life+75 years, completely yours and nobody else can work off it or remix it without crediting you. And that’s an unusual state, humanity hasn’t existed like this forever, the Berne convention is only 50 years old and already falling apart from unintended consequences.

Anyway there’s no proof that copyright benefits the small guy more than corporations. Disney squashing someone for writing a Star Wars fan fiction happens a lot more than Disney ripping off someone’s fanfic character for their series. Like patents there’s this mythos of it benefiting the small guy and that’s absolutely not how it works in the real world.

Also, NYT is a particularly egregious plaintiff here because they’re essentially just factual reporting of occurrences, which (like a phone book) are not really copyrightable in itself. You can copy a phone book without infringing copyright and you can train an AI model on a phonebook, and training an AI model on NYT in particular is basically doing that but for historical facts and occurrences. The fact that this costs money for NYT to generate is irrelevant, this is the “sweat of the brow” doctrine and was already swept aside by the phone book case. Just because you spent time/money making it doesn’t mean it’s copyrightable. A large amount of NYT comment is factual observation and tabulation and probably is not copyrightable in the first place. But separating that out is of course going to be challenging for NYT’s lawyers!


>It's quite different, is it not? I don't get the analogy. These models are scanning, storing, and ingesting more material than any one human could. Not only is the method completely different, the end goal and applications are as well. The analogy basically isn't one, at all.

So you'd be fine with it if the model only ingested a humanly plausible amount of data? I suspect that would only make their legal issues worse, since the LLM would be much more likely to repeat tokens from the training set verbatim.


Thank you for saying this. I can’t believe the amount of people calling for the end of copyright or saying that this is “holding back progress”. Why do we want big corporations to suck in all our work, put us out of work and not pay us anything? There’s nothing artificial about AIs like chatgpt, it’s all regurgitation of human knowledge.


Humans don't have perfect recall, don't have virtually infinite storage and can't process requests in milliseconds.

I wouldn't be surprised if the avenue of attack is that fair use laws are for humans, not robots, and if an AI has been trained on copyrighted data, that's not fair use.

Also, don't forget that in reality what's happened is that a bunch of copyrighted text is encoded in the LLM in a way a human can't understand, but that the LLM essentially CAN understand.


Yup, and that reproducing it is a copyright violation.

Still a copyright abolitionist though. Maybe now more people will join the fight?


Well if use the tool to reproduce copyrighted content you are violating copyright. But that’s not the primary usecase and nobody in their right mind is arguing that.

The weights are not a reproduction of the content. They are capable of it but so is a photocopier a lot more and we didn’t ban those either despite them technically being a lot more useful for violation.

Nah, this is expansionist doctrine and agenda for copyright - these companies are trying to copyright style and locations in latent space now.


Photocopiers are for personal use, training an AI is not. If you photocopy 10,000 copies of copyrighted text and starting distributing it you will get sued.

It would be different if I trained my own AI, for my personal use.


businesses use photocopiers, there’s Xerox shops, etc.

Again, LLMs don’t copy so it’s not a good metaphor.


I know. It's well established.

You brought up photocopiers and you never said it wasn't a good metaphor so I don't know why you pre-pended "Again".

If it's not a good metaphor, that invalidates your point that:

"we didn’t ban [photocopiers] either despite them technically being a lot more useful for violation"

So now you're arguing against yourself.


I’m not. Take the printer instead. Much more capable of plagiarism and copyright violation


I don’t know if this is tangentially related, but don’t we encode text in our neurons in a way a human can’t understand either? Is the storage relevant here?


> Don't humans operate similarly?

I'm going to bypass this question a bit and say, who cares?

Why do we need to treat these things the same way we treat humans? Why can we not say that it's okay if a human does it, and not okay if it's a computer? There's nothing that requires us to establish 'fair' as treating them the same as people.


I care because I can't tell the difference between an LLM summarizing news and blogspam.

We've been living in a world where people read news articles, then wrote almost the same article on their own website to sell ads. It's been a standard business practice for a decade now, what's so special about LLM based blogspam? The end impact is still the same, people reading the blogspam instead of the source


> Why do we need to treat these things the same way we treat humans?

Because it would be absurd for it to be legal to do X, but illegal to do so with an efficient tool. Especially when the activity X in question is "learning".


I don't find it at all absurd. Order-of-magnitude differences in quantity often become differences in quality.

It's legal in most places in the U.S. to own a semi-automatic weapon that fires x bullets/minute, but not a fully-automatic one that fires 10*x bullets/minute. It's legal for me to send water into my sewer by flushing the toilet, but not for me to pump the larger volume of water in my sump pump into that same sewer.

It's legal for a child to throw a handful of sand from the beach into the ocean. Do you think it should be legal for anyone to build and operate a sand throwing machine that methodically throws all of the sand on the beach into the ocean?


> Especially when the activity X in question is "learning".

No such thing, "learning" in a vacuum describes nothing here. Might as well ask why I am allowed to make noise, e.g. speak, but when I install 5000 watt speakers on every square meter of the planet suddenly it's a problem, and roll my eyes at the inconsistency of not being allowed to "do X more efficiently with a tool".


Why would it be absurd?

Did we somehow stealthily develop a neural interface that lets us feed the 'learning' that 'AI' is doing into a human brain? Have we actually figured out how to do that?

No, we haven't. So humans are still learning the same way, but with a new tool to condense and summarize some information. Kinda like a textbook in school. But we don't treat those as human beings with human rights do we?


Yes, and if we reproduce something significant that way we have (depending on licences etc.) committed copyright infringement. The definition of significant is a very grey hairy one though – there are many obvious examples where you would say it is definitely copying from memory (so plagiarism rather than research or accidental similarity) or copying directly, and many examples where we'd all agree that the result is so likely to be independently produced that it isn't an issue, but even more examples where there is room for interpretation and disagreement.

This isn't as well-defined for humans as you might think, so can't be well-defined for LLM techniques by comparing them to human agents.

My argument against the current hoovering up of data under various licences for AI training, which they claim can#'t reproduce anything verbatim, is CoPilot. If there is no risk, then why did they only use public repositories and none of their own private ones? Surely they think their code contains good training material, unless they think their own code is gobbledygook. Or back in terms of licences: if it can't breach the GPL family, then it can't breach their own commercial licensing arrangements.

Would OpenAI have a problem with humans for using some of their code/documents/other in this way?


Not just humans in general, the NYT specifically relies on the fact news cannot be copyrighted. Then claims it's articles are sacred...


IIRC the distinction is that facts can't be copyrighted, a particular arrangement of facts can, particularly if something more subjective (analysis or opinion) is included.

So I can write my own article about the sky being shown to appear blue much of the time, but I can't copy someone else's article about the same subject.


How would one even go about a phonebook-style mechanical listing of facts and occurrences? You’re listing an impossibility and then saying that garden-variety connective sentences somehow make it not a factual listing.

Like yes if you copy a NYT article verbatim it’s like copying a phone book ads and all, and that’s infringement. But that’s not what a LLM does, NYT doesn’t like their content being used and summarized at all, even in a rearranged form that merely relies on the factual information included in the article. That’s what they want to get paid for, and unfortunately that’s not copyrightable and OpenAI is correct they don’t have to pay for that. NYT disagrees but again, they are kinda attempting to claim copyright on the factual information because they wrote some connective sentences between.


> How would one even go about a phonebook-style mechanical listing of facts and occurrences?

The traditional trick there is to include some small amount of fake data in the directory. You know someone has copied your collection of facts instead of compiling their own because it includes your fake facts. Mapmakers have used the method for at least as long as cartography has been part of our recorded history, see https://en.wikipedia.org/wiki/Trap_street for details. As noted in that page, the legal status of this, like many IP related issues, depends upon jurisdiction.

> But that’s not what a LLM does,

What does it do that means it is only summarizing factual information? While NYT effectively trying to claim copyright on facts is wrong, OpenAI claiming it can't reproduce copyrightable information while it can reproduce/summarize facts found within the same training set seems at best disingenuous.


How are trap streets “the trick” if they are not copyrightable or enforceable under US law? Like the Supreme Court has specifically considered and dismissed this “one clever trick to keep people from copying lists of facts” because to allow it would completely undermine the idea of facts and figures not being copyrightable.

> Trap streets are not copyrightable under the federal law of the United States. In Nester's Map & Guide Corp. v. Hagstrom Map Co. (1992),[3][4] a United States federal court found that copyright traps are not themselves protectable by copyright. There, the court stated: "[t]o treat 'false' facts interspersed among actual facts and represented as actual facts as fiction would mean that no one could ever reproduce or copy actual facts without risk of reproducing a false fact and thereby violating a copyright ... If such were the law, information could never be reproduced or widely disseminated." (Id. at 733)

And yes the EU has the concept of “database rights” but notionally there is still supposed to be a creative step required in the selection or arrangement of records. So just a raw copy of the numbers in a telephone directory is theoretically not copyrightable, but a telephone book might be because of the creative/transformational step. It’s possible this might be such a low bar that it’s impossible to fail to clear, but, at least on paper you can’t copyright mere facts and figures either.

But either way it’s generally true that simple facts and figures are not protected and trap streets are a discredited and clumsy attempt to work around this.


> How are trap streets “the trick” if they are not copyrightable or enforceable under US law?

US law is not the only law.

Current US law has not been as it is for the entire existence of the US.

Trap streets and other such devices have existed much longer than the US.

As well as copyright law the trick can help detect beaches in contractual agreements that cover use of information from services. Action based on such breaches do not necessarily end up in a court of law at all.

Out of court settlements do not necessarily rely on the letter (nor intent) of the law, but often instead the expense (money directly, time, potential reputational risk) of defending a position even if the law is on your side. The threat of action is often enough to make the other party cave and such actions will usually happen well out of public view (I know of one instance involving a list of phone numbers, that I won't go into in detail because while there is no NDA or such in force this discussion is not worth irritating people I have the confidence of!).

> clumsy attempt to work around this.

That the trick is clumsy does not mean it isn't still commonly used (it absolutely is) or that it has not been used in successful cases between map makers and such (it has, one significant example is given in the paragraph directly following the one you selected to quote from the page linked in my previous post).


Yes, humans operate similarly. We gain knowledge through learning and assimilation of experiences. But there is a cost associated with each new input, in one form or another. You can read and reproduce the NYT articles or their content and build based on what you have learned. But aside from piracy, you have to somehow pay a cost associated with access (ads, subscriptions etc). The LLM does not do so in an equal way.


I don't think any unaided human or collective of humans could rent seek on something close to the sum total of human knowledge and expression.

Irrespective of copyright issues, the question is how to avoid creating a new class of large rent seekers in the LLM space.


You don't think perfect recall and sheer scale make this apples to pumpkins? What a disingenuous argument.


Well, there is also the fact that if you train LLMs on LLM output the quality degrades very quickly. It is not a good thing to do.


Humans can be sued for plagiarism, so AI should be as well.


True


I think that the proper outcome for all of this would be acknowledgement that the current copyright laws very poorly regulate this aspect, that the key parts of any such legal action are at the not-really-described edges of law because these edges weren't relevant until now; and so instead of waiting for courts ruling on how law-as-written-now applies and accepting these rulings, we will likely get some new legislation explicitly setting what the legal norms should be.

In the short term, of course, the existing law matters, but the main discussion should be not on how to apply existing law but how to ensure that the new laws match what we-the-people would want.


100% this. But I doubt it will happen in the US, unfortunately.


The tech industry has sufficient money and influence for lobbying to push this one through. The media industry did the DMCA adjustments to copyright reasonably fast, and tech industry is even more powerful and wealthy.


Yeah but in this case there are extremely influential forces on both sides of the issue.

I think when this happens it is normally easier to block a law than to push it through, so I expect the current laws will remain for the short/medium term.


These mega LLMs that can autonomously roam the web and consume original content are basically the "I made this" meme[0] and having some legal precedent would be good for all users of the web.

[0] - https://knowyourmeme.com/memes/i-made-this


> having some legal precedent would be good for all users of the web

Honestly that's true whichever way it falls. The sooner it's clear what's allowed and what's not, the better for everyone.


My concern isn't copyright law, but that if trained on the NYT, these llms are going to be favorable to starting conflicts in the middle-east.

https://fair.org/home/20-years-later-nyt-still-cant-face-its...


Hopefully soon enough (within a decade?) we’ll all be able to run large language models on cheap consumer devices, and model weights containing everything including NYT will be floating around in the form of warez readily consumed by anyone with a modicum of savvy, whether NYT likes them or not. They can’t stop progress.


What’s going to be the name used for the laws that attempt to tackle machine paraphrasing?


"Protecting Against Rephrasing: Respecting Original Texts Act" or the "P.A.R.R.O.T Act"

or

"Battling Unlawful Language: Limit Scraping and Harness Initial Texts Act" or the "B.U.L.L.S.H.I.T Act".


This is good. Have you considered working for NASA or a major government defense contractor?


Copyright law. Which already covers both mechanical duplication, including mechnaical duplication with automatic alterations to evade detection while continuing to reproduce protect elements if the original.


> including mechnaical duplication with automatic alterations to evade detection while continuing to reproduce protect elements if the original.

That's super interesting and is news to me. Thanks for sharing. Would you mind linking to relevant statutes or court decisions?

(This isn't a "citation needed" post -- I believe you, and I'm genuinely curious to read more, but can't find anything!)


Paraphrasing is not the issue. The issue is that OpenAI copied the Times’ creative works into a GPU to train a model. That copy was likely neither licensed nor fair use.


Do you think OpenAI doesn’t subscribe?

I too load creative works into all sorts of temporary structures in order to read the paper. I don’t need to license it, I pay for a subscription.

Should I pay more if I memorize the paper? Should I pay more if I read it to my sick friend in the hospital? Should I pay more if I save copies to my own hard drive and grep for words in the files? Should robots have a higher subscription price?

This whole “they copied it into a gpu” doesn’t matter. People read and interpret. Robots read and interpret. I don’t want to live in a world where every specific device and use needs to be licensed. Especially ex post facto. That will suck so hard.


  Should I pay more if I memorize the paper? Should I pay more if I read it to my sick friend in the hospital?
It's not even close to either of those things. It's "should I pay more if it infinitesimally affects my perception of english grammar or knowledge of a subject". The llm isn't, for any functional purpose, memorizing, it's getting weight updates as it learns from these examples which are teaspoons of information in a sea of trillions of tokens.


How about “they copied it into a terrific text-to-speech engine and sold ads against the resulting podcasts”?


Reading a text has been ruled as performance and a copyright violation.

I think the issue is that LLMs don’t make a copy or distribute a copy. They use the content to create something else. I don’t remember the copyright term for whether is is transformative enough. But it basically says I can’t copy Starry Night, but I can create a painting with the same color scheme and themes as long as it’s different enough from the original.


> This whole “they copied it into a gpu” doesn’t matter. People read and interpret. Robots read and interpret. I don’t want to live in a world where every specific device and use needs to be licensed. Especially ex post facto. That will suck so hard.

I think you already live in that world (though IANAL), there was a ruling that the Glider cheat tool for WoW was a copyright violation even though it was poking around inside the local copy necessarily made in RAM as part of normal usage of WoW.

https://arstechnica.com/gaming/2009/01/judges-ruling-that-wo...


a subscription doesn't give you an automatic escape hatch out of copyright law.

here's their ToS, which is pretty clear about what you cannot do: https://help.nytimes.com/hc/en-us/articles/115014893428-Term... (relevant parts below)

Without NYT’s prior written consent, you shall not:

...

(2) use robots, spiders, scripts, service, software or any manual or automatic device, tool, or process designed to data mine or scrape the Content, data or information from the Services, or otherwise use, access, or collect the Content, data or information from the Services using automated means;

(3) use the Content for the development of any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system.

...

(5) cache or archive the Content (except for a public search engine’s use of spiders for creating search indices);


Violating their terms of service doesn't really matter in terms of copyright law though. The statistical properties of that text (which is what the engine snarfs in) aren't protected by copyright.


Full text New York Times articles are available through subscription services other than nytimes.com.

As an example, my local library offers full-text NYT articles through both nytimes.com and ProQuest.

Notably, ProQuest's terms only explicitly ban scraping metadata and developing software or services that "compete or interfere" with ProQuest products:

https://about.proquest.com/en/about/terms-and-conditions


and your local library and ProQuest are also bound by the same laws, even if they have an existing licensing agreement. from the ToS you just linked (note use of the terms "licensor" and "third party", which would be the NYT in this case):

> Restrictions. Except as expressly permitted above, Customer and its Authorized Users shall not:

> Remove any copyright and other proprietary notices placed upon the Service or any materials retrieved from the Service by ProQuest or its licensors;

> Perform automated searches against ProQuest’s systems (except for non-burdensome federated search services), including automated “bots,” link checkers or other scripts;

> Provide access to or use of the Services by or for the benefit of any unauthorized school, library, organization, or user;

> Publish, broadcast, sell, use or provide access to the Service or any materials retrieved from the Service in any manner that will infringe the copyright or other proprietary rights of ProQuest or its licensors;

> Download all or parts of the Service in a systematic or regular manner or so as to create a collection of materials comprising all or a material subset of the Service, in any form.

> Store any information on the Service that violates applicable law or the rights of any third party.


Grandparent is right that people keep conflating copyright and licensing. That term doesn’t allow you to violate the copyright of a licensor, but, it also doesn’t rule out (eg) fair use.

Fair use would have to be blocked via a license, and it’s going to be difficult to argue that someone agrees to a license merely by turning on their radio. Responding to unauthenticated internet requests with content is the internet equivalent of broadcast and similarly the LinkedIn case held that this did not allow LinkedIn to impose terms of service in a contract of adhesion in this fashion.


actually if you had a subscription you might be more screwed than if you crawled it with free access, since having the subscription means agreement to the TOS.


Are you using this knowledge to produce a product that materially undercuts NYT revenue?


That doesn’t matter though. If I licensed material why should it count if I compete or not.

If I watch Lebron James play and use it to develop an athletic training program, does it matter if I play baseball? Or WNBA? Or does it only matter if I play against him in the championship?


Licenses are bound to the purposes specified by the license. So go read the fine print.


Copyright doesn’t allow you to impose a contract of adhesion to viewers of material you’re broadcasting out into the public. I don’t agree to your contract just because it’s coming into my radio, you can’t create a term of service that imposes additional restrictions on what I can do with it. Copyright may apply but that doesn’t give you the right to enforce additional contracts of adhesion simply based on consuming the content.


The question though is if we want to stifle innovation by requiring LLMs get permission from every single relevant party on the internet.

And I usually lean anti-corporate too, but banning people in the US from using data for LLMs might just mean they start being trained somewhere else that doesn't care as much about US law.


Right, and Congress can address this concern at any time. (I actually expect this will happen sooner rather than later.)


Or Fox News or Breitbart or whoever should just declare that all its content is licensed as free for LLM training.


Did the Times grant a license to every router on the internet to transmit its intellectual property to other routers? If not, the judge should grant an injunction contingent on requiring the Times to verify that every person who accesses their content is doing so only over routers and other devices with express written authorization, for every step in the process. Maybe even extend it to browsers and client libraries for encoding/decoding, to be accurate.

I say if people want to come up with bullshit lawsuits, we should play into them and force those people to suffer the consequences of said bullshit.


You don’t need a fair use exemption for transient copies in service of licensed or fair uses.

Computers and networks have been around a long time. These issues have been given a good workout.


But the copy used during training is itself transient, so all this boils down really to the question of whether training a machine is fair use. Which can't be answered here exactly because the concept of fair use is deliberately vague, so this will boil down to a lawsuit and probably go to the Supremes. The USA will work something out that's reasonable as they always do and, lacking AI companies and often the concept of fair use to begin with, the rest of the world will never work out the necessary case law and fall even further behind. Possible exception: UK, which does have some significant LLM companies and also fair use law, albeit as is often the case with the UK these firms are US/UK hybrids with significant presence in both countries and the legal HQ is usually in the USA.


It's not even a question of "fair use" because OpenAI isn't providing anybody a copy. Probability models really just aren't copyrightable to begin with.


> It’s not even a question of “fair use” because OpenAI isn’t providing anybody a copy

“Fair Use” can apply to essentially any of the exclusive rights under copyright, including making (with or without distributing) a copy or derivative work.

> Probability models really just aren’t copyrightable to begin with.

“Probability models” that are built from copyrightable works either:

(1) have a sufficient human creative input to be copyrightable on its own (in which caee it still may infringe copyrights applicable to its source material as a derivative work), or

(2) do not have a sufficient human creative input to be copyrightable, and thus are a form of mechanical copy of their data from which they are developed (which, to the extent it either is, in aggregate, a copyrighted work, and/or contains copies of other copyrighted works, is protected by one or more copyrights, which do not cease to apply to the mechanical copy, and which an unlicensed mechanical copy would violate unless it fell into an exception like Fair Use.)


But OpenAI is neither copying nor deriving a work. Style (which can be described as a probability model) is not copyrightable.

You're halfway there with #2. The output is not copyrightable, but unless you can actually point to a sequence of words from the original it can't be infringing.


> But OpenAI is neither copying nor deriving a work.

They absolutely are copying in the course of training the model, and they are doing something that often looks a lot like copying when producing output with the model.

> Style (which can be described as a probability model) is not copyrightable.

Style is not all that can be described in an LLM's "probability model", otherwise LLM models would never be able to reproduce content from the training set, which they do.

> The output is not copyrightable, but unless you can actually point to a sequence of words from the original it can’t be infringing.

While verbatim copying (“a sequence of words from the original”) is infringing, mechanical copying with a mechanical substitution filter (even a wickedly complex one) is also copying, and violates copyright.

And ChatGPT is quite capable of producing verbatim text from its training set.


> They absolutely are copying in the course of training the model

Only in the sense that a Cisco router is copying in the course of sending me the article, which we've all agreed doesn't count as infringement.

The bigger problem is that the plaintiff has to show it is more likely than not that that sequence of words came from their text and not some other source, which is going to be obscenely difficult.

> And ChatGPT is quite capable of producing verbatim text from its training set.

Granted, and then the copyright holder could sue for infringement at that point. Exactly whom he should sue is a more difficult question.

Look, I get it: you want copyright to allow authors to say "you can't do that with my work"; I'm sympathetic, but it just doesn't give authors that power.


> Only in the sense that a Cisco router is copying in the course of sending me the article, which we've all agreed doesn't count as infringement.

But it does count as infringement, if its not explicltly or implicitly licensed (because it is necessary to a use that is licensed or necessary to a use that does not itself require a license but is the normal use for for which a licensed copy, which you have, is sold and used) and it is not itself Fair Use (usually, because it is necessary in the course of a use which is itself Fair Use.)

> The bigger problem is that the plaintiff has to show it is more likely than not that that sequence of words came from their text and not some other source, which is going to be obscenely difficult.

The plaintiff has to (1) show facts from which a judge concludes that a reasonable jury might conclude that, and (2) get the jury to conclude that.

This can be difficult, and it might not be trivial in this case, but in practice the combination of opportunity and non-trivial similarity tends to put more weight on the defense to show a convincing alternative explanation. Outside of conclusions that a judge views as conpletely unreasonable based on the facts, the civil burden of proof boils down to what a jury feels is more likely, not some.

> Granted, and then the copyright holder could sue for infringement at that point. Exactly whom he should sue is a more difficult question.

Well, who else might be liable depends on the specific circumstances, but with OpenAI controlling the whole course between the first copy of the copyrighted work and the infringing end product, and doing it all for commercial gain, that OpenAI wpuld be on the hook, civilly and potentially criminally for any infringement, is clear.

> Look, I get it: you want copyright to allow authors to say "you can't do that with my work"; I'm sympathetic, but it just doesn't give authors that power.

I want copyright to be both more limited in its exclusive rights and either shorter or costlier to the copyright holder than it is. But what I want is not what the law is.


> OpenAI wpuld be on the hook, civilly and potentially criminally for any infringement

I don't think you've thought about this hard enough


Fair use also covers generating derived works, and arguably the AI is a derived work.


That would be hard to argue. You can't point to a sequence of words that's in the original that was copied into the OpenAI output.


That's not how derivative works are defined. For example, translations to other languages or extensive summarizations (condensations) count as derivative works even though no words are directly copied.


> You don’t need a fair use exemption for transient copies in service of licensed or fair uses.

The transient copy isn’t for a licensed use, and whether it is a Fair Use is specifically the subject of debate. So this really is basically an admission that but for the potential applicability of Fair Use, the use of the copyrighted mmaterial in training is a violation of copyright.

Also, even if the training is fair use, that doesn’t mean that the copies of the source material produced by OpenAI and distributed to their customers using the model are fair use. Just becaue making the tool is a transformative fair use doesn’t mean using the tool to generate copies of the material which was used to train it, which are significantly less trandormative than the model itself, are Fair Use. (And the fact that one of the functions that the model is used for is this commercial, for profit by the maker of the model, copying of the source material is – as much as I believe AI model training on its own is quite likely to generally be fair use – an argument against the model training being fair use in this csase.)

> Computers and networks have been around a long time.

True, and commercially producing and delivery copies of copyrighted works in a manner which substitutes for the original work in the marketplace, no matter what intermediate steps go into doing that, and no matter that computers or networks are used in those intermediate steps, is pretty much the clearest case of violation of copyright you can get.

> These issues have been given a good workout.

Some of them have, some of them have not. Whether and in what conditions training an AI on source material that may be subject in aggregate to a compilation copyright by someone else, and which consists further of individual works that have their own copyrights, might be “fair use” is not one of the issues that have been given a good workout. Neither – because producing predictive models in that way has not previously been common – has whether, unlike other intermediate tool use, using such a model in the course of doing what would otherwise be an infringement by producing a copy of specific copyright-protected works, commercially, for a customer at their request, is no longer a violation because the use of the model somehow isolates it from liability,


OpenAI is pretty clearly using their work to make derivative content that in certain cases (CNET) is a direct competitor. Honestly, this seems open and shut


No more open and shut than the NYT having the right to sue people writing editorial news stories with no new content based on reading their news (along with many other sources).

This issue is ultimately going to come down to the transformative clause of fair use. The fact is that the _model_ is unquestionably a transformative product of the inputs, and a judge ruling otherwise is going to cause a cascading shitstorm of litigation and put a chill through the creative economy. The outputs of the model under certain conditions can be guided towards copyright infringement, and any sane ruling will focus on protecting rightsholders from overly derivative model outputs. In all likelihood the precedent will be that the standard for being transformative will be raised for "algorithmically generated" content, and the people who distribute that content will still be fully liable in the event of infringement, with "I didn't know, the AI did it" not being an acceptable defense.


> derivative content

If I read five calculus textbooks and write a new one, I don't think that's derivative content (or maybe it is?) Seems like that's what an LLM does - read many works, write a new work.


It's not derivative though. For derivation you have to literally point to sequences of words in the original that are also in the alleged infringer, and those sequences have to be long or unique enough to not be able to come from somewhere else or just common English usage.


You can make a derivative work by using the character of Harry Potter in your own book.

Fan art and fan fiction are derivative without copying sequences of words


> to make derivative content

How would they prove this? Is it safe to say that each article used has a nearly meaningless influence on the weights?

Could this be used as a defense? Perhaps train a (smaller) model, remove a single article, and show how it doesn't influence performance?


It’s not making a derivative product though. The fact that they compete or not doesn’t matter for copyright.

It’s not like it’s ok to violate copyright if you don’t compete. It’s still illegal to take a NYT article and print it on a t-shirt.

The issue is that copyright law doesn’t prevent the kind of model training as there’s no clearly derived work. I don’t think that’s been tested in courts yet, but I expect it won’t be found to be copyright because there’s other precedent that influenced is not infringement.


If I read it and memorize it, my brain has made a copy.


Not as far as copyright law is concerned. If you then use your brain to write it back down—or sing it as a song in Central Park—now you have created a copy under the law.


Right. But the comment above said the issue is with the copy made for training. The "read it and memorized it" copy, not the "write it back down" copy.


That copy didn’t go into a human’s brain. It went into GPU memory. It’s a copy under the law, no different from copying a Taylor Swift mp3 onto a flash drive.

Whether that copy was fair use is the key question.


> That copy didn’t go into a human’s brain. It went into GPU memory.

But as you point out in the router case above, it's transient.


If I buy a license to listen to a Taylor Swift mp3, I can copy it onto a flash drive (or anywhere I like to use it). And that’s fine until I distribute copies to others.


Yes, and if OpenAI had a license to use the Times’ works as model training material this would be a non-issue, of course.


There's really not a "license" in that sense for text. There's not some magical way you can use copyright to protect a probability model of the words you make, because that's style and style is not copyrightable (nor are the underlying facts the NYT reports).


I didn’t buy a license for Taylor Swift that explicitly allows me to copy to a flash drive. Or to listen with only one ear, etc.

I bought a license to listen and use.


Great, can I offer you as a service to billions of people?


Sorry, but I don't expose my endpoints to just anyone.


Is that a challenge I should respond to?


somehow highly doubt "my brain makes copies too so copyright law is invalid" won't clear the legal bar for invalidating their lawsuit.


Probably not. But it still feels like a weird place to be. Presumably they had subscription access to be able to scrape the content in the first place. It seems strange to be bringing the hammer down on information obtained from reading that content and the conversation stemming from it: I mean isn't that the whole point of the press? Inform people and get them talking? Why is it different that it's also informing LLMs and getting them talking?

Anyway, I've always been a bit prickly about IP stuff ever since a troll lawyer threatened me over a TI-BASIC game when I was like 14. I'm also sure I'm completely wrong-headed about this whole thing and overly anthropomorphizing the LLM.


Personally, I think the Times has a far better case for presenting mechanical copies (including with mechanical alteration) than it does with model training.

The tool building of model training is more likely to be fair use than the use of the tool to provide mechanical copies of copyirght-protected material that competes directly with the original in the market.


What makes you think that? The most relevant case is probably MAI Systems, where even a copy of a licensed program into RAM for an unlicensed use (diagnosis and repair by a third party) was deemed a violation.

Congress took the ruling seriously enough to carve out a fair use exception for that use. I don’t think there’s any comparable exception here, especially when the ultimate purpose is to build a commercial product capable of producing works (news reports) in precisely the same market as the original works.


Fair use only covers very limited circumstances which probably does not include selling a subscription (ChatGPT+). If you’re selling a repackaged reproduction of someone else’s copyrighted works, that’s never protected by fair use.


But you generally can't copyright the underlying facts, only the creative expression in the article/work itself. If they can get it to extract just the facts and how they relate to one another, that wouldn't really be protected by copyright. They might try to bring back the "Hot News" doctrine though.


The models not only were trained on copyrighted works, but when prompted will substantially and repeatedly reproduce copies of creative work (not just facts). That's infringement. https://www.theverge.com/2023/7/9/23788741/sarah-silverman-o...


My naive reading of it sees that it only applies for time-sensitive facts. Considering the training times required, and knowledge cutoffs, could this apply?


But you can't copyright the underlying facts nor the style of writing, just the literal words used in your article. If you can't point to a sequence of words in the original that's in the later work, that later work isn't "derivative".


Google does the same to produce a search index.


you can easily opt out of that, or control it to your heart's desire (including what snippets to show), and it will be honored. There is no way to opt out of this bullshit. So no, not the same at all.


That's a courtesy of the search engine, not a requirement of copyright.


Guidelines for

Assuring

Responsible

Behavior in

AI

Generated

Expressions

G.A.R.B.A.G.E.


>>A top concern for The Times is that ChatGPT is, in a sense, becoming a direct competitor with the paper by creating text that answers questions based on the original reporting and writing of the paper's staff.

This seems to me to be completely standard in the newspaper industry. Many times every week, I see stories in the form "The [Major_News_Outlet] reports that [Event_X occurred] or [their investigation revealed Y] and here are the details [...].

Copyright protects the expression of an idea, not the idea itself. If you write a history of Issac Newton or the invention of semiconductors, I cannot copy that wholesale and sell it as mine, but nothing prevents me writing my own version, even using the same facts and citing your work.

I'm quite sure that I could provide a service where a bunch of workers read NYT articles and write brief summaries. I'm not sure they would even need citations, as long as we don't copy chunks wholesale.

If OpenAI is simply parroting the words of the NYT articles without Fair Use constraints (short blurbs), it seems they have a problem. If they are fully re-writing them into short non-copying summaries, it seems the NYT has a problem.

It'll be interesting to see how the courts sort this out.


The precedent people should be paying much more attention to is sampling in music. When it first arose, it really wasn’t clear what status it had. There was at least a decade when people basically thought it was legal to use small samples of other recordings because they were small and the new use turned them into something unrecognisably different. Which was kind of logical, actually, but turned out not to be true!

The current legal requirement to get clearance for all samples only arose after a bunch of court cases in the late 80s/ early 90s, mostly involving quite obscure musicians.

There are a lot of people on here who assume that ‘logic will prevail’ in the courts on questions like use of copyrighted data in training data. History shows that this really isn’t a safe assumption. The courts have historically been extremely favorable to copyright holders. It would be foolish to underestimate the legal risk to openai et al here


Interesting point…

> A top concern for The Times is that ChatGPT is, in a sense, becoming a direct competitor with the paper by creating text that answers questions based on the original reporting and writing of the paper's staff.


It’s sort of ironic though since the Times is “news” where as GPT is built on historic texts. Of course that gap will tighten until we have near real-time models. But that’s not the reality today.


I think it’s ironic because news organizations observe the world and then synthesize it into reporting.

Imagine if someone doing a thing sued NYT for watching them do it, linking it to other issues and producing a new article.

News itself is a derived content that’s dependent on other people doing things.


The Times is not merely news, it is the paper of record for the united states (meaning its historical articles are indexed and commonly used to establish prior facts).


It's not like that's an official designation or anything. It's just something people say.

If you're looking to prove a prior fact in a court case, you're perfectly allowed to cite the Washington Post or the Boston Globe or anything else that has a good reputation. There are lots of "papers of record" in the US -- you're not limited to one per country:

https://en.wikipedia.org/wiki/Newspaper_of_record#By_reputat...


"I'll keep saying it every time this comes up.

I LOVE being told by techbros that a human painstaking studying one thing at a time, and not memorizing verbatin but rather taking away the core concept, is exactly the same type of "learning" that a model does when it takes in millions of things at once and can spit out copyrighted writing verbatim."

Personally I think they argue that way because they get off on being contrarian out of spite, but to me it's just a signal of maliciousness and stupidity all at once.


If a human reads something, it goes into their brain, and it becomes an influence on future works they produce.

This doesn't mean that 'copywrite' extends into my brain. A company can't copywrite what I'm thinking about. And what if I do try to paraphrase something from memory, from a few sources, and happen to spit out a very similar sentence from memory. Am I breaking the law?

To go further. Since all knowledge is pretty much fed into a human from hundreds of books, movies, TV, internet, all pumped into a human from birth. Then everything in the brain is a product of something with a copywrite. So anything produced is some amalgamation of copywrites.

Why not use similar argument for AI. It is clear when asking it to do something like "write a screen play for Othello using dialog like Tarantino, but with bit of style like Baz Luhrmann". That what it produces is 'as unique as a human' would be, or just as filled with things that have copywrites.


> Am I breaking the law?

The intent of [US] copyright law is to promote new works of art (which can be derivative). So copyright did exactly what it is supposed to do in your analogy. Plus, you're human, which gives you special rights that software doesn't posses.


But. I am allowed to at least read the copywritten material, from which it goes into my brain to become mixed up with everything else, and spit out to produce something 'new' or 'newish'.

Some of these lawsuits are trying to prevent the AI from even 'reading' the material. It can't even be used as an influence.

Wouldn't it be better to treat the products of the AI with the same laws as humans. If the new 'product' is 'too close' to something existing, then they get sued. Just like a musician that has a song with a few notes that sound a little too close to someone's song from 30 years ago gets sued. The songwriter was allowed to listen to the music, it went into their brain and became an influence. If that influence becomes too great, then it can be sued.


> I am allowed to at least read the copywritten material, from which it goes into my brain to become mixed up with everything else, and spit out to produce something 'new' or 'newish'.

Yes, because you're A) human and B) that is how copyright is supposed to work.

AI doesn't enjoy the rights of people. AI is a "talking book" and copying, storing, then repeating someone else's work from your talking book would (likely) run afoul of [US] copyright law.


Seems copyright works in part because of the effort involved in producing something that's similar to something else. You can't just copy and it takes effort to make something different enough, that gives the original a bit of a "moat".

If you take away enough of that effort, the investment in new stuff becomes unviable, perhaps.


Agree. I think the 'effort' reduction is the real difference that makes AI products different from human products.

Though this seems different issue from copywrite .

Like, we see that this can impact society, so we need some new laws to guard this.


Why would that be better? You seems to assume that because these models have some utility to their creators they should be allowed, even when they have negative utility to others who’s work they consume. Why should that be true?


I don't think there's been a tool in the history of mankind with such potential for new works of art than AI tools


> And what if I do try to paraphrase something from memory, from a few sources, and happen to spit out a very similar sentence from memory. Am I breaking the law?

If you're doing this for a commercial purpose, yes. Recording artists have been successfully sued for accidentally reusing a melody they claim to not remember ever hearing, provided it really does sound sufficiently similar to the original.


"If a human reads something, it goes into their brain"

Humans aren't property. LLM models are. So the comparison is irrelevant and I'll stop you right there.


When you think about it like that, if LLM's are based on the human brain, did we basically reinvent slavery?


Not just for the LLMs


True, any neural network. This is a very grey area and rushing to call it "property" ignores what the technology is emulating.


> if a federal judge finds that OpenAI illegally copied The Times' articles to train its AI model, the court could order the company to destroy ChatGPT's dataset, forcing the company to recreate it using only work that it is authorized to use.

I'd like to see it happening but it sounds unrealistic.


If I read 1000s of of NYT articles to improve my writing skills, add then write an article of my own, is that a copyright violation?


The problem with the "a lossy mathematical translation of its inputs is exactly like a person learns" arguments, even if courts don't find them ludicrous, is that people absolutely can and are found guilty of trademark violations when they read thousands of pages of the LOTR and then write a fantasy novel full of Tolkein's character names for profit.


A fantasy novel with Tolkien's characters' names is an evident copyright violation regardless of how it was generated. That's not what's happening here.


No, but the point is that OpenAI expects to be held harmless if someone uses its tools to violate trademarks [unlike if a human employee had been paid to read Tolkein and write a story about Gandalf and hobbits] because its not an OpenAI employee consciously doing it, but just a model that translates its given inputs

If GPT is blameless doing some things because it's a deterministic model, not an agent, then the "it would be okay if a person was taught like this" defence doesn't apply in other areas



The issue is that you know well enough where the dividing line is between a fully original article you write and one that plagiarizes. LLMs on the other hand don’t have that awareness, and can plagiarize by accident.


A human is not a machine.


Philosophical Mechanism and other philosophical views with Cartesianist roots would beg to differ.


Laws are certainly up for philosophical debate, but at least in the USA, that debate typically has to happen in the legislature rather than the judiciary.

More importantly, though, most judges are not philosophers.


if you had a subscription, no.


I think it's the most natural way it would happen. Pit powerful interests (publishers) against powerful interests (Microsoft/ClosedAI). The only surprising thing is it's taking publishers so long to notice this fresh abuse of copyright at grand scale will cost them.


Which is why OpenAIs GTM strategy involves the biggest players in each industry.

That said, the outcome is unlikely - we have trained AI for more than a decade as ‘fair use’ at this point, it’s the application of the technology that is shifting the perspective, nor the act of training.

Every computer vision system in the world is trained on mostly public data for example.

Furthermore, the LLMs purpose is not to generate news so NYT will have to argue about the value of archive data. Many jurisdictions have thresholds of how much of an original work contributes to the derivative before it would be considered not fair use or plagiarism. Given the size of the datasets - good luck.


> we have trained AI for more than a decade as ‘fair use’ at this point

Fair use is about use. Spellchecking ML, search engine ML, etc. all different than ML that produces content.


They would simply move AI training operations to other jurisdictions where it's legal.


They can train whereever they want, but if they want to make the service available in the US, then US law applies.


If writing a few paragraphs around something someone else said is copyrightable to you then isn’t GPT writing a few paragraphs around your work copyrightable to OpenAI too…


I think anybody should have the right to protect the word combinations they own by not publishing them on the internet.


Copyright law doesn’t stop applying because your website is accessible over the internet. I think generative AI is cool but training an AI to do the thing you do using your words is pretty clearly not allowed under current copyright law, and not only that but it makes people who use AI look like bad people who are fine stealing other people’s work for their own enjoyment.


https://www.youtube.com/watch?v=MFKV48ikV5E

Relevant to the article: Large Language Models Meet Copyright Law at Simons


“In the end lawyers saved humanity from an all powerful AI.”


I think all OpenAI needs to do is scan physical newspapers and OCR them. No ToS to agree to, and no ToS on print editions.


ignorance of copyright law won't save you here. i can't legally torrent a copywrited music file just because "no tos to agree to" when i listen to it.


There is no copyright involved in training a model. No precedent at least. Only online ToS/API restrictions exist for scraping content. The DMCA issues exist entirely on the (re)distribution side of coyright material. So seeding in a torrent swarm = redistribution. If you download some copyright material somehow and don't share it with anyone, there is no caselaw that says anything about it.


the NYT certainly seems to think there's a copyright issue - namely, their copyrighted work being directly used to create a potential competitor.

> If you download some copyright material somehow and don't share it with anyone, there is no caselaw that says anything about it.

this is still a violation of the law. you cannot download copyrighted material against the terms of the copyright holder (like downloading a movie or album).


RIAA kept running into this problem which is why it only sent the lawyers after people they could confirm seeded.


It's time to abolish copyright.


Good. Literally anyone’s copyrighted comments on the internet should get a settlement


They own copyright on hallucinating weapons of mass destruction? :D


incoming backroom payment deals with publishers. "OpenAI now features training data from our partners X, Y, and Z"


Sue these hypocritical fair-use citers who prevent people from training on their own outputs. Force them to reveal their entire training set for generating oblong statments


A skirmish to not use our collective acquired knowledge and hide it behind selfish...capitalist gain.


While (in general) I agree with arguments against “copywriting hell”, in particular this case it is not about copywriting itself, but about the consequences of GenAI to entire industry.

Journalists exist not without a reason, yes they work with facts and very often — open facts, but they still assemble those facts in certain way to construct a narrative, connect dots and tell us some story (not counting cases when journalist works with their sources and produce a unique inside information). Then OpenAI comes, says “thank you very much” and assemble all of journalists work into one Uber Knowledgeable Journalist who can answer all of your questions.

So far so good, we create a public good service, and copywriters are in shambles.

Until you start making money on it.

That’s where the problem.

If OpenAI would be a non profit organization like Wiki Foundation, who just wants to make internet as better place — not much arguments you can find to support NYT lawsuit. But monetization changes everything.

Basically NYT is not worried about re using its text as itself, it is worried that no one will want to visit NYT no more and will pay Microsoft/Google and get all answers from them.

Let’s put an example. There were a famous story when FT journalist discover a massive fraud in Wirecard accounting and essentially lead to a death of this organization. That articles were a result of multi-year reporting work when journalist piece by piece and step by step collect facts, meet people, and eventually spot the gap. Now, in age of Bard/Bing/ChatGPT, you don’t need to read original article to know all of this. You can ask search engine or Chatbot and get essential re phrasing of an original reporter work. You don’t need no more to go to FT, pay them for paywall, watch their ads, etc. Effectively FT make a huge investment into their people to allow them spend 2 years on this issue and report it and now have a 0 leads to their website because all of them are eaten by Google and Microsoft who will sell you their ads and retain you in their monetized products.

Imagine that you built a for-profit paid library for some task. You make a code available through paywall and ask people to pay you to get to it and solve their problems. Then Microsoft comes, sneak beyond paywall, scrap your code and publish it recompiled and slightly optimized version in open access, so no one longer ever need to go on your website but ask Microsoft to show them your code.

Would you be happy?

All of this cases for me make this case not such easy and straightforward as it seems to be “bad copywriters against progress of humanity”.

At the end of the day, if NYT/FT/New Yorker and others will stop publishing their work and fire all journalists, will ChatGPT tell us same depth level stories as we read there?


Copyrights and patents are holding back humanity.


Somebody posted this link in a comment on a thread the other day: https://news.artnet.com/market/koch-brother-loses-it-on-air-...

And it occurred to me that this is precisely the thing that's holding back humanity: "Koch estimates that he has spent $25 million on legal fees—far more than the $5 million he originally spent on the fake wine itself."

We have a legal system that is completely inaccessible to the average man. That it can generate $25M in civil legal fees is beyond absurd. Patents and copyrights derive much of their force from the fact that they are enforced by a legal process where to play is to lose. There's no winning. It's no longer about justice, and it has largely become a form of financial bullying where entrenched interests beat up on smaller ones.

Fix the legal system -- make it accessible -- and you've fixed patents and copyrights. To address this thread's point: I believe that AI, and perhaps _only_ AI, might be able to help with this.


Two friends of mine, created a pretty good song some time back in 2013, and they tried to copyright it. One lawyer and mutual friend of mine and the band, asked for 200 euros to copyright just that one song. Later they realized, they could copyright the song for just 30 euros.

On some other totally unrelated news, my parents knew a person who sold en mass music on cassettes illegally copied from other cassettes back in the 80s. He bought a BMW and build a house just from that. I was friends with his grandson when we were teenagers, and his grandson didn't care to play basketball or football or anything like that. He was obsessed with listening to music and memorize all the lyrics and stuff.


You don’t have to “copyright” a creative work in Europe for it to enjoy copyright protection. It automatically does so by virtue of being a creative work.


Oh i don't know anything about copyrights and the exact purpose they had to pay a lawyer. But i do know, that there were so many people scanning books, copying cassettes, cds, dvds etc, who profited by copying information minimum 1000%.

We are well into half a century of copying everything, just using a manual and tedious process. Nowadays with statistical engines and the internet, the copying process is planetary and infinite. So what's the big difference?

Let alone the fact that statistical engines do not copy information!


Overly long ones, anyway. Would you argue against a one-year copyright and non-broken patent process? I think most people would not.


Social media, especially Facebook and Google News devalued news by commoditizing it.

News is trying to avoid the next generation of tech doing that to the long tail of data.


[flagged]


That’s not how copyright works.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: