Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] Japan Goes All In: Copyright Doesn't Apply to AI Training (biia.com)
164 points by ashvardanian 11 months ago | hide | past | favorite | 168 comments



The source referenced is the minutes a representative made in a committee last year in april and within that a clarification question while discussing AI in education. It's not policy, not recent and not true.


That's how I like all of my news stories - untrue, untimely, and based off of the word of one person in the legislature.


Plus this website looks so SEO'ed that it's unclear whether it's even a real 'association'.


But at least it was written by a real human! /s


But was it submitted by a real human? /s


So I can see the logic in treating the inputs to the AI training data sets the same way we treat humans learning something.

A potential downside is that AI systems can 'mechanise' the creation of material that potentially infringes copyright (in the same way that human generated content can infringe)

But a potential upside is that we can 'mechanise' the process by which we judge whether new content infringes the copyright of older material.


> So I can see the logic in treating the inputs to the AI training data sets the same way we treat humans learning something.

If we don't take this approach then there will be a series of very lame legal loop-holes with putting mechanical Turks [1] in the process. Or just end up with very "I know it when I see it" legislation.

So for both practical and philosophical grounds I do support this.

[1] https://en.wikipedia.org/wiki/Amazon_Mechanical_Turk


You don't even need that. Even "just" OpenAI has a valuation large enough to just buy several major publishers and data brokers to secure access to data if they need to license.

And while I expect NYT imagines that their archive is really valuable for training, they're just not that special in the sense that while they may have broken more stories on average than many others, and have had influential op eds etc., the ones that matters will have been cited and referenced and written about elsewhere - the irony is that by virtue of being so well known, their historically most important content is also less unique in terms of the accessibility of the information in it.

So while I'm sure OpenAI would love their archives, I'm also sure that if OpenAI and others have to license content and NYT end up being "difficult", OpenAI will just license content from (or buy) a suitably diverse portfolio of other papers instead.

In other words, beyond producing outright synthetic data, if AI companies are prevented from training on data they don't have a license to, the net effect will just be a scramble to buy licenses and/or buy companies that can provide sources of content, and the price for that content will be a lot lower than some of the people pursuing these copyright claims imagine.

In the end, if we go that route, all we'll have achieved as a society is creating massive moats protecting the companies already big enough to buy access to a broad enough set of content and made open models harder.


This will also enable new and unique business models for social network. You won't have to rely on ad targeting and user tracking any more, if you encourage people to make great content, you can make money by licensing that content to advertisers.


And "clean room" AI training other AI.

The end is inevitable, protecting copyright for AI training is probably a lost cause.


We learned to make a distinction based solely on "mechanisation" aspect: - email vs spam, - phone call vs robo-call, - hand drawn vs a copy machine (to some extent), - etc.


It is not a problem if people can mechanism copyright infringements. If the violation is material enough to matter in a public market then we can sue.


The issue is the output, not the input. If LLMs were just learning from their training set and generating novel output, the same way a human might, then there'd be no problem.

The issue is that generative AI - both for images as well as text - in effect memorize sources as well as learn from from, and can end up regenerating training sources verbatim (or with minimal changes in case of images).

I don't think any US court is going to accept "yes your honor, we copied this copyright material, but we used a TOOL to do it" as a way to avoid copyright.


I don't understand your point because humans generate novel output AND memorise and regurgitate source material.

I believe human artists consciously adjust their output to avoid copying previous artists too closely. And sometimes they choose to copy very closely or exactly.

Obviously the same feature can be implemented as an option on generative AI systems.


My point is that Japan can claim whatever they want about copyright not applying to LLM inputs (training data), but it makes no difference. Copyright infringement will be judged on what they output. Of course, many outputs will be novel enough to avoid copyright claims, but not all, same as the output of a human.

No doubt generative AI systems could be built to self-police and not emit any tentative outputs that are too close to training samples, but that's certainly not the way they are today, and it's not clear from the article that started this topic that this issue is addressed in any way by the Japanese law, which is just about training data.


Generative AI tweaks the original material such that it's modified beyond recognition, and non verbatim. A bit like how artists tweak other artist's work, put their own spin on it, and then pass it off as 'original'.


GenAI CAN do that, but it also can regenerate training sources as-is. The way LLMs work makes it quite likely that once it has started copying a training sample it will continue to do so (after copying/generating N consecutive words of a training sample, the most statistically likely next word will be often be the N+1th word of that same sample).

The same thing happens for images - the more an output looks like an input, the more likely (as diffusion-based generation proceeds) it is to look more like it. There are recent examples of Dune posters being recreated essentially as-is.


Or produces the watermark from the copyright holder which even Getty can see it is theirs [0], implying that Stability AI trained on Getty's images without their permission and tried to commercialize it via DreamStudio.

[0] https://www.technollama.co.uk/high-court-rules-that-getty-v-...

[1] https://www.theverge.com/2023/1/17/23558516/ai-art-copyright...


An alternative viewpoint:

Generative AI algorithmically processes the inputs such that they can be roughly recreated after decoding. A bit like how jpeg encodes images "beyond recognition"(aka lossy).


If you ask it to sure, but there are examples of ChatGPT reproducing whole NYT articles, with not nearly enough alterations to constitute a new work.


What were the prompts? A whole article would barely fit in the output tokens, typically, yeah?


https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

Exhibit J (phone won’t cooperate with paste).


Except when it doesn't. See the NYT's lawsuit here - long runs of the training data are spit out verbatim.


There is a distinction to be made between inputs and outputs when it comes to AI and copyright. Most people focus on the former, and discuss whether you can train an LLM on copyrighted works or not, but ultimately the issues really only manifest in the latter.

An AI training itself on a million newspaper articles can be declared legal, sure, but what happens when it also starts spitting out the same articles with nearly no modification? Is that still fair use? This is the crux of NYT's lawsuit, and making laws about AI training isn't going to make a difference to that.


> This is what the crux of NYT's lawsuit is about, and making laws about AI training isn't going to make a difference to that.

The claims of NYT are more than just about training. They're claiming that as part of the ChatGPT software; it _looks up_ stuff in a database of articles. That is beyond _training_.

If a human kept around a briefcase of NYT articles they didn't pay for and let you view them for a fee I think everybody would agree that's copyright infringement.

The focus on "whats the difference between an AI and a Human learning" is a great slight of hand to bypass the obvious copyright infringement OpenAI is doing.


> If a human kept around a briefcase of NYT articles they didn't pay for and let you view them for a fee I think everybody would agree that's copyright infringement.

Sure. Slightly more interesting is if that same human with those same breifcases was taking money to answer questions and referenced those papers, but did not just provide the article or headlines, and might not even be paraphrasing the article at all. Is that okay?

To the extent it is merely paraphrasing articles, or outputting headlines that it just looked up, I agree that could well be infringement. If it more transformative processes those articles into something distinct, then it is not nearly as clear cut. The latter is arguably the intent of openAI, even if the current results might be closer top the former.


While LLMs are designed to generalize from their training data, rather than simply memorizing it, overfitting occurs in niche areas. There's be plenty of niche areas and LLMs of today merely repeat training data, like with the NYT case. Larger datasets and better algorithms will help to an extent, but you'll always have niche topics where overfitting happens. The intent has always been to generalize well, however, it may not be feasible to do so in the long tail of the internet. How should copyright law address this?


> it _looks up_ stuff in a database of articles.

That is clearly not the case. There isn't anything like enough storage space in typical LLMs to maintain a "database" of all the training data.

If an article can be reproduced from an extremely sparse representation using a stochastic algorithm, that seems to me to be prima facie evidence that the article didn't contain much (or possibly, any) significant creative content to begin with.

Certainly I would expect to find less creativity in an allegedly factual news article than in a piece of acknowledged fiction.

Only creative works are copyrightable in the United States.

There are soi-disant "artists" who produce "artworks" that are (e.g.) nothing but a pure white rectangle. That doesn't make <div style="background-color:white;width:100%;height:100%;"></div> any kind of copyright infringement.


> the obvious copyright infringement OpenAI is doing.

I don't see how it is obvious, nor how it is infringement.

How come this doesn't apply to the person who memorized the NYT article, and recited it verbatim?

It's obviously infringement for the person who pressed the "generate" button to produce the article. That's no different from someone who copied a picture using photoshop. However, photoshop itself (and the making of it) does not constitute any infringement whatsoever, as long as at the time of making the application, the sources used are not infringing (which, i presume openAI had the right to view the articles at the time of training).

The crux, to me, is that the information extracted and produced (aka, the neural weights) do not itself constitute any infringement. Using those weights to generate copyrighted stuff is an infringement, but only for the person _doing_ the generation, not on the authors of the weights.


Sounds similar to the lawsuits against Google by the newspapers of the world for them providing excerpts of their articles on the search result page or so.

So essentially you can ask chatgpt to fetch and summarize a paywalled article, because openai has subscribed their crawler?


I believe it really only works with GPT-4 directly, because OpenAI's prompting to make ChatGPT a chatbot ruins the effect.

Specifically the NYT put in the first sentence of the article and asked GPT-4 to autocomplete it, which it did with >95% accuracy. It's really quite stark: https://nitter.net/jason_kint/status/1740146134767865895#m

The issue isn't that GPT-4 users can read NYT stories without paying for a subscription, though that is a legitimate concern. The issue is that for good-faith use cases - e.g. asking to write a summary about a recent current event - GPT-4 could very well copy an entire paragraph from NYT verbatim, without the user having any way of knowing. It's a serious problem.


Which lawsuits? Newspapers are free to remove themselves from google at any time.

You might be thinking of newspapers’ political campaigns such as C-18 in Canada, which were done in bad faith. The newspapers wanted links to stay but wanted money too.

OpenAI can’t do this for NYT without destroying their model and remaking it.


> If a human kept around a briefcase of NYT articles they didn't pay for and let you view them for a fee I think everybody would agree that's copyright infringement.

Except that's not what it's doing. Show me how I can get ChatGPT to show me the full text of a NYT article.


You can read the lawsuit yourself. It goes into great detail and obviously you didn't believe me based on the previous comment so you might as well do some original research then.

https://www.courtlistener.com/docket/68117049/the-new-york-t...


I did read the lawsuit. I'm guessing you didn't actually try it yourself. Try typing the NYT's examples from the lawsuit into ChatGPT and see what you get. Here's what I get: "Sorry, but I can't provide verbatim excerpts from copyrighted texts. How about I provide a summary or some information about the article instead?"

Did ChatGPT change this after the lawsuit was filed? Probably. Does it matter to this conversation when it's clearly possible to limit verbatim outputs of copyrighted text? No.


> Show me how I can get ChatGPT to show me the full text of a NYT article.

There are like 40 pages of examples in NYT's lawsuit showing exactly that.


Try any of those examples yourself and show me what you get. Not a single one works for me.


ChatGPT is non-deterministic, so obviously doing the same thing will not give you the same result. Maybe it’s time for legislation forcing LLMs to be deterministic so that answers can be reproduced in cases like this. Though that wouldn’t help because ChatGPT constantly changes and there is no way to access older versions.


> An AI training itself on a million newspaper articles can be declared legal, sure, but what happens when it also starts spitting out the same articles with nearly no modification? Is that still fair use?

That would not be transformative. If it is also provided as a commercial service that directly competes against the copyright holder, then that would not be 'fair use'.


I’ve come around somewhat on this, at least insofar as I do believe it’s a novel legal question that I genuinely don’t see a clean way through.

If AI training maximally wins, it could substantially erode the value of IP to the degree of threatening the business models of some very important societal pillars like news media.


> nearly no modification

How much modification is enough modification? The courts can wrangle with that endlessly.


Sure, and that's the point of having courts. If you look at the examples in NYT's filing though I don't think anyone can argue that it isn't clear-cut plagiarism.


How would that effect, for example, maintaining a large corpus of pirated work for training AI on? Presume that you didn't break any other laws in the process of pirating the works? Could I license a large corpus of pirated work to people explicitly using it for AI training? Like if I downloaded all the books, and then sold access to that collection to someone for AI training, with them agreeing to a EULA preventing them from actually reading the books?

All kinds of interesting legal questions.


My understanding in general with piracy is that it goes after those that do the distribution. So you downloading all possible material from where ever as long as it is not illegal is likely rather low risk/penalty.

However system will absolutely destroy you if you try to sell or distribute copies of material you collected...


I know here we are talking about Japan, and I am not familiar with Japan stance on Downloading / Uploading and copyright violations.

But here in Mexico, you can download all you want, and last time I read the law, as long as you did stuff "not for profit" you could also distribute digital content.

(of course I always say that, here in Mexico it is illegal to murder people, kidnap and whatnot, and look at how much they prosecute people that do that [95% of crime goes unpunished in Mexico [1]])

[1] https://www.nbcnews.com/news/latino/violent-crimes-rise-mexi...


Civil liability in many European countries and in USA too seems the real risk. Seed a torrent from home with wrong content and get nice "blackmail" letter from some lawyer... Go to court and end up paying thousands if not hundreds of thousands on some extremely bizarre calculations.


That's really what my question is about, are you allowed to distribute pirated content if, and only if, it's being used to train AI?


For all we know, in 5 years you could ask an AI to "reproduce the entire Supernatural TV show but change the character names, dialog and look just enough to bypass copyright issues - 7 seasons, 24 episodes each MP4 format"


Is anybody going to want to watch it other than the one who ordered it? Would anyone order this modified version before having paid for and enjoyed the original? If the market is flooded with these cheap copies, wouldn't the original have even more value for authenticity?


Seven seasons? I thought you said reproduce the entire show? 37 seasons :D


I see we have a fan of that show! :)


(All IMHO) Which is great! Copyright is stupid and it is time we derogated that 350 old law which at this point is stifling innovation.

As I said in another thread: Sorry if your 40 hour work won't pay you $10 bucks a month forever. That's the case for most of the rest of us: we produce for 40 hours, we get paid for those 40 hours, regardless of what we do. Welcome to the club!!


I think it's notable that when the copyright holder is a business there is rarely any questioning that the copyright holder is king. But when the copyright holder is an individual or an artist, then it practically becomes a challenge for the government and/or private sector to try and strip that copyright away from you.


That’s because of publishing, mostly. Individual creators very often sign over some or all control of their work to companies in order to make more money.


AI training is no different than a human reading copyrighted material and training his biological brain. In neither case is the "brain" allowed to regurgitate source material in its original form. And as long as that doesn't happen there is no copyright violation.


Why should LLMs have the same rights as humans?


Because LLM is just a tool, and humans have the right to use whatever tools they want.


Something else is going to have to give... copyright applies to artists and authors not in the sense they can't learn from other peoples' content, but that they can't plagiarize it. So either that end will have to stand the same for humans and AI, one way or the other.


humans and AI aren't the same. There's no reason to think it will have to be the same.


> humans and AI aren't the same. There's no reason to think it will have to be the same.

Exactly. It would be entirely legitimate to legally privilege human learning without allowing that privilege to transfer (by analogy) to machine learning.

There are a lot of people who want use analogy to force the transfer of that privilege because 1) they judge they can profit handsomely and/or 2) they're technology enthusiasts who've read too much sci-fi.


They do have to be the same because learning is not something you can constrain with law pragmatically. Style and structure is not protected and can be mimicked so it is trivial to do parallel construction for any stylistic ideal even if the law says you can’t train on the original. Society does not benefit from these extra steps.

Specific IP like characters are already perfectly well protected by copyright. It does not infringe on anything to learn about Batman.

Just “training for the heck of it” without any justification for limits besides hurting the artists feelings who put their art in public kinda sucks for them but the alternative is just saying that only but doing otherwise is just asking for only large corporations to be able to train. And it’s not like artists will be comped for that either. It’ll just be them losing out because they put it on a “free” platform.


You can train all the shit you want but if you train on someone else's materials without a license and your product makes money you should be forced to stop selling your product and to transfer the profits to the people who you took advantage of for your own I just enrichment. If you do it a lot you should go to jail.


So if you create a cool new art style, and I copy the style by hand, and then I use my copy to train a model that can now generate your style without having ever seen your work it’s ok?

Because if you think this is not ok then you are arguing that people own styles. And they don’t.

And if you think it is ok then you’re just arguing for pointless extra steps.


if you create derivitave works of my projects to use in your commercial AI, you're still abusing my work for your own gain.


Copying a style is not a derivative work. You can copy someone’s style just fine. You cannot own a style.

I’m perfectly within my rights to make art with your artistic style. And I’m perfectly within my rights to train a bot on my own works. So what’s your argument? What am I not allowed to do with this poorly conceived law?


Ya if you make art to train sure because you own the rights to that art. If you take other peoples art without license as has largely happened, no.


Both of these stories involve using the art to the same degree. It’s just a matter of whether you needed a human in the middle


>Both of these stories involve using the art to the same degree

"the art" -> the art you made that you have rights to use, or "the art" that someone else made that you don't have rights to use?

>It’s just a matter of whether you needed a human in the middle

No, your story was you made totally new art that you had rights to use to train your AI. It's not a human in the middle, its a human author who allows you to use the art at all. If the "human in the middle" in round two didn't give you the rights you couldn't use those either. The human is doing the authorship of the work and also allowing or not allowing you to use it. They aren't in the middle, they're 100% of the issue and the difference between allowed and not allowed, human authored or not human authored.


Ok so you’re 100% on board that humans have no claim to their personal styles. Cool.

It’s ok to make a LegitShady bot as long as you can pay someone $30 to make a few works stylistically similar to yours?

Because if that’s true, you’ve just agreed that the value of your creativity is $0. You don’t even get the $30 here. Nobody really wants your specific works, and your creativity will be freely available. In fact it can probably be synthesized with just a human and some simple tooling in the near future / now.

Is that what you want?


At this point its clear you are not engaging in quality discussion and this is a waste of time. Have a great day.


I am.

These are the consequences of your own rhetoric. You are ignoring them for cheap shots to take the moral high ground. You can’t just say “I’m the side of artists” and conclude that you’re doing good. Your rule set is VERY WEAK and prone to abuse. It will not meaningfully protect artists at all.


Yes there is - why should a well-proven law against plagiarism just evaporate if the human gets an ai to do it instead of does it themselves? Either the law should be revoked wholesale or it will have to be applied wholesale.


I don't think you understood what I wrote.


Good for Japan.

The copyright claims are spurious, the laws were never written with such powerful algorithms in mind. If copyright applies to training, and if law is based upon principles instead of raw power then such a ruling would lead to strange places.

Bravo.


If AI is a winner-take-all industry, they are prudent to clear the way for it.


Big if.


So wait. If I encode something near-losslessly into a neural net then it's ok? Books and music are now free ?


Being able to train an AI is different than using it to reproduce copyright works verbatim. I could easily see a rule where you can train a model, but it's still on you to make sure that you don't use output from it that is too close to a copyright work.

This to me seems like the right approach, and is not much different from what humans do. Humans are free to read whatever source material they want, but you can't subsequently write it down from memory verbatim or nearly verbatim.

edit: And if you think I'm being hypothetical, Github Copilot already does exactly this.


How can it possibly fall on you to verify your works?

The only possible way to do this is for companies to provide a list of all of their sources and for me to then automate verification and hope it works!

The real answer is that I should be able to control whether my information is used to train models or not, because we already know that models spit out verbatim results with generic queries and there’s just no way for a user to otherwise check this.


There is already have a blend of exactly what you desire in the US. But to answer your question, in the case of github copilot it does the checking for you, so no you don't need a list of all the sources.

Regarding you wanting to opt-out of training, that's fine you can already do that for many large models. But likely doing that will become the equivalent of putting your works in a safe where no one but you will ever end up reading them or finding them.

Banning all AI training is the equivalent to banning search with any modern search engine.

edit: Also, note that you are already on the hook for not infringing patents, which if you were to try to do "perfectly" like you imply with copyright, then you need to search/read/understand the entire body of published patents. A task that is clearly not possible. Yet patent law functions (unfortunately).


I mean, you’re really just saying that it should be fine for AI to copy stuff and because it’s impossible to verify, nobody can use it.

What good is training an AI if nobody can use it?

Pretending that disallowing AI is the same as disallowing everyone is a strawman. People finding you is a whole lot different than your work being copied being completely okay so long as it was an AI that did it.


Pretty much yea - while trying to make way for progress Japan may have accidentally blown a gigantic hole in their IP rights system that renders the whole thing essentially moot.


If you can type a whole book into a word processor, can you now distribute it without breaking copyright?


apparently!

... or have a computer "type in" a book i.e. file copy i.e. OCR scan (or microphone for audio) (or video-record for video) ...


It was meant to be rhetorical. The answer is no, you can't. Same with ML. Training on copyrighted material? Just fine. Producing and distributing material that infringes on copyright? Not allowed.

It's no different than a photocopier or a VCR when you're using it for that purpose, it's the end result that matters.


It's easier just to pay the $10/month spotify subscription.


sorry if I wasn't clear: under JP copyright, you can encode and then redistribute, presumably for a profit or other gain, and not renumerate content creators at all. True wholesale piracy which destroys the economics for all forms of content creation that don't have another obvious way to make money (e.g. authors can't sell concert tickets).


Will be interesting to watch regulatory arbitrage play out over the next couple of years if the US and EU diverge from this take in upcoming legal disputes.


I wonder how or if this will affect the NYT case (https://news.ycombinator.com/item?id=38781941). It could potentially shift the focus from protecting the source of the data used to train models to preventing the outputs from competing with the source.


Well it's a different jurisdiction entirely - but if that case was happening in Japan after this ruling it'd likely be extremely easy for OpenAI to win the case.


The only sensible approach imo. It's copyright not readingright. AI should be judged by its output. If it's close enough to the copyrighted material then the publisher is liable. If it isn't they aren't.


Well, this would be interesting. Either this will hinder innovation (why bother creating something if anyone can steal it and use it) or will create a new age of innovation based sharing knowledge between citizens.

Time will tell.


Does that mean they are gunning to become AI Central? I mean, this was probably going to happen in Russia anyway, or any other country that doesn't care about Copyright. So I guess that cat is out the bag?

Interesting times!


It is not that hard to train on public-domain content only or content that has the permission of the creators. Stability knew that they would get into trouble if they dared to evade copyright laws in the music industry with StableAudio. [0]

We'll see how Sony Music Japan, Universal Music Group Japan and the rest think about this latest iteration of regulatory arbitrage with AI.

[0] https://stability.ai/research/stable-audio-efficient-timing-...


Policymakers are making a tradeoff between privacy and economic value. When governments see the potential for creating large tech companies and loose copyright policy can give them an edge, they must decide to choose between privacy or economic value.

South Korea has mirrored Japan's policy: https://metanews.com/south-korean-government-says-no-copyrig...


As much as I want to be on the AI industries side here. rumours of chatgpt producing verbatim copies of NYT articles kinda suggests its still going to apply to the output.

Obviously, there is a difference between training AI and using AI.

Its very hard to make a case that training AI breaches copyright in any way.

But its copyright 101 that if what that AI spits out after isn't transformative the AI retailer is getting hefty fines.


not rumors, in legal docs, I haven't read the docs yet but someone else wrote a post claiming they were primed with several paragraphs of the article (I'm not sure what to believe, but the screenshots were disturbing)

(n.b. this is a complex problem and the scenario isn't quite so simple, if I hire an artist that creates work with a copyrighted figure / use a site that gives me random images it created that I know to sometimes replicate copyrighted figures, its not the artist / site / even me having the image in my possession that's a problem)


Will be great to see Japan become world leaders in AI and for the regulators to realize they're shooting themselves in the foot.


I think it’s a weak argument anyway. A pencil can be used to recreate copyrighted works. Why should a LLM be different? The legal responsibility always has been and should continue to be on the person publishing or making available the work.

And at any rate, a country banning this tech will be missing the revolution. Protecting the buggy whip manufacturers and all that.


> A pencil can be used to recreate copyrighted works.

This is a very weak argument.

A pencil, an empty USB stick, an LLM without weights, and a small child can all be used to reproduce copyrighted works. A USB stick with a copy of a movie on it, an LLM trained on a book, and a human that has watched a Disney film can all actually infringe copyright.

There's an issue beyond copyright, though. A lot of companies seem to think it's okay to train LLMs on data that they at least have no moral rights to and possibly have no legal rights to either. (A TOS saying that anyone posting anything privately, IMO, does not mean that the person posting it had rights to it, nor do I believe that fine print ought to give anyone rights to anyone else's private information.) And a person that read all your email and an LLM that has been trained on all your email can both easily infringe your rights to privacy.


A pencil can be used to recreate copyrighted works. Why should a LLM be different?

Speed and scale make these completely unrelated in my opinion.


How though? You don't need an LLM to copy and paste a Disney character and post it on a million websites. An LLM returning an output isn't the same as publishing something. It's not being redistributed by virtue of just being output by the LLM.


Speed and scale also makes the printing press completely different than the pencil, but the principle that the user is responsible for the output remains practical.


The printing press also indirectly triggered the Reformation, and through that the Thirty Years' War and the Eighty Years' War, and through all those together the invention of the concept of Westphalian sovereignty. And in the UK the "Licensing of the Press Act 1662" (full title "An Act for preventing the frequent Abuses in printing seditious treasonable and unlicensed Bookes and Pamphlets and for regulating of Printing and Printing Presses" so guess what they cared about here).

While the ideals of freedom these brought may be desirable, we may want to avoid repeating the bloodshed from this particular bit of history. And that goes double for anyone in government who takes personal exception to being treated like French royalty.


My opinion is that people should just state that conclusion rather than make these silly comparisons.


Not everybody thinks in the same way. Some, maybe most people benefit from intuition pumps such as analogies. Good analogies clarify the point, poor ones obscure it.


>Good analogies clarify the point, poor ones obscure it.

I definitely agree with this, but I think a "good analogy" is really rare.

In this case, I think that the analogy just obscures the main point about copyright. I think the original comment would have been stronger by omitting the analogy and just discussing where the legal responsibility falls.

In almost every case an analogy is made, people discuss the validity of the analogy over the actual point of the comment (I'm guilty here too! my comment ended up being about the analogy rather than the main point).


Cripple the future so we don't have to leave the ways of the past...


So if I use 500 pages/minute scanner it will be fine?


My comment doesn't say what my opinion on the topic is, just that pencils and LLMs are hardly comparable due to difference in speed and scale.


Congratulations, by attaching qualitative legal importance to scale, speed, and size, you just successfully argued that the First Amendment doesn't apply on the Internet.

You'll make friends on both sides of the political aisle with that position, but it's a shame to see it so readily accepted around here.

(Can't reply due to HN's rate limiting algorithm that penalizes me after four or five posts while allowing the most-corrosive trolls imaginable to party all day, so I edited to clarify.)


Congratulations, you just successfully argued that the First Amendment doesn't apply on the Internet.

What?


You will still eventually run into absurd contradictions whenever you blame a tool for the actions of its user. Subjective factors like "scale and speed" should have no place in the law, even though they often do.

Never mind that every regulation that constrains AI development in the West is a giftwrapped blessing to China and other actors that DGAF about copyright law.


What's the difference between picking a cherry tomato from your streetside garden and eating it, and driving a combine harvester over your lawn and taking everything?


I'm not a lawmaker, but it's probably pretty hard to write a law that effectively distinguishes between "doing X" and "doing X at scale". As another commenter mentions, if you target the means of doing it (human doing X vs. machine doing X), someone will just use Mechanical Turk or something to hire 10,000 humans to do X.

If telling AI to study Spiderman and then output 10 pictures of Spiderman is illegal, how is that different from hiring 10,000 artists to study Spiderman, having them each do a drawing, and then hiring 100 talent judges to pick the top 10?


Isn't the latter already a copyright infringement? So in this case, both should be forbidden?

I think the more immediate issue with AI is that it's like having access to a close-to-zero cost human who doesn't care whether or not they're creating content which, if a human did it, would be considered a copyright infringement. And that they care so little about copyright (and other data rights) that they're basically incapable of even warning you if they are close to an existing character or living person.

I don't know how this is going to play out, but right now we're getting a more polite re-run of the luddites smashing early industrial equipment. I can sympathise with the loss of purpose and economic disenfranchisement, but the economic power in that revolution went to those that did the most automation, and I expect the same to be true this time.


Not at all. We do it all the time. E.g. It's legal to consume fruits found in a National Park, as much as a single person can consume. It's illegal to harvest and take away anything more than that.


I'm usually sympathetic to arguments where scale can change the fundamental nature of the thing but this doesn't really track. Because the argument is that LLMs are tools -- you can disagree with that but that's the parent's position, and those tools can be used to create copyrighted material. The quality, availability, and ease of use of those tools don't particularly matter.

The difference as far as copyright is concerned if one person has a scanner, camera, or blank cassette tape or everyone has them are the same.


Legally, the scale of the punishment / fine, but not the legality of the base act. Also, the latter involves trespassing and probably property damage.


Pathetically incongruent analogy


Exactly, as was the previous one. Catching on to this stuff I see.


Piling on to a bad example with a much dumber example is not helpful


What’s the work? Would you say it ought to be legal to distribute a really good prompt to generate a copyright character? What about an embedding?


I would say it ought to be as legal as distributing a "how to draw mickey mouse" tutorial or a "how to sing taylor swift song" video or perhaps even a "how to make a twitter clone" tutorial.


In that case it seems like it would be much easier to just make it legal to distribute the copyright material in the first place.


That doesn't follow at all? Just because you can create something doesn't mean you can distribute it.

Just because I drew Mickey Mouse doesn't mean I can sell it. Just because I sing Taylor's song doesn't mean I can upload it to Spotify.

Just because GPT can return an image doesn't mean I'm allowed to sell it.

Creation and distribution are different, and the reality is that the 90% of consumers will never create, they will consume distribution.


> A pencil can be used to recreate copyrighted works. Why should a LLM be different?

I find this a pretty week argument too. Does an LLM use a pencil? Can an LLM use a pencil? Does a pencil watch anime or read the NYT? Does a pencil create new work when upon request?

Are these things really comparable? Are they even related?


I just tried asking my pencil to draw me a picture of Mario but nothing happened. What gives?


You're holding it wrong.


You need to use Google Images, and you will get millions of unauthorized electronic reproductions of Mario.


Copyright is a dumb system in the first place. Information wants to be free. People can still pay other people for a promise to create more content, or to create custom content. Companies can still have trademarks and patents for a couple of years. And states, royalty and rich people can still be patrons.


I feel like this will cause pseudo companies to pop up in japan just for training AI models. That way any company that wants to deflect blame can just deflect it to the pseudo company, which then has no legal obligation to respect the copyright.


Output can still infringe, however. This is only talking about input.


Japan was falling behind technologically and to prevent any crazy New York Times vs OpenAI style litigations they went ahead and did this. It makes sense.


What's the counter to this argument? "If you put something on the web, it's in the public domain. Free game."


It isn’t an argument, it is just a statement of something which is factual or not depending on your jurisdiction, or more importantly the jurisdiction of people who see your work.


"...whether it is content obtained from illegal sites or otherwise.” Keiko Nagaoka, Japanese Minister of Education


I AI has only the one benefit of destroying the very idea of copyright, this in itself will be a huge win for humanity.


sounds like many places like the US etc will no longer be able to recognize/reciprocate with Japanese copyright laws and AI products. Japanese AI products might never be able to make it to the west.


It would follow that it doesn't apply to outputs either then, yes?


Moloch wins. Too much money to be made.


I fail to see how sensible copyright policy is equivalent to burning infants as human sacrifice to Canaanite deities.


moloch in ginsberg's sense


Still though, this seems like a reasonable copyright option.


Context: https://slatestarcodex.com/2014/07/30/meditations-on-moloch/

TLDR: A "Moloch Trap" is a generic term for situations like the the Prisoner's Dilemma, where optimal actions for the group are the opposite of optimal actions for each individual in that group.


Even with this context, seems a little dumb. Copyright maximalists are constantly trying to invent new rights that they've never had. If a person is allowed to read training material acquired legally, then so too must their LLM experiment be allowed to read it. The LLM reading it creates no new copies.

Moloch didn't win here. Everyone else did. We get new technologies, the copyright owners got everything they were promise, just not more.


I don't think the result is as clear either as the "Moloch" commenter, nor you, seem to think.

I agree that the result is not a "Moloch trap": as you say, the new technology actually does de-fang Google Search and empowers many smaller people to be able to do things they would never otherwise have been able to do. The contrary ruling would certainly have been enjoyed by "copyright maximalists" like Disney extracting value from the public domain without giving anything back.

But there are more people affected than just greedy copyright maximalists. Individual artists who have spent years developing a distinctive style and making it popular are seeing their style copied ad-infinitum for free. Organizations like the NYT that invest money doing investigative journalism are having their results slurped up and regurgitated.

To your comment:

> If a person is allowed to read training material acquired legally, then so too must their LLM experiment be allowed to read it. The LLM reading it creates no new copies.

In the past, each copyrighted work seen might train a single BNN (biological neural network). Only a small percentage of BNNs would actually study such work to learn to emulate it; only a handful would achieve parity or exceed the quality of the work. Each BNN was expensive to employ, would only work for a certain number of hours per day, and a certain number of years before retiring.

Now a single ANN (artificial neural network) can study works to emulate them in a month or two. That ANN is far less expensive to employ than a BNN; can be deployed 24/7 indefinitely; and can be duplicated to as many GPUs as someone can get their hands on.

Currently legally, it may be that an ANN learning from an artist's work is the same as a BNN learning from an artist's work. But from a practical perspective, from the case of individual artists, it's clearly not the same.

Now maybe that's the inevitable price of progress; but 1) I don't think that's the inevitable conclusion, and 2) even if it is, we need to be honest about it.


other comments here suggest that this is not a recent development, and it is deeply editorialized


90s Japan vs the US energy is back.


Copyright needs reform desperately. It has been outdated and not fulfilling its purpose for decades now. This is a first step by Japan in a good direction IMO. I hope it will have some influence on the rest of the world.


Can you please stop posting like this? It is excessive:

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...


This is just an end to copyright. There is no definition of "AI", it has always been a marketing term for monetising csci research. There is no difference between a lossy jpg, taking its pixels as weights, and the weights of a NN.

So if i just zip up copyrighted images using a NN, then, what? They're public domain?

Regulators here are miles away from understanding the implications -- this is what happens when you let companies whose profit motive is selling "AI" be the "Experts" on the topic.

If you don't care about Disney, fine -- so what about your health records? This is also the prelude to an end to privacy


Completely crazy take. Your health records are not currently protected by copyright law. If someone magically snapped their fingers and eliminated copyright law, the protections on your health data, scant though they may be, would more or less be the same (IANAL)


So to be clear -- you're well aware that in the case of your private data that you have an interest in preventing it being used to train AI.

Great. So d'you think you could outline a reason why you wouldnt have an interest in your creative works not being used also?

Either the training data is, as big-ad-tech says, essentially equivalent to generic human experiences -- ie., weakly repoducible; OR it is extremely reporducible, and equivalent more to standard contemporary data compression.

If you're kool-aid'ing the former on copyright, why not the latter on privacy?>


Because my private data isn't protected by copyright, it's protected by things like HIPAA which doesn't matter one iota about human experiences and applies equally to humans and machines. It's about data sovereignty and who may access my data and for what purpose. A human is not allowed to share, retain, or reproduce my medical data.

So arguments like "I can get the AI to output my chart verbatim" start carrying weight because it's granted access to data that the humans that created the AI are not permitted to share in any form whatsoever where as copyright concerns what I may do with the data after it's produced. Copyright is full of exceptions for things that don't count as a reproduction or performance of the work and this is just one more, it doesn't change the nature of copyright.


Equivocating health records to art style is a laughable proposition to build an argument on. I mean, c'mon.


If you don't care about Disney, fine -- so what about your health records?

Can you help me understand the relationship here?

Something can be protected by privacy laws without being under copyright, and vice versa.


see my reply to throwawaymaths


It'd be really interesting to open up a movie theater in Japan that just ingests blockbusters through a "do nothing" NN and then be able to screen them royalty free. This decision feels incredibly half baked.


This is explicitly for training, not for the distribution of copyrighted material. Courts aren't stupid.


I'd clarify that in my example the screening would be of the potentially random output of a model... just one that was only trained by watching a specific blockbuster movie and thus extremely likely to just reproduce the source material. My example is obviously an extreme but it gets at the core of the NYT case here in the states... I think it's a bad thing if we allow models to output data nearly indistinguishable from copyrighted data it was trained on.

W.r.t to the NYT case - It's my opinion that it's completely reasonable to use a corpus of vetted english literature like the NYT as a way to train your model to comprehend language - but if the model also begins to echo the contents of those articles then that may be a serious breech of the NYT's right's to monetize their work.


That will probably go about as well as distributing an encrypted copyrighted work along with its encryption key and then claiming that none of the bits are the same so you did nothing wrong. Courts historically have not had any problem sorting out nerdy fantasy workarounds of the type often posted on HN.


How is screening a movie training an AI?


It seems pretty clear what AI means in this context.

> There is no difference between a lossy jpg, taking its pixels as weights, and the weights of a NN.

Somehow I don't think this is going to hold up in court.


At that direct level yeah probably, and I do think copyright is dumb and should be if not abolished, limited to 20 years or whatever. That said, imagine training a network from scratch entirely on Disney's catelog. Even if that model is then prompted to generate new characters, it seems weird to say that Disney's copyright wasn't infringed.


This is little more than baseless fearmongering.

"zip up copyrighted images using a nn" is trek level technobabble.

that's not how NNs work.

how the hell does it even connect to privacy?

copyright isn't what makes it illegal to expose and have your medical records

it's privacy violations, which this doesn't even touch


> "zip up copyrighted images using a nn" is trek level technobabble.

Look up 'overfitting', neural-network based compression, etc. or that paper that used zip compression as a neural-network basically. Farthest thing possible from being 'technobabble' once you understand how inextricably linked compression and 'understanding' is.


Regardless of whether it's "technobabble," it's a misunderstanding of how courts operate. The law is not a formally specified algorithm. If you overfit a NN to produce someone else's work, that's not going to get you off the hook in front of a court.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: