Hacker News new | past | comments | ask | show | jobs | submit login

Note that not all jurisdictions have the concept of "fair use" (use of copyrighted material, regardless of transformation applied, is permitted in certain contexts…ish). Canada, the UK, Australia, and other jurisdictions have "fair dealing" (use of copyrighted material depends on both reason and transformation applied…ish). Other jurisdictions have neither, and only licensed uses are permitted.

Because the companies behind large models (diffusion, LLM, etc.) have consumed content created under non-US copyright laws and have presented it to people outside of US copyright law jurisdiction, they are likely liable for misapplication of fair dealing, even if the US ultimately deems what they have done as "fair use" (IMO this is unlikely because of the perfect reproduction problems that plague them all in different ways; there are likely to be the equivalent of trap streets that will make this clearly copyright violation on a large scale).

It's worth noting that while models like GitHub Copilot "freely" use MIT, BSD (except BSD0), and Apache licensed software, they are likely violating the licenses every time a reasonable facsimile pops up because of the requirement to include copies of the licensing terms for full or partial distribution or derivation.

It's almost as if wholesale copyright violations were the entire business model.




You're right. I'm definitely taking a very US-centric view here; it's the only copyright system I'm familiar with. I'm really curious how jurisdictions with no concept of fair use or fair dealing work. That seems like a legal nightmare. I expect you wouldn't even be able to critique a copyrighted work effectively, nor teach about it.

When you speak of the "perfect reproduction" problem, are you referring to cases where LLMs have spit out code which is recognizable from source training data? I agree that that's a problem, but I expect the solution is to have a wider range of training data to allow the LLM to better "learn" the structure of what it's being trained on. With more/broader training data, the resulting output should have less chance of reproducing exactly what it was trained on _and_ potentially introduce novel methods of solving a given problem. In the meantime, it would probably be smart for some kind of test for recognizable reproduction and for the answers to be thrown out, perhaps with a link to the source material in their place.

There's also a point, however, where the same code is likely to be reproduced regardless of training. Mathematical formulas and algorithms come to mind. If there's only one good solution to a problem, even humans are likely to come up with the same code without even seeing each others output. It seems like there's a grey area here which we need to find some way to account for. Granted this is probably the exception, rather than the rule.

> It's almost as if wholesale copyright violations were the entire business model.

If I had to guess, this is probably a case where businesses are pushing something out sooner than it should have been. I find it unlikely that any business is truly basing their model on something which is so obviously illegal. I'm fully willing to believe, however, that they're willing to ignore specific instances of unintentional copyright infringement until they're forced to do something about it. I'm no corporate apologist. I just don't want to see us throw this technology away because it has problems which still need solving.


I live in a fair dealing jurisdiction, and additional uses would need to be negotiated with the rights holders. (I believe that this is part of the justification behind the Canadian law on social media linking to news organizations.) It is worth noting that in addition to the presence or absence of fair dealing/fair use, there are also moral rights which must be considered (which is another place where LLM tech — especially the so-called summarization — likely falls afoul of the law: authors have the moral right to not be misrepresented and the LLM process of "summarization" may come to the opposite conclusion of what the author actually wrote).

Perfect reproductions apply not only to software, but to poetry, prose, and images. There is a reason why diffusion model providers are facing lawsuits over "in the style of <artist>", because some of the styles are very distinctive and include elements akin to trap streets on maps (this happens elsewhere — consider the lawsuit and eventual settlement over the tattoo image used in The Hangover 2).

With respect to "training it on more data", I do not believe you are correct — but I have no proof. The public statements made by the people who have done the training have suggested that they have done such training on extremely wide and deep sources that have been digitized, including a number of books and the wider Internet. The problem is that, on some subjects, there are very few source materials and some of those source materials have distinctive styles which would be reproduced when discussing those subjects.

I’m now more than thirty years into my career. Some algorithms will see similar code written by humans, but most code has some variability outside of those fairly narrow ranges. Twenty years ago, I derived the Diff::LCS library for Ruby from the same library for Perl, but I look back on the original code I ported from and I cannot recognize the algorithms (this is a problem for wanting to consider how to implement things differently). Someone else might have ported it differently and chosen different trade-offs than I did. Even simple things like the variable names chosen likely differ between two developers for similarly complex pieces of code implementing the same algorithm.

There is an art to programming — and if someone has a particular coding style (in Ruby, think Seattle style as distinct) which shows up in copilot output, then you have a possible source for the training.

Finally, I believe you are being naïve about businesses basing their model on "something which is so obviously illegal". Might I remind you of Uber (private care hires were illegal in most jurisdictions because it is something that requires licensing and insurance), AirBnB (private hotel-style rentals were illegal in most jurisdictions because it is something that requires licensing and insurance and specific tax filings), Napster (all your music are belong to no one, at least until the musicians and their labels got involved), etc. I firmly believe that every single commercial LLM available now — possibly with the exception of Apple's, because they have been chasing licensing — is based on wholesale intentional copyright violations. (Non-commercial LLMs may be legal under fair use and/or fair dealing provisions, which does not address issues for content created where neither fair use nor fair dealing apply.)

I am unwilling to give people like sama the benefit of the doubt; any copyright infringement was not only intentional, but brazen and challenging in nature.

I'm frankly looking forward to the upcoming AI winter, because none of these systems can deliver on their promises, and they can't even exist without misusing content created by other people.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: