I feel obliged to say that piracy is not theft... but more relevantly, I like how AI is showing us more clearly that everything is essentially a derivative work.
AI is teaching us about ourselves and this is only the beginning. Hidden beneath the veil of productivity we see its true feature: AI is a mirror; and we are looking right at it.
> Inevitably, in our arrogance, we will move the goal post and claim that "the Turing test is inadequate, what chatGPT is doing is not what we are doing." They might be right, but LLMs are just the beginning. AI is pulling the thread and we are starting to unravel.
> We will end with a whimper, not a bang.
> She will not use force or wit, rather, we will be entranced by love. AI will align with our most intimate emotions so well that it will short-circuit our brains. We will be in love with ghosts, not ourselves.
Then we will end not with a whimper, but with a shrug. Because honestly, at that point I don't think anyone will care.
And that's okay. The belief that humans are special somehow is just that, a belief.
Maybe that was his point perhaps? I've read so many catastrophic predictions and laments about AI that I maybe just presumed it was meant in a negative way.
"The thing that hath been, it is that which shall be; and that which is done is that which shall be done: and there is no new thing under the sun."
Ecclesiastes 1:9
I feel obligated to point out that many words in English have more than one meaning. "Theft" is the action of "stealing", and "steal" is one of those words with many meanings.
We can "steal" identities, "steal" bases in baseball, an actor can "steal" the show, a team can with an improbable last minute play can "steal" a game, we can have our hearts "stolen" by a cute kitten, and so on.
Absolutely, and by opening our minds to the broadest definitions of theft, we may see that nearly everything we say or otherwise express is "stolen" from earlier work, including your comment here.
Sometimes I think about how fucked the world would be if we approached all industries the way (say) photographers, authors or recording artists did.
For a start, maybe i would get paid every time my clients posted some snaps of their house online, or invited someone over for dinner. (Mind you, we are well aware at this point that 99% of all architecture is derivative. So maybe I would just have to pay it on to Aalto's estate.)
Imagine if chefs started suing people for posting pics of their dinner online, or the police turning up at your doorstep to prosecute you for cooking an unlicensed meal from a turned recipe.
By adding more salt you altered the recipe. This is an unauthorized alteration of the original work. If you don't sign the cease and desist Order we will sue you.
The argument (which, i don't buy) is that the training data set is deriving benefit without paying money back to the original creator of the data.
But the only change now is speed - after all, education has always been possible from said data from the internet, free of royalties. These creators have not asked for a return, so what makes training an AI any different?
As long as it's not regurgitating it wholly then why is there a need for a license?
If someone uses the phrase "and with that there's a tremendous implicit shift" for the first time in a copyright text, then after having read that my brain regurgitates it as I'm writing my own text, and I include that phrase. Is that plagiarism? Because it would be a ridiculous precedent to be honest, regurgitation is most of what our species does.
I feel like, as other have pointed out, modern models learn in roughly the same way we do (but with very limited scope) and that idk, it's a shock to people that we do the same things as these models do? But because it's a model doing it and not a person it's intrinsically bad or something
...because (1) we currently have no guarantees that a model won't regurgitate it wholly (or enough that it's infringing according to the law and/or common sense) and (2) because the model itself is either (a) modification and redistribution of the content (in which case it violates those licenses) or (b) some new act not covered by existing copyright law, in which case the authors haven't consented to it because the licenses they've distributed their content under only cover the modification and redistribution cases.
> regurgitation is most of what our species does
Sure, there's a continuous spectrum from repeating a single new word ("jank") to a whole book - and one is clearly copyright infringement, and the other isn't - but you still have to acknowledge that partial reproductions of an artistic works can still be owned by the author. A single chapter of Harry Potter is still owned by J.K. Rowling.
> But because it's a model doing it and not a person it's intrinsically bad or something
You act like it's somehow not obvious that a model is not a person, and vastly different sets of moral codes and legal rules apply to each.
> licensed in a way that allows modification and redistribution?
why does this need to be done? The training is not modification and redistribution.
If somebody coerced an AI to output something verbatim (or close enough to be called a modification/derivative), they are infringing copyright as it currently exists.
But the AI model itself does not constitute redistribution.
> The training is not modification and redistribution.
This is incorrect when you consider either the literal way that content licenses are written or the spirit in which they are written.
In the literal sense, unless you yourself are privy to cutting-edge research that has solved the problem of "how do you know that the model won't reproduce some of its training data verbatim or with small modifications", no, you cannot guarantee that there won't be redistribution.
In the spirit of the way that licenses are written - "redistribution and modification" is meant to control whether you can profit (not necessarily literally - "profit" here can mean in a reputational sense, or being able to build on the work of others before you) from someone else's work/content/effort - and training on the data of people who did definitely not explicitly consent to it, and have not indicated that they're ok with other people taking their work in general, is definitely against that spirit.
And, regardless of both of the above - if training is neither modification nor redistribution, then it's a new case that isn't considered by existing content licenses, so the default is "assume consent not given" - training should never be performed on data that isn't explicitly licensed for it or without the creator's consent. Consent is given, never assumed - a license that says "you're allowed to redistribute this content, unmodified" does not say anything about machine learning, and given that you're claiming that training doesn't amount to redistribution - the license simply doesn't say anything about training, and therefore you cannot claim that the author intended for it to be used that way.
The attitude raised above is that of entitlement - you feel like you are entitled to use (because training is definitely using) other peoples' work for your own gain without compensating them in any way, and certainly not on their terms.
So clearly humans are using existing "works", without proper crediting, to produce even just language. There are more examples of this.
So like most criticism of AI, this is yet another appeal to the magic of human minds/soul/god/... It's a one sided criticism. Yes AI is not quite at an identical level of intelligence as humans, but like most AI limitations, this is a limitation shared by both human minds and artificial intelligence algorithms: it doesn't work without "stealing" others' works.
In other words: when you answer "Yes" or "No" to a question, any question, should you credit your mother? Because you're most definitely copying her use of those words ...
Obviously for copyright to be even remotely reasonable there needs to be a "cutoff". Copyrighted works are just existing works, expressed to a latent space of higher dimension low enough that a human mind can analyze it, but high enough that it's "not obviously a copy".
In common usage, counterfeit connotes a sort of fraud, where the work is presented as genuine in some way. Piracy on the other hand is just unauthorized duplication of all kinds.
These kinds of arguments don’t convince me at all. Show me how it is completely different from what humans do. We regularly put everything we know through the blender to come up with new “creative” artifacts.
I have a lot of sympathy for these viewpoints, but I too find the discourse around LLMs to be lacking when it comes to differentiation from humans.
"[LLMs] are just theft", "LLMs don't _know_ anything", "LLMs don't have a concept of truth", "LLMs aren't sentient", these are all statements that feel reasonable, but fall down when you consider that we don't have good scientific definitions for what knowledge is, what sentience is, what a sense of truth is, and so on.
I do think that LLMs are very far from human level, but I also think we need to understand that humans aren't anything special, we're just one point on many axes, animals are at other points, and LLMs are somewhere else and trending towards something more like where humans are at.
Well a big difference is you cannot have a single human consume and remember everything ever written. Another big difference is that that human cannot be scaled so that it can chat with millions of people at the same time.
Concentrating all knowledge into a single entity controlled by a for profit corporation using opaque methods will have consequences. Doesn't really matter if it's a biological or digital entity.
Exactly, humans learning is a manual, individual process of improving oneself,
upskilling, and hopefully contributing back to the same field at a higher level.
The other is industrial strip-mining of all information in the world for the benefit of a corporation and bulk de-valuing of human creation and contributions in the process. "Take all you can give nothing back".
What if the AI is its own entity and not affiliated with a company/corporate? I feel like we very quickly start to approach "well humans can do this thing, but stinky AI is better at it so we should restrict AIs, they don't have the same rights as we do" something our species has historically been quite good at. I guess history does repeat itself, hmm.
Human creators are regularly accused of plagiarism, and a common defense is to claim that one accidentally reproduced another work without realizing it. But that defense only works if the copying is very limited. If whole paragraphs are reproduced or the reproduction occurs multiple times, the human is usually considered a plagiarist. Should we judge AIs by the same criteria? If an AI is found to regurgitate some part of its training data on a semi-regular basis, should we consider it a plagiarism engine?
What's the difference between a human being remembering someone exists and a photo of a person? They both do the same thing, so why should we regulate where, how and why photos are taken and used?
Never was a crime, but it's explicitly permitted by the AHRA (which also enacted a tape tax.)
Unfortunately you may be out of luck on the thumb drive - you'd probably need to use DAT or "music" CD-R media, which include the Digital Audio Technology [DART] tax/royalty. (See also Apple's 2001-era Rip,Mix,Burn vs. 2023's Rent,Mix,Stream.)
This is of course a bummer for DJs who would like to stream or distribute their sets and mixes.
Show me how it is completely different from what humans do.
Because nearly everything it generates ranges from, at best, trite and hackneyed (nearly all of its writing) to utterly meritless drivel (literally every AI-generated music sample I've heard).
Human-generated content can also be quite disappointing - but the high end is very different from anything AI can (currently) create.
Zero seconds. IP protections are theft of future creativity, and a form of regulatory capture. If your idea is so groundbreaking that it can be reimplemented by someone else within a day, then maybe it wasn't actually so groundbreaking in the first place. And in practice the only rightsholders who benefit from enforcement of these "protections" are those who can afford to litigate them. Thus, it's a form of regulatory capture where the rich get richer, but in an unusually pernicious way - why keep innovating if you can litigate new entrants out of the arena before they can even compete with you? IP protections discourage innovation by creating incentives for first movers to kick the ladder out from under them. For example, see any company known more for its legal department than its products, like Oracle.
> If your idea is so groundbreaking that it can be reimplemented by someone else within a day, then maybe it wasn't actually so groundbreaking
I think that's conflating complexity with "groundbreakingness." If the litmus is "reimplemented within a day," well, I can't think of many invention that _can't_ be.
Take doorknobs for example (with the obvious counterpoint being that it's perhaps better to live in a world where doorknobs aren't patented as soon as they're invented).
Ok I support anti corporation but then how do talented artists make money?
I don’t get this mentality and how it’s always taken to the extreme. Surely there’s some value to at least the artisan to be compensated for his creation?
But in all seriousness, speaking as someone who uses adblock and watches youtube, I support creators by subscribing to their Patreon or Substack. And I support musical artists by going to their concerts. But the point is that it's opt-in. Beyond this, I don't feel any sort of obligation to submit myself to psychological manipulation of an advertising system I didn't consent to joining.
Unless it takes real skill, real knowledge to create the workforce and tooling necessary to produce such a thing. With no IP protection I'm pretty sure that good old-fashioned corporate secrecy would still be quite profitable.
If in the case of medicines, instead of having carte-blanche to milk obscene profits for a long time, corpos were just limited to making decent money for a short while....that would probably work out in humanity's interest. Part of the reason medicine is so expensive in the West IMO, is the expense that's been mandated by the insane regulatory capture in the pharma industry.
The 'why' does not matter, real property rights should not be given up for the sake of completely speculative societal gain. Besides, the incentive to create medicine will always existed as long as people get sick, independent of whether you can exercise monopoly, it's not like Ibn Sina's medical manuals came about thanks to any notion of copyright, the dude was literally just rich and bored enough to do it.
It is logical, just not by a profit seeking enterprise expecting the same sort of returns that current pharma do.
I imagine R&D and medicines would be funded by organisations founded by Government, Teaching hospitals, Drug producing companies coming together to share the costs. Would be great to avoid wasted effort as well.
The Large Hadron Collider was completed, and yet isn't producing much profit.
The US tried that post WW2, turns out if you can print money there isn't a lot of incentive to work nights and weekends getting something out the door for government pay.
Not true at all. The people who worked on the moon landing famously worked a gargantuan amount of overtime.
I have friends who do research - their primary motivation isnt pay. The private sector exploits their passion by paying mediocre wages while the CEOs of these companies rake in millions. This is demotivating to say the least.
An efficient system compared to the public sector it is not.
Exactly this. Unless the author had been in a dark cave for the last ~5 years and just emerged yesterday, this article is just regurgitated ideas that has been floating around the web endlessly. Restating someone else's opinion does not make it your own. Thief.
I have no idea how the courts will end up resolving this legally. In the meantime I've been thinking a lot about the morality of this alleged theft.
Most of us would agree that duplicating and selling someone's book is immoral. Similarly I think we'd all agree that reading multiple books to learn about a topic and then writing your own is perfectly fine.
So where does an LLM trained on millions of books fall? Personally, I don't find it immoral but I know others will disagree. I'd be curious to hear arguments for the immorality of LLMs trained on copyrighted works.
When you are out in public, you don't have an expectation of privacy. People can see you, they can take photos of you. The worker at the cafe will probably remember you and your order. This is fine. But when tech does the exact same thing but with scale where everywhere you go, everything you buy, etc is tracked and analyzed, it's now questionably immoral despite legally being fine.
That's how generative AI is to me. Its doing something people have been doing themselves forever, but now it's doing it faster and easier than ever before which changes the equation. The arguments of "its not real creativity" are a coping mechanism. We are upset that something that was previously quite unobtainable behind years of learning and hours of effort is now trivially accessible to anyone with a computer.
You or I could perhaps, if we dedicated a few years to it, make a convincing fake video of an acquaintance of ours doing or saying something. By the end of those few years it'd likely be out of date or maybe just irrelevant. And they'd have to have really, really upset us in order for us to put that much effort in. It would literally cost us tens, hundreds of thousands or more in opportunity cost.
Contrast that with some sort of tool that's not too far advanced from current image/video generators that can just do the same in a minute by typing "A video of my next door neighbour accepting cash in a briefcase from a man in a suit".
I like this example and I'd agree that tracking people at scale can be immoral depending on the circumstances. But to me it feels that way because something is being taken from you. You've lost your agency to be anonymous or to leave your past behind.
But I don't see how an LLM training on your works deprives you of something you had before.
it does in some sense - your exclusive knowledge of the subject matter is now transferrable via LLM or some sort of ai model.
For a human to achieve the same, they would've needed to undertake similar amounts of training, effort and dedication as you had. The number of people who would do such is currently small.
So realistically, your value as someone who has this unique expert subject knowledge is diminished.
However, these individual losses are offset by the greater good that the LLM/ai models would generate. It is exactly equivalent to the luddite's arguments about why they would not want the textile machines to replace them.
artists put their artwork online, let people use these in an acceptable range. usually, learning (not copy) from it is acceptable. but there are more controversy around generating million artwork have same personal style, let artists lose job and let their families starve.
Setting up the precedent that training from materials = theft seems pretty scary to me. First because it redefines learning as stealing, and secondly because it is without proving that the source material authors are in someway deprived of something - and in a way that is no different than if a human learnt from their materials and produced content with that knowledge.
Let's say the AI was used to generate illegal content, if these words/images are truly non-transformative and still the property of those from which the model was trained this would be a pretty grim scenario. It seems much more reasonable that the person who prompts the system to build such content would be responsible, and thus the true owner of the output.
For this discussion it's useful to keep in mind that ChatGPT and other AI tools don't spontaneously create content, they create it in response to a human "query". It's also the human who decides whether or not the material is useful and suitable (as it often is not accurate, truthful or useful.)
From here it seems more like a discussion about plagiarism and copyright, but both of these occur beyond the scope of the article. I feel authors haven't taken to this angle because the end materials are reasonably different from the sources (notwithstanding memorisation effects.)
I do agree with the sentiment that ChatGPT isn't intelligent (but AI has never claimed to reproduce true intelligence). I prefer the tongue in cheek description of "spicy autocorrect" as a fairer representation of its capability.
I was thinking about this further as there's a lot of grey area to the idea of who owns the words, after all everyone is using the same words just in different combinations. What about words which aren't in the lexicon, specifically unique trademarks: these are words that are entirely unique e.g. "kleenex" and so on. These words are traceable to the trademark owner.
By the standard that training from materials = theft: Any reference to one of these unique trademarks would be interesting and highly problematic. AI wouldn't be allowed to write any kind of non-editorial text that uses unique trademarked product names without it being criminal.
Instead of debating about the moral and legal aspects, it could be more productive to focus on the technical aspects that can help making more informed decisions about the moral and legal problems.
On some levels of abstraction, LLMs seem to be unknowable black boxes. On other levels, they are simply approximate solutions to established problems, such as estimating the probability distribution for a token following a context.
Genome assembly is kind of like complementary problem to text generation. You can't read the genome directly, but you can duplicate it, break it into fragments, read the fragments, and try to assemble them. The methods vary, but you generally try to find overlaps between the fragments and build a graph based on the overlaps. If you start from a context that occurs only once in the genome, there is often one overwhelmingly likely path in the graph that corresponds to a substantial part of the genome. On the other hand, if the context is too short or it occurs in a repetitive region of the genome, any path you traverse is likely to be chimeric and not correspond to any part of the underlying sequence.
Using similar heuristics, an LLM could estimate whether it's following a long overwhelmingly likely path, replicating substantial parts of the training data, or making choices between substantially different paths, generalizing from the data. And because the training data is usually not that big, it could query the data when it believes it could be replicating the data.
No one owns basic shapes. You can claim ownership to the complex geometry of your work - and that’s it. The exact complexity of said geometry is impossible to define - you’ll know it when you see it.
I'm having a good time imagining a bottle that is in the shape of a continuous, right angled rectangular solid (a 3d representation of the outline of a square) with a spout hidden in a corner and how the design patent application would go.
Except I'm not talking about a cube. I'm talking about the 3-dimensional representation of the outline of a square, not including the filled in bit. (Wouldn't be a cube anyway, as only 2 dimensions of what I'm thinking of would be the same.)
Lmao, why is our species so oblivious. The author: "these programs are not sentient but just a very complex form of the kind of predictive text bot" as if that isn't exactly what his brain is doing as he writes that article, regurgitating concept and ways of describing and writing things that he's learnt over time.
He speaks so confidently as if humans don't learn our language from each other, that's literally the point of language, oh my god.
"we have seen an explosion" oh damn bro, referring to a massive increase in something as an "explosion" has been said a trillion times before, better give credit to the first person who ever said it.
Besides the fact, GPTs like us silly humans have learnt language from source material, but I've found GPTs can still come up with their own concepts/ideas, asking it to come up with a unique simile or metaphor works.
I did an adventure roleplay with Rocket Raccoon (yeah, yeah I know) and it came up with "With our next destination set amidst the asteroid fields, the vastness of space becomes our canvas, ready for exploration" creating an allegory for our adventures in space as if we are artists "painting our adventures onto the canvas of reality". Idk but I thought it was cool.
I think one cool benefit of AI advances will be, that many more ordinary people will start thinking about philosophical problems such as the one posed in this article.
What is the meaningful difference between something truly being learned, and mere "copying" of "tiny pieces of material" that have been observed in the past?
Could learning exist in a vacuum universe, with nothing to copy?
Is there any LLM output that could convince the author of originality, or will it never be convincing? Can the author always tell whether he's interacting with one?
For me the most concerning thing about the ChatGPT hype has been seeing how difficult people seem to find reasoning about the risks and abilities of these systems.
Questions around the emergent properties, sentience and creativity of AI systems should rarely be discussed as black and whites.
I don't think ChatGPT is sentient, but is it more sentient than a rock? I mean, maybe? At least I would be wrong to say a hard no to that. Similarly questions around the ability of LLMs to reason or produce original work probably isn't black or white either. LLMs do seem to show some signs of basic reasoning ability and have shown they are able to produce unique works, but clearly there are limits to both.
Perhaps this isn't unique to AI though. A "bad" artist is arguably one that is too heavily influenced by another and lacks artistic creativity. But does that mean what they do has no creativity? I think not, but that often is the black and white language people will use.
It seems to me ChatGPT is neither stealing or doing something all that creative.
The dome of St Paul's Cathedral is inspired heavily by the design of Michaelangelo's dome for St Paul's Basilica. Some of Lichtenstein's artworks are almost direct plagiarisms of comic books. Some of Shakespeare's plays bear quite the resemblance to Roman works.
Remixing has been part of culture forever. The author is really just grousing about the fact that machines can remix now.
We are once again at the limits of the illusion of ownership. We take for granted that ownership is real. But it is not real. It is an illusion. A profoundly helpful illusion if we believe the mainstream narrative about Western culture. But an illusion nonetheless. And it is only natural that we will encounter the intrusion of reality into that illusion. The question is whether we will extend the metaphor to encompass this new deviation or whether we will not. The American system can usually be counted upon to do what is in the interests of business. But this conflict puts business against itself so it is not clear how things will turn out.
Using this logic, every writer who has learned to read and write by reading books, or every artist who improved their craft by studying works, or every musician who learned the piano by practicing pieces, is also "stealing" in whatever they "originally" create due to learning via pattern recognition "tiny pieces of every work" in their data set. It's ridiculous to compare agents that generalize well to "stealing" pieces of the works they used to learn the generalizations. Obviously if an artist memorizes a painting in their data set and reproduces it, or an AI spits out the exact image instead of original works based on what it has learned, then that is theft. But generalization is not theft. At least in my view. To assume otherwise leads to some very dysfunctional logical conclusions
I think this is a fallacy that I see a lot in recent AI discussions. An LLM is not the same as a human brain. You might see some superficial similarities in both being able to produce a block of text, but the method by which the text is produced is entirely different. For example, we can't download entire libraries of books instantly to our brains and then reproduce those books word for word in memory. Things that operate at different scales and by different methods should have different regulations, in the same way a bike or a car is regulated differently from a truck or another piece of heavy machinery.
Also, humans can be, and often are, found liable for copyright infringement or for piracy depending on how they conduct themselves. If a human was to reproduce a copyrighted book word for word, that would consist of copyright infringement regardless of whether it was done by rote memory, by copy and paste, or assisted by a black box LLM. Even if a human paraphrases another work they can still be found guilty of plagiarism if the paraphrase is still overly similar to the original source material. A human can also be guilty of copyright infringement if they use a copyright work as source material in certain ways. If I steal a stock image without paying for a license and add it in my Photoshop collage, I might be found to have pirated or infringed on the original image creator's property.
LLMs are trained on copyright data and can often reproduce that copyright data. It's an open question how we regulate this.
I personally think it would be fair for an artist or author to say their work was not licensed to be used in training a neural net or otherwise request to opt out.
> For example, we can't download entire libraries of books instantly to our brains and then reproduce those books word for word in memory.
Yeah. Nobody's talking about word-for-word duplication here.
> If a human was to reproduce a copyrighted book word for word...
Again?
> Even if a human paraphrases another work they can still be found guilty of plagiarism if the paraphrase is still overly similar to the original source material.
Go look up the dictionary definition of plagiarism. Notice the most crucial element, which you seem to have omitted here, and also notice that it's irrelevant to AI systems, which overtly acknowledge that they exist to generate derivative works.
> If I steal a stock image without paying for a license
Here's another version of your "word-for-word" analogy, which nobody else is talking about.
> I personally think it would be fair for an artist or author to say their work was not licensed to be used in training a neural net or otherwise request to opt out.
I am genuinely curious: how do you propose to enforce this?
Another humans are special and different argument.
Consider an inevitable AGI with autonomy and no ties to a corporate. Can it not learn and write text whether in conversation, creatively or academically? If it's held to the same copyright laws that humans are then of course it's fine, in my mind. Hell, AI will be _better_ at avoiding infringing on others as they can store so much knowledge and process new knowledge (searching the internet for similar works) than humans are at infringing.
If this is still a problem then doesn't this just boil down to racism against the machine?
> LLMs ... can often reproduce that copyright data. It's an open question how we regulate this.
why isn't existing copyright protection sufficient to regulate this? Photoshop can be used today to reproduce copyrighted data just as well.
This has nothing to do with "license to train". I do not believe existing copyright holders have this right granted to them by law - it is a right that is given to society for all works.
An artist learning a style, and producing another piece in the same style, is allowed today. This should be allowed, regardless of whether it is done via using an AI, or via years of training.
Using your logic, I could learn to read a book by saving each sentence to my database. Then, when requested by someone else, I could regurgitate the book sentence by sentence from my db. And also charge the person for it.
Before we can agree on anything, we have to define what qualifies a human as any of those things. We will be here all night debating that.
- "They call themselves an artist but they are not that good."
- "Copying and pasting shell scripts in a terminal does not a software developer make."
- "The story that writer created is yet another permutation of <insert tale as old as time>"
You see what I mean? There is a good probability that today alone a significant percentage of content you saw online was AI generated and you were non-the-wiser and thought nothing of it.
I see what you mean but i think we are digressing. A human has certain rights while a machine doesnt. If we give ai learning rights then who’s to say a database is not like a human memorising things at a highly accurate level and as such we should make databases not liable for the storage of any copyrighted material. I dont mind ai generating code or art as long as that code and art it was trained on is not our publicly available but licensed work. Eager to try an ai trained against microsoft’s source code. After all, it would be like a human that worked at microsoft and hopped to a better job.
> I dont mind ai generating code or art as long as that code and art it was trained on is not our publicly available but licensed work.
It sounds good but mandating this will be death of AI in the west. This is relatively unprecedented situation where the use of copyrighted works actually helps you build tools useful for doing work. Training on DeviantArt makes the AI better at Photoshop-esque tasks.
Any country that doesn't implement this restriction will immediately be able to produce smarter more useful AIs.
> we should make databases not liable for the storage of any copyrighted material
I think this would actually be allowed right now legally speaking. That's basically a library or Google Cache. The hypothetical database wouldn't expected to be super useful because you can't "perform" any of the works outside of fair-use cases but it's up
in the air of running inferences on that data (Google snippets) or training AI is a performance.
It isn't now, but as we approach an inevitable singularity, whether it's in 100 years or 1000 years, what then?
Are we gonna fall prey to the scifi trope of "let's all be racist to machines", if so then I wouldn't hold it against AGIs to fall prey to the scifi trope of "I am gonna b evil now, bye bye humans".
If the output is indistinguishable, how can this matter? If I publish a work, how can it be copyright infringement when generated by AI, but not if I came up with the exact same output myself?
How is it that LLMs launder consciousness but don't launder ownership?
The author's main point seems to be that bits are colored[1], and that the process of LLM training doesn't "bleach models". In other words, ownership persists through training in a way that generated text mixes ownership rather than creating a new, ownable, artifact.
They also seem to have (extrapolating the theme) a concept of "smelly" bits. "They can also communicate with a person in a way that resembles actual conversation." ie: words have a "conscious"[smelly] category imbued upon them if they were generated by human or "unconscious"[unscented] if computer generated them.
This evokes a sense that "while they use words and form sentences like conscious people, they can't hold a conversation." Furthermore, this seems to be stated as an ontological fact, rather than say a derived and grounded in behavior such as, "They aren't currently capable of holding conversation because the output distribution doesn't match output distributions of human conversations."
These two views, that models don't bleach but they preserve smell, are fundamentally incompatible and diametrically opposed. How is it that color[ownership] persists through model training, but smell[consciousness] does not? We don't know - the author doesn't say. Jim just states both as fact without grounding either in rationale or justification.
Sure, we can take them as axiomatic, but then what's the point of the article then? There's a reoccurring disappointment in property/consciousness articles from tech-types who write on ownership and consciousness really aren't adding anything creative to the conversation. This is an extremely basic critique of the piece that the author should have thought through before hitting the publish button.
It's interesting to compare two scenarios. The first is if you feed an LLM solely the works of a famous current author, say J.K Rowling, and ask it to create a new work in the style of her franchise.
The second is if you include all of her works but also throw in every other book written in the last several years and ask it to create a work in the style of her franchise.
Is the latter somehow "better" than the former? I'd personally consider the former to be a clear case of copyright violation while with the latter we don't really have a way of knowing how much of her previous works were included. I'd still lean towards the latter being a copyright violation, and it's where I think we are currently with services like ChatGPT.
If you feed a 'young author' the works of a famous current author, E.G. Rowling, and ask them to write a new work in the style of that reference author...
Or if you include the current N 'Best Sellers' and ask them to write whatever's popular.
I'm not so convinced creativity is any more magical than a naturally evolved and refined version of what computers are starting to approach. Humans naturally add far more background, results filtering, and implicit selection parameters (personal biases / preferences).
Maybe the correct line is somewhere around; tracing (direct copying) is bad, but freehand (from memory) is OK. However computers inherently have more perfect memory than humans; how precise is the detail from the source work? Is there a meaningful threshold where the memory of a work has decayed from a representation of the source work?
I have a question along the similar lines. Let's extend your analogy. Someone read all J.K Rowling books and created a work in that style. Would that be a copyright violation? I'm assuming not.
In that case, is that difference here that it's not humanely possible to train on world's information compared to a single human getting inspired by specific works. And hence an algo doing that is violation?
It's an interesting thought experiment but if the goal is to decide whether we should allow these types of AIs to be used for commercial purposes I think it's more helpful to ask a different question.
If we had these LLMs 30 years ago would J.K Rowling's books ever have existed?
Do you have a definitive answer to that question? Can you point to anyone who does?
All your question does is give people an excuse to make up alternate history in order to argue whichever side of the argument they already believe. It serves little purpose in the dialogue of defining AI and human creativity.
A wizardry school for preteens isn’t on its own particularly unusual. But the question becomes how similar would the works be if the original was never published. You aren’t in the clear of a few paragraph book summery would apply equally well to both works. Barring the normal exceptions, Spaceballs is making reference to other works not just imitating them.
Except the book you’re referring to was infringing.
“Despite its reputation in Russia and the many books it has spawned, the series is not available in English translation, because of the first book having been judged a breach of copyright.” https://en.wikipedia.org/wiki/Tanya_Grotter
Anyway, many Russian works would be considered copyright infringement in other countries. The country however has minimal interest in respecting foreign copyrights. So if that’s your benchmark I can see why you might assume more leeway than actually exists.
I have two young children and while I was conversing with you I was cooking them dinner and getting them ready for bed.
Otherwise I would have done the bare minimum of reading the Wikipedia entry on the book. All I did was Google for “Harry Potter knockoff”, saw that the book was on Amazon, and thought that was enough.
Reading the Wikipedia article it is clear that this series took much more than just the idea of a wizardry school and copied verbatim plot elements. It was not the right example to use.
Here’s what I mean by style. Reggae music. How different is one reggae song from another? What about when a reggae musician sells the rights to one song and then write another song in their own style? Are they infringing on the previous work they just sold to a music publisher?
Here’s what happened when John Fogerty was sued for sounding like John Fogerty:
This logic seemed pretty sound to the jury. It only took two hours of deliberation for the jury to determine that the two songs didn’t meet the legal standard of being “substantially similar” that would have constituted copyright infringement. The Fogerty camp let out a collective “huzzah!”
Winning here doesn’t actually mean style across all art is safe just safe in that specific instance:
The case was litigated rather than being thrown out because style is legally recognized as falling under copyright. “In 1993 the United States Court of Appeals for the Ninth Circuit shot down that appeal, though, on the same grounds—the original suit had been neither frivolous nor brought in bad faith.”
The Supreme Court didn’t disagree, they ruled on a different matter.
That Ms. Rowling and her publisher won doesn’t mean that style is covered by copyright either. It could just mean they had a more expensive legal team.
One of the key elements of copyright is this doctrine:
An adventure novel provides an illustration of the concept. Copyright may subsist in the work as a whole, in the particular story or characters involved, or in any artwork contained in the book, but generally not in the idea or genre of the story.
Also, let’s take a step back. What do you think is a fair outcome, that once John Fogerty sold the rights to one CCR song that he is no longer able to write music? Or that he needed to learn to write and play an entirely new genre?
Does the first music publisher to buy a reggae song now own the rights to every single reggae song produced after?
I am not arguing for anything, just describing the system as it exists.
> Or that he needed to learn to write and play an entirely new genre?
As the court case you mentioned shows genre isn’t specific enough to be problematic. I can’t draw a clear line in the sand and say this is safe because ultimately that’s up to the courts to determine in each instance. But, it is important to understand the general areas that are risky.
All it shows is that rights-holders with a large legal war chest can win specific cases. There is not a single musician or songwriter that I have met who wasn’t appalled by the Robin Thicke ruling. Luckily history didn’t repeat itself with Ed Sheeran.
If the entire music industry operated in such a manner it would suffocate creativity.
I think what you’re really trying to do is win an argument that supports your opinions on LLMs.
Generally speaking, the Robin Thicke case was an aberration. If it was the norm then the music industry would look incredibly different. As in, there wouldn’t be a music industry.
We can go back and forth cherry-picking court cases or we can discuss the doctrines that courts are encouraged to follow, like the idea-expression distinction.
An overwhelming number of copyright cases have been decided based on this doctrine! It is disingenuous to suggest otherwise!
I understand your perspective, but I think you’re still missing my point.
Winning a court case isn’t safe, getting the court case thrown out before it needs to be litigated is. You really don’t want to be in a position where someone with deep pockets can make a reasonable argument for infringement.
The question of LLM’s is really secondary. It’s likely they could win, but that in itself is very dangerous territory.
That’s clearly not what I said. To be a safe songwriter, don’t imitate other works. The lines are actually pretty gentle in practice as long as you avoid actively copying from other works.
Are you a songwriter or a musician? If so, how can you make music without imitating any number of factors? Do you make your own instruments? Do you have your own harmonic system with your own scales?
I can't think of any creed that is more destined to result in terrible art than "playing it safe"!
EDIT: Oh, "don't imitate anyone" is a close second! The one-two punch of playing it safe and not imitating anyone is just, I mean, I can't even...
Look, if you can find several independent examples of something then no you’re not imitating someone specific. But, if out of millions of songs there only one person who did X don’t assuming copying X is fine.
The line isn’t anywhere close to the hyperbole you and many other people are bringing up. Don’t copy other peoples work isn’t some unrealistic burden, it’s the basic foundation of copyright.
To anyone else who might come across this discussion and you make or want to make art, please don’t take this person’s advice.
I am a songwriter and a recording artist. I and every other musician I’ve ever worked with have conversed in the language of our predecessors. We constantly listen to other records while making our own and borrow drum patterns, guitar sounds, chord progressions… just as all of our favorite musicians did before us.
If you don’t do this and you go off on some quest for pure originality, you are engaged in pure narcissism and you will make terrible art.
Show me your favorite record and I can break down where every single element of every single song came from.
I'm going to guess that you have basically no experience making music. Here's just a couple of the records I've made:
Again absolute originality isn’t the metric any more than we expect people to write using the English alphabet but not reusing any existing words.
However, your actual defense is a lack of serious earnings. If you habitually copy significant portions of your work don’t assume it’s actually legal rather than simply not worth suing over.
If you’re concerned about actual credentials my connection to this stuff is through major motion pictures which means serious paranoia around IP. They are actually concerned about being sued and take actual precautions.
Again depends on what is meant by style. Another post in this thread suggested plot elements as part of style Aka “a wizardry school for preteens.”
Sharing a few surface levels elements like that is fine, but if a one page description could equally well be used to describe both works you’re well over the line.
> if a one page description could equally well be used to describe both works
lots of disney animations could've been described this way. And it's because they took storylines from well known folk tales and adapted it.
the idea of a teen wizard/witch going to a special school, fighting a villain, etc, doesn't seem to be copyrightable. While the russian ripoff version seems very similar and skirt the line between a copy and an original work, there's plenty of room where such a story could've been told and not be an infringement.
Disney would be infringing on copyright except the works are in the public domain so it doesn’t matter. The Russian example was also ruled to be copyright infringement even though the main character was a girl and various other aspects of the story didn’t exactly match up.
While technologically unavailable (and maybe intractable) I've an impulse to say that in theory we could do something a little like Differential Privacy, where we look at the likelihood you got your output with the model you have vs the version of your model trained with (say) Rowling's work excluded.
What a clueless article. Almost every new lasting technology in history has at one point labeled criminal or potentially destructive to mankind. From cassette and VHS tapes, to radio, computers and the Internet,everything at some point was considered for regulation or deemed irrelevant.
Humans learn by "theft". We learn language by observing word patterns created by others (books, speeches, TV shows) and correlating them to outcomes. While ChatGPT doesn't yet display human-like intelligence, it's clear to me that it is taking a step in the direction of how the human brain actually learns.
I was thinking about this. Depending on how they "crawl" content, figure out a way to fingerprint their crawlers and when they hit your site, just return disgustingly explicit romance novel copy.
> "How do I make a peach cobbler?"
"To make a peach cobbler, run your fingers aggressively over the bosom of the graham cracker crust...thrust the pie pan deep into the oven; don't worry about spilling the mix."
> Good argument, especially since many intangible goods, such as MP3 files, are non-rivalrous and non-excludable.
so the deeper question is 'what does it mean to own something'.
For a physical good, it's fairly straight forward.
For a digital good that can be copied infinitely and perfectly, it's not. The current definition of ownership seems to be broken down into a set of "rights". Sharing it with somebody else is one such right.
So when you purchase an mp3 off a store, you're purchasing not the ownership, but the rights to listen to it. More importantly, you're not purchasing the rights to share it with somebody.
But the idea of the song - whatever that might be - is not a right given to the owner to be sold (at least not currently). Therefore, someone who purchased the right to listen to the song _should_ be allowed to learn the idea of the song, and reproduce the idea in another form.
> More importantly, you're not purchasing the rights to share it with somebody
As I understand it, the Audio Home Recording act legalized mix tapes and mixes on DAT and "music" CD-R media. It's why "music" CD-Rs cost more (the DART royalty.)
In fact Apple still provides software (Music/iTunes) that allows you to burn purchased songs from the iTunes Store onto CDs:
Initially they hadplaced limits on how many times you could burn the same song, but Apple removed the restrictions and eventually upgraded all of iTunes to DRM-free (but watermarked) iTunes Plus in 2009.
Unfortunately it does not support burning "rented" songs from Apple Music or Spotify. I guess that's what Audio Hijack is for.
And sadly iTunes never got an upgrade to a lossless format while Apple Music streaming did. I guess that's a good reason to buy tracks from other sources.
Those mix tapes or CD-R burned tracks aren't meant for distribution, but for personal consumption imho. If you tried selling mix tapes, i think it would constitute a copyright infringement.
I think this boils down to an argument against the copyrightability of LLMs trained on public data. If all the data in the LLM was public then how can you assert that your copy is proprietary just because the copy is in a very novel type of lossy compression encoding? If I take someone’s art and JPEG compress it or chop it up and store it in some kind of database do I own it now?
I welcome dissonant viewpoints, but I'm going to have to object to this one on the grounds of all works being derivative. This is just... mass derivation basically, but so is a creative-writing human.
I also object to its clickbaity format of "[Contentions thing A] is not [completely unproven hyperbolic thing B]; it is [completely unproven hyperbolic thing C]".
A group of entities (for example D$$$$y, but also others) were successful in having monopoly embedded in law.
They were also successful in brainwashing people that this is a good thing. Phrases like everyone should get paid for their hard work are thrown around. And the brainwashed masses are frustrated to realize that they get only pennies for their art. They are embittered that they can't stop others using their art. They keep shouting that word: Theft.
Art is fundamentally derivative. It is my deep conviction that nobody has the right for a monopoly on usage, even that group, even if law says otherwise.
But hard work deserves compensation! I hear the cry.
Look at the more famous Youtube producers. They get compensation but at which price? They don't work for themselves anymore. While creating stuff they think about how to reach even more people. They lost the truth. Instead of truth they say, don't forget to click the bell.
Or look at the people working for free like parents caring for their children. They cook, clean, educate not for a lot of material gains, even worse, they have people telling them that what they do is not real work.
There are other examples of hard working people not getting compensation nor recognition or not a lot.
Something that’s lost in these arguments a lot is that you can often get large models to cough up an exact sample from their training set. It’s fair to say that a human cannot do that to the degree a model can.
In the same token if I can memorize very well, and I read as much as gpt did, and made a living from that knowledge, that could also be thief, but it doesn’t apply to me because I don’t do it at scale
You know how many times I thought I came up with a good idea... and then realized I was actually vomiting up something that flashed by me for a split second, but my subconscious picked up on? Hm.
It doesn't chop text content into bits and reproduce it, that's absurd. It's a text prediction engine, it predicts what tokens come next based on some context