The logical endgame of all this isn’t “stopping LLMs,” it’s Disney happening to own a critical mass of IP to be able to more or less exclusively legally train and run LLMs that make movies, firing all their employees, and no smaller company ever having a chance in hell with competing with a literal century’s worth of IP powering a generative model. This turns the already egregiously generous backwards facing monopoly into a forward facing monopoly.
None of this was ever the point of copyright. The best part about all this is that Disney initially took off by… making use of public domain works. Copyright used to last 14 years. You’d be able to create derivative works of most the art in your life at some point. Disney is ironically the proof of how constructive a system that regularly turns works over to the public domain can be. But thanks to lobbying by Disney, now you’re never allowed to create a derivative work of the art in your life.
Copyright is only possible because we the public fund the infrastructure necessary to maintain it. “IP” isn’t self manifesting like physical items. Me having a cup necessarily means you don’t have it. That’s not how ideas and pictures work. You can infinitely perfectly duplicate them. Thus we set up laws and courts and police to create a complicated simulation of physical properties for IP. Your tax dollars pay for that. The original deal was that in exchange, those works would enter the public domain to give back to society. We’ve gotten so far from that that people now argue about OpenAI “stealing” from authors, when the authors most of the time don’t even own the works — their employers do! What a sad comedy where we’ve forgotten we have a stake in this too and instead argue over which corporation should “own” the exclusive ability to cheaply and blazingly fast create future works while everyone else has to do it the hard way.
If I thought that nobody had a chance in hell of competing with generative models compiled by Disney from its corpus of lighthearted family movies, I'd be even less keen to give unlimited power to create derivative works out of everything in history to the companies with the greatest amount of computing power, which in this case happens to be a subsidiary of Microsoft.
All property rights depends on public funding the infrastructure to enforce them. If I believed movies derived from applying generative AI techniques to other movies was the endgame of human creativity, I'd find your endgame of it being the fiefdom of corporations who sold enough Windows licenses to own billions of dollars worth of computer hardware even more dystopian than it being invested in the corporations who originally paid for the movies...
1. You are assuming that "greatest computing power" is a requirement. I think we're actually seeing a trend in the opposite direction with recent generative art models: It turns out consumer grade hardware is "enough" in basically all cases, and renting the compute you might otherwise be missing is cheap. I don't buy this as the barrier.
2. Given #1, I think you are framing the conversation in a very duplicitive manner by pitching this as "either Microsoft or Disney - pick your oppressor". I'd suggest that breaking the current fuckery in copyright, and restoring something more sane (like the 7 + 7 year original timespans) would benefit individuals who want to make stories and art far more than it would benefit corporations. Disney is literaly THE reason for half of the current extensions in timespan. They don't want reduced copyright - they want to curtail expression in favor of profit. This case just happens to have a convienent opponent for public sentiment.
---
Further - "All property rights depends on public funding the infrastructure to enforce them" Is false. This is only the case for intellectual property rights, where nothing need be removed from one person for the other to be "in violation".
I'm assuming greater computing power is a requirement because creating generative feature length movies (which is a few orders of magnitude more complex than creating PNGs) is something only massive corporations can afford the computing power to do at the moment (and the implied bar for excellence something we haven't reached). Certainly computing power and dev resource are more of a bottleneck to creating successful AI movies than not having access to the Disney canon which was the argument the OP made for anything other than OpenAI having unlimited rights over everyones content leading inexorably to a Disney generative AI monopoly.
(another weakness of that is I'm not sure the Disney canon is sufficient training data for Disney to replace their staff with generative movies, never mind necessary for anyone else to ever make a marketable quality movie again)
Given #1, I think the OP is framing the conversation in a far more duplicitous manner by assuming that in a lawsuit against AI which doesn't even involve Disney, the only beneficiary of OpenAI not winning will be Disney. Disney extending copyright laws in past decades has nothing to do with a 10 year old internet company objecting to Open AI stripping all the copyright information off its recent articles before feeding them into its generative model.
> Further - "All property rights depends on public funding the infrastructure to enforce them" Is false. This is only the case for intellectual property rights, where nothing need be removed from one person for the other to be "in violation".
People who don't respect physical property are just as capable of removing it as people who don't respect intellectual property are capable of copying it. In both cases the thing that prevents them doing so is a legal system and taxpayer funded enforcement against people that don't play by the rules.
> All property rights depends on public funding the infrastructure to enforce them
Still true, because people generally depend on the legal system and police departments to enforce physical property rights (both are publicly funded entities).
All property rights absolutely depend on public infrastructure. The only thing keeping your house in your name is the legal system enforcing your right to it.
Either copyrights exist, and people can't copy creative works "owned" by somebody else, or copyrights don't exist and people can copy those at will.
"Copyrights exist, and people can copy others works if they have enough computing power to multiplex it with other works and demultiplex to get it back" is not a reasonable position.
I'm all for limiting it to 15 or 20 years, and requiring registration. If you want to completely end them, I'd be ok with that too (but I think it's suboptimal). But "end them to rich people" isn't acceptable.
> Either copyrights exist, and people can't copy creative works "owned" by somebody else, or copyrights don't exist and people can copy those at will.
that's not how copyright works, it's not a binary thing. Also, it's similar but not the same in every legislation. You can make partial copies, you can make full copies as personal backup, you can make copies to transform copyrighted material (like create art and parodies.)
These cases are going to decide whether Google Books was a fluke or indeed, there is a limit to the power of the big copyright holders (not the artists/creators: those keep on starving, except few lucky ones.)
> Either copyrights exist, and people can't copy creative works "owned" by somebody else, or copyrights don't exist and people can copy those at will.
Like most simple binaries, this is a false dichotomy, and not only do more options exist in possibility, but neither of those matches the overt state of the law (where copyrights exist, but so do a range of caveats and exceptions, so people can copy and otherwise make use of works by others without permission under certain circumstances, but not at will, satisfying neither of the two options you present as exhaustive of all possibilities.)
ClosedAI etc. are certainly stealing from open source authors and web site creators, who do own the copyright.
That said, I agree with putting more emphasis on individual creators, even if they have sold the copyright to corporations. I was appalled by the Google settlement with the author's guild: Why does a guild decide who owns what and who gets compensations?
Both Disney and ClosedAI are in the wrong here. I'm the opposite of a Marxist, but Marx' analysis was frequently right. He used the term "alienation from one's work" in the context of factory workers. Now people are being alienated from their intellectual work, which is stolen, laundered and then sold back to them.
I don't think you need to be a Marxist to accept that his observation that people are being alienated from their work capacity is spot on.
The "Marxsist" name is either about believing on the parts that aren't true or about the political philosophy (that honestly, can't stand by its own without the wrong facts). The ones that fit reality only make one a "realist".
I mean, not to be that guy, but multiple Marxist and Marxist-adjacent people I know and am have been out here pointing out how this was exactly and always what was going to happen since the LLM hype cycle really kicked into high-gear in mid-2023. And I was told in no uncertain terms, many times, on here, about how I was being a doomer, a pessimist, a luddite, etc. etc. because I and many like me saw the writing on the wall, immediately, that while generative AI represented a neat thing for folks to play with, that it would, like every other emerging tech, quickly become the sole domain of the monied entities that already run the rest of our lives, and this would be bad for basically everyone long term.
> But thanks to lobbying by Disney, now you’re never allowed to create a derivative work of the art in your life
As far as I can tell the only copyright term extension that might have been influenced by Disney lobbying in the US is the Copyright Term Extension Act of 1998, which extended the term from life+50 to life+70 (or from 75 to 95 years for works of corporate authorship).
The switch from fixed terms to life plus+50 came with the Copyright Act of 1976 which had nothing to do with Disney. They were probably for it, but so was nearly everybody because it laid the groundwork for the US joining the Berne Convention and making its copyright system much more compatible with that of most other countries.
As far as copyright law outside the US goes, most countries were on life+50 or longer before Disney even existed.
> Me having a cup necessarily means you don’t have it. That’s not how ideas and pictures work. You can infinitely perfectly duplicate them.
This is a stupid argument, no matter how often it comes up.
If I hire Alice to come to my sandwich shop and make sandwiches for customers all week and then on payday I say, "Welp, no need to pay you—the sandwiches are already made!" then Alice is definitely out something, and I am categorically a piece of shit for trotting out this line of reasoning to try to justify not paying her.
If I do the same thing except I commission Alice to do a drawing for a friend's birthday, then I am no less a piece of shit if I make my own copy once she's shown it to me and try to get out of paying since I'm not using "her" copy.
(Notice that in neither case was the thing produced ever something that Alice was going to have for herself—she was never going to take home 400 sandwiches, nor was she ever interested in a portrait of your friend and his pet rabbit.)
If Alice senses that I'd be interested in the drawing but might not be totally swayed until I see it for myself, so she proactively decides to make the drawing upfront before approaching me, then it doesn't fundamentally change the balance from the previous scenario—she's out no less in that case than if I approached her first and then refuse to pay after the fact. (If she was wrong and it turns it I didn't actually want it because she misjudged and will not be able to recoup her investment, fair. But that's not the same as if she didn't misjudge and I come to her with this bankrupt argument of, "You already made the drawing, and what's done is done, and since it's infinitely reproducible, why should I owe you anything?")
Copyright duration is too long. But the fundamental difference between rivalrous possession of physical artifacts and infinitely reproducible ideas really needs to stay the hell out of these debates. It's a tired, empty talking point that doesn't actually address the substance of what IP laws are really about.
This isn't really an argument though. It's an assertion that not honoring a commission agreement (or an employment contract) is equivalent to not paying for a license to an existing work. I tend to disagree. I could be persuaded otherwise, but I'd need to hear an argument other than "clearly these are the same thing."
> This isn't really an argument though. It's an assertion that not honoring a commission agreement
Wrong. It's that (not honoring an agreement negotiated beforehand) and an argument against treating past-action-thing as inherently zero-cost and/or zero-value; the fact that a prior agreement is an element in the offered scenarios doesn't negate or neutralize the rest of it (just like the fact that a sandwich shop is an element in one of the scenarios doesn't negate or neutralize the broader reality for non-sandwich-involving scenarios).
And that's before we mention: there _is_ such an prior agreement in the case of modern IP—you can't not contend with the fact that if Alice is operating in the United States which has existing legislation granting her a "temporary monopoly" on her creative output, and then she generates the output on the basis that she'll be protected by the law of the land, and then you decide that you just don't agree with the idea of IP, then Alice is getting screwed over by someone not holding up their end of the bargain.
Agree with the sibling: committing fraud by intentionally not honoring a contract is not morally or logically the same as duplicating a piece of media under copyright. That is not to say that copyright violations are harmless (the scale and intent matter), but details can't be ignored.
A material difference between fraud and copyright violations as categories is the presence of lost profit. With fraud one has lost the time value of their work, but with media piracy there is some research (funded by the EU of all things) that it doesn't trade off with sales and may even help some sales.
You wanna, like, actually digest what I wrote there? The second comment here is so unlike the first that your "Saying it over and over again" remark can only lead to the conclusion that you either didn't read it or didn't grok it. They're two different comments about two different things.
> I'm sorry
Are you? I think you mixed up the words "insincere" and "sorry".
If you hire Alice that means there is a contract you both have agreed to and need to honor. If alice just shows up in your kitchen making burgers she doesn't get to tell you what to do with the burgers after you kick her out. With copyright there is no explicit contract you can choose to enter. Instead everyone is effectively forced into a contract with every creator. A contract that is unconciably biased to benefit the creator.
Do you think it would be reasonable for Mallory to sell burgers and then demand that if you share some of them with your friend you need to seek her permission? And of course since the burger becomes part of your body then perhaps Mallory should have a say in what you can do with that too and can extract some fee for you existing after eating her burgers. That's how copyright is usually (mis)used - to extract rent in perpetuity for work that was done long ago. This kind of business model just doesn't exist out of IP. It's entirely artificial.
> But the fundamental difference between rivalrous possession of physical artifacts and infinitely reproducible ideas really needs to stay the hell out of these debates. It's a tired, empty talking point that doesn't actually address the substance of what IP laws are really about.
On the contrary, it is a very important point. We don't burgers just sitting around to feed everyone for their entire lives. We do have all kinds of art and entertainment as well as productivity tools that have essentially infinite free copies. We don't really NEED to artificially encourage more creation for a lot of these whereas if people stopped producing food everyone would be in big trouble.
I'm answering in reverse order because I think a lot of this comment covers stuff that we don't really disagree with. Thus I will answer the conclusion and then I put my responses to everything else because I find them interesting, but not required for what I want to convey.
> Copyright duration is too long. But the fundamental difference between rivalrous possession of physical artifacts and infinitely reproducible ideas really needs to stay the hell out of these debates. It's a tired, empty talking point that doesn't actually address the substance of what IP laws are really about.
I would argue that it is perhaps the opposite of tired: ironically less relevant in the past and more relevant as technology advances and mere thought experiments become practical reality. I think many of these issues weren't dealt with in the past because these edge cases existed as mere hypotheticals. Kind of like a mathematician saying that copyright doesn't make any sense because he could write a program that iterates through all books. Lawyers just roll their eyes not because they have a counter-argument, but because they don't think that scenario exists as something they'd ever have to deal with. I think the idea of a computer that reads all the text in the world and learns from it is definitely tied to the questions of unresolved issues with the nature of data, but would have until very recently been considered an annoying hypothetical in a serious discussion about copyright, allowing us to actual dismiss it and continue not addressing it.
We all agree that copyright is too long. And I also think this would just become a non-issue if we had a reasonable duration for copyrights. Even if you philosophically disagreed with it, it wouldn't be worth arguing over vs. just waiting it out.
> This is a stupid argument, no matter how often it comes up.
I knew bringing this up would rekindle these arguments from 20 years ago, but it was necessary for a later point, so I was hoping I was making it value-neutral enough that it wouldn't trigger this, but I guess I was wrong.
To be clear, I am not making the same argument you have seen several times before. I am making a strictly weaker argument. The only goal of this distinction is to demonstrate that these properties are "different", and that the law aims to make "intellectual property" behave like physical property. Notice for example that I didn't then assert that IP thus doesn't exist. I didn't even argue whether this goal of matching the behavior was good or bad. I am simply stating that it doesn't by default behave the way we seem to want it to, and, people don't seem to intuitively ascribe the same morality to it either. My only intention is to make the point that this goal thus requires work, and (as I'll explain in more detail below), more work than in the physical case. So far I don't think there is anything necessarily unreasonable about this as a set of premise conditions for establishing the terms under which the public at large agrees to take on the costs of maintaining said system.
> A bunch of stuff about Alice making sandwiches and drawing pictures
Disclaimer: I don't think we're really in disagreement about the important points, and I don't think this section is relevant to the important points which I return to below, however I find it intellectually interesting to talk about, so I have a retort here, which I believe is just an unrelated digression
These analysis of Alice making sandwiches and drawings (IMO) misses the actual meaningful differences in these scenarios since it (IMO) focuses on the uncontroversial, but also irrelevant, breach-of-contract issues. In both these scenarios, the issue is not really the "property," it is the refusal to comply with a previously agreed arrangement. You can see this if we add a third scenario where I pay Alice to do jumping jacks for a week, she does them, and then I refuse to pay at the end of the week. No need to pay you, you already did the jumping jacks. No one "got" anything here, other than I guess "satisfaction" or "exercise". We can make the example even more abstract by having me pay Alice to do nothing all week, and she once again does a great job by sitting quietly in her room all week, and then I once again don't pay her. The sandwiches and drawings are just props in the original examples -- they're not actually necessary since this is a contract question, not a theft question.
The actual interesting aspects around the sandwiches and drawings are 1) what happens much after this transaction, and 2) what happens with third parties. With the sandwiches, "what happens after" is straight-forward. I either the sandwiches, resell them immediately, or they go bad. There's not much interesting there. No one needs to think hard about the "ramifications" of the sale of the sandwiches. Compare this to the drawing. What if after I have paid you, just like we agreed, I proceed to make my infinite copies. You might think that's not fair, you thought you'd have a repeat customer. I assumed I was free to do as I please with the drawing. In fact, ironically enough, in this instance if I treat the drawing like physical property, where the expectation is I can do as I please with it, it ironically creates this conundrum because "putting the paper in the photocopier" is in the set of "do as I please". But let's go one step further, what if I make all those copies and then sell them.
I'm sure you'll now respond that the royalties or usage rights were all implied in your original story. Great! But that's my point. Those were required. You needed a supremely complex web of laws and binding contracts (and litigation if they aren't followed) as a necessary component of that transaction due to the existence of degrees of freedom that simply don't exist for the sandwiches. You can write up a contract around the resale of a sandwich, but most sandwich shops don't because me eating sandwiches for the rest of my life by copying the original sandwich isn't a realistic scenario (so no need to price that into the original cost of the sandwich), and me out-sandwiching you by carbon-cloning the sandwich isn't feasible, and even if it was it would still have material ingredient costs that would bound its effect on my shop, and even "figuring out the recipe" isn't that much of a worry since you still need to like buy ingredients and make sandwiches as opposed to hitting paste over and over. These scenarios are dramatically different, and that's why sandwich shops usually don't employ lawyers but design shops do. And again, we didn't even go into third parties. What if someone manages to somehow make a copy of your image just as you're handing it to the client. Now both of you are in compliance with your deal, neither of you is angry at each other, but there's this weird situation where you were never expecting to get money from me, but I have a copy of the picture now, and it's really hard to reason about what that means in terms of "gain" and "loss" if I never do anything other than hang it up in my room. This is simply not possible with the sandwich, no one could quickly "copy the sandwich" in transit and potentially introduce an entirely new threat to your business.
Again, my only point here is that it seems very strange to insist that there physical property is identical to intellectual property, and that it isn't fairly complicated to make intellectual property approximate the relationships we have with physical property. And to be clear, nothing even derogatory has been said about this goal yet. You could take everything I've written in this comment so far, and use it as part of argument for copyright. However important is it precisely because of the explosion of complexity in possibilities that simply don't exist for the vast majority of physical items.
You're comfortable acknowledging that the relevant principle isn't specific to sandwich shop workers or women named Alice and that what's at issue in the example provided is something more general than either of those two details. You're insisting, though, that it is as specific as breaking a prior agreement and nothing broader, even though that, too, was contrived just the same in order to flesh out the example with detail. That unwillingness is an error.
I'm not merely comfortable acknowledging it, I specifically took the time to demonstrate how the outcome was property-type-independent by explicitly going over what happens when we remove the property from the example but leave everything else unchanged. If you contend that the breaking of the agreement is somehow equally superfluous, then I think it's on you to demonstrate that through a similar analysis. You seem pretty confident this is the case, but I am skeptical given that the story only really had two components: the property in question, and the agreement with regard to the property. They can't both be inconsequential details fleshing out an otherwise empty story, right?
But either way, I think the point that immediately follows that is even more important, right? The fact that the nature of the ownership, even in a "successful" transaction, is incredibly more complicated with the drawing. How the we don't even properly understand how much "ownership" you have of the drawing without a contract. How that transaction potentially puts you in direct competition with Alice in the future. Etc. etc. Again, the entirety of my position is the fairly narrow statements that: 1) intellectual property is fundamentally different from physical property, 2) you thus cannot simply model intellectual property transactions by merely pretending you're dealing with physical objects (since there's fundamentally more dimensionality and ambiguity without explicitly outlining and agreeing to way more terms and details), and 3) intellectual property thus naturally requires significant infrastructure in order to create an environment that gets anywhere close to simulating the same "physical-like" properties for intellectual property. I don't think that's controversial.
I find "this wasn't the point of copyright", referring to the motivations of 18th century legislators, unpersuasive. They were making up rules that were good for the foreseeable future, but they didn't foresee everything and certainly not everyone being connected to a global data network.
Persuasive arguments should focus on what's good for the world today.
I hate to break it to you, but we continued that pattern of making up rules. The main difference is that we let lobbyists play a bigger role over time. The updated rules (life of the author plus 70 years) was passed in 1976, so I don't think they had global data networks in mind either. But perhaps you believe neither side has presented a persuasive argument.
I will however say that I think my comment was not just an appeal to authority. Again, I think the fact that using public domain works was critical to Disney's early success is a fairly important data point, especially considering some of those works would not have allowed to be used with the current lifespans (e.g. Pinocchio's copyright would have lasted until 1960, 20 years after the film premiered).
But again, the most important thing I want taken away from this is that we the "consumers" of the content should not consider ourselves bystanders, but understand that we do have an active stake here as well. You're first sentence is perhaps more important than you realize, making up the rules wasn't something one-off incidental property of being first to the table, we could choose to make up the rules too, so we should act like it, as opposed to trying to "deduce" the ownership of a sentence. This is unique, we don't have that ability with physical property. We can't simply declare that everyone gets Ferrari tomorrow and then have them magically appear in everyone's garage. But we could declare that everyone can have the rights to Superman tomorrow, and they would "magically just have them".
There's no real baseline here. We should just weigh the pros and cons. The fashion industry operates more or less copyright-free. The infrastructure to enforce copyright has real costs. Not to mention there is all the collateral damage from the abuse of copyright takedowns this system brings along with it. And any sort of appeal to authorship is also highly suspicious given that authors rarely end up owning these rights. Every time one of these Marvel movies comes out there's a mini outcry when people see the guy whose comic the movie is based on is just some dude who gets nothing from the making of the movie. On the flip side we take for granted that every public domain character was of course at one point created. Robin Hood, Zorro, Dracula, Sherlock Holmes. Are we unhappy with the diversity of adaptations we've gotten from these? Would it be that Earth shattering if Harry Potter joined that list? As things stand right now no one on this website will likely ever get to legally publish "their take Harry Potter". The clock doesn't start ticking until after JK Rowling dies. It would have entered the public domain 2011 under the original rules. In case you're curious, her net worth in 2011 was $500M, if you want to factor that into whether you think that would have been "fair" (and its not like she stops making money at that point, its just other people start to be able to do stuff with the first book). I think it is worthwhile to imagine a different approach to this.
AI is absolutely a further wealth concentrator by its very nature. It will not liberate the bottom 3/4, it will not free up their time by allowing them to work a lot less (as so many incorrectly predict now). Eric Schmidt for example has some particularly incorrect claims out there right now about how AI will widely liberate people from having to work so many hours, it will prove laughable in hindsight. Those that wield high-end AI, and the extreme cost of operations that will go with it, will reap extraordinary wealth over the coming century. Elon Musk style wealth. Very few will have access to the resources necessary to operate the best AI (the cost will continue to climb over what companies like Microsoft, Google, Amazon, OpenAI, etc are already spending).
Sure, various AI assistants will make more aspects of your life automated. In that sense it'll buy people more time in their private lives. It won't get most people a meaningful increase in wealth, which is the ultimate liberator of time. That is, financial independence.
And you can already see the ratio of people that are highly engaged with utilizing the latest LLMs, paying for them, versus either rarely or never using them (either not caring/interested in utilizing, or not understanding how to do so effectively). It's heavily bifurcated between the elites and everybody else, just as most tech advances have been so far. A decade ago a typical lower / lower middle class person could have gone to the library and learned JavaScript and over the course of years could have dramatically increased their earning potential (a process that takes time to be clear); for the same reason that rarely happens by volition, they also will not utilize LLMs to advance their lives despite the wide availability of them. AI will end up doing trivial automation tasks for the bottom 50%. For the top ~1/4 it will produce enormous further wealth from equity holdings and business process productivity gains (boosting wealth from business ownership, which the bottom 50% lacks universally).
Copyright laws, in many ways, feel outdated and unnecessarily rigid. They often appear to disproportionately favor large corporations without providing equivalent value to society. For example, brands like Disney have leveraged long-running copyrights to generate billions, or even tens of billions, of dollars through enforcement over extended periods. This approach feels excessive and unsustainable.
The reliance on media saturation and marketing creates a perception that certain works are inherently more valuable than others, despite new creative works constantly being developed. While I agree that companies should have the right to profit from their investments, such as a $500 million movie, there should be reasonable limits. Once they recoup their costs, including a reasonable profit multiplier, the copyright could be considered fulfilled and should expire.
Holding onto copyrights indefinitely or for excessively long periods serves primarily to sustain a system that benefits lawyers and enforcement agencies, rather than providing meaningful value to society. For instance, enforcing a copyright from the 1940s for a multinational corporation that already generates billions makes little sense.
There should be a balanced framework. If I invest significant time and effort—say 100 hours—into creating a work, I should be entitled to earn a reasonable return, perhaps 10 times the effort I put in. However, after that point, the copyright should no longer apply. Current laws have spiraled out of control, failing to strike a balance between protecting creators and fostering innovation. Reform is long overdue.
I am personally in favor of strong, short copyrights (and patents). 90+ year copyrights are just absurd. Most movies make almost all their money in the first 10 years anyway, and a strong 10- or 20-year copyright would keep the economics of movie and music production largely the same.
I think the biggest change would be that if characters and stories lose their copyright after 10 years then you would see even more sequels and remakes because anyone could make their own Star Wars sequel series for example.
Is there a way to figure out if OpenAI ingested my blog? If the settlements are $2500 per article then I'll take a free used cars worth of payments if its available.
I suppose the cost of legal representation would cancel it out. I can just imagine a class action where anyone who posted on blogger.com between 2002 and 2012 eventually gets a check for 28 dollars.
If I were more optimistic I could imagine a UBI funded by lawsuits against AGI, some combination of lost wages and intellectual property infringement. Can't figure out exactly how much more important an article on The Intercept had on shifting weights than your hacker news comments, might as well just pay everyone equally since we're all equally screwed
If you posted on blogger.com (or any platform with enough money to hire lawyers) you probably gave them a license that is irrevocable, non-exclusive and able to be sublicensed.
There are reasons for that (they need a license to show it on the platform) but usually these agreements are overly broad because everyone except the user is covering their ass too much.
Those licenses will now be used to sell that content/data for purposes that nobody thought about when you started your account.
Wouldn't the point of the class action to be to dilute the cost of representation? If the damages per article are high and there's plenty of class members, I imagine the limit would be how much OpenAI has to pay out.
> It is unclear if the Intercept ruling will embolden other publications to consider DMCA litigation; few publications have followed in their footsteps so far. As time goes on, there is concern that new suits against OpenAI would be vulnerable to statute of limitations restrictions, particularly if news publishers want to cite the training data sets underlying ChatGPT. But the ruling is one signal that Loevy & Loevy is narrowing in on a specific DMCA claim that can actually stand up in court.
> Like The Intercept, Raw Story and AlterNet are asking for $2,500 in damages for each instance that OpenAI allegedly removed DMCA-protected information in its training data sets. If damages are calculated based on each individual article allegedly used to train ChatGPT, it could quickly balloon to tens of thousands of violations.
Tens of thousands of violations at $2500 each would amount to tens of millions of dollars in damages. I am not familiar with this field, does anyone have a sense of whether the total cost of retraining (without these alleged DMCA violations) might compare to these damages?
If you're going to retrain your model because of this ruling, wouldn't it make sense to remove all DMCA protected content from your training data instead of just the one you were most recently sued for(especially if it sets precedent)
But all content is DMCA protected. Avoiding copyrighted content means not having content as all material is automatically copyrighted. One would be limited to licensed content, which is another minefield.
The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits.
In almost all cases before gen AI, scraping was found to be legal unless the bot accepted terms of service, in which case bot is bound by ToS. The biggest and most clear is [1]. People have been scraping internet for as long as internet existed.
It would make sense from a legal standpoint, but I don't think they could do that without massively regressing their models performance to the point that it would jeopardize their viability as a company.
> I guess I should have used the phrase "common sense stealing in any other context" to be more precise?
Clearly not common sense stealing. The Intercept was not deprived of their content. If OpenAI would have sneaked into their office and server farm and took all the hard drives and paper copies with the content that would be "common sense stealing".
Very much common sense copyright violation though.
Copyright means you're not allowed to copy something without permission.
It's that simple. There is no "Yes but you still have your book" argument, because copyright is a claim on commercial value, not a claim on instantiation.
There's some minimal wiggle room for fair use, but clearly making an electronic copy and creating a condensed electronic version of the content - no matter how abstracted - and using it for profit is not fair use.
If the AI produces chunks of training set nearly verbatim when prompted, it looks like copying.
> And if so, why isn't someone learning from said work not considered copying in their brain?
Well, their brain, while learning, is not someone's published work product, for one thing. This should be obvious.
But their brain can violate copyright by producing work as the output of that learning, and be guilty of plagiarism, etc. If I memorise a passage of your copyrighted book when I am a child, and then write it in my book when I am an adult, I've infringed.
The fact that most jurisdictions don't consider the work of an AI to be copyrightable does not mean it cannot ever be infringing.
The output of a model can be copyright violation. In fact, even if the model was never trained on copyright content, if I provided copyright text then told the model to regurgitate it verbatim that would be a violation.
That does not make the model copyright violation itself.
Yea good point. whats the difference between spidering content and training a model? Its almost like access pages of contact like a search engine.. If the information is publically available?
A product from a company is not a person. An LLM is not a brain.
If you transcode a CD to mp3 and build a business around selling these files without the author's permission you'd be in big legal problems.
Tech products that "accidentally" reproduce materials without the owners' permission (e.g. someone uploading La La Land into YouTube) have processes to remove them by simply filling a form. Can you do that with ChatGPT?
It's legal for you to possess a single joint. It's not legal for you to possess a warehouse of 400 tons of weed.
The line between legal and not legal is sometimes based on scale; being able to ingest a single book and learn from it is not the same scale as ingesting the entire published works of mankind and learning from it.
> Are you describing what the law is or what you feel the law should be?
I am stating what is, right now.
I thought the weed example made that clear.
Let me clarify: the state of things, as they stand, is that the entire justice system, legislation and courts included, takes scale into account when looking at the line dividing "legal" from "illegal".
There is literally no defense of "If it is legal at qty x1, it is legal at any qty".
Excelent. Then the next question is where (in which jurisdiction) are you describing the law? And what are your sources? Not about the weed, i don’t care about that. Particularly the “being able to ingest a single book and learn from it is not the same scale as ingesting the entire published works of mankind and learning from it”.
The reason why i’m asking is because you are drawing a paralel between criminal law and (i guess?) copyright infringement. The drug posession limits in many jurisdictions are explicitly written into the law. These are not some grand principle of laws but the result of explicit legislative intent. The people writing the law wanted to punish drug peddlers without punishing end users. (Or they wanted to punish them less severly or differently.) Are the copyright limits you are thinking about similarly written down? Do you have case references one can read?
I made it clear in both my responses that scale matters, and that there is precedence in law, in almost all countries I can think off right now, for scale mattering.
I did not make the point that there is a written law specifically for copyright violations at scale (although many jurisdictions do have exemptions at small scale written into law).
I will try to clarify once again: there is no defence in law that because something is allowed at qty X1, it must be allowed at any qty.
This is the defence that was originally posted that I replied to, it is the one that is not valid because courts regularly consider the scale of an activity when determining the line between allowed and not allowed.
That might be the point. If your business model is built on reselling something you’ve built on stuff you’ve taken without payment or permission, maybe the business isn’t viable.
I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling” then you’ve essentially won all of humanities output, as you can already scrape the new primary information (scientific articles and other datasets designed for researchers to freely access) and whatever junk outputted by the content mills is just going to be a poor summarizations of that primary information.
Other factors that help this effort of an old model + new public facing data being complete, are the idea that other forms of media like storytelling and music have already converged onto certain prevailing patters. For stories we expect a certain style of plot development and complain when its missing or not as we expect. For music most anything being listened to is lyrics no one is deeply reading into put over the same old chord progressions we’ve always had. For art there are just too few of us who are actually going out of our way to get familiar with novel art vs the vast bulk of the worlds present day artistic effort which goes towards product advertisement, which once again follows certain patterns people have been publishing in psychological journals for decades now.
In a sense we’ve already put out enough data and made enough of our world formulaic to the point where I believe we’ve set up for a perfect singularity already in terms of what can be generated for the average person who looks at a screen today. And because of that I think even a lack of any new training on such content wouldn’t hurt openai at all.
> I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling”
I'm not a lawyer, but I know enough to be pretty confident that that wouldn't work. The law is about intent. Coming up with "one weird trick" to work-around a potential court ruling is unlikely to impress a judge.
They might make it work by (1) having lots of public domain content, for the purpose of training their models on basic language use, and (2) preserving source/attribution metadata about what copyrighted content they do use, so that the models can surface this attribution to the user during inference. Even if the latter is not 100% foolproof, it might still be useful in most cases and show good faith intent.
The latter one is possible with RAG solutions like ChatGPT Search, which do already provide sources! :)
But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity)
But who knows. Maybe it can be done for more fact-like stuff.
Or this point, I'm sure there is more than enough publically and feely usable content to "learn how language works". There is no need to hoover up private or license-unclear content if that is your goal.
I would actually love it if that was true. It would reduce a lot of legal headaches for sure. But if that was true, why were previous GPT versions not as good at understanding language? I can only conclude that it's because that's not actually true. There's not enough digital public domain materials to train a LLM to understand language competently.
Perhaps old texts in physical form, then? It'll cost a lot to digitize that, wouldn't it? And it wouldn't really be accessible to AI hobbyists. Unless the digitization is publicly funded or something.
(A big part of this is also how insanely long copyright lasts (nearly a hundred years!) that keeps most of the Internet's material from being public domain in the first place, but I won't belabour that point here.)
Edit:
Fair enough, I can see your point. "Surely it is cheaper to digitize old texts or buy a license to Google Books than to potentially lose a court case? Either OpenAI really likes risking it to save a bit of money, or they really wanted facts not contained in old texts."
And yeah, I guess that's true. I could say "but facts aren't copyrightable" (which was supported by the judge's decision from the TFA), but then that's a different debate about whether or not people should be able to own facts. Which does have some inroads (e.g. a right against being summarized because it removes the reason to read original news articles).
> Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc.
All of that and more, all at the same time.
Attribution at inference level is bound to work more-less the same way as humans attribute things during conversations: "As ${attribution} said, ${some quote}", or "I remember reading about it in ${attribution-1} - ${some statements}; ... or maybe it was in ${attribution-2}?...". Such attributions are often wrong, as people hallucinate^Wmisremember where they saw or heard something.
RAG obviously can work for this, as well as other solutions involving retrieving, finding or confirming sources. That's just like when a human actually looks up the source when citing something - and has similar caveats and costs.
Only half-serious, but: I wonder if they can dance with the publishers around this issue long enough for most of the contested text to become part of public court records, and then claim they're now training off that. <trollface>
Re-training can be done, but, and it is not a small but, models already do exist and can be used locally suggesting that the milk has been spilled for too long at this point. Separately, neutering them effectively lowers their value as opposed to their non-neutered counterparts.
The onus is on the person collecting massive amounts of data and circumventing DMCA protections to ensure they're not doing anything illegal.
"well someone snuck in some DMCA content" when sharing family photos and doesn't suddenly make it legal to share that DMCA protected content with your photos...
Eventually we're going to have embodied models capable of live learning and it'll be extremely apparent how absurd the ideas of the copyright extremists are. Because in their world, it'd be illegal for an intelligent robot to watch TV, read a book or browse the internet like a human can, because it could remember what it saw and potentially regurgitate it in future.
You have to understand, the media companies don't give a shit about the logic, in fact I'm sure a lot of the people pushing the litigation probably see the absurdity of it. This is a business turf war, the stated litigation is whatever excuse they can find to try and go on the offensive against someone they see as a potential threat. The pro copyright group (big media) sees the writing on the wall, that they're about to get dunked on by big tech, and they're thrashing and screaming because $$$.
If humanity ever gets to the point where intelligent robots are capable of watching TV like human can, having to adjust copyright laws seems like the least of problems. How about having to adjust almost every law related to basic "human" rights, ownership, being to establish a contract, being responsible for crimes and endless other things.
But for now your washing machine cannot own other things, and you owning a washing machine isn't considered slavery.
It's not copyright "extremism" to expect a level playing field. As long as humans have to adhere to copyright, so should AI companies. If you want to abolish copyright, by all means do, but don't give AI a special exemption.
It's actually the opposite of what you're saying. I can 100% legally do all the things that they're suing OpenAI for. Their whole argument is that the rules should be different when a machine does it than a human.
Only because it would be unconscionable to apply copyright to actual human brains, so we don't. But, for instance, you absolutely can commit copyright violation by reading something and then writing something very similar, which is one reason why reverse engineering commonly uses clean-room techniques. AI training is in no way a clean room.
Your ability to regurgitate remembered article that is copyrighted does not make your brain a derivative work because removing that specific article from the training set is below noise floor of impact.
However reproducing the copyrighted material based on that is a violation because the created reproduction does critically depend on that copyrighted material.
(Gross simplification)
Similar to how you can watch & read a lot of Star Wars and then even ape Ralph McQuarrie style in your own drawings but unless the result is unmistakenly related to Star Wars there's no copyright infringement - but there is if someone looks at the result and goes "that's Star Wars, isn't it?"
Can you regurgitate billions of pieces of information to hundreds of thousands of other people in a way that competes with the source of that information?
If there was only one source for a piece of news ever, you might be able to make that argument in good faith, but when there are 20 outlets with competing versions of the same story it doesn't hold.
Go make a movie using the same plot as a Disney movie, that doesn't copy any of the text or images of the original, and see how far "not spitting out a copy" gets you in court.
AI's approach to copyright is very much "rules for thee but not for me".
That might get you pretty far in court, actually. You'd have to be pretty close in terms of the sequence of events, character names, etc. Especially considering how many Disney movies are based on pre-existing stories, if you were, to, say, make a movie featuring talking animals that more or less followed the plot of Hamlet, you would have a decent chance of prevailing in court, given the resources to fight their army of lawyers.
The same rules we already have: follow the license of whatever you use. If something doesn't have a license, don't use it. And if someone says "but we can't build AI that way!", too bad, go fix it for everyone first.
problem is when a human company profits over their scrape... this isn't a non-profit running out of volunteers & a total distant reality from autonomous robots learning it way by itself
we are discussing an emergent cause that has social & ecological consequences. servers are power hungry stuff that may or not run on a sustainable grid (that also has a bazinga of problems like leaking heavy chemicals on solar panels production, hydro-electric plants destroying their surroundings etc.) & the current state of producing hardware, be a sweatshop or conflict minerals.
lets forget creators copyright violation that is written in the law code of almost every existing country and no artist is making billions out of the abuse of their creation right (often they are pretty chill on getting their stuff mentioned, remixed and whatever)
Leaving aside the hypothetical "live learning AGI" of the future (given that money is made or lost now), would a human regurgitating content that is not theirs - but presented as if it is - be acceptable to you?
I don't know about you but my friends don't tell me that Joe Schmoe of Reuters published a report that said XYZ copyright XXXX. They say "XYZ happened."
In have a friend that recites all day amazingly long pieces of literature by heart. He says he just wrote them. He also produces a vast number of paintings in all styles, claiming he is a really talented painter.
Reuters finds a new business model? What did horse and buggy drivers do, pivot to romance themed city tours? I'm sure media companies will figure something out.
So who and why will produce the news for your friend to steal? The horse and buggy metaphor is getting tiresome when its used as some sort signalling of "progress oriented minds" and creative destruction enthusiasts versus the luddites.
Someone who realizes that the raw information has no value in the age we're entering. Influencers have shown the way, brand and community engagement are the new differentiators.
This makes no sense. How can you source objective facts about the world by further devaluing and undermining those who's job it is to do it?
Influencers are parasites that have been made possible by broken, user-hostile platforms.
You are advocating for a deranged, dangerous world, where demagogues rule over large masses of idiots that can't tell the difference between AI junk and reality.
The problem is, we can't come up with a solution where both parties are happy, because in the end, consumers choose one (getting information from news agencies) or the other (getting information from chatgpt). So, both are fighting for life.
Exactly. Also core to the copyright extremists’ delusional train of thought is the fact that they don’t seem to understand (or admit) that ingesting, creating a model, and then outputting based on that model is exactly what people do when they observe others’ works and are inspired to create.
And if I rip a blu ray to my hard drive and then give the hard drive to my friend so he can output a movie is just the same as if I had told him my recollections of the movie from my brain. Both are claims you can make without anything to back them up.
I understand that regulations exist and how there can be copyright violations, but shouldn't we be concerned that other.. more lenient governments (mainly China) who are opposed to the US will use this to get ahead? If OpenAI is significantly set back.
No. OpenAI is suspected to be worth over $150B. They can absolutely afford to pay people for data.
Edit: People commenting need to understand that $150B is the discounted value of future revenues. So... yes they can pay out... yes they will be worth less... and yes that's fair to the people who created the information.
I can't believe there are so many apologists on HN for what amounts to vacuuming up peoples data for financial gain.
The OpenAI that is assumed to keep being able to harvest every form of IP without compensation is valued at $150B, an OpenAI that has to pay for data would be worth significantly less. They're currently not even expecting to turn a profit until 2029, and that's without paying for data.
OpenAI is not profitable, and to achieve what they have achieved they had to scrape basically the entire internet. I don't have a hard time believing that OpenAI could not exist if they had to respect copyright.
I'm not convinced that the economic harm to content creators is greater than the productivity gains and accessibility of knowledge for users (relative to how competent it would be if trained just on public domain text). Personally, I derive immense value from ChatGPT / Claude. It's borderline life changing for me.
As time goes on, I imagine that it'll increasingly be the case that these LLM's will displace people out of their jobs / careers. I don't know whether the harm done will be greater than the benefit to society. I'm sure the answer will depend on who it is that you ask.
> That's a good thing! If a company cannot raise to fame unless they violate laws, it should not have been there.
Obviously given what I wrote above, I'd consider it a bad thing if LLM tech severely regressed due to copyright law. Laws are not inherently good or bad. I think you can make a good argument that this tech will be a net negative for society, but I don't think it's valid to do so just on the basis that it is breaking the law as it is today.
> I'm not convinced that the economic harm to content creators is greater than the productivity gains and accessibility of knowledge for users (relative to how competent it would be if trained just on public domain text).
Good thing whether or not something is a copyright violation doesn't depend on if you can make more money with someone else's work than they can.
I understand the anger about large tech companies using others work without compensation, especially when both they and their users benefit financially. But this goes beyond economcis. LLM tech could accelerate advances in medicine and technology. I strongly believe that we're going to see societal benefits in education, healthcare, especially mental health support thanks to this tech.
I also think that someone making money off LLM's is a separate question from whether or not the original creator has been harmed. I think many creators are going to benefit from better tools, and we'll likely see new forms of creation become viable.
We already recognize that certain uses of intellectual property should be permitted for societies benefit. We have fair use doctrine, patent compulsory licensing for public health, research exmpetions, and public libraries. Transformative use is also permitted, and LLMs are inherently transformative. Look at the volume of data that they ingest compared to the final size of a trained model, and how fundamentally different the output format is from the input data.
Human progress has always built upon existing knowledge. Consider how both Darwin and Wallace independently developed evolution theory at roughly the same time -- not from isolation, but from building on the intellectual foundation of their era. Everything in human culture builds on what came before.
That all being said, I'm also sure that this tech is going to negative impact people too. Like I said in the other reply, whether or not this tech is good or bad will depend on who you ask. I just think that we should weigh these costs against the potential benefits to society as a whole rather than simply preserving existing systems, or blindly following the law as if the law is inherently just or good. Copyright law was made before this tech was even imagined, and it seems fair to now evaluate whether the current copyright regime makes sense if it turns out that it'd keep us in some local maximum.
That's not real money tough. You need actual cash on hand to pay for stuff, OpenAI only have the money they've been given by investors. I suspect that many of the investors wouldn't have been so keen if they knew that OpenAI would need an additional couple of billions a year to pay for data.
That doesn’t mean they have $150B to hand over. What you can cite is the $10 billion they got from Microsoft.
I’m sure they could use a chunk of that to buy competitive I.P. for both companies to use for training. They can also pay experts to create it. They could even sell that to others for use in smaller models to finance creating or buying even more I.P. for their models.
We can, and do, choose to treat normal people different from billion dollar companies that are attempting to suck up all human output and turn it into their own personal profit.
If they were, say, a charity doing this for the good of mankind, I’d have more sympathy. Shame they never were.
The way to treat them differently is not by making them share profits with another corporation. The logical endgame of all this isn’t “stopping LLMs,” it’s Disney happening to own a critical mass of IP to be able to legally train and run LLMs that make movies, firing all their employees, and no smaller company ever having a chance in hell with competing with a literal century’s worth of IP powering a generative model.
The best party about all this is that Disney initially took off by… making use of public domain works. Copyright used to last 14 years. You’d be able to create derivative works of most the art in your life at some point. Now you’re never allowed to. And more often than not, not to grant a monopoly to the “author”, but to the corporation that hired them. The correct analysis shouldn’t be OpenAI vs. Intercept or Disney of whomever. You’re just choosing kings at that point.
People do get sued for making songs that are too similar to previously made songs. One defence available is that they've never heard it themselves before.
If you want to treat AI like humans then if AI output is similar enough to copyrighted material it should get sued. Then you try to prove that it didn't ingest the original version somehow.
I feel like at some point the people in favor of this are going to realize that whether the data was ingested into a training set is completely immaterial to the fact that these companies downloaded data they don't have a license to use to a company server somewhere with the intention to use it for commercial use.
Ah yes, humans and LLMs are exactly the same, learning the same way, reasoning the same way, they're practically indistinguishable. So that's why it makes sense to equate humans reading books with computer programs ingesting and processing the equivalent of billions of books in literal days or months.
“A person is fundamentally different from an LLM” does not need a legal argument and is implied by the fact that LLMs do not have human rights, or even anything comparable to animal rights.
A legal argument would be needed to argue the other way. This argument would imply granting LLMs some degree of human rights, which the very industry profiting from these copyright violations will never let happen for obvious reasons.
The other problem with the legal argument that it's "just like a person learning" is that corporations whose human employees have learned what copyrighted characters look like and then start incorporating them into their art are considered guilty of copyright violation, and don't get to deploy the "it's not an intentional copyright violation from someone who should have known better, it's just a tool outputting what the user requested" defence...
Also, it is only a matter of time until one of those employees (thanks to free will and agency) will whistleblow, it doesn’t scale, etc.
Frankly, the fact that such a big segment of HN crowd unthinkingly buys big tech’s double standard (LLMs are human when copyright is concerned, but not human in every other sense) makes me ashamed of the industry.
Not sure about US or other jurisdictions, but that's not how any of this works in Germany. In Germany downloading anything from anywhere (even a movie) is never illegal and does not require a license. What's illegal is publishing/disseminating copyrighted content without authorization. BitTorrenting a movie is illegal because you're distributing it to other torrenters. Streaming a movie on your website is illegal because it's public. You can be held liable for using a photo from the web to illustrate your eBay auction, not because you downloaded it but because you republished it.
OpenAI (and Google and everyone else) is creating a publicly-accessible system that produces output that could be derived from copyrighted material.
I think it works like that in Canada and some other places too, because they pay an extra tax on storage media when they buy it, which essentially authorizes a license for any copyrighted material that might be stored on that media.
>convert each word into a color, and create a weird piece of art work out of it? I think I can.
I agree, but the original author might get butthurt if you distribute it. Realistically copyright law in the US is a mess when it comes to weird pieces of art.
It is disingenuous to imply the scale of someone buying books and reading them (for which the publisher and author are compensated) or borrowing them from the library and reading them (again, for which the publisher and author are compensated) is the same as the wholesale copying without permission or payment of anything not behind a pay wall on the Internet.
Isn't it a greater risk that creators lose their income and nobody is creating the content anymore?
Take for instance what has happened with news because of the internet. Not exactly the same, but similar forces at work. It turned into a race to the bottom with everyone trying to generate content as cheaply as possible to get maximum engagement with tech companies siphoning revenue. Expensive, investigative pieces from educated journalists disappeared in favor of stuff that looks like spam. Pre-Internet news was higher quality
Imagine that same effect happening for all content? Art, writing, academic pieces. Its a real risk that openai has peaked in quality
Lots of people create without getting paid to do it. A lot of music and art is unprofitable. In fact, you could argue that when the mainstream media companies got completely captured by suits with no interest in the things their companies invested in, that was when creativity died and we got consigned to genre-box superhero pop hell.
I don’t know. When I look at news from before, there never was investigative journalism. It was all opinion swaying editos, until alternate voices voiced their counternarratives. It’s just not in newspapers because they are too politically biased to produce the two sides of stories that we’ve always asked them to do. It’s on other media.
But investigative journalism has not disappeared. If anything, it has grown.
This type of argument is ignorant, cowardly, shortsighted, and regressive. Both technology and society will progress when we find a formula that is sustainable and incentivizes everyone involved to maximize their contributions without it all blowing up in our faces someday. Copyright law is far from perfect, but it protects artists who want to try and make a living from their work, and it incentivizes creativity that places without such protections usually end up just imitating.
When we find that sustainable framework for AI, China or <insert-boogeyman-here> will just end up imitating it. Idk what harms you're imagining might come from that ("get ahead" is too vague to mean anything), but I just want to point out that that isn't how you become a leader in anything. Even worse, if they are the ones who find that formula first while we take shortcuts to "get ahead", then we will be the ones doing the imitation in the end.
It's hysterical to compare training an ML model with slave labour. It's perfectly fine and accepted for a human to read and learn from content online without paying anything to the author when that content has been made available online for free, it's absurd to assert that it somehow becomes a human rights violation when the learning is done by a non-biological brain instead.
> It's hysterical to compare training an ML model with slave labour.
Nobody did that.
> It's perfectly fine and accepted for a human to read and learn from content online without paying anything to the author when that content has been made available online for free, it's absurd to assert that it somehow becomes a human rights violation when the learning is done by a non-biological brain instead.
It makes sense. There is always scale to consider in these things.
worble literally did make that comparison. It is possible for comparisons to be made using other rhetorical devices than just saying "I am comparing a to b".
No, their mention of "slave labor" is not a comparison to how LLMs work, nor an assertion of moral equivalence.
Instead it is just one example to demonstrate that chasing economic/geopolitical competitiveness is not a carte blanche to adopt practices that might be immoral or unjust.
Absolutely: if copyright is slowing down innovation, we should abolish copyright.
Not just turn a blind eye when it's the right people doing it. They don't even have a legal exemption passed by Congress - they're just straight-up breaking the law and getting away with it. Which is how America works, I suppose.
Exactly. They rushed to violate copyright on a massive scale quickly, and now are making the argument that it shouldn't apply to them and they couldn't possibly operate in compliance with it. As long as humans don't get to ignore copyright, AI shouldn't either.
> Should I be paying a proportion of my salary to all the copyright holders of the books, song, TV shows and movies I consumed during my life?
you already are.
a proportion of what you pay for books, music, tv shows, movies goes to rights holders already.
any subscription to spotify/apple music/netflix/hbo; any book/LP/CD/DVD/VHS; any purchased digital download … a portion of that sales is paid back to rights holders.
so… i’m not entirely sure what your comment is trying to argue for.
are you arguing that you should get paid a rebate for your salary that’s already been spent on copyright payments to rights holders?
> If a Hollywood writer says she "learnt a lot about writing by watching the Simpsons" will Fox have an additional claim on her earnings?
no. that’s not how copyright functions.
the actual episodes of the simpsons are the copyrighted work.
broadcasting/allowing purchases of those episode incurs the copyright as it involves COPYING the material itself.
COPYright is about the rights of the rights holder when their work is COPIED, where a “work” is the material which the copyright applies to.
merely mentioning the existence of a tv show involves zero copying of a registered work.
being inspired by another TV show to go off and write your own tv show involves zero copying of the work.
a hollywood writer rebroadcasting a simpsons during a TV interview would be a different matter. same with the hollywood writer just taking scenes from a simpsons episode and putting it into their film. that’s COPYing the material.
—-
when it comes to open AI, obviously this is a legal gray area until courts start ruling.
but the accusations are that OpenAi COPIED the intercept’s works by downloading them.
openAi transferred the work to openAi servers. they made a copy. and now openAi are profiting from that copy of the work that they took, without any permission or remuneration for the rights holder of the copyrighted work.
essentially, openAI did what you’re claiming is the status quo for you… but it’s not the status quo for you.
so yeah, your comment confuses me. hopefully you’re being sarcastic and it’s just gone completely over my head.
The problem is the anti-AI people who complain about AI are going for several steps in the chain (and often they are vague about which ones they are talking about at any point).
As well as the "copying" of content some are also claiming that the output of a LLM should result in paying royalties back to the owning of the material used in training.
So if an AI produces a sitcom script then the copyright holders of those tv shows it ingested should get paid royalties. In additional to the money paid to copy files around.
Which leads to the precedent that if a writer creates a sitcom then the copyright holders of sitcoms she watched should get paid for "training" her.
> If you copied an art piece using photoshop, you would've violated copyright. Photoshop (and adobe) itself never committed copyright violations.
the COPYing is happening on your local machine with non-cloud versions of Photoshop.
you are making a copy, using a tool, and then distributing that copy.
in music royalty terms, the making a copy is the Mechanical right, while distributing the copy is the Performing right.
and you are liable in this case.
> Somehow, if you swap photoshop with openAI and chatGPT, then people claim that the actual application itself is a copyright violation
OpenAI make a copy of the original works to create training data.
when the original works are reproduced verbatim (memorisation in LLMs is a thing), then that is the copyrighted work being distributed.
mechanical and performing rights, again.
but the twist is that ChatGPT does the copying on their servers and delivers it to your device.
they are creating a new copy and distributing that copy.
which makes them liable.
—
you are right that “ChatGPT” is just a tool.
however, the interesting legal grey area with this is — are ChatGPT model weights an encoded copy of the copyrighted works?
that’s where the conversation about the tool itself being a copyright violation comes in.
photoshop provides no mechanism to recite The Art Of War out of the box. an LLM could be trained to do so (like, it’s a hypothetical example but hopefully you get the point).
> OpenAI make a copy of the original works to create training data.
if a user is allowed to download said copy to view on their browser, why isn't that same right given to openAI to download a copy to view for them? What openAI chooses to do with the viewed information is up to them - such as distilling summary statistics, or whatever.
> are ChatGPT model weights an encoded copy of the copyrighted works?
that is indeed the most interesting legal gray area. I personally believe that it is not. The information distilled from those works do not constitute any copyrightable information, as it is not literary, but informational.
It's irrelevant that you could recover the original works from these weights - you could recover the same original works from the digits of pi!
> if a user is allowed to download said copy to view on their browser, why isn't that same right given to openAI to download a copy to view for them?
whether you can download a copy from your browser doesn’t matter. whether the work is registered as copyrighted does (and following on from that, who is distributing the work - aka allowing you to download the copy - and for what purposes).
from the article (on phone cba to grab a quote) it makes clear that the Intercept’s works were not registered as copyrighted works with whatever the name of the US copyright office was.
ergo, those works are not copyrighted and, yes, they essentially are public domain and no remuneration is required …
(they cannot remove DMCA attribution information when distributing copies of the works though, which is what the case is now about.)
but for all the other registered works that OpenAI has downloaded, creating their copy, used in training data, which the model then reproduces as a memorised copy — that is copyright infringement.
like, in case it’s not clear, i’ve been responding to what people are saying about copyright specifically. not this specific case.
> The information distilled from those works do not constitute any copyrightable information, as it is not literary, but informational.
that’s one argument.
my argument would be it is a form of compression/decompression when the model weights result in memorised (read: overfitted) training data being regurgitated verbatim.
put the specific prompt in, you get the decompressed copy out the other end.
it’s like a zip file you download with a new album of music. except, in this case, instead of double clicking on the file you have to type in a prompt to get the decompressed audio files (or text in LLM case)
> It's irrelevant that you could recover the original works from these weights - you could recover the same original works from the digits of pi!
actually, that’s the whole point of courts ruling on this.
the boundaries of what is considered reproduction is at question. it is up to the courts to decide on the red lines (probably blurry gray areas for a while).
if i specifically ask a model to reproduce an exact song… is that different to the model doing it accidentally?
i don’t think so. but a court might see it differently.
as someone who worked in music copyright, is a musician, sees the effects of people
stealing musicians efforts all the time, i hope the little guys come out of this on top.
i’ve been avoiding replying to your comment for a bit, and now i realised why.
edit: i am so sorry about the wall of text.
> some are also claiming that the output of a LLM should result in paying royalties back to the owning of the material used in training.
> So if an AI produces a sitcom script then the copyright holders of those tv shows it ingested should get paid royalties. In additional to the money paid to copy files around.
what you’re talking about here is the concept of “derivative works” made from other, source works.
this is subtly different to reproduction of a work.
see the last half of this comment for my thoughts on what the interesting thing courts need to work out regarding verbatim reproduction https://news.ycombinator.com/item?id=42282003
in the derivative works case, it’s slightly different.
sampling in music is the best example i’ve got for this.
if i take four popular songs, cut 10 seconds of each, and then join each of the bits together to create a new track — that is a new, derivative work.
but i have not sufficiently modified the source works. they are clearly recognisable. i am just using copyrighted material in a really obvious way. the core of my “new” work is actually just four reproductions of the work of other people.
in that case — that derivative work, under music copyright law, requires the original copyright rights holders to be paid for all usage and copying of their works.
basically, a royalty split gets agreed, or there’s a court case. and then there’s a royalty split anyway (probably some damages too).
in my case, when i make music with samples, i make sure i mangle and process those samples until the source work is no longer recognisable. i’ve legit made it part of my workflow.
it’s no longer the original copyrighted work. it’s something completely new and fully unrecognisable.
the issue with LLMs, not just ChatGpt, is that they will reproduce both verbatim and recognisably similar output to original source works.
the original source copyrighted work is clearly recognisable, even if not an exact verbatim copy.
and that’s what you’ve probably seen folks talking about, at least it sounds like it to me.
> Which leads to the precedent that if a writer creates a sitcom then the copyright holders of sitcoms she watched should get paid for "training" her.
yes, there is already some very limited precedent, at least for a narrow specific case involving sheet music in the US.
the TL;DR IANAL version of the question at hand in the case was “did the defendants write the song with the intention of replicating a hook from the plaintiff’s work”.
the jury decided, yes they did.
this is different to your example in that they specifically went out to replicate the that specific musical component of a song.
in your example, you’re talking about someone having “watched” a thing one time and then having to pay royalties to those people as a result.
again, the red line of infringement / not infringement is ultimately up to the courts to rule on.
—
anyway, this is very different to what openAi/chatGpt is doing.
openAi takes the works. chatgpt edits them according to user requests (feed forward through the model). then the output is distributed to the user. and that output could be considered to be a derivative work (see massive amount of text i wrote above, i’m sorry).
LLMs aren’t sitting there going “i feel like recreating a marvin gaye song”. it takes data, encodes/decodes it, then produces an output. it is a mechanical process, not a creative one. there’s no ideas here. no inspiration or expression.
an LLM is not a human being. it is a tool, which creates outputs that are often strikingly similar to source copyrighted works.
their users might be specifically asking to replicate songs though. in which case, openAi could be facilitating copyright infringement (wether through derivative works or not).
and that’s an interesting legal question by itself. are they facilitating the production of derivative works through the copying of copyrighted source works?
i would say they are. and, in some cases, the derivative works are obviously derived.
borrowing a book is not creating a COPY of the book. you are not taking the pages, reproducing all of the text on those pages, and then giving that reproduction to your friend.
that is what a COPY is. borrowing the book is not a COPY. you’re just giving them the thing you already bought. it is a transfer of ownership, albeit temporarily, not a copy.
if you were copying the files from a digitally downloaded album of music and giving those new copies to your friend (music royalties were my specialty) then technically you would be in breach of copyright. you have copied the works.
but because it’s such a small scale (an individual with another individual) it’s not going to be financially worth it to take the case to court.
so copyright holders just cut their losses with one friend sharing it with another friend, and focus on other infringements instead.
which is where the whole torrenting thing comes in. if i can track 7000 people who have all downloaded the same torrented album, now i can just send a letter / court date to those 7000 people.
the costs of enforcement are reduced because of scale. 7000 people, all found the same thing, in a way that can be tracked.
and the ultimate, one person/company has download the works and making them available to others to download, without paying for the rights to make copies when distributing.
that’s the ultimate goldmine for copyright infringement lawsuits. and it sounds suspiciously like openAi’s business model.
>borrowing a book is not creating a COPY of the book. you are not taking the pages, reproducing all of the text on those pages, and then giving that reproduction to your friend.
That's not what's happening with training AI models either though.
Learning from copy written work requires a license to access that work. You can extract information from the world's best books by purchasing those books. But no author is being compensated here they download books.torrent then uses that pirated material to then profit.
If you’re arguing that OpenAI should be compelled to make all their technology and models free then I think we all agree, but it sounds like you’re trying to weasel your way into letting a corpo get away with breaking the law while running away with billions.
Because the world shouldn't be run primarily by economically minded politicians??
I'm sure China gets competitive advantages from their use of indentured and slave-like labor forces, and mass reeducation programs in camps. Should the US allow these things to happen? What about if a private business starts?
But remember, they're just trying to compete with China on a fair playing field, so everything is permitted right?
You might want to look at the constitutional amendment enshrining slave labor "as a punishment for a crime," and the world's largest prison population. Much of your food supply has links to prison labor.
But don't worry, it's not considered "slave labor" because there's a nominal wage of a few pennies involved and it's not technically "forced." You just might be tortured with solitary confinement if you don't do it.
We need to point fewer fingers and clean up the problems here.
Am I even more concerned about the state having control over the future corpus of knowledge via this doomed-in-any-case vector of "intellectual property"? Yes.
I think it will be easier to overcome the influence of billionaires when we drop the pretext that the state is a more primal force than the internet.
100% disagree. "It'll be fine bro" is not a substitute for having a vote over policy decisions made by the government. What you're talking about has a name. It starts with F and was very popular in Italy in the early to mid 20th century.
Rapidity of Godwin's law notwithstanding, I'm not disputing the importance of equity in decision-making. But this matter is more complex than that: it's obvious that the internet doesn't tolerate censorship even if it is dressed as intellectual property. I prefer an open and democratic internet to one policied by childish legacy states, the presence of which serves only (and only sometimes) to drive content into open secrecy.
It seems particularly unfair to equate any questioning of the wisdom of copyright laws (even when applied in situations where we might not care for the defendant, as with this case) with fascism.
It's not Godwin's law when it's correct. Just because it's cool and on the Internet doesn't mean you get to throw out people's stake in how their lives are run.
> throw out people's stake in how their lives are run
FWIW, you're talking to a professional musician. Ostensibly, the IP complex is designed to protect me. I cannot fathom how you can regard it as the "people's stake in how their lives are run". Eliminating copyright will almost certainly give people more control over their digital lives, not less.
> It's not Godwin's law when it's correct.
Just to be clear, you are doubling down on the claim that sunsetting copyright laws is tantamount to nazism?
Get ahead in terms of what? Do you believe that the material in public domain or legally available content that doesn't violate copyrights is not enough to research AI/LLMs or is the concern about purely commercial interests?
China also supposedly has abusive labor practices. So, should other countries start relaxing their labor laws to avoid falling behind ?
If we presume it's illegal to train on copyrighted works, but Wikipedia, a website summarizing the article is perfectly legal, then what would happen if we got LLM A to summarize the article and use that to train LLM B.
If it is illegal to train on copyrighted work, it will also benefit actors that are free to ignore laws, like Chinese public private companies. It means Western companies will lose in the AI race.
Then we don't respect their copyrights? Why is this some sort of unsolvable problem and the only solution is to allow mega corporations to sell us AI that is trained on the work of artists without their consent?
The claim that's being allowed to proceed is under 17 USC 1202, which is about stripping metadata like the title and author. Not exactly "core copyright violation". Am I missing something?
The plaintiffs focused on exactly this part - removal of metadata - probably because it's the most likely to hold in courts. One judge remarked on it pretty explicitly, saying that it's just a proxy topic for the real issue of the usage of copyrighted material in model training.
I.e., it's some legalese trick, but "everyone knows" what's really at stake.
Yeah; I think that's essentially where the disconnect is rooted for me. It seems to me (a non-lawyer, to be clear) that it's damn hard to make the case for model training necessarily being meat-and-potatoes "infringement" as things are defined in Title 17 Chapter 1. I see it as firmly in the grey area between "a mere change of physical medium or deterministic mathematical transformation clearly isn't a defense against infringement on its own" and "giant toke come on, man, Terry Brooks was obviously just ripping off Tolkien". There might be a tension between what constitutes "substantial similarity" through analog and digital lenses, especially as the question pertains to those who actually distribute weights.
I think you're at the heart of it, and you've humorously framed the grey area here and it's very weird. Sans a ruling that, for example, computers are too deterministic to be creative, copyright laws really seem to imply that LLM training is legal. Learning and then creating something new from what you learned isn't copyright infringement, so what's the legal argument here? A ruling declaring this copyright infringement is likely going to have crazy ripple effects going way beyond LLMs, something a good judge is going to be very mindful of.
Ultimately, this is probably going to require congress to create new laws to codify this.
> Learning and then creating something new from what you learned isn't copyright infringement, so what's the legal argument here?
The legal argument is that copying or creating what would otherwise be derivative works solely within a human brain is exempt because the human brain is not a medium wherein a configuration of information constitutes either a copy or a new work until it is set in another medium or performed publicly, whereas the storage of an artificial computer is absolutely such a medium (both of which are well-established law), so that the “learning” metaphor is not legally valid even if it is arguably a decent metaphor for some other purpose, furthermore, learning and then creating something new is often illegal, if the “something new” has sufficient proximity to the source material (that's the prohibition on unlicensed derivative works), and GenAI systems often do that and are (so the argument goes) sufficiently frequently used, and known to the service and model providers to be used. Intentionally to do that that, even were the training itself not a violation, the standards for contributory infringement are met in the provision of the certain models and/or services.
According to us law, is the Internet Archive a library? I know they received a DMCA excemption.
If so, you could argue that your local library returns perfect copies of copyrighted works too. IMO it's somehow different from a business turning the results of their scraping into a profit machinery.
My understanding is that there is no concept of a library license and that you just say you're a library and therefore become one, and whether your claim survives is more a product of social cultural acceptance than actual legal structures but someone is welcome to correct me.
The internet archive also scrapes the web for content, does not pay authors, the difference being that it spits out literal copies of the content it scraped, whereas an LLM fundamentally attempts to derive a new thing from the knowledge it obtains.
I just can't figure out how to plug this into copyright law. It feels like a new thing.
Violations of 17 USC 1202 can be punished pretty severely. It's not about just money, either.
If, during the trial, the judge thinks that OpenAI is going to be found to be in violation, he can order all of OpenAIs computer equipment be impounded. If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.
Whether you call that "core" or not, OpenAI cannot afford to lose these parts that are left of this lawsuit.
“ If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.”
That is exactly why I suggested companies train some models on public domain and licensed data. That risk disappears or is very minimal. They could also be used for code and synthetic data generation without legal issues on the outputs.
That's what Adobe and Getty Images are doing with their image generation models, both are exclusively using their own licensed stock image libraries so they (and their users) are on pretty safe ground.
The problem is that you don't get the same quality of data if you go about it that way. I love ChatGPT and I understand that we're figuring out this new media landscape but I really hope it doesn't turn out to neuter the models. The models are really well done.
If I steal money, I can get way more done than I do now by earning it legally. Yet, you won’t see me regularly dismissing legitimate jobs by posting comparisons to what my numbers would look like if stealing I.P..
We must start with moral and legal behavior. Within that, we look at what opportunities we have. Then, we pick the best ones. Those we can’t have are a side effect of the tradeoffs we’ve made (or tolerated) in our system.
People have been complaining about the DMCA for 2+ decades now. I guess it's great if you are on the winning side. But boy does it suck to be on the losing side.
And normal people can't get on the winning side. I'm trying to get Github to DMCA my own repositories, since it blocked my account and therefore I decided it no longer has the right to host them. Same with Stack Exchange.
GitHub's ignored me so far, and Stack Exchange explicitly said no (then I sent them an even broader legal request under GDPR)
When you uploaded your code to GitHub you granted them a license to host it. You can’t use DMCA against someone who’s operating within the parameters of the license you granted them.
GitHub's terms of service specify the license is granted as necessary to provide the service. Since the service is not provided they don't have a license.
Hosting the code is providing the service, whether you have a working account or not.
Also was this code open source? Your stack exchange contributions were open source, so they don't need any ToS-based permission in the first place. They have access under CC BY-SA.
Some, not all. GitHub is unlikely to continue hosting the code on the basis that it's open source. If they do, I'll send them a GDPR request to detach my name from it, including in source code comments and package names.
It's not always clear that Stack Exchange always followed the CC license, and if they violated it once, it was terminated. The checkbox you have to click now to access the data dumps might be a violation. The data dumps don't come with copies of the licenses, so that's a violation.
Also, is there really any benefit to stripping author metadata? Was it basically a preprocessing step?
It seems to me that it shouldn't really affect model quality all that much, is it?
Also, in the amended complaint:
> not to notify ChatGPT users when the responses they received were protected by journalists’ copyrights
Wasn't it already quite clear that as long as the articles weren't replicated, it wasn't protected? Or is that still being fought in this case?
In the decision:
> I agree with Defendants. Plai ntiffs allege that ChatGPT has been trained on "a scrape of most of the internet, " Compl. , 29, which includes massive amounts of information from innumerable sources on almost any given subject. Plaintiffs have nowhere alleged that the
information in their articles is copyrighted, nor could they do so . When a user inputs a question into ChatGPT, ChatGPT synthesizes the relevant information in its repository into an answer. Given the quantity of information contained in the repository, the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs' articles seems remote. And while Plaintiffs provide third-party statistics indicating that an earlier version of ChatGPT generated responses containing signifi cant amounts of pl agiarized content, Compl. ~ 5, Plaintiffs have not plausibly alleged that there is a " substantial risk" that the current version of ChatGPT will generate a response plagiarizing one of Plaintiffs' articles.
Isn't this the same thing Google has been doing for years with their search engine? Only difference is Google keeps the data internal, whereas openai spits it out to you. But it's still scraped and stored in both cases.
A component of fair use is to what degree the derivative work displaces the original. Google's argument has always been that they direct traffic to the original, whereas AI summaries (which Google of course is just as guilty of as openai) completely obsoletes the original publication. The argument now is that the derivative work (LLM model) is transformative, ie, different enough that it doesn't economically compete with the original. I think it's a losing argument but we'll see what the courts arrive at.
Is this specific to AI or specific to summaries in general? Do summaries, like the ones found in Wikipedia or Cliffs Notes, not have the same effect of making it such that people no longer have to view the original work as much?
Note: do you mean the model is transformative, or the summaries are transformative? I think your comment holds up either way but I think it's better to be clear which one you mean.
In my opinion not a lawyer, Google at least references where they obtained the data and did not regurgitate it as if they were the creators that created something new. obfuscated plagiarism via LLM. Some claim derivative works but I have always seen that as quite a stretch. People here expect me to cite references yet LLM's somehow escape this level of scrutiny.
the very idea of "this digital asset is exclusively mine" cannot die soon enough
let real physically tangible assets keep the exclusivity problem
let's not undo the advantages unlocked by the digital internet; let us prevent a few from locking down this grand boon of digital abundance such that the problem becomes saturation of data
This is, in fact, the core value of the hacker ethos. HackerNews.
> The belief that information-sharing is a powerful positive good, and that it is an ethical duty of hackers to share their expertise by writing open-source code and facilitating access to information and to computing resources wherever possible.
> Most hackers subscribe to the hacker ethic in sense 1, and many act on it by writing and giving away open-source software. A few go further and assert that all information should be free and any proprietary control of it is bad; this is the philosophy behind the GNU project.
Perhaps if the Internet didn't kill copyright, AI will. (Hyperbole)
(Personally my belief is more nuanced than this; I'm fine with very limited copyright, but my belief is closer to yours than the current system we have.)
If I share my texts/sounds/images for free, harvesting and regurgitating them omits the requested attribution. Even the most permissive CC license (excluding CC0 public domain) still requires an attribution.
> A few go further and assert that all information should be free and any proprietary control of it is bad; this is the philosophy behind the GNU project.
In this view, the ideal world is one where copyright is abolished (but not moral rights). So piracy is good, and datasets are also good.
Asking creators to license their work freely is simply a compromise due to copyright unfortunately still existing. (Note that even if creators don't license their work freely, this view still permits you to pirate or mod it against their wishes.)
(My view is not this extreme, but my point is that this view was, and hopefully is, still common amongst hackers.)
I will ignore the moralizing words (eg "ruthless", "harvested" to mean "copied"). It's not productive to the conversation.
Moral rights involve the attribution of works where reasonable and practical. Clearly doing so during inference is not reasonable or practical (you'll have to attribute all of humanity!) but attributing individual sources is possible and is already being done in cases like ChatGPT Search.
So I don't think you actually mean moral rights, since it's not being ignored here.
But the first sentence of your comment still stands regardless of what you meant by moral rights. To that, well... we're still commenting here, are we not? Despite it with almost 100% certainty being used to train AI. We're still here.
And yes, funding is a thing, which I agree needs copyright for the most part unfortunately. But does training AI on, for example, a book really reduce the need to buy the book, if it is not reproduced?
Remember, training is not just about facts, but about learning how humans talk, how languages work, how books work, etc. Learning that won't reduce the book's economical value.
And yes, summaries may reduce the value. But summaries already exist. Wikipedia, Cliff's Notes. I think the main defense is that you can't copyright facts.
Look, either actually read the link and refute the points within, or don't. But there's no use discussing anything if you're unwilling to even understand and seriously refute a single point being made here, other than repeating "mine, mine, mine".
In the process, [OpenAI] trained ChatGPT not to acknowledge or respect copyright, not to notify ChatGPT users when the responses they received were protected by journalists’ copyrights, and not to provide attribution when using the works of human journalists
I won't necessarily argue against that moral view, but in this case it is two large corporations fighting. One has the power of tech, the other has the power of the state (copyright). So I don't think that applies in this case specifically.
Aren't you ignoring that common law is built on precedent? If they win this case, that makes it a lot easier for people who's copyright is being infringed on an individual level to get justice.
You're correct, but I think many don't realize how many small model trainers and fine-tuners there are currently. For example, PonyXL, or the many models and fine-tunes on CivitAI made by hobbyists.
So basically the reasoning is this:
- NYT vs OpenAI, neither is disenfranchied
- OpenAI vs individual creators, creators are disenfranchised
- NYT vs individual model trainers, model trainers are disenfranchised
- Individual model trainers vs individual creators, neither are disenfranchised
And if only one can win, and since the view is that information should be free, it biases the argument towards the model trainers.
What "information" are you talking about? It's a text and image generator.
Your argument is that it's okay to scrape content when you are an individual. It doesn't change the fact those individuals are people with technical expertise using it to exploit people without.
If they wrote a bot to annoy people but published how many people got angry about it, would you say it's okay because that is information?
> If they wrote a bot to annoy people but published how many people got angry about it, would you say it's okay because that is information?
Kind of? It's not okay, but not because it is usage of information without consent (this is the "information should free" part), but because it is intentionally and unnecessarily annoying and angering people (this is the "don't use the information for evil" part which I think is your position).
"See? Similarly, even in your view, model trainers aren't bad because they're using data. They're bad in general because they're exploiting creatives."
But why is it exploitative?
"They're putting the creatives out of a job." But this applies to automation in general.
"They're putting creatives out of a job, using data they created." This is the strongest argument for me. It does intuitively feel exploitative. However, there are several issues:
1. Not all models or datasets do that. For instance, no one is visibly getting paid to write comments on HN, or to write fanfics on the non-commercial fanfic site AO3. Since the data creators are not doing it as a job in the first place, it does not make sense to talk about them losing their job because of the very same data.
2. Not all models or datasets do that. For example, spam filters, AI classifiers. All of this can be trained from the entire Internet and not be exploitative because there is no job replacement involved here.
3. Some models already do that, and are already well and morally accepted. For example, Google Translate.
4. This may be resolved by going the other way and making more models open source (or even leaks), so more creatives can use it freely, so they can make use of the productive power.
"Because they're using creatives' information without consent." But as mentioned, it's not about the information or consent. It's about what you do with the information.
Finally, because this is a legal case, it's also important to talk about the morality of using the state to restrict people from using information freely, even if their use of the information is morally wrong.
If you believe in free culture as in free speech, then it is wrong to restrict such a use using the law, even though we might agree it is morally wrong. But this really depends if you believe in free culture as in free speech in the first place, which is a debate much larger than this.
I don't understand what the "hacker ethos" could have to do with defending openai's blatant stealing of people's content for their own profit.
Openai is not sharing their data(they're keeping it private to profit off of), so how could it be anywhere near the "hacker ethos" to believe that everyone else needs to hand over their data to openai for free?
Following the "GNU-flavour hacker ethos" as described, one concludes that it is right for OpenAI to copy data without restriction, it is wrong for NYT to restrict others from using their data, and it is also wrong for OpenAI to restrict the sharing of their model weights or outputs for training.
Luckily, most people seem to ignore OpenAI's hypocritical TOS against sharing their output weights for training. I would go one step further and say that they should share the weights completely, but I understand there's practical issues with that.
Luckily, we can kind of "exfiltrate" the weights by training on their output. Or wait for someone to leak it, like NovelAI did.
I did not contribute a vote either way to your comment above, but I would point out that you get more of what you reward. Maybe the reward is monetary, like an author paid for spending their life writing books. Maybe it’s smaller, more reputational or social—like people who generate thoughtful commentary here, or Wikipedia’s editors, or hobbyists’ forums.
When you strip people’s names from their words, as the specific count here charges; and you strip out any reason or even way for people to reward good work when they appreciate it; and you put the disembodied words in the mouth of a monolithic, anthropomorphized statistical model tuned to mimic a conversation partner… what type of thought is it that becomes abundant in this world you propose, of “data abundance”?
In that world, the only people who still have incentive to create are the ones whose content has negative value, who make things people otherwise wouldn’t want to see: advertisers, spammers, propagandists, trolls… where’s the upside of a world saturated with that?
Yes, I have no idea either. I find it disappointing.
I think people simply like it when data is liberated from corporations, but hate it when data is liberated from them. (Though this case is a corporation too so idk. Maybe just "AI bad"?)
I think you'll find that most people aren't comfortable with this in practice. They'd like e.g. the state to be able to keep secrets, such as personal information regarding citizens and the stuff foreign spies would like to copy.
Obviously we're all impacted in these perceptions by our bubbles, but it would surprise me at this particular moment in the history of US politics to find that most people favor the existence of the state at all, let alone its ability to keep secret personal information regarding citizens.
Would this deprecation of the state include disbanding the police and the armed forces? I'm guessing the people who are for the deprecation of the state would answer quite differently if the question specified details of government functions.
...I mean, police are deeply unpopular in the American political consciousness, and have been since prior to their rebrand from "slave patrols" in the 19th century. Surely you recall that, only four years ago, millions of people took to the streets calling for a completion to the unfinished business of abolition?
Obviously the armed forces are much less despised than the police. But given that private gun ownership is at an all-time high (with woman and people of color - historically marginalized groups with regard to arms equality - making up the lion's share of the recent increase), I'm not sure that people are feeling particularly vulnerable to invasion either.
Is the state really that popular in your circle? How do people express their esteem? Am I just missing it?
Will we see human washing, where Ai art or works get a "Made by man" final touch in some third world mechanical turk den? Would that add another financial detracting layer to the ai winter?
The law generally takes a dim view of such attempts to get around things like that. AI biggest defense is claiming they are so beneficial to society that what they are doing is fine.
That argument stands on the mother of all slippery slopes! Just find a way to make your product mpressive or ubiquitous and all of a sudden it doesn't matter how much you break the law along the way? That's so insane I don't even know where to start.
Why not, considering copyright law specifically has fair use outlined for that kind of thing? It's not some overriding consequence of law, it's that copyright is a granting of a privilege to individuals and that that privilege is not absolute.
There's no point in having third world mechanical turk dens do finishing passes on AI output unless you're trying to make it worse.
Artists are already using AI to photobash images, and writers are using AI to outline and create rough drafts. The point of having a human in the loop is to tell the AI what is worth creating, then recognize where the AI output can be improved. If we have algorithms telling the AI what to make and content mill hacks smearing shit on the output to make it look more human, that would be the worst of both worlds.
I think the point of the comment isn't to have this finishing layer to make things "better", but to make things "legal".
Humans are allowed to synthesize a bunch of inputs together and produce a new novel copyrighted.
An algorithm, if it mixes a bunch of copyrighted things together by itself, plausibly is incapable of producing a novel copyright, and instead inherits the old copyright.
Just like Clean Room Design (https://en.wikipedia.org/wiki/Clean-room_design) can be used to re-create the same software free of the original copyright, I think the parent is arguing that a mechanical turk process could allow AI to produce the same output free of the original copyright.
That will probably happen to some extent if not already. However I think people will just stop publishing online if malicious corps like OpenAI are just going to harvest works for their own gain. People publish for personal gain, not to enrich the public or enrich private entities.
One of the results the LLM has available to itself is a confidence value. It should, at the very least, provide this along with it's answer. Perhaps if it did people would stop calling it 'AI'.'
My understanding is that this confidence value is not a measure of how likely something is correct/true, but more along the lines of how likely that sentence would be. Including it could be more misleading than helpful, for example if it is repeating commonly misunderstood information.
I'm not sure that it's possible to produce anything reasonable in that space. It would need to know how far it is away from correct to provide a usable confidence value otherwise it'd just be hallucinating a number in the same way as the result.
An analogy. Take a former commuter friend of mine, Mr Skol (named after his favourite breakfast drink). Seen on a minibus I had to get to work years ago, we shared many interesting conversations. Now he was a confident expert on everything. If asked to rate his confidence in a subject it would be a good 95% at least. However he spoke absolute garbage because his brain was rotten away from drinking Skol for breakfast, and the odd crack chaser. I suspect his model was still better than GPT-4o. But an average person could determine the veracity of his arguments.
Thus confidence should be externally rated as an entity with knowledge cannot necessarily rate itself for it has bias. Which then brings in the question of how do you do that. Well you'd have to do the research you were going to do anyway and compare. So now you've used the AI and done the research which you would have had to do if the AI didn't exist. So the AI at this point becomes a cost over benefit if you need something with any level of confidence and accuracy.
Thus the value is zero unless you need crap information, which is at least here, never, unless I'm generating a picture of a goat driving a train or something. And I'm not sure that has any commercial value. But it's fun at least.
This is what a number of startups, such as Yurts.ai and Vannevar Labs, are racing to build for organizations. I wouldn't be surprised if, in 5-10 years, most large corps and government agencies had these sort of LLM/RAGs over their internal documents.
ChatGPT Search provides this, by the way, though it relies a lot on the quality of Bing search results. Consensus.app does this but for research papers, and has been very useful to me.
More often than not in my experience, clicking these sources takes me to pages that either don’t exist, don’t have the information ChatGPT is quoting, or ChatGPT completely misinterpreted the content.
Yep, the game plan is to keep settling out of court so that (they hope) no legal precedent is set that would effectively make their entire business model illegal. That works until they run out of money I guess, but they probably can't keep it up forever.
Wouldn’t the better method to throw all your money at one suit you can make an example of and try to win that one? You can’t effectively settle every single suit if you have no realistic chance of winning, otherwise every single publisher on the internet will come and try to get their money.
That's a good strategy, but you have to have the right case. One where OpenAI feels confident they can win and establish favorable precedent. If the facts of the case aren't advantageous, it's probably not worth the risk.
Side question, why doesn't other companies get the same attention? Anthropic, xAI and others have deep pockets, and scraped the same data, I'm assuming? It could be a gold mine for all these news agencies to keep settling out of court to make some buck.
I mean it could also be that this is just a case of an OpenAI spokesperson repeating the OpenAI party line? OpenAI's very existence depends on training LLMs on copyrighted works being considered fair use, so I would be extremely surprised if any spokesperson ever hinted that it might not be fair use.
I can see where you're coming from with wanting the government to be more proactive in clamping down on illegal practices. But it's pretty standard, from what I understand, that violations civil law only has consequences if and when an aggrieved party goes to court.
Yeah, as I said, I see where you're coming from. I'm just saying that, as far as I understand, it's not unusual in any way. In fact, it almost seems to be a part of the Silicon Valley VC playbook to blatantly break the law and either grow so huge that the law has to change to accommodate, or cash out before the law breaking has consequences.
If OpenAI is evading paywalls, then the Aaron Swartz case should be considered precedent. The scale is just much much large and it's for financial gains, but not motivated by morals.
Fair. But I made a comment somewhere else that, if their models become better than ours, they'll be incorporated into products. Then we're back to being depended on China for LLM model development as well, on top of manufacturing. Realistically that'll be banned because of National Security laws or something, but companies tend to choose the path of "best and cheapest" no matter what.
You don't have to pre-register copyright in any Berne Convention countries. Your copyright exists from the moment you create something.
(ETA: This paragraph below is diametrically wrong. Sorry.)
AFAIK in the USA, registered copyright is necessary if you want to bring a lawsuit and get more than statutory damages, which are capped low enough that corporations do pre-register work.
Not the case in all Berne countries; you don't need this in the UK for example, but then the payouts are typically a lot lower in the UK. Statutory copyright payouts in the USA can be enough to make a difference to an individual author/artist.
As I understand it, OpenAI could still be on the hook for up to $150K per article if it can be demonstrated it is wilful copyright violation. It's hard to see how they can argue with a straight face that it is accidental. But then OpenAI is, like several other tech unicorns, a bad faith manufacturing device.
You seem to know more about this than me. I have a family member who "invented" some electronics things. He hasn't done anything with the inventions (I'm pretty sure they're quackery).
But to ensure his patent, he mailed himself a sealed copy of the plans. He claims the postage date stamp will hold up in court if he ever needs it.
Is that a thing? Or is it just more tinfoil business? It's hard to tell with him.
It won't hold up in court, and given that the post-office will mail/deliver unsealed letters (which may then be sealed after the fact), will be viewed rather dimly.
Even if they did, it in fact cannot be checked. There is precedent that you cannot subpoena NSA for their intercepts, because exactly what has been intercepted and stored is privileged information.
Mailing yourself using registered mail is a very old tactic to establish a date for your documents using an official government entity, so this can be meaningful in court. However this may not provide the protection he needs. Copyright law differs from patent law and he should seek legal advice
The USmoved to first to file years ago. Whoever files first gets it, except if he publishes it publicly there is a 1-year inventor's grace period (it would not apply to a self mail or private mail to other people).
Honestly I don't know whether that actually is a meaningful thing to do anymore; it may be with patents.
It certainly used to be a legal device people used.
Essentially it is low-budget notarisation. If your family member believes they have something which is timely and valuable, it might be better to seek out proper legal notarisation, though -- you'd consult a Notary Public:
Without registration you still have your natural copyright, but you would have to try to recover the profits made by the infringer.
Which does sound like more of an uphill struggle for The Intercept, because OpenAI could maybe just say "anything we earn from this is de minimis considering how much errr similar material is errrr in the training set"
Oh man it's going to take a long time for me to get my brain to accept this truth over what I'd always understood.
You have to register to sue, but you have the copyright automatically at the moment the work is created.
You can go register after an infringement and still sue, but you then won't be able to get statutory damages or attorney's fees.
Statutory damages are a big deal in general but especially here where proving how much of OpenAI's revenue is due to your specific articles is probably impossible. Which is why they're suing under this DMCA provision: it's not an infringement suit so the registration requirement doesn't apply, and there's a separate statutory damages provision for it.
It's so weird to me seeing journalists complaining about copyright and people taking something they did.
The whole of journalism is taking the acts of others and repeating them, why does a journalist claim they have the rights to someone else's actions when someone simply looks at something they did and repeat it.
If no one else ever did anything, the journalist would have nothing to report, it's inherently about replicating the work and acts of others.
That’s a pretty narrow view of journalism. If you look into newspapers, it’s not just a list of events but also opinion pieces, original research, reports etc. The main infringement isn’t with the basic reporting of facts but with the original part that’s done by the writer.
None of this was ever the point of copyright. The best part about all this is that Disney initially took off by… making use of public domain works. Copyright used to last 14 years. You’d be able to create derivative works of most the art in your life at some point. Disney is ironically the proof of how constructive a system that regularly turns works over to the public domain can be. But thanks to lobbying by Disney, now you’re never allowed to create a derivative work of the art in your life.
Copyright is only possible because we the public fund the infrastructure necessary to maintain it. “IP” isn’t self manifesting like physical items. Me having a cup necessarily means you don’t have it. That’s not how ideas and pictures work. You can infinitely perfectly duplicate them. Thus we set up laws and courts and police to create a complicated simulation of physical properties for IP. Your tax dollars pay for that. The original deal was that in exchange, those works would enter the public domain to give back to society. We’ve gotten so far from that that people now argue about OpenAI “stealing” from authors, when the authors most of the time don’t even own the works — their employers do! What a sad comedy where we’ve forgotten we have a stake in this too and instead argue over which corporation should “own” the exclusive ability to cheaply and blazingly fast create future works while everyone else has to do it the hard way.
reply