to get paid for*.AI has definitely reduced the influence pseudo-intellectuals have had on society. Now, you actually have to be smart enough to do something that isn't easily reproduced using LLMs.
No, I get your point. Unfortunately, alot of people here try to act high and mighty like they are posting here for some altruistic reason. The reason why I, you, and everyone else posts here is the human reason that we want others to engage with our posts. In order to do that, you have to put your best foot forward, which includes making sure the spelling and grammar of your posts is correct. While I do not use an LLM for this, I think that it is vaild to use these tools to make sure nothing gets in the way of whatever point you are trying to make.
> In order to do that, you have to put your best foot forward
In English. You have to put your best foot forward in English. And in your environment with the resources you have at your disposal.
For example, I'm currently engaging with you between steps in a chemistry process that's happening under the fumehood next to me while wearing a respirator, a muggy plastic chemical resistant gown and disposable gloves nitrile globes.
I am absolutely certain that these conditions are different than the ones I would need to 'put my best food forward' in this discussion. I'm also certain that quite certain that you and I would both absolutely stumble if we were obligated to particpate in this forum in a language that we're not proficient in as many users often attempt to do and are unfairly penalized for by other members of the community.
I'm with you on the LLM usage for grammatical issues for non-native speakers. I bet more in this community would feel the same way if Dang whimsically mandated that people had to use a language other than English on certain days of the week.
>It's better to communicate as an individual, warts and all, than to replace your expression with a sanitized one just because it seems "better."
It is definitely not true that it is better for a poster to communicate like an individual when it comes to spelling and grammar. People ignore posts that have poor grammar or spelling mistakes, and communications that have poor grammar are seen as unprofessional. Even I do it at a semi-subconscious level. The more difficult or the more amount of attention someone has to pay to understand your post, the less people will be willing to put in that effort to do so.
Exactly. Tell that to whoever is grading your next paper, or reviewing your resume, or watching your presentation. People are judged by their linguistic ability even in cases where it shouldn't matter. It's a well known heuristic bias. It's no surprise that many of the people here denying it are themselves quite literate.
>Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim? Sure, it can also produce extremely transformed versions, but is that really relevant if it holds within it enough information for a (near-)verbatim reproduction?
I feel as though, from an information-theoretic standpoint, it can't be possible that an LLM (which is almost certainly <1 TB big) can contain any substantial verbatim portion of its training corpus, which includes audio, images, and videos.
> I feel as though, from an information-theoretic standpoint, it can't be possible that an LLM (which is almost certainly <1 TB big) can contain any substantial verbatim portion of its training corpus, which includes audio, images, and videos.
It doesn't need to for my argument to make sense. It's a problem if it reproduces a single copyrighted work (near)-verbatim. Which we have plenty of examples of.
Do we? Even when people attempt to jail break most models with 1000s of prompts they are only able to get a paragraph or two of well known copyrighted works and some blocks of paraphrased text, and that's with giving it a substantially leading question.
It surely doesn't matter how leading or contorted the prompt has to be if it shows that the model is encoding the copyrighted work verbatimly or nearly so.
It definitely does, which is why I put substantial amount of verbatim material. If someone can recite the first paragraph of Harry Potter and the sorcerers stone from memory, it surely doesn't mean they have memorized the entire book.
Of course not. But if the passage they can recite is long enough that it is copyrightable, then surely distributing a thing that (contortedly or not) can do said recitation is a form of redistribution of the work itself?
No. It is against their TOS to attempt to jailbreak their models. While I don't agree that the models can recite longer periods of verbatim copyrighted material, even if it could, the person who is at fault is the person subverting the system, not the creator of the system. If I steal a library book and make copies of it to distribute illegally, it wouldn't make sense to hold the library at fault for infringing on the book publisher's copyright.
That's why he is saying it's not equivalent. For it to be the same, the LLM would have to train on/transform Minecraft's source code into its weights, then you prompt the LLM to make a game using the specifications of Minecraft solely through prompts. Of course it's copyright infringement if you just give a tool Minecraft's source code and tell it to copy it, just like it would be copyright infringement if you used a copier to copy Minecraft's source code into a new document and say you recreated Minecraft.
What if Copilot was already trained with Minecraft code in the dataset? Should be possible to test by telling the model to continue a snippet from the leaked code, the same way a news website proved their articles were used for training.
I feel as though the fact that you are asking a valid question shows how transformative it is; clearly, while the LLM gets a general ability to code from its training corpus, the data gets so transformed that it's difficult to tell what exactly it was trained on except a large body of code.
This would still be true of the case where you ask an LLM to rewrite a program while referencing the source. Unless someone was in the room watching or the logs are inspected, how would they know if the LLM was referencing the original source material, or just using general programing knowledge to build something similar.
The context window is quite literally not a transformation of tokens or a "jumbling of bytes," it's the exact tokens themselves. The context actually needs to get passed in on every request but it's abstracted from most LLM users by the chat interface.
>This feels sort of like saying "I just blindly threw paint at that canvas on the wall and it came out in the shape of Mickey Mouse, and so it can't be copyright infringement because it was created without the use of my knowledge of Micky Mouse"
IANAL, but that analogy wouldn't work because Mickey Mouse is a trademark, so it doesn't matter how it is created.
I agree there has to be a court case about it. I think the current argument, however, is that it is transformative, and therefore falls under fair use.
Yea, a finding that training is transformative would be pretty significant and it's likely that the precedent of thumbnail creation being deemed transformative would likely steer us towards such a finding. Transformative is always a hard thing to bank on because it is such a nebulous and judgement based call. There are excellent examples of how precise and gritty this can get in audio sampling.
Didn't know about thumbnails being fair use. In that case, I just don't see an argument that genAI training on source code is less transformative than thumbnails.
You don’t get to simply claim fair use based on how transformative your derivative work is.
“””
Section 107 calls for consideration of the following four factors in evaluating a question of fair use:
Purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes: Courts look at how the party claiming fair use is using the copyrighted work, and are more likely to find that nonprofit educational and noncommercial uses are fair. This does not mean, however, that all nonprofit education and noncommercial uses are fair and all commercial uses are not fair; instead, courts will balance the purpose and character of the use against the other factors below. Additionally, “transformative” uses are more likely to be considered fair. Transformative uses are those that add something new, with a further purpose or different character, and do not substitute for the original use of the work.
Nature of the copyrighted work: This factor analyzes the degree to which the work that was used relates to copyright’s purpose of encouraging creative expression. Thus, using a more creative or imaginative work (such as a novel, movie, or song) is less likely to support a claim of a fair use than using a factual work (such as a technical article or news item). In addition, use of an unpublished work is less likely to be considered fair.
Amount and substantiality of the portion used in relation to the copyrighted work as a whole: Under this factor, courts look at both the quantity and quality of the copyrighted material that was used. If the use includes a large portion of the copyrighted work, fair use is less likely to be found; if the use employs only a small amount of copyrighted material, fair use is more likely. That said, some courts have found use of an entire work to be fair under certain circumstances. And in other contexts, using even a small amount of a copyrighted work was determined not to be fair because the selection was an important part—or the “heart”—of the work.
Effect of the use upon the potential market for or value of the copyrighted work: Here, courts review whether, and to what extent, the unlicensed use harms the existing or future market for the copyright owner’s original work. In assessing this factor, courts consider whether the use is hurting the current market for the original work (for example, by displacing sales of the original) and/or whether the use could cause substantial harm if it were to become widespread.
“””
>I haven't claimed anything, The courts did: https://www.whitecase.com/insight-alert/two-california-distr.... And regardless, my point still stands that it is an open question; however, given the already present body of cases, it is tipping in the favor of the AI companies. Also, if thumbnails fall under fair use due to it being transformative of full-sized pictures, I cannot see an argument that AI training on data is somehow less transformative than downscaling an image for a thumbnail.
I would usually. Sometimes if it's like 2 * x + b, I would not, but personally, I hate chasing down bugs like this, so just add it to remove ambiguity. Also, for like b + 2 * a, I will almost always use parentheses.
Children and young students, certainly. Adult students: almost 100%. If writing is your job, then by definition, and your problem is more often finding something to say, not writing it.
You’re not counting all the office workers who have to write reports or emails, or all the scammers who write those websites to manipulate SEO or show you ads.
Everyone should think twice about putting their name on AI garbage, or garbage of any kind. But wishing doesn’t stop it from happening, especially when companies are explicitly selling you on doing just that. Remember the Apple Intelligence office ads?
reply