Hacker Newsnew | past | comments | ask | show | jobs | submit | NewsaHackO's commentslogin

What he's saying though is that the original poster is vastly overstating the effect the headphone had on his head. There was no dent in his skull, just skin deformation that happens with literally every headphone.

> Developers worth working with, grow out of this in a new project. Claude doesn't.

There is no way this is true. People make fewer bugs with time and guidance, but no human makes zero bugs. Also, bugs are not planned; it's always easy to in hindsight say "A human would have literally copied the original lookup map," but every bug has some sort of mistake that is made that is off the status quo. That's why it's a bug.


Sorry, perhaps I should have been clearer. They don't grow completely out of making bugs (although they do tend to make fewer over time), they grow out of making solutions that look right but don't actually solve the problem. This is because they understand the problem space better over time.

No, it's broadly true. Also, that's why we have code review and tests, so that it has to pass a couple of filters.

LLMs don't make mistakes like humans make mistakes.

If you're a SWE at my company, I can assume you have a baseline of skill and you tested the code yourself, so I'm trying to look for any edge cases or gaps or whatever that you might have missed. Do you have good enough tests to make both of us feel confident the code does what it appears to do?

With LLMs, I have to treat its code like it's a hostile adversary trying to sneak in subtle backdoors. I can't trust anything to be done honestly.


If you like the open-source codebase, then why are you peddling your closed-source paid platform?

You're allowed to like both. Antinote is very unique, and devs should be allowed to charge for their work if it's a quality app with a really polished UX.

Also, its not theirs.


This is a great legal defense, but if they are trying to make themselves seen as though they are fighting for the rights of the users and aren't doing the literal same thing that Reddit is doing, that is disingenious.

I wonder if any lawyers could weigh in here. Does this admission that they know the data is the user's make a class-action against SerpApi or whatever a slam-dunk? They're practically publishing their own admission of guilt!

Is SerpApi asking each user for permission to use their posts if they are saying that the rights of the posts belong to the user?

Copyright protects copying. Scraping content does not violate copyright if the content is not republished. Otherwise Google and all search engines would be illegal.

to get paid for*.AI has definitely reduced the influence pseudo-intellectuals have had on society. Now, you actually have to be smart enough to do something that isn't easily reproduced using LLMs.

No, I get your point. Unfortunately, alot of people here try to act high and mighty like they are posting here for some altruistic reason. The reason why I, you, and everyone else posts here is the human reason that we want others to engage with our posts. In order to do that, you have to put your best foot forward, which includes making sure the spelling and grammar of your posts is correct. While I do not use an LLM for this, I think that it is vaild to use these tools to make sure nothing gets in the way of whatever point you are trying to make.

> In order to do that, you have to put your best foot forward

In English. You have to put your best foot forward in English. And in your environment with the resources you have at your disposal.

For example, I'm currently engaging with you between steps in a chemistry process that's happening under the fumehood next to me while wearing a respirator, a muggy plastic chemical resistant gown and disposable gloves nitrile globes.

I am absolutely certain that these conditions are different than the ones I would need to 'put my best food forward' in this discussion. I'm also certain that quite certain that you and I would both absolutely stumble if we were obligated to particpate in this forum in a language that we're not proficient in as many users often attempt to do and are unfairly penalized for by other members of the community.

I'm with you on the LLM usage for grammatical issues for non-native speakers. I bet more in this community would feel the same way if Dang whimsically mandated that people had to use a language other than English on certain days of the week.


Oh shit that would be fun. Tuesday, we're going to do it in Mongolian, see how that goes.

>It's better to communicate as an individual, warts and all, than to replace your expression with a sanitized one just because it seems "better."

It is definitely not true that it is better for a poster to communicate like an individual when it comes to spelling and grammar. People ignore posts that have poor grammar or spelling mistakes, and communications that have poor grammar are seen as unprofessional. Even I do it at a semi-subconscious level. The more difficult or the more amount of attention someone has to pay to understand your post, the less people will be willing to put in that effort to do so.


Exactly. Tell that to whoever is grading your next paper, or reviewing your resume, or watching your presentation. People are judged by their linguistic ability even in cases where it shouldn't matter. It's a well known heuristic bias. It's no surprise that many of the people here denying it are themselves quite literate.

>Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim? Sure, it can also produce extremely transformed versions, but is that really relevant if it holds within it enough information for a (near-)verbatim reproduction?

I feel as though, from an information-theoretic standpoint, it can't be possible that an LLM (which is almost certainly <1 TB big) can contain any substantial verbatim portion of its training corpus, which includes audio, images, and videos.


> I feel as though, from an information-theoretic standpoint, it can't be possible that an LLM (which is almost certainly <1 TB big) can contain any substantial verbatim portion of its training corpus, which includes audio, images, and videos.

It doesn't need to for my argument to make sense. It's a problem if it reproduces a single copyrighted work (near)-verbatim. Which we have plenty of examples of.


Do we? Even when people attempt to jail break most models with 1000s of prompts they are only able to get a paragraph or two of well known copyrighted works and some blocks of paraphrased text, and that's with giving it a substantially leading question.

It surely doesn't matter how leading or contorted the prompt has to be if it shows that the model is encoding the copyrighted work verbatimly or nearly so.

It definitely does, which is why I put substantial amount of verbatim material. If someone can recite the first paragraph of Harry Potter and the sorcerers stone from memory, it surely doesn't mean they have memorized the entire book.

Of course not. But if the passage they can recite is long enough that it is copyrightable, then surely distributing a thing that (contortedly or not) can do said recitation is a form of redistribution of the work itself?

No. It is against their TOS to attempt to jailbreak their models. While I don't agree that the models can recite longer periods of verbatim copyrighted material, even if it could, the person who is at fault is the person subverting the system, not the creator of the system. If I steal a library book and make copies of it to distribute illegally, it wouldn't make sense to hold the library at fault for infringing on the book publisher's copyright.

This is an interesting take that I hadn't considered. Your analogy with a library break-in is good. I'll need to digest this. Thanks.

That's why he is saying it's not equivalent. For it to be the same, the LLM would have to train on/transform Minecraft's source code into its weights, then you prompt the LLM to make a game using the specifications of Minecraft solely through prompts. Of course it's copyright infringement if you just give a tool Minecraft's source code and tell it to copy it, just like it would be copyright infringement if you used a copier to copy Minecraft's source code into a new document and say you recreated Minecraft.

What if Copilot was already trained with Minecraft code in the dataset? Should be possible to test by telling the model to continue a snippet from the leaked code, the same way a news website proved their articles were used for training.

I feel as though the fact that you are asking a valid question shows how transformative it is; clearly, while the LLM gets a general ability to code from its training corpus, the data gets so transformed that it's difficult to tell what exactly it was trained on except a large body of code.

This would still be true of the case where you ask an LLM to rewrite a program while referencing the source. Unless someone was in the room watching or the logs are inspected, how would they know if the LLM was referencing the original source material, or just using general programing knowledge to build something similar.

Then the training itself is the legal question. This doesn't seem all that complicated to me.

Is there a legal distinction between training, post-training, fine tuning and filling up a context window?

In all of these cases an AI model is taking a copyrighted source, reading it, jumbling the bytes and storing it in its memory as vectors.

Later a query reads these vectors and outputs them in a form which may or may not be similar to the original.


Judges have previously ruled that training counts as sufficiently transformative to qualify for fair use: https://www.whitecase.com/insight-alert/two-california-distr...

I don't know of any rulings on the context window, but it's certainly possible judges would rule that would not qualify as transformative.


The context window is quite literally not a transformation of tokens or a "jumbling of bytes," it's the exact tokens themselves. The context actually needs to get passed in on every request but it's abstracted from most LLM users by the chat interface.

It's not equivalent, but it's close enough that you can't easily dismiss it.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: