Hacker News new | past | comments | ask | show | jobs | submit login

Human authors cannot read and perfectly memorize millions of books in a day, and are therefore not comparable to computers running machine learning software.



That's not how machine learning works. They don't "perfectly memorize" anything. They do learn much quicker than humans, of course, but that alone doesn't seem like a good argument.


They do not perfectly memorize but they can spit out entire chapters of books word for word.


I've never heard or seen this unless a model is way overtrained. Could you provide an example?


Is GPT3.5 way overtrained? Because it can regurgitate a ton of text verbatim from books.

https://ibb.co/album/sJBB11


what am I looking at? Do you have the prompts that were used?


You are looking at a Large Language Model (specifically GPT3.5) output full paragraphs from a book it was trained on. The prompt is the first 13 words of the book, shown with a white background in the first screenshot. The words with green background are the LLM outputs.

The second screenshot is a diff that compares the LLM output with the original book.


What book? I just want to verify your statements. I don’t take 2 cropped screenshots on an image site as evidence to be frank.


Frankenstein. You might argue that it is a public domain book, but it demonstrates that these LLMs can and do memorize books.

It also works with Harry Potter, but the API behaves weirdly and after producing correct output in the beginning really quickly, suddenly hangs. You can continue generating the correct words by doing more API calls, but it only does a few words at a time before stopping. It clearly knows the right content, but doesn't want to send it all at once.

I think there is some output filtering for "big" authors and stuff that is too famous that they filter in order to avoid getting sued.

I wrote more details about the weird Harry Potter behaviour here: https://news.ycombinator.com/item?id=37614697


The could do, but with enough input the odds of that happening are very remote, is my understanding.

And just because something could do something doesn't make it a violation.


Do you have a credible source that says an LLM like the ones trained by OpenAI can perfectly memorize millions of books? Or be trained in a single day?


It's actually provably impossible for openAI to perfectly memorize millions of books.


It's provably impossible for a model with 1.76 trillion floating-point parameters (like GPT-4) to memorize millions of books?

How many bytes do you think a million compressed books takes? Consider that the way these models are trained is basically completing the next symbol based on the previous words, which is how most compressors are made.


No, but the point still stands.


No it doesnt. Kim just took your entire thesis away.


Are you really trying to argue that there exist humans that can learn facts at a similar pace to a datacenter running GPT training software on petabytes of scraped data? My point still stands.


What point? To me it sounds like they shot down your entire comment


Does anyone have a credible source indicating that LLM learning is anything like human learning? That's the first step in this argument


Well, we would probably have to understand human learning a lot better first.


"Your honor, I don't know human minds work, but clearly LLMs work the same way" The legal burden of evidence is on LLM proponents to establish that what they are doing is the same as the human mind and therefore should be treated the same way.


That's a bit of a false dilemma. To address copyright issues, we needn't prove that machine learning models learn exactly like humans or not. The more relevant point is that neither human learning nor machine learning has the intent to store or replicate copyrighted material; both aim to generalize from data to produce new content. It is in this way that they are similar.


It's the argument that is being made. Intent isn't a requisite for copyright infringement. Your re-characterization of the argument is so general that it's useless.


Heh, we seem to be talking past each other.

I wonder if this might be because Slavboj might be a monist/physicalist, and you might be a dualist[2]? If that's the case we'd all argue until we're blue in the face, if we don't at least recognize this underlying difference. For the record, since I've studied biology, I'm probably closest to some form of mechanism[3] (due to the rejection of vis vitalis [4] in early 20th c. )

Of course, you could also just be a very skeptical monist mindful of the kluger hans effect[5] and working from there.

Let me know which (if any), maybe we can still find middle ground!

[1] https://en.wikipedia.org/wiki/Physicalism

[2] https://en.wikipedia.org/wiki/Mind%E2%80%93body_dualism

[3] https://en.wikipedia.org/wiki/Mechanism_(philosophy)

[4] https://en.wikipedia.org/wiki/Vitalism

[5] https://en.wikipedia.org/wiki/Clever_Hans


So we should block people with Eidetic memory should be banned?

This is all becoming akin to the 'but on a computer' for patents.


> 'but on a computer'

It's exactly the issue here. The capability doesn't matter, what matters is the scale of it.

Would you be ok with your car billing you every single time you exceed the speed limit because "there already are speed cameras out there" ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: