Human authors cannot read and perfectly memorize millions of books in a day, and...

tananaev · on Sept 22, 2023

That's not how machine learning works. They don't "perfectly memorize" anything. They do learn much quicker than humans, of course, but that alone doesn't seem like a good argument.

skepticATX · on Sept 22, 2023

They do not perfectly memorize but they can spit out entire chapters of books word for word.

nickthegreek · on Sept 22, 2023

I've never heard or seen this unless a model is way overtrained. Could you provide an example?

gkbrk · on Sept 22, 2023

Is GPT3.5 way overtrained? Because it can regurgitate a ton of text verbatim from books.

https://ibb.co/album/sJBB11

nickthegreek · on Sept 22, 2023

what am I looking at? Do you have the prompts that were used?

gkbrk · on Sept 22, 2023

You are looking at a Large Language Model (specifically GPT3.5) output full paragraphs from a book it was trained on. The prompt is the first 13 words of the book, shown with a white background in the first screenshot. The words with green background are the LLM outputs.

The second screenshot is a diff that compares the LLM output with the original book.

nickthegreek · on Sept 23, 2023

What book? I just want to verify your statements. I don’t take 2 cropped screenshots on an image site as evidence to be frank.

gkbrk · on Sept 23, 2023

Frankenstein. You might argue that it is a public domain book, but it demonstrates that these LLMs can and do memorize books.

It also works with Harry Potter, but the API behaves weirdly and after producing correct output in the beginning really quickly, suddenly hangs. You can continue generating the correct words by doing more API calls, but it only does a few words at a time before stopping. It clearly knows the right content, but doesn't want to send it all at once.

I think there is some output filtering for "big" authors and stuff that is too famous that they filter in order to avoid getting sued.

I wrote more details about the weird Harry Potter behaviour here: https://news.ycombinator.com/item?id=37614697

beej71 · on Sept 22, 2023

The could do, but with enough input the odds of that happening are very remote, is my understanding.

And just because something could do something doesn't make it a violation.

Kim_Bruning · on Sept 22, 2023

Do you have a credible source that says an LLM like the ones trained by OpenAI can perfectly memorize millions of books? Or be trained in a single day?

throwawaymaths · on Sept 22, 2023

It's actually provably impossible for openAI to perfectly memorize millions of books.

gkbrk · on Sept 22, 2023

It's provably impossible for a model with 1.76 trillion floating-point parameters (like GPT-4) to memorize millions of books?

How many bytes do you think a million compressed books takes? Consider that the way these models are trained is basically completing the next symbol based on the previous words, which is how most compressors are made.

vkoskiv · on Sept 22, 2023

No, but the point still stands.

nickthegreek · on Sept 22, 2023

No it doesnt. Kim just took your entire thesis away.

vkoskiv · on Sept 25, 2023

Are you really trying to argue that there exist humans that can learn facts at a similar pace to a datacenter running GPT training software on petabytes of scraped data? My point still stands.

chankstein38 · on Sept 22, 2023

What point? To me it sounds like they shot down your entire comment

freejazz · on Sept 22, 2023

Does anyone have a credible source indicating that LLM learning is anything like human learning? That's the first step in this argument

ghaff · on Sept 22, 2023

Well, we would probably have to understand human learning a lot better first.

freejazz · on Sept 22, 2023

"Your honor, I don't know human minds work, but clearly LLMs work the same way" The legal burden of evidence is on LLM proponents to establish that what they are doing is the same as the human mind and therefore should be treated the same way.

Kim_Bruning · on Sept 23, 2023

That's a bit of a false dilemma. To address copyright issues, we needn't prove that machine learning models learn exactly like humans or not. The more relevant point is that neither human learning nor machine learning has the intent to store or replicate copyrighted material; both aim to generalize from data to produce new content. It is in this way that they are similar.

freejazz · on Sept 23, 2023

It's the argument that is being made. Intent isn't a requisite for copyright infringement. Your re-characterization of the argument is so general that it's useless.

Kim_Bruning · on Sept 24, 2023

Heh, we seem to be talking past each other.

I wonder if this might be because Slavboj might be a monist/physicalist, and you might be a dualist[2]? If that's the case we'd all argue until we're blue in the face, if we don't at least recognize this underlying difference. For the record, since I've studied biology, I'm probably closest to some form of mechanism[3] (due to the rejection of vis vitalis [4] in early 20th c. )

Of course, you could also just be a very skeptical monist mindful of the kluger hans effect[5] and working from there.

Let me know which (if any), maybe we can still find middle ground!

[1] https://en.wikipedia.org/wiki/Physicalism

[2] https://en.wikipedia.org/wiki/Mind%E2%80%93body_dualism

[3] https://en.wikipedia.org/wiki/Mechanism_(philosophy)

[4] https://en.wikipedia.org/wiki/Vitalism

[5] https://en.wikipedia.org/wiki/Clever_Hans

marak830 · on Sept 22, 2023

So we should block people with Eidetic memory should be banned?

This is all becoming akin to the 'but on a computer' for patents.

lm28469 · on Sept 22, 2023

> 'but on a computer'

It's exactly the issue here. The capability doesn't matter, what matters is the scale of it.

Would you be ok with your car billing you every single time you exceed the speed limit because "there already are speed cameras out there" ?