Human authors cannot read and perfectly memorize millions of books in a day, and are therefore not comparable to computers running machine learning software.
That's not how machine learning works. They don't "perfectly memorize" anything. They do learn much quicker than humans, of course, but that alone doesn't seem like a good argument.
You are looking at a Large Language Model (specifically GPT3.5) output full paragraphs from a book it was trained on. The prompt is the first 13 words of the book, shown with a white background in the first screenshot. The words with green background are the LLM outputs.
The second screenshot is a diff that compares the LLM output with the original book.
Frankenstein. You might argue that it is a public domain book, but it demonstrates that these LLMs can and do memorize books.
It also works with Harry Potter, but the API behaves weirdly and after producing correct output in the beginning really quickly, suddenly hangs. You can continue generating the correct words by doing more API calls, but it only does a few words at a time before stopping. It clearly knows the right content, but doesn't want to send it all at once.
I think there is some output filtering for "big" authors and stuff that is too famous that they filter in order to avoid getting sued.
Do you have a credible source that says an LLM like the ones trained by OpenAI can perfectly memorize millions of books? Or be trained in a single day?
It's provably impossible for a model with 1.76 trillion floating-point parameters (like GPT-4) to memorize millions of books?
How many bytes do you think a million compressed books takes? Consider that the way these models are trained is basically completing the next symbol based on the previous words, which is how most compressors are made.
Are you really trying to argue that there exist humans that can learn facts at a similar pace to a datacenter running GPT training software on petabytes of scraped data?
My point still stands.
"Your honor, I don't know human minds work, but clearly LLMs work the same way" The legal burden of evidence is on LLM proponents to establish that what they are doing is the same as the human mind and therefore should be treated the same way.
That's a bit of a false dilemma. To address copyright issues, we needn't prove that machine learning models learn exactly like humans or not. The more relevant point is that neither human learning nor machine learning has the intent to store or replicate copyrighted material; both aim to generalize from data to produce new content. It is in this way that they are similar.
It's the argument that is being made. Intent isn't a requisite for copyright infringement. Your re-characterization of the argument is so general that it's useless.
I wonder if this might be because Slavboj might be a monist/physicalist, and you might be a dualist[2]? If that's the case we'd all argue until we're blue in the face, if we don't at least recognize this underlying difference. For the record, since I've studied biology, I'm probably closest to some form of mechanism[3] (due to the rejection of vis vitalis [4] in early 20th c. )
Of course, you could also just be a very skeptical monist mindful of the kluger hans effect[5] and working from there.
Let me know which (if any), maybe we can still find middle ground!