what am I looking at? Do you have the prompts that were used?

gkbrk · on Sept 22, 2023

You are looking at a Large Language Model (specifically GPT3.5) output full paragraphs from a book it was trained on. The prompt is the first 13 words of the book, shown with a white background in the first screenshot. The words with green background are the LLM outputs.

The second screenshot is a diff that compares the LLM output with the original book.

nickthegreek · on Sept 23, 2023

What book? I just want to verify your statements. I don’t take 2 cropped screenshots on an image site as evidence to be frank.

gkbrk · on Sept 23, 2023

Frankenstein. You might argue that it is a public domain book, but it demonstrates that these LLMs can and do memorize books.

It also works with Harry Potter, but the API behaves weirdly and after producing correct output in the beginning really quickly, suddenly hangs. You can continue generating the correct words by doing more API calls, but it only does a few words at a time before stopping. It clearly knows the right content, but doesn't want to send it all at once.

I think there is some output filtering for "big" authors and stuff that is too famous that they filter in order to avoid getting sued.

I wrote more details about the weird Harry Potter behaviour here: https://news.ycombinator.com/item?id=37614697