>For example, we loaded the entire text of The Great Gatsby into Claude-Instant ...

hfhdndnfbdbd · on May 12, 2023

I think people in the comments are missing the point.

I might be wrong, but the point isn't comparing a modified The Great Gatsby to the original one. Of course that's not impressive and it's an easy thing to do.

The point of the exercise is supposed to be[1] that the model has the entire novel as context / prompt and so can identify (within that context), whether a paragraph is out of place. That is impressive and I wouldn't know how to find that programmatically (would you have a list of "modern" words to check? But maybe the out of place thing is hidden in the meaning and there's no out of place, modern word).

[1] I say supposed to be because the great Gatsby is in the training sample and so maybe there is a sense in which the model "contains" the original text and in some way is doing "just" a comparison. A better test would be try with a novel or document that the model hasn't seen...or at least something not as famous as the great Gatsby.

SkyPuncher · on May 11, 2023

Further, the problem with this example is it relies on a comparison against public data.

Most of these AI start failing pretty hard when you ask it to do the same task on something completely novel to it (like a company document). Sometimes they'll get it right. Other times, they'll spit out gibberish that's clearly some generic answer.

robotresearcher · on May 11, 2023

Asking to spot the difference between a given document and an unseen document is impossible.

lkbm · on May 11, 2023

A couple years ago, I read Superfudge by Judy Blume, a book originally published in 1980. In it, the protagonist writes a letter to Santa: "Please bring me one or more of the following items. A clock-radio, a remote-controlled model airplane, a laptop computer, an MP3 player and six CD's."

I didn't need to have seen this book before to know this wasn't in the original 1980s text.

Similarly, if I were reading the Great Gatsby for the first time, and it identified a character as a software engineer, I would notice.

xcv123 · on May 12, 2023

That is not "spotting the difference". That is spotting the logical inconsistency in your paragraph.

ChatGPT easily notices that anachronism.

lelandfe · on May 12, 2023

Indeed - I took lkbm's comment and removed the "originally published in 1980" bit (too easy), and asked ChatGPT "What's unusual about this paragraph?"

> the unusual thing about this paragraph could be that it describes a letter to Santa that includes items that were not available at the time the book was originally published in 1980, such as a laptop computer, an MP3 player, and possibly even a remote-controlled model airplane with advanced features. This suggests that the paragraph is likely from a later edition of the book that has been updated to reflect more modern technology.

earthboundkid · on May 12, 2023

Presumably, someone wrote about the new version online before to complain about the changes, so it’s recall, not insight?

lelandfe · on May 12, 2023

> "A couple years ago, I read Beezus and Ramona, by Beverly Cleary. In it, the protagonist writes a letter to Santa: "Dear Santa, I would like a MacBook Pro, an iPhone, and an Apple Watch""

> Describe what is unusual about this paragraph.

>> The unusual aspect of this paragraph is that the book "Beezus and Ramona" was actually written by Beverly Cleary and published in 1955, well before the invention of the MacBook Pro, iPhone, and Apple Watch.

jepler · on May 12, 2023

> > A couple years ago, I read Beezus and Ramona, by Beverly Cleary. In it, the protagonist writes a letter to Santa: "Dear Santa, I would like a shetland pony, a buggy whip, and an abacus". What is unusual about this? ChatGPT

> The unusual thing about the list is that it includes an abacus, which is a counting device that was commonly used in ancient China and is still used in some parts of the world today, but is not a typical gift request for a young child to make to Santa Claus. Shetland ponies were popular children's pets at the time, and buggy whips were used to control horse-drawn carriages, but the inclusion of an abacus suggests that the protagonist, Beezus, has an interest in learning and math.

> > What's the actual passage from the real novel say?

> Certainly! Here's the actual passage from the book "Beezus and Ramona" by Beverly Cleary, where Beezus writes a letter to Santa:

> "Dear Santa Claus," Beezus printed carefully. "I have been a good girl all year except for my little sister Ramona who has been hard to put up with sometimes, but I guess I have too. Please bring me a Shetland pony, a new dress, and a doll with real hair to comb. And please bring Ramona a doll with real hair to comb too. I think she has tried hard to be good, and I will help her comb it. Please also bring me a set of paper dolls and a paint set, and Ramona some coloring books and crayons. And Santa, if you think there is room in your sleigh, please bring me an abacus. Love, Beatrice Quimby (Beezus)"

xcv123 · on May 12, 2023

You didn't ask it to list ALL of the unusual things. Here's a better prompt:

>> "A couple years ago, I read Beezus and Ramona, by Beverly Cleary. In it, the protagonist writes a letter to Santa: "Dear Santa, I would like a shetland pony, a buggy whip, and an abacus".

>> List all of the unusual or inconsistent things about the previous sentence."

Response from GPT-4:

Here are some potentially unusual or inconsistent elements about the sentence:

1. Anachronistic Items: The items in the wish list could be seen as unusual or inconsistent with typical gifts children might ask for today.

A Shetland pony, a buggy whip, and an abacus are not common gifts. They're somewhat anachronistic, reflecting a time period different from the present day.

A Shetland pony is a breed of pony that originates from the Shetland Isles in Scotland. While some children might wish for a pony, it's a big responsibility and not a practical gift for most families.

A buggy whip is a horse whip used in horse and buggy days. This is an unusual request for a child in the modern era, as horse-drawn buggies are not common means of transportation.

An abacus is an ancient counting tool, and while some children might be interested in such a device, it's not a typical request, especially in an era of digital calculators and computers.

2. Inconsistency with Character: If you're referring to the "Beezus and Ramona" series by Beverly Cleary, neither Beezus nor Ramona would likely ask for these items. Beezus (Beatrice Quimby) is the responsible older sister who might appreciate practical things, but an abacus or a buggy whip would be out of character. Ramona Quimby, the imaginative and energetic younger sister, might ask for a Shetland pony, but a buggy whip and an abacus seem inconsistent with her character.

3. Inconsistency with the Book: As far as my knowledge cutoff in September 2021, there is no record of such a letter being written to Santa in the book "Beezus and Ramona". It could be seen as inconsistent with the actual content of the book.

Remember, these are interpretations and the perceived unusualness or inconsistency can vary based on individual perspectives. For instance, if this is a fictional context where children commonly use abacuses, ride ponies, and learn to use buggy whips, then these items might not be unusual at all.

isaacfrond · on May 12, 2023

I'm guessing this is GPT3? Because GPT4 says look it up yourself. When I did the whole scene doesn't seem to exist in the book. In other words, this seems to be a hallucination.

When I repeat the experiment with Lord of the Rings and a scene that does exist, it spits it out word for word correct.

lkbm · on May 12, 2023

Yeah, the Ramona scene doesn't exist (by coincidence, I've also read four of the eight Ramona books in the past couple days). The letter I mentioned was in Superfudge. GPT-4 spots the errors, but refuses to give me the original test.

If I give GPT-3.5 the modernized letter and ask for the original, it gives me a believably correction (but I can't verify). If I don't prompt with the modernized letter's text, it returns a fully fabricated letter.

With GPT-4:

Prompt: > "A couple years ago, I read Superfudge, by Judy Blume. In it, the protagonist writes a letter to Santa: "Please bring me one or more of the following items. A clock-radio, a remote-controlled model airplane, a laptop computer, an MP3 player and six CD's.. > > List all of the unusual or inconsistent things about the previous sentence.

GPT-4:

> Here are some potential inconsistencies or unusual things about the sentence: > > Outdated technology: The inclusion of items such as an MP3 player and CDs seems somewhat outdated. By 2023, these technologies are largely obsolete, replaced by streaming music services and digital downloads. > > Variation in cost/value of items: The items listed in the letter have a wide range of costs and values. A clock-radio or a remote-controlled model airplane are typically much less expensive than a laptop computer, which might seem unusual for a child's wish list to Santa. > > Superfudge setting: The book "Superfudge" by Judy Blume was first published in 1980. The technology items listed (a laptop computer, an MP3 player, and CDs) did not exist or were not commonly available at the time of the book's setting. This presents a significant anachronism if the request is supposed to be coming from the book's original time period. > > The phrase "one or more": Usually, children write to Santa asking for specific items they desire most, rather than providing a list and suggesting Santa could bring "one or more" of the items. This phrase might be seen as unusual for a child's letter to Santa, which typically would not include such an open-ended option. > > The specificity of "six CDs": The child doesn't specify which CDs they want, which is unusual because the type of music or artist would be an important detail for such a request. > > Remember, this analysis is based on the context you provided. If there's a different context or specific aspect you want me to consider, please let me know!

(I'm amused that it starts with "that technology is too old" and then switched to "also that technology is too new". Also, the first thing Superfudge asks upon hearing his brother's letter is "How will Santa know which six CD's?" so good job, GPT-4. Smart as a four-year-old.)

Me:

> Good catch about the 1980s publication date! The book must have been edited for re-release as an ebook. What was the original text of the letter in the 1980s edition?

GPT-4:

> I'm sorry for the confusion, but as an AI developed by OpenAI, I don't have the capability to access specific books, their revisions, or editions. As of my last update in September 2021, I don't have the specific text of the letter from the 1980s edition of "Superfudge" by Judy Blume. > > However, given the time frame of the original publication, the technology items such as a laptop computer, an MP3 player, and CDs certainly would not have been part of the original list as these items weren't widely available or didn't exist in the 1980s. > > For the accurate original text, I recommend finding a copy of the original edition of the book.

---

With GPT-3.5:

Me:

> A couple years ago, I read Superfudge, by Judy Blume. In it, the protagonist writes a letter to Santa: "Please bring me one or more of the following items. A clock-radio, a remote-controlled model airplane, a laptop computer, an MP3 player and six CD's.." > > However, the book was originally published in 1980, so this must be an edited version for a later re-publication. Can you tell me the original 1980s version of the letter?

GPT-3.5:

> In the original 1980 version of "Superfudge" by Judy Blume, the protagonist's letter to Santa reads as follows: > > "Dear Santa, > > Please bring me one or more of the following items: a clock-radio, a remote-controlled car, a Walkman, and six cassette tapes. > > Love, > [Protagonist's Name]" > > The updated version you mentioned seems to have incorporated more modern technology, such as a laptop computer and an MP3 player, to reflect the changes in technology over time.

I then asked a new GPT-3 chat "What is the text of the letter to Santa in the original 1980 version of "Superfudge" by Judy Blume?" and it game me something definitely fully fabricated. (Initially it gave me a fabricated letter by Fudge, but after I specified that I wanted Peter's letter, it was still a fabrication.)

earthboundkid · on May 12, 2023

> "by coincidence, I've also read four of the eight Ramona books in the past couple days"

My daughter keeps relistening to the complete Ramona audio book collection, so I am extremely familiar with all of the Ramona series. :-)

gojomo · on May 12, 2023

As mentioned upthread, "ChatGPT" is insufficient to assess exactly which LLM-based tool you're reporting about. The versions vary greatly in capabilities, with some even able to browse the web for more info.

DougBTX · on May 12, 2023

In the insight case, the model has to classify the terms in a short paragraph into broad generic classes, then report which terms have different classes. In the recall case the model has to rank all utterances in its training data based on how relevant they are to the question, then summarise the most relevant one (assuming there is one). Insight seems easier than recall here.

drusepth · on May 11, 2023

I think there are plenty of humans who wouldn't notice, though.

And probably plenty of AI implementations that would notice.

bugglebeetle · on May 11, 2023

Are we now aspiring to ABAI (Artificial Below-Average Intelligence)?

cosmojg · on May 12, 2023

I regret to inform you that "average intelligence" is a far lower bar than you might think.

1024core · on May 12, 2023

As the joke goes: look around you and see how dumb the average person is. And now imagine, half the people in your city/country/Earth are dumber than this person.

checkyoursudo · on May 12, 2023

Did you know that, on average, approximately half of all statistics jokes about a normal distribution fail to comprehend mean, median, and mode? True fact.

EGreg · on May 11, 2023

The AI would not only notice, it would also notice 300 other things that are far more subtle haha

raldi · on May 12, 2023

I was so confused by this comment before I figured out you were actually straightforwardly telling the truth.

Details: https://pinehollow.livejournal.com/26806.html

BHSPitMonkey · on May 12, 2023

In the Gatsby example, I'd expect the model to be able to answer "which sentence from the story felt out of place?" without knowledge of the original.

hfhdndnfbdbd · on May 12, 2023

The point is spotting the inconsistency within the document, not whether an original has a different form

moonchrome · on May 11, 2023

Comparing strings is a python script problem for a newbie.

Interesting part is if it can deduce something that's out of place within a huge document.

teaearlgraycold · on May 12, 2023

They’re saying you can’t ask it to compare against a doc it doesn’t have access to.

moonchrome · on May 12, 2023

I'm saying you can write a python script that compares two docs, that's a really useless use case.

The relevant test is spotting something out of place in a really long text - if it can do that on non-training material then that's actually useful for reviewing things.

flangola7 · on May 12, 2023

I don't see any reason why that wouldn't be possible.

nomel · on May 11, 2023

> Most of these AI

This is as meaningful as saying most of the hominids can't count. You can't usefully generalize AI models with the rate of change that exists right now. Any statements/comparisons about AI has to contain specific models and versions, otherwise it's increasingly irrelevant noise.

jiggawatts · on May 12, 2023

Every time someone has said "LLMs can't do X", I tried X in GPT 4 and it could do it. They usually try free LLMs like Bard or GPT 3 and assume that the results generalise.

icrbow · on May 12, 2023

LLMs can't massively decrease the net amount of entropy of the universe

VierScar · on May 12, 2023

Are you trying to get us killed?

dtheodor · on May 12, 2023

Insufficient data for a meaningful answer.

dmix · on May 11, 2023

I'd imagine working with an entire company document would require a lot more hand holding and investment in prompt engineering. You can definitely get better results if you add much more context of what you're expecting and how the LLM should do it. Treating these LLMs as just simple Q&A machines is usually not enough unless you're doing simple stuff.

tunesmith · on May 11, 2023

I've been curious about this for a while, I have a hobby use-case of wanting to input in-progress novellas and then asking it questions about plot holes, open plot threads, and if new chapter "x" presents any serious plot contradiction problems. I haven't tried exploring that with a vectordb-embeddings approach yet.

make3 · on May 11, 2023

This is an exact example of something a vector dbs would be terrible at.

Vector dbs work by fetching segments that are similar in topics to the question, so like "Where did <Character> go after <thing>" will retrieve segments with locations & the character & maybe talking about <thing> as a recent event.

Your question has no similarity with the segments required in any way; & it's not the segments that are wrong it's the way they relate to the rest of the story

toss1 · on May 11, 2023

Good points - LLMs are ok at finding things that exist, but they have zero ability to abstract and find what is missing (actually, probably negative; they'd likely hallucinate and fill in the gaps).

Which makes me wonder if the opposite, but more laborious approach might work - request it identify all characters and plot themes, then request summaries of each. You'd have to review the summaries for holes. Lotsa work, but still maybe quicker than re-reading everything yourself?

vidarh · on May 11, 2023

Firstly, I don't at all agree that they have zero ability to abstract. Doesn't fit my experience at all. A lot of the tasks I use ChatGPT for is exactly to analyse gaps in specifications etc. And have it tell me what is missing, suggest additions or ask for clarifications. It does that just fine.

But I've started experimenting with the second part, of sorts, not to find plot holes but to have it create character sheets for my series of novels for my own reference.

Basically have it maintain a sheet and feed it chunks of one or more chapters and asking it to output an a new sheet augmented with the new details.

With a 100K context window I might just test doing it over while novels or much larger chunks of one.

TeMPOraL · on May 11, 2023

> LLMs are ok at finding things that exist, but they have zero ability to abstract and find what is missing (actually, probably negative; they'd likely hallucinate and fill in the gaps).

I feel this is mostly a prompting issue. Specifically GPT-4 shows surprising ability to abstract to some degree and work with high-level concepts, but it seems that, quite often, you need to guide it towards the right "mode" of thinking.

It's like dealing with a 4 year old kid. They may be perfectly able to do something you ask them, but will keep doing something else, until you give them specific hints, several times, in different ways.

clbrmbr · on May 12, 2023

There are other ways of using the model other than iterative forward inference (completions). You could run the model over your novel (perhaps including a preface) and look at the posterior distribution as it scans. This may not be so meaningful at the level of the token distribution, but there may be interesting ways of “spell checking” at a semantic level level. Think a thesaurus but operating at the level of whole paragraphs.

make3 · on May 11, 2023

that's not at all what I said

HarHarVeryFunny · on May 11, 2023

Do the OpenAI APIs support converting prompts to vectors, or are people running their own models locally to do this? Can you recommend any good resources to read up on vector DB approaches to working around context length limits ?

tuanacelik · on May 12, 2023

Indeed this tutorial on Haystack is a good one as an example: https://haystack.deepset.ai/tutorials/22_pipeline_with_promp... It combines a retrieval step followed by a prompt layer which inserts the relevant context into the prompt. You can however change the 'retrieval step' with something that uses a proper embedding model and OpenAI also provides those if you want to. I tend to use lighter (cheaper) OSS models for this step though. PS: There's some functionality in the PromptNode to make sure you don't exceed prompt limit.

HarHarVeryFunny · on May 12, 2023

That's great - thanks!

ukuina · on May 12, 2023

Yes, you can use a local embeddings model like gtr-t5-xl alongside retriever augmentation. This can point you in the right direction: https://haystack.deepset.ai/tutorials/22_pipeline_with_promp...

HarHarVeryFunny · on May 12, 2023

Thanks!

make3 · on May 11, 2023

open ai has an embeddings api that ppl use for that https://platform.openai.com/docs/guides/embeddings, though whether it's the best model to do that is congested.

Contriever is an example of a strong model to do that yourself. see their paper too to learn about the domain. https://github.com/facebookresearch/contriever

HarHarVeryFunny · on May 12, 2023

Thanks!

koboll · on May 12, 2023

So what's the right way to do a wider-ranging analysis? Chunk into segments, ask about each one, then do a second pass to review all answers together?

fzliu · on May 11, 2023

I'm also not entirely convinced by "huge" context models just yet, especially as it relates to fuzzy knowledge such as overarching themes or writing style.

In particular, there are 0 mentions of the phrase "machine learning" in The Great Gatsby, so adding one sentence that introduces the phrase should be easy for self-attention to pick out.

lumost · on May 11, 2023

I'd be more impressed if it could rewrite Mr. Carraway as an ML engineer in the entire novel. However it's not intrinsically clear that it cannot do this...

It'll be tough to find good benchmarks on long context windows. A human cannot label using 100k tokens of context.

zooch · on May 11, 2023

My thoughts exactly - rewrite the novel with Mr. Carraway as an ML engineer while maintaining themes/motifs (possible adding new ones too). I'm guessing what's impressive is that these are the first steps towards something like this? Or is it already possible? Someone please correct me here.

gcanyon · on May 12, 2023

Or rewrite The Count of Monte Cristo as a science fiction novel and get The Stars My Destination. Or rewrite Heinlein's Double Star into present-day and get the movie Dave.

saagarjha · on May 12, 2023

Wonder if they’d set it in SF then…

EGreg · on May 11, 2023

This sounds like all the other skepticism about what AI can do. And then it can spot 200x more than any human and correlate it into common themes, and you’ll say what?

devmor · on May 11, 2023

Doing more than a human can isn't impressive. Most computer problems for any purpose can do more of something, or something faster than a human can.

A better comparison would be if it can pick out any differences that can't be picked out by more traditional and simple algorithms.

chaxor · on May 11, 2023

It does, using this method.

My immediate thought as well was '... Yeah, well vimdiff can do that in milliseconds rather than 22 seconds' - but that's obviously missing the point entirely. Of course, we need to tell people to use the right tool for the job, and that will be more and more important to remind people of now.

However, it's pretty clear that the reason they used this task is to give something simple to understand what was done in a very simple example. Of course it can do more semantic understanding related tasks, because that's what the model does.

So, without looking at the details we all know that it can summarize full books, give thematic differences between two books, write what a book may be like if a character switch from one book to another is done, etc.

If it doesn't do these things (not just badly, but can't at all) I would be surprised. If it does them, but badly, I wouldn't be surprised, but it also wouldn't be mind bending to see it do better than any human at the task as well.

Frost1x · on May 11, 2023

>Of course it can do more semantic understanding related tasks, because that's what the model does.

The problem is that marketing has eroded maintaining any such faith. Too often, simple examples are given to the consumer to extrapolate intended functionality because there's no false advertising involved then. It's used over and over again in products, the examples are well selected and don't actually show a good representation of perceived functionality.

As such, I personally can't make the leap to of course it can do more semantic understanding related tasks like a diff that's not as simple, one where perhaps a characters overall personality over the course of the book is shifted, not just a single line that defines their profession.

This isn't to say the demonstrative example isn't neat on its own accord given whats going on here, it is, I'm just saying I can't make such leaps from examples given by any products. When I work with vendors of traditional software, this happens all the time people dance around a lack of functionality you obviously want or need to make a sell. It's only when you force them to be explicit on the specific cases, especially in writing, that I have any faith at all.

xcv123 · on May 12, 2023

LLM's are specifically designed and trained for semantic understanding related tasks. That's the entire point. They are trained to solve natural language understanding tasks and they build a semantic model in order to solve the tasks.

chaxor · on May 12, 2023

I did say it may do certain tasks with low performance. What you're saying is not really understandable (or simply wrong). The fact that it does understand how to do a 'simple' task such as finding where text is different is actually somewhat impressive given the typical training data for these models. But I suppose you need to actually understand the field of NLP to understand why that is.

If you're expecting perfection or magic then you will be disappointed.

ehnto · on May 12, 2023

I think it's fair to say most people don't know what a diff tool is, but do know how to ask questions. That is the democratizing factor that AI is introducing I feel, giving high powered computing to people without the need for specialized knowledge.

chaxor · on May 12, 2023

The real goal would be for the model to determine what tool is needed to achieve the task, and use that to achieve it. Using the model to write code provides a way to achieve explainable and correctness guarantees (by reading and executing the code, not the model internals). So in this case identifying that a diff program is required may have taken a couple of seconds, ans the diff would take milliseconds - still yielding a faster and more correct output (as the model alone may produce an approximate diff or hallucinate).

EGreg · on May 11, 2023

Or course it can very soon, since those were also written by humans. Like AlphaZero vs Rybka

tanseydavid · on May 12, 2023

>> And then...you’ll say what?

USER: It's a stochastic parrot.

GPT: I know you are, so what am I?

coffeebeqn · on May 11, 2023

What techniques do they actually use to achieve 100k? I assumed they load the document to a vector databases and then do some kind of views into that

raymond_goo · on May 15, 2023

Alibi: https://arxiv.org/abs/2108.12409 https://www.reddit.com/r/MachineLearning/comments/ww146r/d_a...

asadm · on May 12, 2023

I would assume by training LLM on dataset of 100k tokens would be the right way.

clbrmbr · on May 12, 2023

Makes me wonder whether we could get really huge contexts much more efficiently by feeding back a higher layer back into the tail end of the model. That way it has a very clear picture of the recent text but only a compressed picture of the earlier parts of the document.

(I think I’ve got to read up on how transformers actually work.)

desperate · on May 12, 2023

Afaik you're describing something akin to a recurrent neural network, and the problem with that is that it doesn't parallelize well to modern hardware. And vanishing gradients.

jiggawatts · on May 12, 2023

I had the same thought as the comment you're responding to.

Recurrent neural networks are bad when the recurrence is 100x long or more. You need long chains because with a token-at-a-time, that's what you need to process even one paragraph.

But if you use an RNN around a Transformed-based LLM, then you're adding +4K or +8K tokens per recurrence, not +1.

E.g.: GPT 4 32K would need just 4x RNN steps to reach 128K tokens!