Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>For example, we loaded the entire text of The Great Gatsby into Claude-Instant (72K tokens) and modified one line to say Mr. Carraway was “a software engineer that works on machine learning tooling at Anthropic.” When we asked the model to spot what was different, it responded with the correct answer in 22 seconds.

This sort of needle-in-a-haystack retrieval is definitely impressive, and it makes a lot more sense to achieve this in-context rather than trying to use a vector database if you can afford it.

I'm curious, though, whether there are diminishing returns in terms of how much analysis the model can do over those 100k tokens in a single forward pass. A human reading modified-Gatsby might eventually spot the altered line, but they'd also be able to answer questions about the overarching plot and themes of the novel, including ones that cannot be deduced from just a small number of salient snippets.

I'd be curious to see whether huge-context models are also able to do this, or if they start to have trouble when the bottleneck becomes reasoning capacity rather than input length. I feel like it's hard to predict one way or the other without trying it, just because LLMs have already demonstrated a lot of surprising powers.



I think people in the comments are missing the point.

I might be wrong, but the point isn't comparing a modified The Great Gatsby to the original one. Of course that's not impressive and it's an easy thing to do.

The point of the exercise is supposed to be[1] that the model has the entire novel as context / prompt and so can identify (within that context), whether a paragraph is out of place. That is impressive and I wouldn't know how to find that programmatically (would you have a list of "modern" words to check? But maybe the out of place thing is hidden in the meaning and there's no out of place, modern word).

[1] I say supposed to be because the great Gatsby is in the training sample and so maybe there is a sense in which the model "contains" the original text and in some way is doing "just" a comparison. A better test would be try with a novel or document that the model hasn't seen...or at least something not as famous as the great Gatsby.


Further, the problem with this example is it relies on a comparison against public data.

Most of these AI start failing pretty hard when you ask it to do the same task on something completely novel to it (like a company document). Sometimes they'll get it right. Other times, they'll spit out gibberish that's clearly some generic answer.


Asking to spot the difference between a given document and an unseen document is impossible.


A couple years ago, I read Superfudge by Judy Blume, a book originally published in 1980. In it, the protagonist writes a letter to Santa: "Please bring me one or more of the following items. A clock-radio, a remote-controlled model airplane, a laptop computer, an MP3 player and six CD's."

I didn't need to have seen this book before to know this wasn't in the original 1980s text.

Similarly, if I were reading the Great Gatsby for the first time, and it identified a character as a software engineer, I would notice.


That is not "spotting the difference". That is spotting the logical inconsistency in your paragraph.

ChatGPT easily notices that anachronism.


Indeed - I took lkbm's comment and removed the "originally published in 1980" bit (too easy), and asked ChatGPT "What's unusual about this paragraph?"

> the unusual thing about this paragraph could be that it describes a letter to Santa that includes items that were not available at the time the book was originally published in 1980, such as a laptop computer, an MP3 player, and possibly even a remote-controlled model airplane with advanced features. This suggests that the paragraph is likely from a later edition of the book that has been updated to reflect more modern technology.


Presumably, someone wrote about the new version online before to complain about the changes, so it’s recall, not insight?


> "A couple years ago, I read Beezus and Ramona, by Beverly Cleary. In it, the protagonist writes a letter to Santa: "Dear Santa, I would like a MacBook Pro, an iPhone, and an Apple Watch""

> Describe what is unusual about this paragraph.

>> The unusual aspect of this paragraph is that the book "Beezus and Ramona" was actually written by Beverly Cleary and published in 1955, well before the invention of the MacBook Pro, iPhone, and Apple Watch.


> > A couple years ago, I read Beezus and Ramona, by Beverly Cleary. In it, the protagonist writes a letter to Santa: "Dear Santa, I would like a shetland pony, a buggy whip, and an abacus". What is unusual about this? ChatGPT

> The unusual thing about the list is that it includes an abacus, which is a counting device that was commonly used in ancient China and is still used in some parts of the world today, but is not a typical gift request for a young child to make to Santa Claus. Shetland ponies were popular children's pets at the time, and buggy whips were used to control horse-drawn carriages, but the inclusion of an abacus suggests that the protagonist, Beezus, has an interest in learning and math.

> > What's the actual passage from the real novel say?

> Certainly! Here's the actual passage from the book "Beezus and Ramona" by Beverly Cleary, where Beezus writes a letter to Santa:

> "Dear Santa Claus," Beezus printed carefully. "I have been a good girl all year except for my little sister Ramona who has been hard to put up with sometimes, but I guess I have too. Please bring me a Shetland pony, a new dress, and a doll with real hair to comb. And please bring Ramona a doll with real hair to comb too. I think she has tried hard to be good, and I will help her comb it. Please also bring me a set of paper dolls and a paint set, and Ramona some coloring books and crayons. And Santa, if you think there is room in your sleigh, please bring me an abacus. Love, Beatrice Quimby (Beezus)"


You didn't ask it to list ALL of the unusual things. Here's a better prompt:

>> "A couple years ago, I read Beezus and Ramona, by Beverly Cleary. In it, the protagonist writes a letter to Santa: "Dear Santa, I would like a shetland pony, a buggy whip, and an abacus".

>> List all of the unusual or inconsistent things about the previous sentence."

Response from GPT-4:

Here are some potentially unusual or inconsistent elements about the sentence:

1. Anachronistic Items: The items in the wish list could be seen as unusual or inconsistent with typical gifts children might ask for today.

A Shetland pony, a buggy whip, and an abacus are not common gifts. They're somewhat anachronistic, reflecting a time period different from the present day.

A Shetland pony is a breed of pony that originates from the Shetland Isles in Scotland. While some children might wish for a pony, it's a big responsibility and not a practical gift for most families.

A buggy whip is a horse whip used in horse and buggy days. This is an unusual request for a child in the modern era, as horse-drawn buggies are not common means of transportation.

An abacus is an ancient counting tool, and while some children might be interested in such a device, it's not a typical request, especially in an era of digital calculators and computers.

2. Inconsistency with Character: If you're referring to the "Beezus and Ramona" series by Beverly Cleary, neither Beezus nor Ramona would likely ask for these items. Beezus (Beatrice Quimby) is the responsible older sister who might appreciate practical things, but an abacus or a buggy whip would be out of character. Ramona Quimby, the imaginative and energetic younger sister, might ask for a Shetland pony, but a buggy whip and an abacus seem inconsistent with her character.

3. Inconsistency with the Book: As far as my knowledge cutoff in September 2021, there is no record of such a letter being written to Santa in the book "Beezus and Ramona". It could be seen as inconsistent with the actual content of the book.

Remember, these are interpretations and the perceived unusualness or inconsistency can vary based on individual perspectives. For instance, if this is a fictional context where children commonly use abacuses, ride ponies, and learn to use buggy whips, then these items might not be unusual at all.


I'm guessing this is GPT3? Because GPT4 says look it up yourself. When I did the whole scene doesn't seem to exist in the book. In other words, this seems to be a hallucination.

When I repeat the experiment with Lord of the Rings and a scene that does exist, it spits it out word for word correct.


Yeah, the Ramona scene doesn't exist (by coincidence, I've also read four of the eight Ramona books in the past couple days). The letter I mentioned was in Superfudge. GPT-4 spots the errors, but refuses to give me the original test.

If I give GPT-3.5 the modernized letter and ask for the original, it gives me a believably correction (but I can't verify). If I don't prompt with the modernized letter's text, it returns a fully fabricated letter.

With GPT-4:

Prompt: > "A couple years ago, I read Superfudge, by Judy Blume. In it, the protagonist writes a letter to Santa: "Please bring me one or more of the following items. A clock-radio, a remote-controlled model airplane, a laptop computer, an MP3 player and six CD's.. > > List all of the unusual or inconsistent things about the previous sentence.

GPT-4:

> Here are some potential inconsistencies or unusual things about the sentence: > > Outdated technology: The inclusion of items such as an MP3 player and CDs seems somewhat outdated. By 2023, these technologies are largely obsolete, replaced by streaming music services and digital downloads. > > Variation in cost/value of items: The items listed in the letter have a wide range of costs and values. A clock-radio or a remote-controlled model airplane are typically much less expensive than a laptop computer, which might seem unusual for a child's wish list to Santa. > > Superfudge setting: The book "Superfudge" by Judy Blume was first published in 1980. The technology items listed (a laptop computer, an MP3 player, and CDs) did not exist or were not commonly available at the time of the book's setting. This presents a significant anachronism if the request is supposed to be coming from the book's original time period. > > The phrase "one or more": Usually, children write to Santa asking for specific items they desire most, rather than providing a list and suggesting Santa could bring "one or more" of the items. This phrase might be seen as unusual for a child's letter to Santa, which typically would not include such an open-ended option. > > The specificity of "six CDs": The child doesn't specify which CDs they want, which is unusual because the type of music or artist would be an important detail for such a request. > > Remember, this analysis is based on the context you provided. If there's a different context or specific aspect you want me to consider, please let me know!

(I'm amused that it starts with "that technology is too old" and then switched to "also that technology is too new". Also, the first thing Superfudge asks upon hearing his brother's letter is "How will Santa know which six CD's?" so good job, GPT-4. Smart as a four-year-old.)

Me:

> Good catch about the 1980s publication date! The book must have been edited for re-release as an ebook. What was the original text of the letter in the 1980s edition?

GPT-4:

> I'm sorry for the confusion, but as an AI developed by OpenAI, I don't have the capability to access specific books, their revisions, or editions. As of my last update in September 2021, I don't have the specific text of the letter from the 1980s edition of "Superfudge" by Judy Blume. > > However, given the time frame of the original publication, the technology items such as a laptop computer, an MP3 player, and CDs certainly would not have been part of the original list as these items weren't widely available or didn't exist in the 1980s. > > For the accurate original text, I recommend finding a copy of the original edition of the book.

---

With GPT-3.5:

Me:

> A couple years ago, I read Superfudge, by Judy Blume. In it, the protagonist writes a letter to Santa: "Please bring me one or more of the following items. A clock-radio, a remote-controlled model airplane, a laptop computer, an MP3 player and six CD's.." > > However, the book was originally published in 1980, so this must be an edited version for a later re-publication. Can you tell me the original 1980s version of the letter?

GPT-3.5:

> In the original 1980 version of "Superfudge" by Judy Blume, the protagonist's letter to Santa reads as follows: > > "Dear Santa, > > Please bring me one or more of the following items: a clock-radio, a remote-controlled car, a Walkman, and six cassette tapes. > > Love, > [Protagonist's Name]" > > The updated version you mentioned seems to have incorporated more modern technology, such as a laptop computer and an MP3 player, to reflect the changes in technology over time.

I then asked a new GPT-3 chat "What is the text of the letter to Santa in the original 1980 version of "Superfudge" by Judy Blume?" and it game me something definitely fully fabricated. (Initially it gave me a fabricated letter by Fudge, but after I specified that I wanted Peter's letter, it was still a fabrication.)


> "by coincidence, I've also read four of the eight Ramona books in the past couple days"

My daughter keeps relistening to the complete Ramona audio book collection, so I am extremely familiar with all of the Ramona series. :-)


As mentioned upthread, "ChatGPT" is insufficient to assess exactly which LLM-based tool you're reporting about. The versions vary greatly in capabilities, with some even able to browse the web for more info.


In the insight case, the model has to classify the terms in a short paragraph into broad generic classes, then report which terms have different classes. In the recall case the model has to rank all utterances in its training data based on how relevant they are to the question, then summarise the most relevant one (assuming there is one). Insight seems easier than recall here.


I think there are plenty of humans who wouldn't notice, though.

And probably plenty of AI implementations that would notice.


Are we now aspiring to ABAI (Artificial Below-Average Intelligence)?


I regret to inform you that "average intelligence" is a far lower bar than you might think.


As the joke goes: look around you and see how dumb the average person is. And now imagine, half the people in your city/country/Earth are dumber than this person.


Did you know that, on average, approximately half of all statistics jokes about a normal distribution fail to comprehend mean, median, and mode? True fact.


The AI would not only notice, it would also notice 300 other things that are far more subtle haha


I was so confused by this comment before I figured out you were actually straightforwardly telling the truth.

Details: https://pinehollow.livejournal.com/26806.html


In the Gatsby example, I'd expect the model to be able to answer "which sentence from the story felt out of place?" without knowledge of the original.


The point is spotting the inconsistency within the document, not whether an original has a different form


Comparing strings is a python script problem for a newbie.

Interesting part is if it can deduce something that's out of place within a huge document.


They’re saying you can’t ask it to compare against a doc it doesn’t have access to.


I'm saying you can write a python script that compares two docs, that's a really useless use case.

The relevant test is spotting something out of place in a really long text - if it can do that on non-training material then that's actually useful for reviewing things.


I don't see any reason why that wouldn't be possible.


> Most of these AI

This is as meaningful as saying most of the hominids can't count. You can't usefully generalize AI models with the rate of change that exists right now. Any statements/comparisons about AI has to contain specific models and versions, otherwise it's increasingly irrelevant noise.


Every time someone has said "LLMs can't do X", I tried X in GPT 4 and it could do it. They usually try free LLMs like Bard or GPT 3 and assume that the results generalise.


LLMs can't massively decrease the net amount of entropy of the universe


Are you trying to get us killed?


Insufficient data for a meaningful answer.


I'd imagine working with an entire company document would require a lot more hand holding and investment in prompt engineering. You can definitely get better results if you add much more context of what you're expecting and how the LLM should do it. Treating these LLMs as just simple Q&A machines is usually not enough unless you're doing simple stuff.


I've been curious about this for a while, I have a hobby use-case of wanting to input in-progress novellas and then asking it questions about plot holes, open plot threads, and if new chapter "x" presents any serious plot contradiction problems. I haven't tried exploring that with a vectordb-embeddings approach yet.


This is an exact example of something a vector dbs would be terrible at.

Vector dbs work by fetching segments that are similar in topics to the question, so like "Where did <Character> go after <thing>" will retrieve segments with locations & the character & maybe talking about <thing> as a recent event.

Your question has no similarity with the segments required in any way; & it's not the segments that are wrong it's the way they relate to the rest of the story


Good points - LLMs are ok at finding things that exist, but they have zero ability to abstract and find what is missing (actually, probably negative; they'd likely hallucinate and fill in the gaps).

Which makes me wonder if the opposite, but more laborious approach might work - request it identify all characters and plot themes, then request summaries of each. You'd have to review the summaries for holes. Lotsa work, but still maybe quicker than re-reading everything yourself?


Firstly, I don't at all agree that they have zero ability to abstract. Doesn't fit my experience at all. A lot of the tasks I use ChatGPT for is exactly to analyse gaps in specifications etc. And have it tell me what is missing, suggest additions or ask for clarifications. It does that just fine.

But I've started experimenting with the second part, of sorts, not to find plot holes but to have it create character sheets for my series of novels for my own reference.

Basically have it maintain a sheet and feed it chunks of one or more chapters and asking it to output an a new sheet augmented with the new details.

With a 100K context window I might just test doing it over while novels or much larger chunks of one.


> LLMs are ok at finding things that exist, but they have zero ability to abstract and find what is missing (actually, probably negative; they'd likely hallucinate and fill in the gaps).

I feel this is mostly a prompting issue. Specifically GPT-4 shows surprising ability to abstract to some degree and work with high-level concepts, but it seems that, quite often, you need to guide it towards the right "mode" of thinking.

It's like dealing with a 4 year old kid. They may be perfectly able to do something you ask them, but will keep doing something else, until you give them specific hints, several times, in different ways.


There are other ways of using the model other than iterative forward inference (completions). You could run the model over your novel (perhaps including a preface) and look at the posterior distribution as it scans. This may not be so meaningful at the level of the token distribution, but there may be interesting ways of “spell checking” at a semantic level level. Think a thesaurus but operating at the level of whole paragraphs.


that's not at all what I said


Do the OpenAI APIs support converting prompts to vectors, or are people running their own models locally to do this? Can you recommend any good resources to read up on vector DB approaches to working around context length limits ?


Indeed this tutorial on Haystack is a good one as an example: https://haystack.deepset.ai/tutorials/22_pipeline_with_promp... It combines a retrieval step followed by a prompt layer which inserts the relevant context into the prompt. You can however change the 'retrieval step' with something that uses a proper embedding model and OpenAI also provides those if you want to. I tend to use lighter (cheaper) OSS models for this step though. PS: There's some functionality in the PromptNode to make sure you don't exceed prompt limit.


That's great - thanks!


Yes, you can use a local embeddings model like gtr-t5-xl alongside retriever augmentation. This can point you in the right direction: https://haystack.deepset.ai/tutorials/22_pipeline_with_promp...


Thanks!


open ai has an embeddings api that ppl use for that https://platform.openai.com/docs/guides/embeddings, though whether it's the best model to do that is congested.

Contriever is an example of a strong model to do that yourself. see their paper too to learn about the domain. https://github.com/facebookresearch/contriever


Thanks!


So what's the right way to do a wider-ranging analysis? Chunk into segments, ask about each one, then do a second pass to review all answers together?


I'm also not entirely convinced by "huge" context models just yet, especially as it relates to fuzzy knowledge such as overarching themes or writing style.

In particular, there are 0 mentions of the phrase "machine learning" in The Great Gatsby, so adding one sentence that introduces the phrase should be easy for self-attention to pick out.


I'd be more impressed if it could rewrite Mr. Carraway as an ML engineer in the entire novel. However it's not intrinsically clear that it cannot do this...

It'll be tough to find good benchmarks on long context windows. A human cannot label using 100k tokens of context.


My thoughts exactly - rewrite the novel with Mr. Carraway as an ML engineer while maintaining themes/motifs (possible adding new ones too). I'm guessing what's impressive is that these are the first steps towards something like this? Or is it already possible? Someone please correct me here.


Or rewrite The Count of Monte Cristo as a science fiction novel and get The Stars My Destination. Or rewrite Heinlein's Double Star into present-day and get the movie Dave.


Wonder if they’d set it in SF then…


This sounds like all the other skepticism about what AI can do. And then it can spot 200x more than any human and correlate it into common themes, and you’ll say what?


Doing more than a human can isn't impressive. Most computer problems for any purpose can do more of something, or something faster than a human can.

A better comparison would be if it can pick out any differences that can't be picked out by more traditional and simple algorithms.


It does, using this method.

My immediate thought as well was '... Yeah, well vimdiff can do that in milliseconds rather than 22 seconds' - but that's obviously missing the point entirely. Of course, we need to tell people to use the right tool for the job, and that will be more and more important to remind people of now.

However, it's pretty clear that the reason they used this task is to give something simple to understand what was done in a very simple example. Of course it can do more semantic understanding related tasks, because that's what the model does.

So, without looking at the details we all know that it can summarize full books, give thematic differences between two books, write what a book may be like if a character switch from one book to another is done, etc.

If it doesn't do these things (not just badly, but can't at all) I would be surprised. If it does them, but badly, I wouldn't be surprised, but it also wouldn't be mind bending to see it do better than any human at the task as well.


>Of course it can do more semantic understanding related tasks, because that's what the model does.

The problem is that marketing has eroded maintaining any such faith. Too often, simple examples are given to the consumer to extrapolate intended functionality because there's no false advertising involved then. It's used over and over again in products, the examples are well selected and don't actually show a good representation of perceived functionality.

As such, I personally can't make the leap to of course it can do more semantic understanding related tasks like a diff that's not as simple, one where perhaps a characters overall personality over the course of the book is shifted, not just a single line that defines their profession.

This isn't to say the demonstrative example isn't neat on its own accord given whats going on here, it is, I'm just saying I can't make such leaps from examples given by any products. When I work with vendors of traditional software, this happens all the time people dance around a lack of functionality you obviously want or need to make a sell. It's only when you force them to be explicit on the specific cases, especially in writing, that I have any faith at all.


LLM's are specifically designed and trained for semantic understanding related tasks. That's the entire point. They are trained to solve natural language understanding tasks and they build a semantic model in order to solve the tasks.


I did say it may do certain tasks with low performance. What you're saying is not really understandable (or simply wrong). The fact that it does understand how to do a 'simple' task such as finding where text is different is actually somewhat impressive given the typical training data for these models. But I suppose you need to actually understand the field of NLP to understand why that is.

If you're expecting perfection or magic then you will be disappointed.


I think it's fair to say most people don't know what a diff tool is, but do know how to ask questions. That is the democratizing factor that AI is introducing I feel, giving high powered computing to people without the need for specialized knowledge.


The real goal would be for the model to determine what tool is needed to achieve the task, and use that to achieve it. Using the model to write code provides a way to achieve explainable and correctness guarantees (by reading and executing the code, not the model internals). So in this case identifying that a diff program is required may have taken a couple of seconds, ans the diff would take milliseconds - still yielding a faster and more correct output (as the model alone may produce an approximate diff or hallucinate).


Or course it can very soon, since those were also written by humans. Like AlphaZero vs Rybka


>> And then...you’ll say what?

USER: It's a stochastic parrot.

GPT: I know you are, so what am I?


What techniques do they actually use to achieve 100k? I assumed they load the document to a vector databases and then do some kind of views into that



I would assume by training LLM on dataset of 100k tokens would be the right way.


Makes me wonder whether we could get really huge contexts much more efficiently by feeding back a higher layer back into the tail end of the model. That way it has a very clear picture of the recent text but only a compressed picture of the earlier parts of the document.

(I think I’ve got to read up on how transformers actually work.)


Afaik you're describing something akin to a recurrent neural network, and the problem with that is that it doesn't parallelize well to modern hardware. And vanishing gradients.


I had the same thought as the comment you're responding to.

Recurrent neural networks are bad when the recurrence is 100x long or more. You need long chains because with a token-at-a-time, that's what you need to process even one paragraph.

But if you use an RNN around a Transformed-based LLM, then you're adding +4K or +8K tokens per recurrence, not +1.

E.g.: GPT 4 32K would need just 4x RNN steps to reach 128K tokens!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: