Recursively summarizing enables long-term dialogue memory in LLMs

whimsicalism · on Sept 2, 2023

To my intuition, all of these ways of building memory in “text space” seem super hacky.

It seems intuitive to me that memory would be best stored in dense embedding space that can preserve full semantic meaning for the model rather than as some hacked on process of continually regenerating summaries.

And similarly, the model needs to be trained in a setting where it is aware of the memory and how to use it. Preferably that would be from the very beginning (ie. the train on text).

nottheengineer · on Sept 2, 2023

It does seem hacky, but then again the whole concept of conversational LLMs is. You're just asking it to add an extra word to a given conversation and after a bit, it spits out an end token that tells your application to hand control back to the user.

I think latent space and text space aren't as far apart as you think. LLMs are pretty stupid, but very good at speech. They are good at writing code because that's very similar, but fall apart in things that need some actual abstract thinking, like math.

Those text space hacks do tend to work and stuff like "think step by step" has become common because of that.

LoRAs are closer to what you mean and they're great at packing a lot of understanding into very little data. But adjusting weights for a single conversation just isn't feasible yet, so we're exploring text space for that purpose. Maybe someome will transfer the methods we discover in text space to embedding space to make them more efficient, but that's for the future.

og_kalu · on Sept 2, 2023

>They are good at writing code because that's very similar, but fall apart in things that need some actual abstract thinking, like math.

Pretty odd assertion. LLMs are not "good at speech, bad at abstract thinking".

What do these have to do with speech ?

https://general-pattern-machines.github.io/

https://arxiv.org/abs/2212.09196

It doesn't even hold with your example because GPT-4 is pretty good at Math, nowhere near "falling apart".

nottheengineer · on Sept 3, 2023

Pattern reproduction is very close to speech in my opinion. Formal grammars even have it in the name and approaches like https://news.ycombinator.com/item?id=37125118 show that LLMs are indeed very fit for that purpose.

I think I have to walk that claim about math back and try to phrase what I meant differently: LLMs have a hard time with problems that don't translate well into the text space, i.e. abstract problems. Math used to be one of those because early tokenizers were designed just with text in mind and LLMs weren't good enough to overcome those limitations. OpenAI put in a lot of effort into their tokenizers to make GPT3.5 and GPT4 better at math specifically.

The second paper you linked is very interesting and I think it supports my original assertion of text space and latent space being close. The first graph shows GPT3.5 doing much better at pattern reproduction and language tasks while humans still hold an advantage in the more abstract tasks like story analogies. Higher order relations being a key thing that's measured maybe makes this task a bit too perfect for arguing my case, but it does show that humans have an advantage in more abstract situations.

I think any problem that can be viewed as being mostly a form of translation is a good one for LLMs and if you can express a problem as that, you can get better results.

To get back to the main point: latent space and text space, or feature space in general, being close is what I believe causes all of this. Happy to hear counterexamples.

riku_iki · on Sept 2, 2023

> GPT-4 is pretty good at Math, nowhere near "falling apart".

its good at tasks which were included into training dataset in some variations.

Nevermark · on Sept 2, 2023

I have played around with GPT-4 and some fairly simple but completely new math ideas. It was fabulous at identifying special cases I overlooked, that disproved conjectures.

sudokuist · on Sept 2, 2023

Example?

Nevermark · on Sept 2, 2023

I was playing around with prime numbers, and simple made up relationships between them, such as between the square of a prime N vs. the set of primes smaller than N, etc.

It caught me out with specific examples that violated my conjectures. In one case the conjecture held for all but one case, another conjecture was generally true but not for 2 and 3.

In one case it thought a conjecture I made was wrong, and I had to push it to think through why it thought it was wrong until it realized the conjecture was right. As soon as it had its epiphany, it corrected all its logic around that concept.

It was very simple stuff, but an interesting exercise.

The part I enjoyed the most was seeing GPT-4's understanding move and change as we pushed back on each other's views. You miss out on that impressive aspect of GPT-4 in simpler sessions.

sudokuist · on Sept 2, 2023

Have you tried formalizing your ideas with Isabelle? It has a constraint solver and will often find counterexamples to false arithmetical propositions[1].

1: https://isabelle.in.tum.de/overview.html

mannykannot · on Sept 2, 2023

I have not been able to figure out how that would help in the context of this discussion. As I see it, what’s very interesting here is that an LLM is able to do this.

riku_iki · on Sept 2, 2023

I think the point is that LLM is not right tool for deep reasoning, and isabelle and others are much better such tools, even community trying to apply LLM in this area following current wave of hype.

riku_iki · on Sept 2, 2023

curious why you referred specifically on isabelle, which looks ancient and over engineered, there are many other tools and langs in this area.

I am not criticizing, but curious about your opinion.

c-cube · on Sept 3, 2023

Isabelle is good at counter examples in ways few other proof assistants are. In general its automation is excellent, partly because it uses a less powerful logic (HOL instead of CIC; more expressive logics are harder to write automation for). It's not obsolete.

Nevermark · on Sept 2, 2023

I have not, thanks for the tip.

riku_iki · on Sept 2, 2023

its hard to judge how deep and unique your conjectures were.

I did similar testing of GPT4, and my observation is that it starts failing after 3-4 levels of reasoning depth.

golol · on Sept 3, 2023

Nice to see number of levels of reasoning depth mentioned. I personally believe the size of a (well-trained) LLM determines how many steps of reasoning in sequence it can approximate. Newer models get deeper and deeper, giving them deeper reasoning context windows. My hypothesis is that you don't need infinite reasoning depth, just a bit more than GPT-4 has. I think once you can tie your output together with thinking in terms of ~10+ reasoning steps you'll be very close to hunan performance.

tux1968 · on Sept 2, 2023

> failing after 3-4 levels of reasoning depth.

That sounds more like an implementation or resource limitation, rather than an inherent limitation of the technique in general.

riku_iki · on Sept 3, 2023

it is not obvious to me how you came to such conclusion.

LLMs got lots of investments: 10s of billions of dollars and tons of compute, maybe more than any other tech in history, and can't crack 3 steps reasoning. It sounds like tech limitation..

naasking · on Sept 3, 2023

None of these systems or their training sets have been specifically tailored to tackle abstract reasoning or math, so that seems like a premature conclusion. The fact that they're decent at programming despite that is interesting.

vikramkr · on Sept 3, 2023

They're also brand new and at some undetermined part of the sigmoid curve. Trying to predict where you are on the curve while in the middle of a sigmoid is a fools errand, the best you can do is make random predictions and hope you are accidentally correct so you can become a pundit later

PaulHoule · on Sept 2, 2023

Kinda able to do some math tasks some of the time whereas you can use techniques from the arithmetic textbook to get the right answer all of the time with millions of times less CPU even including the overhead of round-tripping to ASCII numerals which is shockingly large compared to what a multiply costs.

Kinda "the problem" with LLMS is that they successfully seduce people by seeming to get the right answer to anything 80% of the time.

sdenton4 · on Sept 2, 2023

The arithmetic issues are well documented and understood; it's a problem of sub-token manipulation, which has nothing to do with reasoning. (Similar to calling blind people unintelligent because they can't read the iq test.)

And the better llms can easily write code to do the attention that they suck at...

bigyikes · on Sept 3, 2023

Excellent anology. LLMs are capable of many extraordinary things, and it’s a shame people dismiss them because they fail to live up to some specific test they invented.

og_kalu · on Sept 2, 2023

Math is a lot more than just Arithmetic.

PaulHoule · on Sept 2, 2023

Yeah but if you can only do arithmetic right X% of the time you aren't going to get other answers right as often as would really be useful.

That said, LLMs have a magic ability to "short circuit" and get the right answer despite not being able to get the steps right. I remember scoping out designs for NLP systems about 5 years ago and frequently conclude that "that won't work" because information was lost at an early stage but in retrospect by short circuiting a system like that can outperform its parts but it still faces a ceiling on how accurate the answers are because the reasoning is not sound.

btilly · on Sept 2, 2023

Human reasoning is amazingly not sound.

When you add in various patterns, double-checks, and memorized previous results, what human reasoning can do is astounding. But it is very, vary far from sound.

riku_iki · on Sept 2, 2023

all currently available reasoning approaches are limited.

I guess the topic is how far GPT in reasoning is from human. We can take some simple tests:

- can GPT play chess as well as humans, as benchmark of reasoning games?

- did GPT prove some nontrivial math theorems or solve some math problems where humans couldn't find solution yet?

PaulHoule · on Sept 3, 2023

One thing I thought was amusing was that there was a burst of articles about Cyc that got mentioned when Doug Lenat died including this arXiv paper

https://arxiv.org/abs/2308.04445

and that one said that Cyc had over 1,100 special purpose reasoning engines. The general purpose resolution solver was nowhere near fast enough to be really useful.

Early one there was

https://en.wikipedia.org/wiki/General_Problem_Solver

which would be capable in principle of finding a winning move in a chess position but because it worked by exhaustive search it would practically take too long. The thing is that a good chess playing program is not generally intelligent just as a chess grandmaster isn't necessarily good at anything other than chess, it just has special purpose heuristics (as opposed to algorithms) that find good chess move.

ChatGPT-like systems will be greatly improved by coupling them to other systems such as "write a Python/SQL script then run it", "run a query against bing and summarize the results", and "go find the chess engine and ask it what move to make", that is, like Cyc, it will get a swiss army knife of tools that help it do things it's not good at but it doesn't create general intelligence any more than Cyc did.

Robert Penrose in the Emperor's New Mind suggests that there must be some quantum magic in the human mind because the human mind is able to solve any math problem whereas any machine is limited by Gödel's theorem. It's silly, however, because we don't humans are capable of proving any theorem: look at how we struggled with Fermat for nearly 360 years or how

https://en.wikipedia.org/wiki/Collatz_conjecture

seems not even tantalizingly out of reach.

The difference might be that humans feel bad when they get the wrong answer whereas ChatGPT certainly doesn't. (as much as its empty apology can be satisifying to people) This isn't just an attribute of humans, working with other animals such as horses I'm convinced that they feel bad when they screw up too.

naasking · on Sept 3, 2023

> it will get a swiss army knife of tools that help it do things it's not good at but it doesn't create general intelligence any more than Cyc did

How do you know general intelligence is its own thing and not just a Swiss army knife of tools?

> because the human mind is able to solve any math problem whereas any machine is limited by Gödel's theorem

Any machine can be programmed to solve any problem at all, if the proof system is inconsistent. Which is probably exactly the case with humans. We work around it because different humans have different inconsistencies, so checking each other's work is how we average out those inconsistencies.

PaulHoule · on Sept 3, 2023

(As a person who went down the rabbit hole of knowledge-based systems and looked at Cyc quite a bit.)

Three forms of intelligence are (i) animal intelligence, (ii) language use, and (iii) abstract thinking.

Animals are intelligent in their own way, particularly socially intelligent. My wife runs a riding barn and it is clear to me that one of the things horses are most interested in is what the people and other horses are up to and that a horse doesn’t just have an opinion about other horses but they have an opinion about what opinion about what the other horses think about a horse. (e.g. Cyc has a system of microtheories and modalized logic that tries to get at this. Of course visual recognition and similar things are a big part of animal intelligence and boy have neural nets made progress there.)

Language is a unique capability of humans. (which Cyc made no real contribution to.)

If you get a PhD what you learn is how to develop systems of abstract thinking or at the very least go to conferences and acquire them or dig through the literature, dust them off and get them working. There is the aspect of individual creativity but also the “standing on the shoulders giants” that Newton talked about.

Before Lenat started on Cyc he was interested in expert systems for building expert systems or at the very least a set of development tools for doing the same and that was a motivation of Cyc even if the point of Cyc was to produce new knowledge bases and reasoning procedures that would live inside Cyc. The trouble is that this was a tortuous procedure and I did go through a phase of thinking about evaluating OpenCyc for a project but it had the problem that it would have taken at least six months just to get started with a project that could be finished in some other way much more quickly.

My own journey led through twists and turns but I came to see it as something like systems software development where you build tools like compilers and debuggers that transform inputs into a knowledge base and put it to work, but I very much gave up on “embedding in itself”

As for problems in general I don’t really know if they can all be solved? Isn’t it possible that there is no finite procedure to prove the Collatz conjecture?

naasking · on Sept 3, 2023

> Language is a unique capability of humans.

No it's not. Language is well documented in dolphins, for instance. Crows have also demonstrated self awareness and ability to do arithmetic. I think your 3-part breakdown of intelligence is out of date. There's no rigourous evidence that intelligence breaks down in this way, it's just a "folk theory" at this point.

Aerbil313 · on Sept 5, 2023

According to Gödel’s incompleteness theorem, some truths aren’t provable in the sense of acquiring proof through human reasoning and logic.

og_kalu · on Sept 2, 2023

No it's just pretty good in general lol.

cerved · on Sept 2, 2023

my experience is that it's pretty subpar

astrange · on Sept 2, 2023

Are you using code interpreter? It's better.

The mobile app doesn't offer it though, and also has a system prompt that causes some strange behavior - sometimes it will put emojis in the text and then apologize for using emojis.

bigyikes · on Sept 3, 2023

Care to share a GPT conversation you’ve had? I’m interested in what sorts of prompts lead you to this opinion. My experience is the opposite.

cerved · on Sept 3, 2023

A bit too much of a hassle. But if you're willing to share some of your good experiences, I'm curious

sudokuist · on Sept 2, 2023

Can LLMs solve sudoku yet?

JieJie · on Sept 3, 2023

These folks think so.

https://github.com/jieyilong/tree-of-thought-puzzle-solver

sudokuist · on Sept 3, 2023

They don't have 9x9 puzzles. Any guesses as to why they only tried 3x3, 4x4, and 5x5 but not 9x9?

This work is interesting. I wouldn't have guessed 3x3 puzzles would be solvable by a large Markov chain. It would be interesting to know how large of a context is necessary to solve 9x9 puzzles. No existing model can currently solve 9x9 puzzles even though the recursive backtracking algorithm can solve any given puzzle in less than a second.

chaxor · on Sept 3, 2023

Why are people so intent on incorrectly asserting these models are Markov chains? It makes sense to use the analogy as an educational tool for exposition, but it more often seems that many use it as a way to minimize the notion that these models could ever possibly be useful for anyone. Is this just simply to make it more intuitive for others that it's a sequence model? Because it seems about as helpful as 'email is just bits' when everyone and their grandma knows about the relation between transformers, GAT, and circulant matrices.

JieJie · on Sept 3, 2023

Well, you just said sudoku.

As others have pointed out, maybe intelligence derived from language just isn't very good at math? It's not like linear algebra comes naturally to humans, we have to be specially trained. I've been taking Khan Academy classes and believe me, math sure doesn't come naturally to me.

I realize tempers are high on this subject, but I literally just wanted to point it out, in case you hadn't seen it. I wasn't trying to dunk on you or anything.

dongping · on Sept 3, 2023

I'm not sure about its math, but GPT-4 fails miserably at simple arithmetic questions like 897*394=?

The GPT-3.5 turbo is fined-tuned for arithmetic according to ClosedAI (noted in one of the change logs), so it is sometimes slightly better, but nevertheless always fails equations like 4897*394=?

mistercow · on Sept 3, 2023

Arithmetic is a pretty pathological case for ChatGPT because of BPE. Digits just tokenize in a way that makes arithmetic way more complicated.

That said, I just fed both of your examples into GPT-4 and it answered them correctly without using CoT.

dongping · on Sept 3, 2023

This was the response that I got from the GPT-4 API yesterday:

{'id': 'chatcmpl-7uVF5xGqR1oEzXITw3WZYsnB4Yzt8', 'object': 'chat.completion', 'created': 1693700819, 'model': 'gpt-4-0613', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '353538'}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 11, 'completion_tokens': 2, 'total_tokens': 13}}

Maybe they fine-tuned the ChatGPT version better, or fed it to an calculator.

mistercow · on Sept 6, 2023

I would guess that they have a later model on web than on API (I also see worse results on API with 0613). Further testing shows that it loses the plot after a few more digits, which wouldn't make sense if they were injecting calculations.

Filligree · on Sept 3, 2023

> I'm not sure about its math, but GPT-4 fails miserably at simple arithmetic questions like 897*394=?

That's, um, about 300,000?

...

353,418 actually. But I'm not going to blame the AI too much for failing at something I can't do either.

dongping · on Sept 3, 2023

One can resort to traditional vertical multiplication (which requires patience), or do

897*394 = (900-3) * (400-6) = 900*400 - 6*900 - 400*3 + 3*6 = 360,000 - (5,400 + 1,200) + 18 = 360,018 - 6,600 = 353,418

6510 · on Sept 3, 2023

   8*3=24 and 800*300 =240000
   8*9=72 and 800* 90 = 72000
   8*4=32 and 800*  4 =  3200
   9*3=27 and  90*300 = 27000
   9*9=81 and  90* 90 =  8100
   9*4=36 and  90*  4 =   360
   7*3=21 and   7*300 =  2100
   7*9=63 and   7* 90 =   630
   7*4=28 and   7*  4 =    28
   --------------------------
                       353418

dash2 · on Sept 3, 2023

But you are smart enough to use a computer or calculator. And AI is a computer. So the naive expectation would be that it would be capable of doing as well as a computer.

Also, you probably could do long multiplication with paper and pencil if you needed to. So a reasoning AI (which has read many many descriptions of how to do long multiplication) should be able to also.

ambrozk · on Sept 3, 2023

> And AI is a computer. So the naive expectation would be that it would be capable of doing as well as a computer.

Why would you judge an AI against the expectations of a naive person who doesn't understand capabilities AIs are likely to have? If an alien came down to earth and concluded humans weren't intelligent because the first person it met couldn't simulate quantum systems in their head, would that be fair?

dash2 · on Sept 3, 2023

The original question was whether LLM's are "smart" in a human-like way. I think that if you gave a human a computer, he'd be able to solve 3-digit multiplications. If LLM's were human-like smart, they could do this too.

Timon3 · on Sept 3, 2023

Did someone train LLMs with "access" to a computer? If not, why would you expect them to be able to use something they have never seen?

dash2 · on Sept 4, 2023

“It’s right there, you stupid llm! Dammit, YOU’RE RUNNING ON IT!”

Timon3 · on Sept 4, 2023

I mean, I'm running on incredible amounts of highly complex physics and maths, but that doesn't mean I can give you the correct answer to all questions on those.

walleeee · on Sept 3, 2023

I dunno, I simulate quantum systems (you, myself, my friends) in my head all the time

vidarh · on Sept 3, 2023

An AI is a program running on a computer.

Minecraft runs on a computer too, but you don't expect the Minecraft NPCs to be able to do math.

So it's a very naive assumption.

Most people struggle with long multiplication despite not only having learnt the rules, but having had extensive reinforcement training in applying the rules.

Getting people conditioned to stay on task for repetitive and detail oriented tasks is difficult. There's little reason to believe it'd be easier to get AIs to stay on task, in part because there's a tension between wanting predictability and wanting creativity and problem solving. Ultimately I think the best solution is the same as for humans: tool use. Recognise that the effort required to do some things "manually" is not worth it.

Timon3 · on Sept 3, 2023

> But you are smart enough to use a computer or calculator. And AI is a computer. So the naive expectation would be that it would be capable of doing as well as a computer.

I disagree. The AI runs on a computer, but it isn't one (in the classical sense). Otherwise you could reduce humans the same way - technically our cells are small (non-classical) computers, and we're made up of chemistry. Yet you don't expect humans to be perfect at resolving chemical reactions, or computing complex mathematics in their heads.

IanCal · on Sept 3, 2023

They can reason through it they just sometimes make mistakes along the way, which is not surprising. More relevant to your comment is that if you give gpt4 a calculator it'll use it in these cases.

Filligree · on Sept 3, 2023

I am indeed smart enough to do that. And so is the AI, if you use the right AI. (I.e, code interpreter.)

mabster · on Sept 3, 2023

I've got an engineers style mindset for these kind of calculations.

897 is about 900. 394 is about 400. 900×400 = 360,000. Only 2% error!

Closi · on Sept 3, 2023

> [GPT] always fails equations like 4897 x 394=?

In some ways, I think we should treat GPT like a human without access to a calculator.

If you ask a human what 4897 x 394 is, they will struggle.

isaacfung · on Sept 3, 2023

Sometimes chatgpt fails at tasks like counting number of a letter in a short string or checking if two nodes are connected in a simple forest of 7 nodes, even with chain of thoughts prompt. Human can solve those pretty easily.

hboon · on Sept 3, 2023

I think I understand your logic, but ChatGPT+GPT-4 gave me the correct answer for "What is 897*394?"

https://chat.openai.com/share/00f94e43-c353-400a-858a-50c10c...

(GPT-3 gave the wrong numeric answer though)

dongping · on Sept 3, 2023

Thanks for testing it. I canceled my ChatGPT Plus a few months ago (when they changed the color from black to purple IIRC).

So I only tested that with the GPT-4 API, with the following results:

{'id': 'chatcmpl-7uVF5xGqR1oEzXITw3WZYsnB4Yzt8', 'object': 'chat.completion', 'created': 1693700819, 'model': 'gpt-4-0613', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '353538'}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 11, 'completion_tokens': 2, 'total_tokens': 13}}

chaxor · on Sept 3, 2023

It absolutely FAILS for the simple problem of 1+1 in what I like to call 'bubble math'.

1+1=1

Or actually, 1+1=1 and 1+1=2, with some probability for each outcome.

Because bubbles can be put together and either merge into one, or stay as two bubbles with a shared wall.

Obviously this can be extended and formalized, but hopefully it also displays that mathematics isn't even guaranteed to provide the same answer for 1+1, since it depends on the context and rules you set up (mod, etc).

I should also mention that GPT-4 does quite astoundingly good at this type of problem wherein new rules are made up on the fly. So in-context learning is powerful, and the idea that it 'just regurgitates training data' for simple problems is quite false.

orbital-decay · on Sept 3, 2023

> LLMs are pretty stupid, but very good at speech. They are good at writing code because that's very similar, but fall apart in things that need some actual abstract thinking, like math.

Isn't that more of a training method issue? Try teaching a caveman to count by making him memorize the sticks and words pairs like LLM does, and you will get similar results, as he won't know about the stateful counting algorithm.

Humans get the powerful reasoning ability through gradual learning of new abstractions, and won't be able to extract anything useful from a textbook on quantum physics until they learned the basics first.

westurner · on Sept 2, 2023

FWIU recently there's?:

- Increase the input prompt token limit (2023-09: 32K tokens in: OpenAI GPT-4 Enterprise, Giraffe (LLama 2))

- Fine tune [a "LoRA" atop a foundation model]

- TODO: ~Checkpoint w/ Copy-on-Write

galaxyLogic · on Sept 2, 2023

Consider what happens if we use this method in our head. Recursively summarize the discussion so far. It will improve our memory. It seem 'hacky" to summarize things in your head but I think that is a big part of how memory works.

coldtea · on Sept 2, 2023

That's what memory is. Whether sensory information (what you remember for a place) or the points from a meeting (where we only remember the big picture items, not what was said minute to minute) it's all summarized/compressed.

whimsicalism · on Sept 2, 2023

This would be more equivalent to having some little spool in our head that writes down our summarized thoughts and then when we need to remember something we take the spool out of our head and look at it to reorient ourselves.

jonas21 · on Sept 3, 2023

Isn't that what people are doing when they take notes and read them later to refresh their memory?

bryant · on Sept 2, 2023

> rather than as some hacked on process of continually regenerating summaries.

incidentally, this isn't far off from how the human brain is believed to work (at least with long term memories).

https://news.northwestern.edu/stories/2012/09/your-memory-is...

whimsicalism · on Sept 2, 2023

> incidentally, this isn't far off from how the human brain is believed to work (at least with long term memories).

Not as literal words printed somewhere in our mind it isn't. This is more akin to things like the funnel transformer. [0] Nevermind that we hardly understand how our minds actually work.

[0]: https://github.com/laiguokun/Funnel-Transformer

sudokuist · on Sept 2, 2023

No one knows how the brain works and how it is connected with the body. Did you know your gut is directly connected with cognition? An unhealthy digestive system has been linked with several neurodegenerative diseases. Also, walking and cardio in general is known to create new neurons and prevent cognitive degeneration.

It's always funny to me when people on online forums confidently proclaim to know what cognition and thinking is all about and that furthermore it can be reduced to symbol shuffling on computers. No one has any clue how to make computers intelligent and anyone that claims to know is either doing marketing for some "AI" company or thoroughly confused about what counts as biological intelligence.

coldtea · on Sept 2, 2023

>No one knows how the brain works and how it is connected with the body

Yes, but we have some informed theories.

Nobody knew how physics works at Newton's time either, but they did know enough to model to satisifcation lots of phenomena like ballistics...

>Did you know your gut is directly connected with cognition?

Well, it is widely reported even in pop-science articles for several decades!

>No one has any clue how to make computers intelligent

For not having "any clue", sure the LLM guys did quite well in getting some of the way to the goalposts...

sudokuist · on Sept 2, 2023

Turns out Markov chains with a large context can do a lot and yet no one has figured out why LLMs can not solve sudoku puzzles. Why do you think that's the case if the goalposts have moved so much?

coldtea · on Sept 2, 2023

Perhaps because intelligence is multi-faceted, and the aspect required for Sudoku puzzles is not modelled well enough with an LLM style backend.

sudokuist · on Sept 2, 2023

Perhaps.

bigyikes · on Sept 3, 2023

“Perhaps”? Are you suggesting that intelligence is not multi-faceted? What exactly did the user you’re replying to say that is questionable?

bryant · on Sept 2, 2023

You're 100% right, no one knows how the brain works. And all the elements you described are probably relevant — including things you didn't mention, such as personality changes tied to heart transplants, etc.

But that's probably reading a little too deeply and seriously into what I said.

gremlinunderway · on Sept 3, 2023

There's actually some very interesting criticism of modern cognitive sciences, one of them contradicting the common idea that neural network based methods were inspired by a model of "how the brain worked". The criticism basically says that instead, we should be looking at the opposite, with modern cognitive science coming out at a time when computers were the technology de jour.

It builds off previous histories and theories of technology which debunk the notion of "neutral" technology. Humans are very much influenced and inspired by doing things because of the technology they have at-hand.

Great example is looking at the most common mechanical models of the human heart. The original models for the heart describe it as a system of "pumps" because it was designed right during the height of the end of the industrial revolution, when the technology at the time were big mechanical pumps and machinery. So, the abstract model of the heart as a pump was born and used, not because necessarily its an "accurate" model but mainly because the terminology made sense to people at the time. Nowadays this model is criticised for a lot of reasons.

Cognitive science suffers from a lot of assumptions which may have been indirectly influenced by computer science (afterall, you're a researcher coming in at a time when you have these powerful machines that could crunch data. Well surely then you'll need to model the brain in a binary and computer-readable format).

btilly · on Sept 2, 2023

Nobody knows, but this model works well enough to, for instance, treat PTSD. Read through https://www.ptsd.va.gov/understand_tx/emdr.asp to verify that.

chaxor · on Sept 3, 2023

For someone correctly claiming we don't know how the brain works, you seem to have a remarkable trust in the brain-gut axis pop sci articles. It's always good to remember that much of the 'established' data in neuroscience (such as the A beta hypothesis) are very well within the possibility of being based on a decent amount of fraud, as has been shown recently.

Not to say all brain-gut axis data is irrelevant, but, as with almost all of biology, the effect sizes of observables anyone typically cares about are pretty small.

However, the visual system in rodents is probably the best data source we have that maps to any formal theory right now, along with temporal difference learning and RL in neuroscience.

doctor_eval · on Sept 2, 2023

This is why the Turing Test tried to abstract away everything other than the text of a conversation. Because ultimately the mechanism is going to be inscrutable, and it’s the output that counts.

gaganyaan · on Sept 3, 2023

> this isn't far off from how the human brain is believed to work

> No one knows how the brain works

You sound very hostile, but I think OP agrees with you.

And I'm not sure why you're trying to dunk on them here. Perhaps at a high level the approach is very similar, even if the mechanics are very different. Would you scoff at someone saying "the robot moved its arm" and say "Ha! Robots don't have muscles, and humans don't have batteries! It's totally different!"?

fnordpiglet · on Sept 2, 2023

I would note that almost everything in computing we use today is super hacky stuff sufficiently abstracted and error handled such that it seems like it’s not at all a hack.

bigyikes · on Sept 3, 2023

I would note that all of biology is hacks on top of hacks! Have you seen a human being? They have this useless fucking organ that tends to burst and kill them.

elevaet · on Sept 3, 2023

This is getting way off topic, but the appendix may have an important role as a refuge for the gut microbiome: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3551545/#:~:tex....

tinco · on Sept 2, 2023

Why do you have the intuition that a dense embedding space could preserve full semantic meaning? From what I understand from embeddings is that they are inherently lossy. At least with a textual summary you could have an agent verify the summary actually accurately represents the information that it's meant to summarize.

chefandy · on Sept 2, 2023

From a technical perspective, sure: it's clear why it functions like that, and there's no technical reason it shouldn't. From a user interface perspective— likely what most people would judge with intuition— that doesn't matter. Processes that mimick familiar interaction patterns cause dissonance when they don't match our mental model of those interactions, and familiar conversation is about as familiar as you get. People we interact with know the things we explicitly tell them or guide them through figuring out, and people intuitively interact with these applications as they would with people because they've deliberately adopted the voice of an individual person. Additionally, for SaaS products, we're used to them maintaining state automatically.

(As annoying as our input can be, this is why dev teams that try to solve problems for non-dev users have designers.)

skybrian · on Sept 3, 2023

It seems to me that there's a major advantage, though: you can read it and fix any mistakes.

amelius · on Sept 3, 2023

I'm wondering how well the data is preserved if you put it in a buffer like that.

It reminds me of the game where someone tells A some story, then A tells it to B, B to C, etc. and then of course the end result is that the story became completely different.

digitcatphd · on Sept 2, 2023

Most things right now seem so. Seems like we are going through many rounds of iteration and guessing what works long term and what is a short term fix is frustrating

SubiculumCode · on Sept 2, 2023

It seems to me that sparse encodings would be more efficient and practical for medium term memory. Isn't rhe problem wirh dense embedding is memory usage?

whimsicalism · on Sept 2, 2023

> Isn't rhe problem wirh dense embedding is memory usage?

Not necessarily, given that a dense embedding can encode the equivalent of many many words or even higher order concepts not easily expressed in word space.

3abiton · on Sept 2, 2023

But there are hardware limitations for storing them in memory

kristopolous · on Sept 3, 2023

Sure but airplanes don't look like birds

ChatGTP · on Sept 3, 2023

They don't fly like them either...

gillh · on Sept 3, 2023

We have been doing this at CodeRabbit[0] for incrementally reviewing PRs and allowing conversations in the context of code changes, giving the impression that the bot has much more context than it has. It's one of the few tricks we use to scale the AI to code review even large PRs (100+ files).

For each commit, we summarize diff for each file. Then, we create a summary of summaries, which is incrementally updated as further commits are made on a pull request. This summary of summaries is saved, hidden inside a comment on a pull request, and is used while reviewing each file and answering the user's queries.

Some of our code is in the open source. Here is the link to the relevant prompt for recursive summarization - https://github.com/coderabbitai/ai-pr-reviewer/blob/main/src...

[0]: coderabbit.ai

ec109685 · on Sept 3, 2023

Do you run into issues with parsing the prompt result? And if so, have you tried using “function” calling versus parsing the more free text output?

gillh · on Sept 3, 2023

We haven’t used functions as our text parsing has been pretty high fidelity so far. It’s because we provide an example of the format that we expect. We didn’t feel like fighting too hard with LLMs to get structured output. You will also notice that our input format is not structured as well. Instead of unidiff format we provide the AI side-by-side diff with line number annotations so the it can comment accurately - this is similar to how humans want to look at diffs.

Our OSS code is far behind our proprietary version. We have a lot more going on over there and we don’t use functions in that version as well.

ec109685 · on Sept 5, 2023

We’ve found things like double quotes, emoji’s and other edge cases are a struggle when trying to get data back.

Roark66 · on Sept 3, 2023

>Code and scripts will be released later.

At this stage I'll not believe a single claim without "code and scripts". It may be true, it may be bullshit. Who knows. Without a low effort way to replicate the experiments I consider this paper(and others like it) something written just so authors can put it in their CVs.

I've been waiting for 6+ months for other "code to be released later" papers in this(LLMs) space and there is no indication it will ever be released. Some of these papers are so brazen to even include broken links that lead to parked domains.

It's time the community started recognising this behaviour.

phire · on Sept 3, 2023

Yes, it's a really simple idea that doesn't require much code. Really shouldn't be hard to clean it up and publish.

I was actually playing with similar ideas a while back. I didn't have any code at all, I was just experimenting with prompts directly in the API dashboard. The idea showed some promise, but it didn't seem like the API cost would be worth it. I suspect you would be much better off with a vector embeddings approach.

TuringNYC · on Sept 2, 2023

A bit of personal anecdote -- at work, we have thousands of "Briefings" which are hour-long (sometimes day-long) in-person panels. We've successfully summarized each and every briefing. The messy transcripts are well summarized into five paragraph of text.

More topical, we also 1:many categorized each briefing into topics and sub-topics with topics ending up with several dozen briefings and sub-topics of a dozen briefings. For this we summarized the subset of associated summaries and tested this comprehensively, and had great results with LLMs.

I was originally skeptical this would work, but it worked beautifully. Given a sufficiently large context window, we would not have done so, but thankfully this was not a problem.

ec109685 · on Sept 3, 2023

Even with large context windows, this technique is useful. I have found breaking down a problem into “map reduce” does much better than stuffing it into a huge 32k context window and asking it solve it in one shot.

TuringNYC · on Sept 3, 2023

It would require careful setup, but one thing that might be the case is losing light mentions across many documents and keeping heavy mentions clustered in one document.

With the summary-of-summaries approach, I think dense clusters in one document vs consistent mentions in many documents wouldnt get the same treatment.

kgeist · on Sept 2, 2023

I tried to build memory using recursive summarization months ago using open source models and what I found is that with a naive implementation, it would often get stuck on a certain topic forever because certain bits would survive all summarization rounds.

washadjeffmad · on Sept 2, 2023

Yeah, unless this substantially mitigates amplification, even when using manual chunk sizing on known materials, the context still hangs onto its "dying thoughts" in a way that remarkably resembles Alzheimer's.

cushpush · on Sept 2, 2023

plaque buildup?

wokwokwok · on Sept 3, 2023

Not only that, it’s provable that this approach doesn’t scale.

That is to say, specifically, that it is not possible, by any means, to take any block of text and reduce it to a smaller block of text without losing some information.

That is infinite compression; if you could do this, you could reduce any dataset to 1 bit and recover the data seamlessly.

You can’t.

What that means is that this approach is fundamentally just improving performance, it’s not solving the problem.

When you compress a conversation into a summary, some information will be lost. You can tweak and fold and do whatever clever things you want, but fundamentally you are losing information.

…and the process is recursive. So at some point you will be summarising a set of summaries… and you will lose some information you’ve summarised, to some extent.

You can’t not lose information with this process.

So… while, it probably helps in trivial cases, putting recursive summaries into your prompt is kind of daft, and almost certainly doesn’t actually work when you try use it to do useful things.

It probably looks like it’s working when you don’t use the recursive summaries heavily, because at that point you’re not losing much information.

…but, I’m going to bet that it doesn’t scale, and people using it will find that out pretty quickly as they use it.

anon012012 · on Sept 3, 2023

Technically you can reduce any amount of bits to 1 bit if the compressor/decompressor agree about what informations that bit means. So there would be the full data on client-side/server-side, and you can "communicate" the finale desired data with solely 1 bit. (it's like the blockchain algorithm)

This sounds trite/heretic but it is as fundamental in its nature as the concept of "incompressible data". I agree that you cannot spare any data in the absolute, but I argue that you can communicate information with less bits than the data holds if the peers and the "model" incorporates knowledge about what kind of data is being transmitted.

azeirah · on Sept 3, 2023

I don't disagree with your proof but I would like to add that your example doesn't mean it's not practical to recursively summarise a couple times.

One bit of information can mean more the more parameters an LLM has. The text "a8ckd" could easily mean the entire Declaration of Independence or whatever to an LLM (it doesn't, of course).

It has memorised a lot of stuff and similar to us "bubble sort" means

    // An optimized version of Bubble Sort
    void bubbleSort(int arr[], int n)
    {
        int i, j;
        bool swapped;
        for (i = 0; i < n - 1; i++) {
            swapped = false;
            for (j = 0; j < n - i - 1; j++) {
                if (arr[j] > arr[j + 1]) {
                    swap(&arr[j], &arr[j + 1]);
                    swapped = true;
                }
            }

            // If no two elements were swapped by inner loop,
            // then break
            if (swapped == false)
                break;
        }
    }

That's 26x as much text (counting comments and whitespace). Without comments and removing most indentation it's about 20x as much.

The more information stored in its parameters, the more information a single bit can hold.

All we need to do is find out how many times can you recur until it breaks down? 3 times? 5 times? Just a few times is still genuinely very useful!

arketyp · on Sept 2, 2023

Reminds me of a "bad trip" or OCD patterns. I sometimes think about how little it takes to derail the human mind, by trauma or ontogenically, and how wishful the idea of a human-like AI then seems.

keskival · on Sept 2, 2023

You can just advice it to forget (that is, skip in summarization) things which seem irrelevant.

BoorishBears · on Sept 2, 2023

Or you could use techniques from the 1970s that let you skip the kind of highly repetitive text that causes them to get stuck without losing important concepts (IDF)

potatoman22 · on Sept 3, 2023

This is interesting. How would you incorporate IDF into an LLM?

BoorishBears · on Sept 3, 2023

BM25F (which is based on IDF) to quickly identify relevant past responses based on the question and reply they got.

At each summarization step you can go through and find which Q/A pairs have the highest relevance across all of the other questions in the current batch, which helps solve LLMs tendencies to get stuck in high similar repetitions

Michelangelo11 · on Sept 2, 2023

Kind of a disappointing paper, really. Essentially zero details about their technique, just tables showing that, with the methodology they used, it does indeed achieve good results.

I get that that's par for the course for science these days, but to me as a developer working on LLMs, the paper is essentially valueless (of course, I'm sure it incrementally raises the authors' prestige in their corner of academia, as was no doubt its purpose).

glandium · on Sept 3, 2023

Check the very last page. Their prompt is there.

Roark66 · on Sept 3, 2023

Also: >Code and scripts will be released later.

Hey, wanna team up and author a few articles like that?</sarcasm> I wonder how many it'll take before one can put a title of a "ML researcher" on top of their CV.

ec109685 · on Sept 3, 2023

It’s a simple topic made much much more complex by embedding it into a scientific paper. Way more time should have been spent on examples and the prompt.

BoorishBears · on Sept 2, 2023

I mentioned this in a comment a few weeks ago, but people are oversimplifying the summarization part: https://news.ycombinator.com/item?id=37117515

For a given use-case, long term memory has a subtly different value proposition.

If I'm building a home assistant, I should be using NER to identify names and build an understanding of how they like to be spoken to in messages, or places and how they tend to be commuted to

If I'm building a CS bot, I should be identifying queries that resulted in extended exchanges, or led to a sudden abandoned cart

Summarization at the generic summarization level is enough for flashy demoes, but to build truly useful products right now you need to go a step further

roseway4 · on Sept 3, 2023

I’m struggling to understand what’s novel here. LLM-based summarization of chat history memory is a well-established technique implemented by many LLM frameworks. Summarizing on every message is, as proposed in the paper, a major performance bottleneck and adds significant latency to the chat loop.

Many implementations utilize a fixed sized buffer, progressively summarizing batches of older memories when they fall out of the buffer. Ideally, this is also done out of band to the chat loop.

I’m an author of Zep[0], an open source long-term memory store, and this is how we implemented summarization.

0: https://github.com/getzep/zep

anotherpaulg · on Sept 3, 2023

Aider does this too, using a background thread to summarize the messages older than the last N.

https://github.com/paul-gauthier/aider/blob/main/aider/histo...

px43 · on Sept 3, 2023

Yeah, I'm pretty novice, but I took that hourish long Andrew Ng class on LangChain, and they covered recursive summarization as a standard memory management technique.

https://www.deeplearning.ai/short-courses/langchain-for-llm-...

chrgy · on Sept 3, 2023

Exactly, There is nothing Novel here, even a middle school Chatgpt user would have known this.

Nevermark · on Sept 2, 2023

This is a little sideways to the article/discussion.

The short memory is a real limitation, but I have noticed most critiques of GPT-4's abilities apply as much, or more, to humans.

I don't think anyone alive could convince me they were GPT-4 in a Reverse Turing Test situation. GPT-4's fast organized responses alone hammer human abilities.

But even a team of humans, with 60 minutes to answer each question, could find it difficult to match GPT-4's responses to interesting queries. It would be a fun competition.

samanator · on Sept 3, 2023

The paper's implementation is to essentially append memory text as part of the prompt.

Why don't they use a storage/retrieval system that doesn't consume context window tokens? E.g. Storage could be to automatically categorize data with tags at insertion time (i.e. upon user prompt) and retrieval could be a query that filters using a tag guessed by the LLM (before responding to the user).

With a few initial rules like hard coded tag names/styles, my intuition is that this would produce great results.

MadsRC · on Sept 2, 2023

This seems very close to Langchain's "summary" memory functionality (which seems to have existed since March 2023?) - granted I haven't read the paper just yet

golol · on Sept 3, 2023

How is this original work? I bet you can find 100 implementations of some kind of recursive summarization memory for LLMs on Github, one of them on mine.

magicalhippo · on Sept 3, 2023

I'm not into ML at all, and this was the first idea I had when I learned about the limitations of the context length. Even commented on it here[1] some time later.

Of course, it's one thing to have a good idea, another is to actually implement it well...

[1]: https://news.ycombinator.com/item?id=36531771

wintermutestwin · on Sept 2, 2023

Probably my ignorance talking, but I don't understand why there isn't a layer on top of an LLM that uses good old computer logic to keep track of and validate data.

I am baffled by the fact that when I ask ChatGPT to look up some stock data (for example) it spits out errors. How does it not have a built in ability to check its own output?

a-r-t · on Sept 2, 2023

Because if you had a logic layer capable of validating LLM's output, you wouldn't need the LLM.

turmeric_root · on Sept 3, 2023

c'mon just write a function that takes in text and tells you whether or not it's true, how hard could it be

ben_w · on Sept 2, 2023

I think we still need an LLM to enable the system as a whole to understand vague and half-baked human input.

I can easily ask an LLM to write be a function in a random programming language, then feed the output to a compiler, and pipe errors from the compiler back to the LLM.

What doesn't work so well is typing "pong in java" into a bash shell.

This isn't a perfect solution (not even for small projects), but it does demonstrate that automated validation can improve the output.

a-r-t · on Sept 3, 2023

This is what ChatGPT's Code Interpreter does (writes code in Python and then runs it to check for errors). I'm not sure if it's enabled for everyone yet though.

kendalf89 · on Sept 2, 2023

Even something as simple as censoring swear words would be in line with what openAI are trying to accomplish but they keep lobotomizing the model instead.

BoorishBears · on Sept 3, 2023

It does have the ability, you haven't allowed it to.

If you're completing the sentence "What is the current price of Apple?..." based on the internet as a training source, the most likely reply is not:

"Well, let's think about how we'd go about this, I should do a search for AAPL, and then..."

The most likely completion is: "The price of AAPL is $X"

OpenAI had to "artificially" bias it to say "I can't answer that" or it'd happily tell you a million things like that.

—

On the other hand, give the LLM room to plan, it will plan. If you ask it "To answer x what do you need", it does better at answering x

The most likely completion to "How would you tell me Apple's current price", is a logical step by step process. And that's how it gets a chance to check itself.

I think people underestimate how much logic the model can do in a single pass, LLMs need to output things that we assume can be done with working memory

jamilton · on Sept 2, 2023

What do you mean by "check its own output"?

tayo42 · on Sept 3, 2023

There is, I've seen it called react pattern. I'm pretty sure this is what chat gpts plug-ins feature is

tomr75 · on Sept 2, 2023

people have built their own solutions to validate data like this

won't be one size fits all due to various types of data - and if you could validate all types - why do you need the llm!

caprock · on Sept 2, 2023

This sounds a lot like the chunking concept from learning theory.

m3kw9 · on Sept 2, 2023

This wouldn’t work for ChatGPT right? All it cares about is the input context

hansvm · on Sept 2, 2023

ChatGPT is used in the paper. The problem they're solving is how to throw a loop around the LLM (like ChatGPT) to generate an appropriate input context.

slavboj · on Sept 3, 2023

In human terms, we would call this "reading the minutes".

ipnon · on Sept 2, 2023

Very interesting academies these researchers are based at … makes you wonder what kind of new applications this research enables.

Philpax · on Sept 2, 2023

What are you saying?

yieldcrv · on Sept 2, 2023

7 people got together and wrote this paper

Can someone summarize why this is different/better/more important than what we’ve seen 4 months ago with AutoGPT or even longer ago with the guy who got ChatGPT to make a compression algorithm that other ChatGPT sessions could read and resume conversations from

keskival · on Sept 2, 2023

It seems to be exactly the same as what everyone has been doing for a long time already, and they fail in citing existing implementations.

codefreakxff · on Sept 2, 2023

I haven't read it yet to be certain, but my impression is that rather than using lossless compression of entire chats, this uses lossy summarization to get the gist of chats. There will be tradeoffs between the two methods, hopefully this paper covers that.