It does seem hacky, but then again the whole concept of conversational LLMs is. ...

og_kalu · on Sept 2, 2023

>They are good at writing code because that's very similar, but fall apart in things that need some actual abstract thinking, like math.

Pretty odd assertion. LLMs are not "good at speech, bad at abstract thinking".

What do these have to do with speech ?

https://general-pattern-machines.github.io/

https://arxiv.org/abs/2212.09196

It doesn't even hold with your example because GPT-4 is pretty good at Math, nowhere near "falling apart".

nottheengineer · on Sept 3, 2023

Pattern reproduction is very close to speech in my opinion. Formal grammars even have it in the name and approaches like https://news.ycombinator.com/item?id=37125118 show that LLMs are indeed very fit for that purpose.

I think I have to walk that claim about math back and try to phrase what I meant differently: LLMs have a hard time with problems that don't translate well into the text space, i.e. abstract problems. Math used to be one of those because early tokenizers were designed just with text in mind and LLMs weren't good enough to overcome those limitations. OpenAI put in a lot of effort into their tokenizers to make GPT3.5 and GPT4 better at math specifically.

The second paper you linked is very interesting and I think it supports my original assertion of text space and latent space being close. The first graph shows GPT3.5 doing much better at pattern reproduction and language tasks while humans still hold an advantage in the more abstract tasks like story analogies. Higher order relations being a key thing that's measured maybe makes this task a bit too perfect for arguing my case, but it does show that humans have an advantage in more abstract situations.

I think any problem that can be viewed as being mostly a form of translation is a good one for LLMs and if you can express a problem as that, you can get better results.

To get back to the main point: latent space and text space, or feature space in general, being close is what I believe causes all of this. Happy to hear counterexamples.

riku_iki · on Sept 2, 2023

> GPT-4 is pretty good at Math, nowhere near "falling apart".

its good at tasks which were included into training dataset in some variations.

Nevermark · on Sept 2, 2023

I have played around with GPT-4 and some fairly simple but completely new math ideas. It was fabulous at identifying special cases I overlooked, that disproved conjectures.

sudokuist · on Sept 2, 2023

Example?

Nevermark · on Sept 2, 2023

I was playing around with prime numbers, and simple made up relationships between them, such as between the square of a prime N vs. the set of primes smaller than N, etc.

It caught me out with specific examples that violated my conjectures. In one case the conjecture held for all but one case, another conjecture was generally true but not for 2 and 3.

In one case it thought a conjecture I made was wrong, and I had to push it to think through why it thought it was wrong until it realized the conjecture was right. As soon as it had its epiphany, it corrected all its logic around that concept.

It was very simple stuff, but an interesting exercise.

The part I enjoyed the most was seeing GPT-4's understanding move and change as we pushed back on each other's views. You miss out on that impressive aspect of GPT-4 in simpler sessions.

sudokuist · on Sept 2, 2023

Have you tried formalizing your ideas with Isabelle? It has a constraint solver and will often find counterexamples to false arithmetical propositions[1].

1: https://isabelle.in.tum.de/overview.html

mannykannot · on Sept 2, 2023

I have not been able to figure out how that would help in the context of this discussion. As I see it, what’s very interesting here is that an LLM is able to do this.

riku_iki · on Sept 2, 2023

I think the point is that LLM is not right tool for deep reasoning, and isabelle and others are much better such tools, even community trying to apply LLM in this area following current wave of hype.

riku_iki · on Sept 2, 2023

curious why you referred specifically on isabelle, which looks ancient and over engineered, there are many other tools and langs in this area.

I am not criticizing, but curious about your opinion.

c-cube · on Sept 3, 2023

Isabelle is good at counter examples in ways few other proof assistants are. In general its automation is excellent, partly because it uses a less powerful logic (HOL instead of CIC; more expressive logics are harder to write automation for). It's not obsolete.

Nevermark · on Sept 2, 2023

I have not, thanks for the tip.

riku_iki · on Sept 2, 2023

its hard to judge how deep and unique your conjectures were.

I did similar testing of GPT4, and my observation is that it starts failing after 3-4 levels of reasoning depth.

golol · on Sept 3, 2023

Nice to see number of levels of reasoning depth mentioned. I personally believe the size of a (well-trained) LLM determines how many steps of reasoning in sequence it can approximate. Newer models get deeper and deeper, giving them deeper reasoning context windows. My hypothesis is that you don't need infinite reasoning depth, just a bit more than GPT-4 has. I think once you can tie your output together with thinking in terms of ~10+ reasoning steps you'll be very close to hunan performance.

tux1968 · on Sept 2, 2023

> failing after 3-4 levels of reasoning depth.

That sounds more like an implementation or resource limitation, rather than an inherent limitation of the technique in general.

riku_iki · on Sept 3, 2023

it is not obvious to me how you came to such conclusion.

LLMs got lots of investments: 10s of billions of dollars and tons of compute, maybe more than any other tech in history, and can't crack 3 steps reasoning. It sounds like tech limitation..

naasking · on Sept 3, 2023

None of these systems or their training sets have been specifically tailored to tackle abstract reasoning or math, so that seems like a premature conclusion. The fact that they're decent at programming despite that is interesting.

vikramkr · on Sept 3, 2023

They're also brand new and at some undetermined part of the sigmoid curve. Trying to predict where you are on the curve while in the middle of a sigmoid is a fools errand, the best you can do is make random predictions and hope you are accidentally correct so you can become a pundit later

PaulHoule · on Sept 2, 2023

Kinda able to do some math tasks some of the time whereas you can use techniques from the arithmetic textbook to get the right answer all of the time with millions of times less CPU even including the overhead of round-tripping to ASCII numerals which is shockingly large compared to what a multiply costs.

Kinda "the problem" with LLMS is that they successfully seduce people by seeming to get the right answer to anything 80% of the time.

sdenton4 · on Sept 2, 2023

The arithmetic issues are well documented and understood; it's a problem of sub-token manipulation, which has nothing to do with reasoning. (Similar to calling blind people unintelligent because they can't read the iq test.)

And the better llms can easily write code to do the attention that they suck at...

bigyikes · on Sept 3, 2023

Excellent anology. LLMs are capable of many extraordinary things, and it’s a shame people dismiss them because they fail to live up to some specific test they invented.

og_kalu · on Sept 2, 2023

Math is a lot more than just Arithmetic.

PaulHoule · on Sept 2, 2023

Yeah but if you can only do arithmetic right X% of the time you aren't going to get other answers right as often as would really be useful.

That said, LLMs have a magic ability to "short circuit" and get the right answer despite not being able to get the steps right. I remember scoping out designs for NLP systems about 5 years ago and frequently conclude that "that won't work" because information was lost at an early stage but in retrospect by short circuiting a system like that can outperform its parts but it still faces a ceiling on how accurate the answers are because the reasoning is not sound.

btilly · on Sept 2, 2023

Human reasoning is amazingly not sound.

When you add in various patterns, double-checks, and memorized previous results, what human reasoning can do is astounding. But it is very, vary far from sound.

riku_iki · on Sept 2, 2023

all currently available reasoning approaches are limited.

I guess the topic is how far GPT in reasoning is from human. We can take some simple tests:

- can GPT play chess as well as humans, as benchmark of reasoning games?

- did GPT prove some nontrivial math theorems or solve some math problems where humans couldn't find solution yet?

PaulHoule · on Sept 3, 2023

One thing I thought was amusing was that there was a burst of articles about Cyc that got mentioned when Doug Lenat died including this arXiv paper

https://arxiv.org/abs/2308.04445

and that one said that Cyc had over 1,100 special purpose reasoning engines. The general purpose resolution solver was nowhere near fast enough to be really useful.

Early one there was

https://en.wikipedia.org/wiki/General_Problem_Solver

which would be capable in principle of finding a winning move in a chess position but because it worked by exhaustive search it would practically take too long. The thing is that a good chess playing program is not generally intelligent just as a chess grandmaster isn't necessarily good at anything other than chess, it just has special purpose heuristics (as opposed to algorithms) that find good chess move.

ChatGPT-like systems will be greatly improved by coupling them to other systems such as "write a Python/SQL script then run it", "run a query against bing and summarize the results", and "go find the chess engine and ask it what move to make", that is, like Cyc, it will get a swiss army knife of tools that help it do things it's not good at but it doesn't create general intelligence any more than Cyc did.

Robert Penrose in the Emperor's New Mind suggests that there must be some quantum magic in the human mind because the human mind is able to solve any math problem whereas any machine is limited by Gödel's theorem. It's silly, however, because we don't humans are capable of proving any theorem: look at how we struggled with Fermat for nearly 360 years or how

https://en.wikipedia.org/wiki/Collatz_conjecture

seems not even tantalizingly out of reach.

The difference might be that humans feel bad when they get the wrong answer whereas ChatGPT certainly doesn't. (as much as its empty apology can be satisifying to people) This isn't just an attribute of humans, working with other animals such as horses I'm convinced that they feel bad when they screw up too.

naasking · on Sept 3, 2023

> it will get a swiss army knife of tools that help it do things it's not good at but it doesn't create general intelligence any more than Cyc did

How do you know general intelligence is its own thing and not just a Swiss army knife of tools?

> because the human mind is able to solve any math problem whereas any machine is limited by Gödel's theorem

Any machine can be programmed to solve any problem at all, if the proof system is inconsistent. Which is probably exactly the case with humans. We work around it because different humans have different inconsistencies, so checking each other's work is how we average out those inconsistencies.

PaulHoule · on Sept 3, 2023

(As a person who went down the rabbit hole of knowledge-based systems and looked at Cyc quite a bit.)

Three forms of intelligence are (i) animal intelligence, (ii) language use, and (iii) abstract thinking.

Animals are intelligent in their own way, particularly socially intelligent. My wife runs a riding barn and it is clear to me that one of the things horses are most interested in is what the people and other horses are up to and that a horse doesn’t just have an opinion about other horses but they have an opinion about what opinion about what the other horses think about a horse. (e.g. Cyc has a system of microtheories and modalized logic that tries to get at this. Of course visual recognition and similar things are a big part of animal intelligence and boy have neural nets made progress there.)

Language is a unique capability of humans. (which Cyc made no real contribution to.)

If you get a PhD what you learn is how to develop systems of abstract thinking or at the very least go to conferences and acquire them or dig through the literature, dust them off and get them working. There is the aspect of individual creativity but also the “standing on the shoulders giants” that Newton talked about.

Before Lenat started on Cyc he was interested in expert systems for building expert systems or at the very least a set of development tools for doing the same and that was a motivation of Cyc even if the point of Cyc was to produce new knowledge bases and reasoning procedures that would live inside Cyc. The trouble is that this was a tortuous procedure and I did go through a phase of thinking about evaluating OpenCyc for a project but it had the problem that it would have taken at least six months just to get started with a project that could be finished in some other way much more quickly.

My own journey led through twists and turns but I came to see it as something like systems software development where you build tools like compilers and debuggers that transform inputs into a knowledge base and put it to work, but I very much gave up on “embedding in itself”

As for problems in general I don’t really know if they can all be solved? Isn’t it possible that there is no finite procedure to prove the Collatz conjecture?

naasking · on Sept 3, 2023

> Language is a unique capability of humans.

No it's not. Language is well documented in dolphins, for instance. Crows have also demonstrated self awareness and ability to do arithmetic. I think your 3-part breakdown of intelligence is out of date. There's no rigourous evidence that intelligence breaks down in this way, it's just a "folk theory" at this point.

Aerbil313 · on Sept 5, 2023

According to Gödel’s incompleteness theorem, some truths aren’t provable in the sense of acquiring proof through human reasoning and logic.

og_kalu · on Sept 2, 2023

No it's just pretty good in general lol.

cerved · on Sept 2, 2023

my experience is that it's pretty subpar

astrange · on Sept 2, 2023

Are you using code interpreter? It's better.

The mobile app doesn't offer it though, and also has a system prompt that causes some strange behavior - sometimes it will put emojis in the text and then apologize for using emojis.

bigyikes · on Sept 3, 2023

Care to share a GPT conversation you’ve had? I’m interested in what sorts of prompts lead you to this opinion. My experience is the opposite.

cerved · on Sept 3, 2023

A bit too much of a hassle. But if you're willing to share some of your good experiences, I'm curious

sudokuist · on Sept 2, 2023

Can LLMs solve sudoku yet?

JieJie · on Sept 3, 2023

These folks think so.

https://github.com/jieyilong/tree-of-thought-puzzle-solver

sudokuist · on Sept 3, 2023

They don't have 9x9 puzzles. Any guesses as to why they only tried 3x3, 4x4, and 5x5 but not 9x9?

This work is interesting. I wouldn't have guessed 3x3 puzzles would be solvable by a large Markov chain. It would be interesting to know how large of a context is necessary to solve 9x9 puzzles. No existing model can currently solve 9x9 puzzles even though the recursive backtracking algorithm can solve any given puzzle in less than a second.

chaxor · on Sept 3, 2023

Why are people so intent on incorrectly asserting these models are Markov chains? It makes sense to use the analogy as an educational tool for exposition, but it more often seems that many use it as a way to minimize the notion that these models could ever possibly be useful for anyone. Is this just simply to make it more intuitive for others that it's a sequence model? Because it seems about as helpful as 'email is just bits' when everyone and their grandma knows about the relation between transformers, GAT, and circulant matrices.

JieJie · on Sept 3, 2023

Well, you just said sudoku.

As others have pointed out, maybe intelligence derived from language just isn't very good at math? It's not like linear algebra comes naturally to humans, we have to be specially trained. I've been taking Khan Academy classes and believe me, math sure doesn't come naturally to me.

I realize tempers are high on this subject, but I literally just wanted to point it out, in case you hadn't seen it. I wasn't trying to dunk on you or anything.

dongping · on Sept 3, 2023

I'm not sure about its math, but GPT-4 fails miserably at simple arithmetic questions like 897*394=?

The GPT-3.5 turbo is fined-tuned for arithmetic according to ClosedAI (noted in one of the change logs), so it is sometimes slightly better, but nevertheless always fails equations like 4897*394=?

mistercow · on Sept 3, 2023

Arithmetic is a pretty pathological case for ChatGPT because of BPE. Digits just tokenize in a way that makes arithmetic way more complicated.

That said, I just fed both of your examples into GPT-4 and it answered them correctly without using CoT.

dongping · on Sept 3, 2023

This was the response that I got from the GPT-4 API yesterday:

{'id': 'chatcmpl-7uVF5xGqR1oEzXITw3WZYsnB4Yzt8', 'object': 'chat.completion', 'created': 1693700819, 'model': 'gpt-4-0613', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '353538'}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 11, 'completion_tokens': 2, 'total_tokens': 13}}

Maybe they fine-tuned the ChatGPT version better, or fed it to an calculator.

mistercow · on Sept 6, 2023

I would guess that they have a later model on web than on API (I also see worse results on API with 0613). Further testing shows that it loses the plot after a few more digits, which wouldn't make sense if they were injecting calculations.

Filligree · on Sept 3, 2023

> I'm not sure about its math, but GPT-4 fails miserably at simple arithmetic questions like 897*394=?

That's, um, about 300,000?

...

353,418 actually. But I'm not going to blame the AI too much for failing at something I can't do either.

dongping · on Sept 3, 2023

One can resort to traditional vertical multiplication (which requires patience), or do

897*394 = (900-3) * (400-6) = 900*400 - 6*900 - 400*3 + 3*6 = 360,000 - (5,400 + 1,200) + 18 = 360,018 - 6,600 = 353,418

6510 · on Sept 3, 2023

   8*3=24 and 800*300 =240000
   8*9=72 and 800* 90 = 72000
   8*4=32 and 800*  4 =  3200
   9*3=27 and  90*300 = 27000
   9*9=81 and  90* 90 =  8100
   9*4=36 and  90*  4 =   360
   7*3=21 and   7*300 =  2100
   7*9=63 and   7* 90 =   630
   7*4=28 and   7*  4 =    28
   --------------------------
                       353418

dash2 · on Sept 3, 2023

But you are smart enough to use a computer or calculator. And AI is a computer. So the naive expectation would be that it would be capable of doing as well as a computer.

Also, you probably could do long multiplication with paper and pencil if you needed to. So a reasoning AI (which has read many many descriptions of how to do long multiplication) should be able to also.

ambrozk · on Sept 3, 2023

> And AI is a computer. So the naive expectation would be that it would be capable of doing as well as a computer.

Why would you judge an AI against the expectations of a naive person who doesn't understand capabilities AIs are likely to have? If an alien came down to earth and concluded humans weren't intelligent because the first person it met couldn't simulate quantum systems in their head, would that be fair?

dash2 · on Sept 3, 2023

The original question was whether LLM's are "smart" in a human-like way. I think that if you gave a human a computer, he'd be able to solve 3-digit multiplications. If LLM's were human-like smart, they could do this too.

Timon3 · on Sept 3, 2023

Did someone train LLMs with "access" to a computer? If not, why would you expect them to be able to use something they have never seen?

dash2 · on Sept 4, 2023

“It’s right there, you stupid llm! Dammit, YOU’RE RUNNING ON IT!”

Timon3 · on Sept 4, 2023

I mean, I'm running on incredible amounts of highly complex physics and maths, but that doesn't mean I can give you the correct answer to all questions on those.

walleeee · on Sept 3, 2023

I dunno, I simulate quantum systems (you, myself, my friends) in my head all the time

vidarh · on Sept 3, 2023

An AI is a program running on a computer.

Minecraft runs on a computer too, but you don't expect the Minecraft NPCs to be able to do math.

So it's a very naive assumption.

Most people struggle with long multiplication despite not only having learnt the rules, but having had extensive reinforcement training in applying the rules.

Getting people conditioned to stay on task for repetitive and detail oriented tasks is difficult. There's little reason to believe it'd be easier to get AIs to stay on task, in part because there's a tension between wanting predictability and wanting creativity and problem solving. Ultimately I think the best solution is the same as for humans: tool use. Recognise that the effort required to do some things "manually" is not worth it.

Timon3 · on Sept 3, 2023

> But you are smart enough to use a computer or calculator. And AI is a computer. So the naive expectation would be that it would be capable of doing as well as a computer.

I disagree. The AI runs on a computer, but it isn't one (in the classical sense). Otherwise you could reduce humans the same way - technically our cells are small (non-classical) computers, and we're made up of chemistry. Yet you don't expect humans to be perfect at resolving chemical reactions, or computing complex mathematics in their heads.

IanCal · on Sept 3, 2023

They can reason through it they just sometimes make mistakes along the way, which is not surprising. More relevant to your comment is that if you give gpt4 a calculator it'll use it in these cases.

Filligree · on Sept 3, 2023

I am indeed smart enough to do that. And so is the AI, if you use the right AI. (I.e, code interpreter.)

mabster · on Sept 3, 2023

I've got an engineers style mindset for these kind of calculations.

897 is about 900. 394 is about 400. 900×400 = 360,000. Only 2% error!

Closi · on Sept 3, 2023

> [GPT] always fails equations like 4897 x 394=?

In some ways, I think we should treat GPT like a human without access to a calculator.

If you ask a human what 4897 x 394 is, they will struggle.

isaacfung · on Sept 3, 2023

Sometimes chatgpt fails at tasks like counting number of a letter in a short string or checking if two nodes are connected in a simple forest of 7 nodes, even with chain of thoughts prompt. Human can solve those pretty easily.

hboon · on Sept 3, 2023

I think I understand your logic, but ChatGPT+GPT-4 gave me the correct answer for "What is 897*394?"

https://chat.openai.com/share/00f94e43-c353-400a-858a-50c10c...

(GPT-3 gave the wrong numeric answer though)

dongping · on Sept 3, 2023

Thanks for testing it. I canceled my ChatGPT Plus a few months ago (when they changed the color from black to purple IIRC).

So I only tested that with the GPT-4 API, with the following results:

{'id': 'chatcmpl-7uVF5xGqR1oEzXITw3WZYsnB4Yzt8', 'object': 'chat.completion', 'created': 1693700819, 'model': 'gpt-4-0613', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '353538'}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 11, 'completion_tokens': 2, 'total_tokens': 13}}

chaxor · on Sept 3, 2023

It absolutely FAILS for the simple problem of 1+1 in what I like to call 'bubble math'.

1+1=1

Or actually, 1+1=1 and 1+1=2, with some probability for each outcome.

Because bubbles can be put together and either merge into one, or stay as two bubbles with a shared wall.

Obviously this can be extended and formalized, but hopefully it also displays that mathematics isn't even guaranteed to provide the same answer for 1+1, since it depends on the context and rules you set up (mod, etc).

I should also mention that GPT-4 does quite astoundingly good at this type of problem wherein new rules are made up on the fly. So in-context learning is powerful, and the idea that it 'just regurgitates training data' for simple problems is quite false.

orbital-decay · on Sept 3, 2023

> LLMs are pretty stupid, but very good at speech. They are good at writing code because that's very similar, but fall apart in things that need some actual abstract thinking, like math.

Isn't that more of a training method issue? Try teaching a caveman to count by making him memorize the sticks and words pairs like LLM does, and you will get similar results, as he won't know about the stateful counting algorithm.

Humans get the powerful reasoning ability through gradual learning of new abstractions, and won't be able to extract anything useful from a textbook on quantum physics until they learned the basics first.

westurner · on Sept 2, 2023

FWIU recently there's?:

- Increase the input prompt token limit (2023-09: 32K tokens in: OpenAI GPT-4 Enterprise, Giraffe (LLama 2))

- Fine tune [a "LoRA" atop a foundation model]

- TODO: ~Checkpoint w/ Copy-on-Write