It does seem hacky, but then again the whole concept of conversational LLMs is. You're just asking it to add an extra word to a given conversation and after a bit, it spits out an end token that tells your application to hand control back to the user.
I think latent space and text space aren't as far apart as you think. LLMs are pretty stupid, but very good at speech. They are good at writing code because that's very similar, but fall apart in things that need some actual abstract thinking, like math.
Those text space hacks do tend to work and stuff like "think step by step" has become common because of that.
LoRAs are closer to what you mean and they're great at packing a lot of understanding into very little data. But adjusting weights for a single conversation just isn't feasible yet, so we're exploring text space for that purpose. Maybe someome will transfer the methods we discover in text space to embedding space to make them more efficient, but that's for the future.
Pattern reproduction is very close to speech in my opinion. Formal grammars even have it in the name and approaches like https://news.ycombinator.com/item?id=37125118 show that LLMs are indeed very fit for that purpose.
I think I have to walk that claim about math back and try to phrase what I meant differently:
LLMs have a hard time with problems that don't translate well into the text space, i.e. abstract problems. Math used to be one of those because early tokenizers were designed just with text in mind and LLMs weren't good enough to overcome those limitations.
OpenAI put in a lot of effort into their tokenizers to make GPT3.5 and GPT4 better at math specifically.
The second paper you linked is very interesting and I think it supports my original assertion of text space and latent space being close. The first graph shows GPT3.5 doing much better at pattern reproduction and language tasks while humans still hold an advantage in the more abstract tasks like story analogies. Higher order relations being a key thing that's measured maybe makes this task a bit too perfect for arguing my case, but it does show that humans have an advantage in more abstract situations.
I think any problem that can be viewed as being mostly a form of translation is a good one for LLMs and if you can express a problem as that, you can get better results.
To get back to the main point: latent space and text space, or feature space in general, being close is what I believe causes all of this. Happy to hear counterexamples.
I have played around with GPT-4 and some fairly simple but completely new math ideas. It was fabulous at identifying special cases I overlooked, that disproved conjectures.
I was playing around with prime numbers, and simple made up relationships between them, such as between the square of a prime N vs. the set of primes smaller than N, etc.
It caught me out with specific examples that violated my conjectures. In one case the conjecture held for all but one case, another conjecture was generally true but not for 2 and 3.
In one case it thought a conjecture I made was wrong, and I had to push it to think through why it thought it was wrong until it realized the conjecture was right. As soon as it had its epiphany, it corrected all its logic around that concept.
It was very simple stuff, but an interesting exercise.
The part I enjoyed the most was seeing GPT-4's understanding move and change as we pushed back on each other's views. You miss out on that impressive aspect of GPT-4 in simpler sessions.
Have you tried formalizing your ideas with Isabelle? It has a constraint solver and will often find counterexamples to false arithmetical propositions[1].
I have not been able to figure out how that would help in the context of this discussion. As I see it, what’s very interesting here is that an LLM is able to do this.
I think the point is that LLM is not right tool for deep reasoning, and isabelle and others are much better such tools, even community trying to apply LLM in this area following current wave of hype.
Isabelle is good at counter examples in ways few other proof assistants are. In general its automation is excellent, partly because it uses a less powerful logic (HOL instead of CIC; more expressive logics are harder to write automation for). It's not obsolete.
Nice to see number of levels of reasoning depth mentioned. I personally believe the size of a (well-trained) LLM determines how many steps of reasoning in sequence it can approximate. Newer models get deeper and deeper, giving them deeper reasoning context windows. My hypothesis is that you don't need infinite reasoning depth, just a bit more than GPT-4 has. I think once you can tie your output together with thinking in terms of ~10+ reasoning steps you'll be very close to hunan performance.
it is not obvious to me how you came to such conclusion.
LLMs got lots of investments: 10s of billions of dollars and tons of compute, maybe more than any other tech in history, and can't crack 3 steps reasoning. It sounds like tech limitation..
None of these systems or their training sets have been specifically tailored to tackle abstract reasoning or math, so that seems like a premature conclusion. The fact that they're decent at programming despite that is interesting.
They're also brand new and at some undetermined part of the sigmoid curve. Trying to predict where you are on the curve while in the middle of a sigmoid is a fools errand, the best you can do is make random predictions and hope you are accidentally correct so you can become a pundit later
Kinda able to do some math tasks some of the time whereas you can use techniques from the arithmetic textbook to get the right answer all of the time with millions of times less CPU even including the overhead of round-tripping to ASCII numerals which is shockingly large compared to what a multiply costs.
Kinda "the problem" with LLMS is that they successfully seduce people by seeming to get the right answer to anything 80% of the time.
The arithmetic issues are well documented and understood; it's a problem of sub-token manipulation, which has nothing to do with reasoning. (Similar to calling blind people unintelligent because they can't read the iq test.)
And the better llms can easily write code to do the attention that they suck at...
Excellent anology. LLMs are capable of many extraordinary things, and it’s a shame people dismiss them because they fail to live up to some specific test they invented.
Yeah but if you can only do arithmetic right X% of the time you aren't going to get other answers right as often as would really be useful.
That said, LLMs have a magic ability to "short circuit" and get the right answer despite not being able to get the steps right. I remember scoping out designs for NLP systems about 5 years ago and frequently conclude that "that won't work" because information was lost at an early stage but in retrospect by short circuiting a system like that can outperform its parts but it still faces a ceiling on how accurate the answers are because the reasoning is not sound.
When you add in various patterns, double-checks, and memorized previous results, what human reasoning can do is astounding. But it is very, vary far from sound.
and that one said that Cyc had over 1,100 special purpose reasoning engines. The general purpose resolution solver was nowhere near fast enough to be really useful.
which would be capable in principle of finding a winning move in a chess position but because it worked by exhaustive search it would practically take too long. The thing is that a good chess playing program is not generally intelligent just as a chess grandmaster isn't necessarily good at anything other than chess, it just has special purpose heuristics (as opposed to algorithms) that find good chess move.
ChatGPT-like systems will be greatly improved by coupling them to other systems such as "write a Python/SQL script then run it", "run a query against bing and summarize the results", and "go find the chess engine and ask it what move to make", that is, like Cyc, it will get a swiss army knife of tools that help it do things it's not good at but it doesn't create general intelligence any more than Cyc did.
Robert Penrose in the Emperor's New Mind suggests that there must be some quantum magic in the human mind because the human mind is able to solve any math problem whereas any machine is limited by Gödel's theorem. It's silly, however, because we don't humans are capable of proving any theorem: look at how we struggled with Fermat for nearly 360 years or how
The difference might be that humans feel bad when they get the wrong answer whereas ChatGPT certainly doesn't. (as much as its empty apology can be satisifying to people) This isn't just an attribute of humans, working with other animals such as horses I'm convinced that they feel bad when they screw up too.
> it will get a swiss army knife of tools that help it do things it's not good at but it doesn't create general intelligence any more than Cyc did
How do you know general intelligence is its own thing and not just a Swiss army knife of tools?
> because the human mind is able to solve any math problem whereas any machine is limited by Gödel's theorem
Any machine can be programmed to solve any problem at all, if the proof system is inconsistent. Which is probably exactly the case with humans. We work around it because different humans have different inconsistencies, so checking each other's work is how we average out those inconsistencies.
(As a person who went down the rabbit hole of knowledge-based systems and looked at Cyc quite a bit.)
Three forms of intelligence are (i) animal intelligence, (ii) language use, and (iii) abstract thinking.
Animals are intelligent in their own way, particularly socially intelligent. My wife runs a riding barn and it is clear to me that one of the things horses are most interested in is what the people and other horses are up to and that a horse doesn’t just have an opinion about other horses but they have an opinion about what opinion about what the other horses think about a horse. (e.g. Cyc has a system of microtheories and modalized logic that tries to get at this. Of course visual recognition and similar things are a big part of animal intelligence and boy have neural nets made progress there.)
Language is a unique capability of humans. (which Cyc made no real contribution to.)
If you get a PhD what you learn is how to develop systems of abstract thinking or at the very least go to conferences and acquire them or dig through the literature, dust them off and get them working. There is the aspect of individual creativity but also the “standing on the shoulders giants” that Newton talked about.
Before Lenat started on Cyc he was interested in expert systems for building expert systems or at the very least a set of development tools for doing the same and that was a motivation of Cyc even if the point of Cyc was to produce new knowledge bases and reasoning procedures that would live inside Cyc. The trouble is that this was a tortuous procedure and I did go through a phase of thinking about evaluating OpenCyc for a project but it had the problem that it would have taken at least six months just to get started with a project that could be finished in some other way much more quickly.
My own journey led through twists and turns but I came to see it as something like systems software development where you build tools like compilers and debuggers that transform inputs into a knowledge base and put it to work, but I very much gave up on “embedding in itself”
As for problems in general I don’t really know if they can all be solved? Isn’t it possible that there is no finite procedure to prove the Collatz conjecture?
No it's not. Language is well documented in dolphins, for instance. Crows have also demonstrated self awareness and ability to do arithmetic. I think your 3-part breakdown of intelligence is out of date. There's no rigourous evidence that intelligence breaks down in this way, it's just a "folk theory" at this point.
The mobile app doesn't offer it though, and also has a system prompt that causes some strange behavior - sometimes it will put emojis in the text and then apologize for using emojis.
They don't have 9x9 puzzles. Any guesses as to why they only tried 3x3, 4x4, and 5x5 but not 9x9?
This work is interesting. I wouldn't have guessed 3x3 puzzles would be solvable by a large Markov chain. It would be interesting to know how large of a context is necessary to solve 9x9 puzzles. No existing model can currently solve 9x9 puzzles even though the recursive backtracking algorithm can solve any given puzzle in less than a second.
Why are people so intent on incorrectly asserting these models are Markov chains? It makes sense to use the analogy as an educational tool for exposition, but it more often seems that many use it as a way to minimize the notion that these models could ever possibly be useful for anyone. Is this just simply to make it more intuitive for others that it's a sequence model? Because it seems about as helpful as 'email is just bits' when everyone and their grandma knows about the relation between transformers, GAT, and circulant matrices.
As others have pointed out, maybe intelligence derived from language just isn't very good at math? It's not like linear algebra comes naturally to humans, we have to be specially trained. I've been taking Khan Academy classes and believe me, math sure doesn't come naturally to me.
I realize tempers are high on this subject, but I literally just wanted to point it out, in case you hadn't seen it. I wasn't trying to dunk on you or anything.
I'm not sure about its math, but GPT-4 fails miserably at simple arithmetic questions like 897*394=?
The GPT-3.5 turbo is fined-tuned for arithmetic according to ClosedAI (noted in one of the change logs), so it is sometimes slightly better, but nevertheless always fails equations like 4897*394=?
I would guess that they have a later model on web than on API (I also see worse results on API with 0613). Further testing shows that it loses the plot after a few more digits, which wouldn't make sense if they were injecting calculations.
But you are smart enough to use a computer or calculator. And AI is a computer. So the naive expectation would be that it would be capable of doing as well as a computer.
Also, you probably could do long multiplication with paper and pencil if you needed to. So a reasoning AI (which has read many many descriptions of how to do long multiplication) should be able to also.
> And AI is a computer. So the naive expectation would be that it would be capable of doing as well as a computer.
Why would you judge an AI against the expectations of a naive person who doesn't understand capabilities AIs are likely to have? If an alien came down to earth and concluded humans weren't intelligent because the first person it met couldn't simulate quantum systems in their head, would that be fair?
The original question was whether LLM's are "smart" in a human-like way. I think that if you gave a human a computer, he'd be able to solve 3-digit multiplications. If LLM's were human-like smart, they could do this too.
I mean, I'm running on incredible amounts of highly complex physics and maths, but that doesn't mean I can give you the correct answer to all questions on those.
Minecraft runs on a computer too, but you don't expect the Minecraft NPCs to be able to do math.
So it's a very naive assumption.
Most people struggle with long multiplication despite not only having learnt the rules, but having had extensive reinforcement training in applying the rules.
Getting people conditioned to stay on task for repetitive and detail oriented tasks is difficult. There's little reason to believe it'd be easier to get AIs to stay on task, in part because there's a tension between wanting predictability and wanting creativity and problem solving. Ultimately I think the best solution is the same as for humans: tool use. Recognise that the effort required to do some things "manually" is not worth it.
> But you are smart enough to use a computer or calculator. And AI is a computer. So the naive expectation would be that it would be capable of doing as well as a computer.
I disagree. The AI runs on a computer, but it isn't one (in the classical sense). Otherwise you could reduce humans the same way - technically our cells are small (non-classical) computers, and we're made up of chemistry. Yet you don't expect humans to be perfect at resolving chemical reactions, or computing complex mathematics in their heads.
They can reason through it they just sometimes make mistakes along the way, which is not surprising. More relevant to your comment is that if you give gpt4 a calculator it'll use it in these cases.
Sometimes chatgpt fails at tasks like counting number of a letter in a short string or checking if two nodes are connected in a simple forest of 7 nodes, even with chain of thoughts prompt. Human can solve those pretty easily.
It absolutely FAILS for the simple problem of 1+1 in what I like to call 'bubble math'.
1+1=1
Or actually, 1+1=1 and 1+1=2, with some probability for each outcome.
Because bubbles can be put together and either merge into one, or stay as two bubbles with a shared wall.
Obviously this can be extended and formalized, but hopefully it also displays that mathematics isn't even guaranteed to provide the same answer for 1+1, since it depends on the context and rules you set up (mod, etc).
I should also mention that GPT-4 does quite astoundingly good at this type of problem wherein new rules are made up on the fly. So in-context learning is powerful, and the idea that it 'just regurgitates training data' for simple problems is quite false.
> LLMs are pretty stupid, but very good at speech. They are good at writing code because that's very similar, but fall apart in things that need some actual abstract thinking, like math.
Isn't that more of a training method issue? Try teaching a caveman to count by making him memorize the sticks and words pairs like LLM does, and you will get similar results, as he won't know about the stateful counting algorithm.
Humans get the powerful reasoning ability through gradual learning of new abstractions, and won't be able to extract anything useful from a textbook on quantum physics until they learned the basics first.
I think latent space and text space aren't as far apart as you think. LLMs are pretty stupid, but very good at speech. They are good at writing code because that's very similar, but fall apart in things that need some actual abstract thinking, like math.
Those text space hacks do tend to work and stuff like "think step by step" has become common because of that.
LoRAs are closer to what you mean and they're great at packing a lot of understanding into very little data. But adjusting weights for a single conversation just isn't feasible yet, so we're exploring text space for that purpose. Maybe someome will transfer the methods we discover in text space to embedding space to make them more efficient, but that's for the future.