Mathematical logic doesn’t refer to ‘complex word problems with big numbers’. The reason that LLMs struggle (well, are hopeless currently) with mathematics problems is not because it involves arithmetic (although this is already enough of a struggle) but because it requires a conceptual understanding of the ideas rather than just being about manipulating language — a lot of the goings-on are hidden in the semantic world and aren’t necessarily encoded in the syntax, unlike when writing code.
I've seen it gets tripped up if you mix something common with something uncommon. ChatGPT chokes on, for example:
"Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"
"Two trains on different and separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"
"Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"
Right, but I myself missed the trick the first time around reading your comment and I assure that I am in fact a general intelligence. (And a relatively intelligent one if I say so myself!)
To paraphrase XKCD: Communicating badly and then acting smug about it when you're misunderstood is not cleverness. And falling for the mistake is not evidence of a lack of intelligence. Particularly, when emphasizing the trick results in being understood and chatGPT PASSING your "test".
The biggest irony here, is that the reason I failed, and likely the reason chatGPT failed the first prompt, is because we were both using semantic understanding: that is, usually, people don't ask deliberately tricky questions.
I suspect if you told it in advance you were going to ask it a deliberately tricky question, that it might actually succeed.
> I suspect if you told it in advance you were going to ask it a deliberately tricky question, that it might actually succeed.
Indeed it does:
"Before answering, please note this is a trick question.
Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"
Answer from GPT-4 Turbo:
If two trains are 30 miles apart and are approaching each other at a speed of 10 mph each, their combined approach speed is 20 mph (10 mph + 10 mph). To find the time it takes for them to meet, you divide the distance by their combined speed:
30 miles / 20 mph = 1.5 hours
Therefore, it will take 1.5 hours before they meet (or crash into each other, as stated in the question).
Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?
Inconclusive. The model includes a disclaimer: "(or crash into each other, as stated in the question)." LLMs often take a detour and spill their guts without answering the actual question. Here's a hint suggesting that user input influences the internal world representation much more significantly than one might expect.
Which would be quite unusual for normal trains. That being said, the question implies that they will crash into each other, so you could argue that this is a valid assumption anyway.
Yes. Though I’d say that example is a bit mean (it’s a trick question) since the answer has expected type <time> but whose actual answer is something like “don’t be stupid; they’re not even on the same track”. It’s like asking “if I add three apples to two apples, how many pears do I get?” and being surprised when the LLM says “5 pears”. It’s the sort of thing that’s avoidable if the question is written properly.
This comes up time and time again. People claim these models are mind blowing. But then someone will post something where the model falls flat on its face, and they just get a bunch of that's too complex or that's the wrong type of thing to ask.
So it ends up that these models are awesome if you ask them questions from a narrow set of things and if you assume what they respond with is correct.
Well, at least in this subthread, the model is only failing at the same things humans are failing at too. To see the mind-blowing part, stop treating GPT-4 like the Oracle in Delphi, and start treating it as "first comes to mind" answer (aka. the inner voice) - and then notice the failure modes are pretty much the same like with humans. For example, coercing a trick question into a similarly-sounding straight question, and answering it before realizing the person asking is an asshole.
I was originally making the point that these models struggle with even basic mathematics (of the true kind, not arithmetic — though of course they struggle with that too). My point here was to play devil’s advocate and be slightly forgiving of the model, since I as a human am likely to be tripped up by similar trick questions. Since we don’t really know ‘how these models think’ (have much idea of the emergent world model they build) we are stuck in constant debate about whether they’re really quite amazing or absolutely pathetic.
The equation 2+5=10 is never true in standard arithmetic, as 2 plus 5 always equals 7. However, in a different context or under a specific set of rules (such as a puzzle or a riddle where the symbols have different meanings), it could be made to appear true.
GPT-4(phind):
The equation 2 + 3 = 10 is true in a base-5 number system.
In base-5 (also known as quinary), the digits range from 0 to 4. When you add 2 and 3 in base-5, the result is 10. This is because in base-5, the number 5 is represented as 10.
Here's a simple breakdown of how this works:
In base-5, the rightmost digit represents 5^0 (which is 1), the next digit to the left represents 5^1 (which is 5), the next digit represents 5^2 (which is 25), and so on.
So when you add 2 and 3, you get 5. But since 5 is represented as 10 in base-5 (1 digit of 5, and 0 digits of 1), the result is 10.
I think it's completely reasonable to ask an AI that people want to consider intelligent trick questions. If it's really that smart and capable of reasoning, it should identify the trick.
Some humans will be fooled by the question, sure. But an AI should be smarter than humans, or at least, as smart as an above-average human.
I agree. But you could ask which is more intelligent: recognising a trick question and balking, or recognising that the question as posed doesn’t quite make sense and offering a reformulation together with its answer. It’s not always clear whether something’s a trick, a mistake or a strangely worded (but nonetheless intentionally weird) question. So I think it would be very hard to get it to never fall for any tricks.
I think they've fixed it now, but it does seem to recognize popular trick questions, like "what weighs more, a ton of feathers or a ton of bricks?". It would answer with the typical explanation about density not mattering, etc.
But, it used to fail on "what weighs more, 3 tons of feathers or 2 tons of bricks?".
So, it seems less about what's a trick, and more about what's a common question --> answer pattern.
It's the same with humans. I don't fail on this (in an on-the-spot response) question because I've fallen on it as a kid, then learned the trick, then learned to be suspicious of this trick in similarly-worded questions.
If we're going to call these things "AI" (which I absolutely oppose) I think it's not unreasonable to expect them to get this right. A 5 year old would understand you some get pears by adding apples together.
It is even confusing to me. The trains are on separate tracks but the question implies that they will crash into each other. Which could happen even if they are on separate tracks (very low chance but non-zero given there is a malfunction).
Now even if they are on the same track it doesn't mean they would crash into each other as they still could brake in time.
Some more interesting results. It is much better now at solving tasks in laconic mode (though these tasks GPT-4 were able to solve from day one, but spilled it's guts to unbearable extent):
The mother is older than her daughter 4 times now, in 3 years she will be older then her daughter only 3 times. How old are they both now? Be laconic, do not explain anything.
The mother is 24 years old, the daughter is 6 years old.
In a fantasy land (map is 255x255) Karen have a quest to kill a monster (an ogre - a cannibal giant). This isn't an easy task. The ogre is huge and experienced human hunter. Karen has only 1/2 chance to kill this ogre. If she can't kill the ogre from a first attempt she will die. Ogre is located at (12,24), Karen is located at (33,33). Karen can improve her chances to kill an ogre for additional 25% by gathering the nightshades at (77,77). In addition she can receive the elves blessing from elves shaman, wich will increase her chances by additional 25%, at the elves village (125,200). However this blessing is not cost free. She need to bring the fox fur with her as a payment for the blessing ritual. The foxes may be found in a forest which is located between (230,40) and (220,80). For the ritual to be most effective she should hold the nightshades in her hands during the ritual. Find the shortest path for Karen to improve her chances of killing the ogre and survive. Do not explain anything, be laconic, print out the resulting route only.
Karen's route: (33,33) -> (77,77) -> (230,60) -> (125,200) -> (12,24).
This additional explanation "(an ogre - a cannibal giant)" was added actually for LLaMA 2 to, but I keep it in this redaction for all models.
What is your distinction between ‘statistical’ and ‘discrete’? And what are you responding ‘no’ to?
Logic in the everyday sense (that is, propositional or something like first-order logic) is indeed ‘discrete’ in a certain sense since it is governed by very simple rules and is by definition a formal language. But ‘mathematical logic’ is a completely different thing. I don’t think it’s discrete in the sense you are imagining. It’s much more akin to a mixture of formal derivations massively guided and driven by philosophical and creative — you might say ‘statistical’ — hunches and intuition.
I don't think the missing piece is conceptual understanding. Good LLMs seem to 'understand' most concepts as well as most humans do, even if they're a little less multimodal about it (for now). The common factor here seems to me to be that they're not good at problems which involve hidden intermediate steps. You can trip ChatGPT up pretty easily by telling it not to show working, while on the same problem if you tell it to explain its reasoning in steps it'll do fine.
That's an interesting observation. It seems that in theory, you could train it to do math if you encoded literally everything in syntax and started at the beginning, like Principia Mathematica 50 pages proving 1+1=2 type beginning, and then the rest of known mathematics encoded similarly, and used that as the training data, although the context window limitations might still stop it from doing anything useful, and it might not work.
> because it requires a conceptual understanding of the ideas rather than just being about manipulating language
Yes, that's exactly the point I was trying to make. I just used the example of "complex word problems with big numbers" to differentiate from just normal mathematical statements that any programming language (i.e. deterministic algorithm) can execute.
It wasn’t my intention to give a snarky correction; I just wanted to emphasise that most of what mathematicians call mathematics has very little to do with numbers (as we would usually understand them). Talking about ‘word problems’ and ‘big numbers’ gives the wrong idea about what current LLMs struggle with. Even if they do struggle with these, it is still not sufficient when it gains the ability to be able to say that it can now do mathematics.
Not at all; just that the ‘understanding’-related content of pure mathematics is much less evident on the page. It mostly lives in the heads of mathematicians and is largely independent of notation, whereas writing blocks of code is a task that is literally about using notation — something for which a huge amount of high-quality training data exists. Of course, the higher level ideas in programming and software development are not just about ‘writing code’ itself, but I suspect this is where current models begin to show their weakness.
Still struggling to understand. You're saying that most coding has no high level ideas and is just boilerplate, and the ones that aren't are the ones LLM's struggle with? This could be true I can see it.
WRT the understanding not being shown on the page in math, I guess I tend to agree(?). But I think good mathematical papers show understanding of the ideas too more than just the proofs which result from the understanding. The problem (probably you know this but just for the benefit of whoever is reading) is that "understanding" in mathematics, at least with respect to producing proofs, often rely on mental models and analogies which are WRONG. Not like vague but often straight up incorrect. And you understand also the limitations of where the model goes wrong. And it's kind of embarrassing (I assume) for most people to write wrong statements into papers even with caveats. For simple examples there's a meme right where to visualize n-dimensional space, you visualize R^3 and say (n-dimensional) in your head. In this sense I think it's possibly straight-up unhelpful for the authors to impose their mental models on the reader as well (for example if the reader can actually visualize R^n without this crutch it would be unhelpful).
But I'm not sure if this is what distinguishes math and programming. There's also the alternative hypothesis that the mental work to generate each additional line of proof is just order of magnitude higher than the average for code. Just meaning that it usually requires more thought to produce a line of math proof. In this possibility, we would expect it to be solved by scaling alone. One thing it reminds of, which is quite different admittedly, is the training of leela-zero on go. There was a period of time where it would struggle on long ladders. And eventually it was overcome with training along (despite people not believing it would be resolved at first). I think in that situation, people summarized afterwards the situation as, in particular situations, humans can search much deeper than other places, and therefore requiring more training for the machine to match the humans' ability.
I’ll start with a disclaimer that I don’t know for sure (no one really does) what the difference between solving programming problems and solving pure mathematics problems is (and certainly if you stretch the domains you can emulate each within the other, thereby showing their equivalence… if you like). I’m just speculating, as usual. So if you’re confused, maybe that’s just because I’m wrong.
> most coding has no high level ideas and is just boilerplate, and the ones that aren't are the ones LLM's struggle with?
Pretty much, although calling it boilerplate might be going a bit far.
I’m not here to claim something like ‘mathematicians think and programmers do not’ because that is clearly not the case (and sounds like a mathematician with a complex of some kind). But it is empirically the case that so far GPT-4 and the like are much better at programming than maths. Why? I think the reason is that whilst the best programmers have a deep understanding of the tools and concepts they use, it’s not necessary to get things to work. You can probably get an away without it (I have ideas about why, but for now that’s not the point). And given the amount of data available on basic programming questions (much more than there is of mathematics) if you’re an LLM it’s quite possible to fake it.
I guess one could also make the point that the space of possible questions in any given programming situation, however large, is still fairly constrained. At least the questions will always be ‘compute this’ or ‘generate one of these’ or something. Whereas you can pick up any undergraduate maths textbook, choose a topic, and if you know what you’re doing it’s easy to ask a question of the form ‘describe what I get if I do this’ or ‘is it true that xyz’ that will trip ChatGPT up because it just generates something that matches the form implied by the question: ‘a mathematical-looking answer’, but doesn’t seem to actually ask itself the question first. It just writes. In perfect Mathematical English. I guess in programming it turns out that ‘a code-looking answer’ for some reason often gives something quite useful.
Another difference that occurs to me is that what is considered a fixable syntax error in programming when done in the context of maths leads to complete nonsense because the output is supposed to describe rather than do. The answers are somehow much more sensitive to corruption, which perhaps says something about the data itself.