I've been playing with ChatGPT and am impressed by its ability to produce code and, to a certain extent, seem to reason about novel questions and situations and seem to engage in abstract analysis even on totally novel subjects.
However, it really struggles with basic counting compared with most people.
For example, I tested it asking about primes between certain intervals, first I asked "Are there more prime numbers between 1 and 10 or between 10 and 20?" and after it answered correctly I asked "Are there more prime numbers between 10 and 20 or between 20 and 30?". It answered:
"There are four prime numbers between 10 and 20: 11, 13, 17, and 19. There are also six prime numbers between 20 and 30: 23, 29, 31, 37, 41, and 43. So there are more prime numbers between 20 and 30 than between 10 and 20."
The middle sentence clearly contains an error. Why does chatGPT make such errors and struggle so much with numbers?
What would it take to "fix" it so it could be more human-like in its use of numbers?
chatGPT is the same, it's just billions of times more complicated (and optimized by gradient descent instead of a person). But it's really just responses to patterns.
So asking why gpt struggles is like asking why eliza struggles. It doesn't have a mind or an internal model of the world, it just has responses to patterns. There are "in distribution" cases where it gives an answer you expect, but outside of this, the model fails arbitrarily, and because it has no mind, it has nothing to sanity check the output against and often looks silly.
You could try training a math specific language model, though as I understand it, neural network generally aren't good at math because they can't extrapolate (they interpolate very well). So for example it's challenging to train a NN that learns to tell if numbers are even or odd, given even / odd training pairs.
You could have some sort of math loop as part of a system that uses a language model, say to flag when it spits out incorrect math, but that would be equivalent to hard coding something. As I understand it, chatgpt has some kind of RL layer on top of the language model that performs that supervisory function, but as a neural network it will suffer from the same problem, and itself not know when it's wrong