The analogy I've used before is the old Eliza chatbot that responds based on pat...

The analogy I've used before is the old Eliza chatbot that responds based on pattern in the input. (E.g. I am x / how long have you been x?)

chatGPT is the same, it's just billions of times more complicated (and optimized by gradient descent instead of a person). But it's really just responses to patterns.

So asking why gpt struggles is like asking why eliza struggles. It doesn't have a mind or an internal model of the world, it just has responses to patterns. There are "in distribution" cases where it gives an answer you expect, but outside of this, the model fails arbitrarily, and because it has no mind, it has nothing to sanity check the output against and often looks silly.

You could try training a math specific language model, though as I understand it, neural network generally aren't good at math because they can't extrapolate (they interpolate very well). So for example it's challenging to train a NN that learns to tell if numbers are even or odd, given even / odd training pairs.

You could have some sort of math loop as part of a system that uses a language model, say to flag when it spits out incorrect math, but that would be equivalent to hard coding something. As I understand it, chatgpt has some kind of RL layer on top of the language model that performs that supervisory function, but as a neural network it will suffer from the same problem, and itself not know when it's wrong