Your comment is regarding LLMs, but Q\* may not refer to an LLM. As such, our in...

Your comment is regarding LLMs, but Q* may not refer to an LLM. As such, our intuition about the failure of LLM's may not apply. The name Q* likely refers to a deep reinforcement learning based model.

To comment, in my personal experience, reinforcement learning agents learn in a more relatable human way than traditional ml, which act like stupid aliens. RL Agents try something a bunch of times, mess up, and tweak their strategy. After some extreme level of experience, they can make wider strategic decisions that are a little less myopic. RL agents can take in their own output, as their actions modify the environment. RL Agents also modify the environment during training, (which I think you will agree with me is important if you're trying to learn the influence of your own actions as a basic concept). LLM's, and traditional ml in general, are never trained in a loop on their own output. But in DRL, this is normal.

So if RL is so great and superior to traditional ml why is RL not used for everything? Well the full time horizon that can be taken into consideration in a DRL Agent is very limited, often a handful of frames, or distilled frame predictions. That prevents them from learning things like math. Traditionally RL bots have been only used for things like robotic locomotion, chess, go. Short term decision making that is made given one or some frames of data. I don't even think any RL bots have learned how to read english yet lol.

For me, as a human, my frame predictions exist on the scale of days, months, and years. To learn math I've had to sit and do nothing for many hours, and days at a time, consuming my own output. For a classical RL bot, math is out of the question.

But, my physical actions, for ambulation, manipulation, and balance, are made for me by specialized high speed neural circuits that operate on short time horizons, taking in my high level intentions, and all the muscle positions, activation, sensor data, etc. Physical movement is obfuscated from me almost in entirety. (RL has so far been good at tasks like this.)

With a longer frame horizon, that predicts frames far into the future, RL can be able to make long term decisions. It would likely take a lifetime to train. So you see now why math has not been accomplished by RL yet, but I don't think the faculty would be impossible to build into an ml architecture.

An RL bot that does math would likely spin on its own output for many many frames, until deciding that it is done, much like a person.