Well, the model is based on llama-8b, which is quite bad at reasoning. Reasoning (or things that look and quack like reasoning) is more the domain of 70B+ models, and some of the newer 7B models.
The model is doing well on many reasoning tasks, what they are doing is a massive step up from llama-8b. But it still makes some silly mistakes. I bet if you did the same finetuning procedure with quen-7B or llama-70B as a starting point you would get a quite competent model
The model is doing well on many reasoning tasks, what they are doing is a massive step up from llama-8b. But it still makes some silly mistakes. I bet if you did the same finetuning procedure with quen-7B or llama-70B as a starting point you would get a quite competent model