I don't get why this question is relevant to evaluate the reasoning capacity. Gp...

wongarsu · 2024-11-30T08:23:30 1732955010

Large models have no issues with this question at all. Even llama-70B can handle it without issues, and that is a lot smaller than GPT-4o. But for small models this is a challenging question. llama-8B gets it confidently wrong 4 out of 5 times. gemma-2-9B gets it wrong pretty much every time. quen-coder-7B can handle it, so it's not impossible. It's just uncommon for small models to reliably get this question right, which is why I find it noteworthy that this model does.

jb_briant · 2024-11-30T10:02:21 1732960941

Yes makes sense, I didn't took in account the model size and now you mention it makes a lot of sense.