Hacker News new | past | comments | ask | show | jobs | submit login

I don't get why this question is relevant to evaluate the reasoning capacity. Gpt4o (no reasoning in an anthropomorphic sense) answers correctly

--- The reasoning lies in the concept of mass and weight. The weight of an object is determined by its mass, not its material.

1. Mass comparison:

2kg of feathers has a mass of 2 kilograms.

1kg of lead has a mass of 1 kilogram.

Since 2 kilograms is greater than 1 kilogram, the feathers are heavier.

2. Irrelevance of material:

The type of material (feathers or lead) does not change the mass measurement.

Lead is denser than feathers, so 1kg of lead takes up much less space than 2kg of feathers, but the weight is still based on the total mass.

Thus, 2kg of any substance, even something as light as feathers, is heavier than 1kg of a dense material like lead.






Large models have no issues with this question at all. Even llama-70B can handle it without issues, and that is a lot smaller than GPT-4o. But for small models this is a challenging question. llama-8B gets it confidently wrong 4 out of 5 times. gemma-2-9B gets it wrong pretty much every time. quen-coder-7B can handle it, so it's not impossible. It's just uncommon for small models to reliably get this question right, which is why I find it noteworthy that this model does.

Yes makes sense, I didn't took in account the model size and now you mention it makes a lot of sense.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: