GPT-4o is different from GPT-4, you can "feel" it is smaller model that really struggles to do reasoning and programming and has a much weaker logic.
If you compare to Claude Sonnet, just the context window considerably improves the answers as well.
Of course there is no objective metrics, but from a user perspective I can see the coding skills are much better in Anthropic (and it's funny, because in theory, according to benchmarks it is Google Gemini the best, but in reality is absolutely terrible).
> GPT-4o is different from GPT-4, you can "feel" it is smaller model that really struggles to do reasoning and programming and has a much weaker logic.
FWIW according to LMSYS this is not the case. In coding, current GPT-4o (and mini, for that matter) beat GPT-4-Turbo handily, by a margin of 32 points.
By contrast Sonnet 3.5 is #1, 4 score points ahead of GPT-4o.
I'm a firm believer that the best benchmark is playing around with the model for like an hour. On the type of tasks that are relevant to you and your work, of course.
I've also found GPT-4o to be subjectively less intelligent than GPT-4. The gap especially shows up when more complex reasoning is required, eg, on macroeconomic questions or other domains in the world where the interactions are important or where subtle aspects of the question or domain are important.