I have found Claude 3.5 Sonnet really good for coding tasks along with the artif...

cubefox · 2024-07-23T18:23:24 1721759004

I have found it to be better than GPT-4o at math too, despite the latter being better at several math benchmarks.

wfme · 2024-07-23T23:20:38 1721776838

My experience reflects this too. My hunch is that GPT-4o was trained to game the benchmarks rather than output higher quality content.

In theory the benchmarks should be a pretty close proxy for quality, but that doesn't match my experience at all.

margorczynski · 2024-07-23T23:55:34 1721778934

A problem with a lot of benchmarks is that they are out in the open so the model basically trains to game them instead of actually acquiring knowledge that would let it solve it. Probably private benchmarks that are not in the training set of these models should give better estimates about their general performance.

Davidzheng · 2024-07-24T01:48:51 1721785731

I personally disagree. But i haven't used sonnet that much

cubefox · 2024-07-24T06:42:17 1721803337

I asked both whether the product of two odds (odds=(probability/(1-probability)) can itself be interpreted as an odds, and if so, which. Neither could solve the problem completely, but Claude 3.5 Sonnet at least helped me to find the answer after a while. I assume the questions in math benchmarks are different.

Alifatisk · 2024-07-24T08:43:38 1721810618

Yeah same experience here aswell, I found Sonnet 3.5 to fulfill my task much better than 4o even though 4o scores higher on benchmarks.