Hacker News new | past | comments | ask | show | jobs | submit login

For what it's worth, as always 99% benchmarks are very unreliable and per-task performance still greatly differs per model, with plenty of cases where results are wildly different.

I have a task I use in my work where Gemini 1.5-Pro is SOTA. Handily beating o1, Sonnet-3.5, Gemini-exp and everyone else, very consistently and significantly.

The newer/bigger models are better at reasoning and especially coding, but there's plenty of tasks that have little overlap with those skills.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: