Thanks for feedback! Yes, we’re looking to improve quality in the coming months....

eternityforest · 2025-02-23T09:42:12 1740303732

I still don't understand why all the datasets have so many general knowledge questions and so much math, when so few people can do any of that stuff.

It makes sense for ASI research I suppose, but why are we trying to teach small models to do stuff almost no humans even try to do?

What happens if you train them with RAG context in the prompts and calculator calls in the CoT?

rosstaylor90 · 2025-02-23T18:59:06 1740337146

Many math questions are easy to verify, and it's a classic benchmark for reasoning -> so it's a good hill to climb.

I agree with your meta-point that better benchmarks testing more types of task would be good!