You need benchmarks with the following three properties: 1) No known solutions, ...

		optimalsolver 1 day ago \| parent \| context \| favorite \| on: Some critical issues with the SWE-bench dataset You need benchmarks with the following three properties: 1) No known solutions, so there's no "ground truth" dataset to train on 2) Presumably hard to solve 3) But easy to verify a solution if one is provided. This, of course, is easier done on the STEM side of things, but how do you automatically test creativity, or philosophical aptitude?

hsuduebc2 1 day ago [–]

I guess it's purely subjective. Maybe some internal commission if it comes to quality of creative work?