*> I am developing an evaluation suite so I can keep watching the progress in a ... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

		taylodl 13 days ago \| parent \| context \| favorite \| on: Why is everyone trying to replace Software Enginee... > I am developing an evaluation suite so I can keep watching the progress in a systematic way.. Sounds like something that should be published on github

pona-a 13 days ago [–]

Open benchmarks are vulnerable to saturation. I think benchmarks should have an embargo periodic, until which only 3% of the question-answer pairs is released, with an explicit warning not to use it 3 months after being released.

Join us for AI Startup School this June 16-17 in San Francisco!
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact