Hacker News new | past | comments | ask | show | jobs | submit login

> I am developing an evaluation suite so I can keep watching the progress in a systematic way..

Sounds like something that should be published on github






Open benchmarks are vulnerable to saturation. I think benchmarks should have an embargo periodic, until which only 3% of the question-answer pairs is released, with an explicit warning not to use it 3 months after being released.



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: