Hacker News new | past | comments | ask | show | jobs | submit login

The issue is the obsession with benchmark datasets and their flaky evaluation



What else could you do to test it besides it works for me and this test said it's good at talking?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: