Yea, people have a really hard time dealing with data leakage especially on data... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

mlepath 34 days ago | parent | context | favorite | on: 30% drop in O1-preview accuracy when Putnam proble...

Yea, people have a really hard time dealing with data leakage especially on data sets as large as LLMs need.

Basically if something appeared online or was transmitted over the wire should no longer be eligible to evaluate on. D. Sculley had a great talk at NeurIPS 2024 (same conference this paper was in) titled Empirical Rigor at Scale – or, How Not to Fool Yourself

Basically no one knows how to properly evaluate LLMs.

refulgentis 34 days ago [–]

No, an absolute massive amount of people do. In fact they have been doing exactly as you recommend, because as you note, it's obvious and required for a basic proper evaluation.

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact