Hacker News new | past | comments | ask | show | jobs | submit login

Every time someone on HN invokes an LLM's scoring on a benchmark as if it meant something, I'll post a link to this paper:

What Will it Take to Fix Benchmarking in Natural Language Understanding?

https://aclanthology.org/2021.naacl-main.385/

To summarise, it's been quite some time now that language models have been beating NLU benchmarks left and right, but that still doesn't tell us anything about their true capabilities, because those benchmarks don't measure what they purport to be measuring.

Or, to rephrase in a way that shows how much I love pattern matching: you keep saying this word, "benchmark". I don't think it means, what you think it means.




Thank you for the perspective; I'm an ML outsider so I'm not familiar with the debates going on within the field. That said, I was didn't mean to suggest that these benchmarks were measuring anything other than ...well, the performance of the benchmark. I was trying to give the parent commenter something less "subjective" to read that I think connects with the somewhat squishier Twitter discussion.


I think that's a great motivation. But now you can also read the paper I linked above and be more informed about the discussion around benchmarks for NLU (Natural Language Understanding). You can also follow the references in that paper. That will then help you know how to contribute to discussions, in particular when it comes to pointing interlocutors to results, without increasing the amount of noise. We sorely need that right now, so please consider becoming better informed.


That paper makes some very fair criticisms, I think that GPT-4's performance in exams (bar exam, LSAT, AP bio, etc.) is far more impressive than its ability to beat benchmarks.

My concern is that fragments of those exams might be found in the training set, and it is hard to see how to account for that. Even if we use exams ostensibily written more recently than the training data was collected, exam boards do tend to reuse questions (sometimes verbatim, sometimes slightly modified).


This was posted on HN today (not by me!):

https://news.ycombinator.com/item?id=35245626

It talks about bar exams and contamination, so spot on :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: