Every time someone on HN invokes an LLM's scoring on a benchmark as if it meant ...

namuol · on March 19, 2023

Thank you for the perspective; I'm an ML outsider so I'm not familiar with the debates going on within the field. That said, I was didn't mean to suggest that these benchmarks were measuring anything other than ...well, the performance of the benchmark. I was trying to give the parent commenter something less "subjective" to read that I think connects with the somewhat squishier Twitter discussion.

YeGoblynQueenne · on March 19, 2023

I think that's a great motivation. But now you can also read the paper I linked above and be more informed about the discussion around benchmarks for NLU (Natural Language Understanding). You can also follow the references in that paper. That will then help you know how to contribute to discussions, in particular when it comes to pointing interlocutors to results, without increasing the amount of noise. We sorely need that right now, so please consider becoming better informed.

sebzim4500 · on March 20, 2023

That paper makes some very fair criticisms, I think that GPT-4's performance in exams (bar exam, LSAT, AP bio, etc.) is far more impressive than its ability to beat benchmarks.

My concern is that fragments of those exams might be found in the training set, and it is hard to see how to account for that. Even if we use exams ostensibily written more recently than the training data was collected, exam boards do tend to reuse questions (sometimes verbatim, sometimes slightly modified).

YeGoblynQueenne · on March 21, 2023

This was posted on HN today (not by me!):

https://news.ycombinator.com/item?id=35245626

It talks about bar exams and contamination, so spot on :)