There weren't any serious examples of degradation. Does only GPT-4 have to suffe...

rushingcreek · on Aug 25, 2023

There's a couple of things here:

1. I'm not saying we have to wait until GPT-5, we just need an apples-to-apples comparison where contamination is taken into account

2. GPT-4 does not seem to have improved on real-world coding tasks since March, so it's unclear where any purported HumanEval gains could've come from

3. I've personally noticed degradation anecdotally in the GPT-4 June update vs. the original March release

lhl · on Aug 26, 2023

> 2. GPT-4 does not seem to have improved on real-world coding tasks since March, so it's unclear where any purported HumanEval gains could've come from

Once Markdown formatting is accounted for, the June model improves answers on the Leetcode questions from the LLM Drift paper testing to 70% (35/50) vs the March model's 52% (26/50).

see:

* https://github.com/lchen001/LLMDrift/blob/main/generation/

* https://twitter.com/Si_Boehm/status/1681801371656536068

refulgentis · on Aug 25, 2023

1. TL;DR: OpenAI must verify HumanEval data wasn't used in training in order to compare it?

2. Link in the post you replied to.

3. Subjectivity is fine by me! There's a motte & bailey flavor to it if we combine your comment and this one, c.f. "This is why we use the official numbers."

pclmulqdq · on Aug 26, 2023

I think you're assuming that OpenAI is incentivized to benchmark honestly. Like every other company for which a benchmark is a goal, they are not.

somenameforme · on Aug 26, 2023

Also for a topic like this, subjectivity is all there really is. Even if you create some metric, what you prioritize is going to be subjective. Because performance is going to vary against different sorts of tasks, and there are a literally infinite number of categories of tasks, so it's not like you can ever truly get a fair sampling.

Because of this, a sample of subjective opinions is probably much more valuable than any official metric, especially if that metric comes from, as you mentioned, individuals/orgs who are highly motivated to game it endlessly. Even when it comes from an external source you end up with a similar risk of it being gamed. It's like how old school Google puzzle interviews went from seeing who was most clever [in that domain], to seeing who'd booked up the most.

refulgentis · on Aug 26, 2023

Well, no, we have the HumanEval results for the June release.

somenameforme · on Aug 27, 2023

Which is both (1) a subjective selection to measure the effectiveness of various chatbots and (2) now subject to gaming from companies using opaque/closed/inaccessible/unverifiable systems, like OpenAI.