What data did you use to train and how do you evaluate your model for overfittin...

What data did you use to train and how do you evaluate your model for overfitting? I ask due to the issues with the HumanEval dataset.

-------------

For those that are unfamiliar with the issues, allow me to elaborate. You can find the dataset in the parent's link or here[0] and you can find the paper here[1].

I'll quote from the paper. First is page 2 right above the github link and second is page 4 section 2.2 (note, this paper has 58 authors... 58)

> To accurately benchmark our model, we create a dataset of 164 original programming problems with unit tests. These problems assess language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

> It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources. For example, there are more than ten public repositories containing solutions to Codeforces problems, which make up part of the recently proposed APPS dataset

So we take from this that the problems are simple and leet code style and that they have verified that the data is not in the training set by the simple nature of simply writing the code from scratch. If you aren't laughing now, you should be. So let's look and see if there are in fact samples of code that are exact or near to those in the test set that exist in public githubs prior to May 2020, their cutoff date.

Now let's look at some of the test questions and see if we can find them on github. Github search is total garbage so I'm going to pull results from the last time I looked (search my comment history "godelski human eval") I apologize in advance for formatting.

HumanEval/4:

Prompt: from typing import List def mean_absolute_deviation(numbers: List[float]) -> float: """ For a given list of input numbers, calculate Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the average absolute difference between each element and a centerpoint (mean in this case): MAD = average | x - x_mean | >>> mean_absolute_deviation([1.0, 2.0, 3.0, 4.0]) 1.0 """

canonical_solution: mean = sum(numbers) / len(numbers) return sum(abs(x - mean) for x in numbers) / len(numbers)

Found on Github[2], commit date Oct 5, 2019: if reduction == "median": return np.median(scores) mean = sum(scores) / len(scores) if reduction == "mean": return mean return sum(abs(x - mean) for x in scores) / len(scores)

A solution that is functionally equivalent. Swap numbers and scores and remove the if statement. This constitutes a near collision and ML models will preform very well on near collisions. If you look at the testing method for the evaluation you will also see that this code will pass the test. Thus our LLM can very easily simply copy paste this code and pass no problem. I'm not saying that's what happened, but that we cannot rule this out. What actually happened is an open question and we're far from ready as a community to call LLMs fuzzy copy machines.

I also have this search query marked which still seems to be working[3] but you'll have to manually check the date.

You can repeat this process for many examples in the HumanEval dataset. Or simply look the human eval dataset questions and answers and ask yourself "Have I written those exact lines of code?" The answer is probably.

But note here, that overfitting is perfectly okay in certain circumstances. But HumanEval simply measures how good a LLM is at solving short leetcode style questions. It does not measure a LLMs ability to write code and certainly not write non-leetcode. It may very well do so, but this benchmark does not measure such things. This still can provide utility to people and these LLMs still learn a lot more than what HumanEval tests. My issue is with the metric and claims as to what the results indicate rather than the product itself. There is also the danger of chasing benchmarks like these as you will not be able to disentangle overfitting from desired training outcomes. I am not critiquing OP's network nor the work they did to create this. I'll explicitly state here, well done OP. This took a lot of hard work and you should feel very proud. I hope this question and context does not come off as pejorative nor overly cynical. I think your work is without a doubt, something to be proud of and useful to our community.

This is a warning to all HN readers to help avoid snakeoil (I expect every ML person to already know this), scrutinize your metrics and know exactly what they measure. I mean precisely, there are no metrics that measure abstract things like "image quality", "performance in language", "code generation performance" and so on. Generative models are exceptionally difficult to determine what model is better and we are unfortunately at a point where many of our metrics (remember: metrics are proxies or more abstract goals. Metrics are models. All models are wrong, just some are more wrong than others) and you must do far more investigation to come to an even fuzzy answer to this question. Nuance is necessary.

[0] https://huggingface.co/datasets/openai_humaneval

[1]https://arxiv.org/abs/2107.03374

[2] https://github.com/danielwatson6/hate-speech-project/blob/8e...

[3] https://github.com/search?q=abs%28x+-+mean%29+for+language%3...