I might be missing it but I can't find where it says how the data was generated, it mostly refers back to the previous paper which started they used 3.5
I'd not be too surprised but I can't find anything in the technical report paper saying they're using 4 specifically.
Read the first paper "Textbooks Are All You Need".
> We annotate the quality of a small subset of these files (about 100k samples) using
GPT-4: given a code snippet, the model is prompted to “determine its educational value for a student whose goal is to learn basic coding concepts”.
They use GPT-3.5 to generate 1B tokens of synthetic data.
They used GPT-4 to annotate data to train a classifier to filter human written code.
The quote directly after yours:
> We then use this annotated dataset to train a random forest classifier that predicts the quality of
a file/sample using its output embedding from a pretrained codegen model as features. We note that
unlike GPT-3.5, which we use extensively to generate synthetic content (discussed below), we use GPT-4
minimally only for annotations on the quality of a small subset of The Stack and StackOverflow samples.
We thus view our usage of GPT-4 as merely a way to avoid tedious human-annotation efforts
I'd not be too surprised but I can't find anything in the technical report paper saying they're using 4 specifically.