I might be missing it but I can't find where it says how the data was generated,...

alecco · on Dec 14, 2023

Read the first paper "Textbooks Are All You Need".

> We annotate the quality of a small subset of these files (about 100k samples) using GPT-4: given a code snippet, the model is prompted to “determine its educational value for a student whose goal is to learn basic coding concepts”.

IanCal · on Dec 14, 2023

Yes, they didn't use GPT-4 to generate data.

They use GPT-3.5 to generate 1B tokens of synthetic data.

They used GPT-4 to annotate data to train a classifier to filter human written code.

The quote directly after yours:

> We then use this annotated dataset to train a random forest classifier that predicts the quality of a file/sample using its output embedding from a pretrained codegen model as features. We note that unlike GPT-3.5, which we use extensively to generate synthetic content (discussed below), we use GPT-4 minimally only for annotations on the quality of a small subset of The Stack and StackOverflow samples. We thus view our usage of GPT-4 as merely a way to avoid tedious human-annotation efforts