Hacker News new | past | comments | ask | show | jobs | submit login

I might be missing it but I can't find where it says how the data was generated, it mostly refers back to the previous paper which started they used 3.5

I'd not be too surprised but I can't find anything in the technical report paper saying they're using 4 specifically.




Read the first paper "Textbooks Are All You Need".

> We annotate the quality of a small subset of these files (about 100k samples) using GPT-4: given a code snippet, the model is prompted to “determine its educational value for a student whose goal is to learn basic coding concepts”.


Yes, they didn't use GPT-4 to generate data.

They use GPT-3.5 to generate 1B tokens of synthetic data.

They used GPT-4 to annotate data to train a classifier to filter human written code.

The quote directly after yours:

> We then use this annotated dataset to train a random forest classifier that predicts the quality of a file/sample using its output embedding from a pretrained codegen model as features. We note that unlike GPT-3.5, which we use extensively to generate synthetic content (discussed below), we use GPT-4 minimally only for annotations on the quality of a small subset of The Stack and StackOverflow samples. We thus view our usage of GPT-4 as merely a way to avoid tedious human-annotation efforts




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: