* Trained 17GB of code from the top 10,000 most popular Debian packages. The source files were deduplicated using a process similar to the OpenWebText preprocessing (basically a locality-sensitive hash to detect near-duplicates).
* I used the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) code for training. Training took about 1 month on 4x RTX8000 GPUs.
* You can download the trained model here: https://moyix.net/~moyix/csrc_final.zip and the dataset/BPE vocab here: https://moyix.net/~moyix/csrc_dataset_large.json.gz https://moyix.net/~moyix/csrc_vocab_large.zip
Happy to answer any questions!
Did people find it to be as challenging when you showed it to them as some of us are here? Did you expect that level of complexity?
Yes, I think people definitely find it challenging. I'm keeping track of the correct and total guesses for each snippet, right now people are at almost exactly 50% accuracy:
correct | total | pct
6529 | 12963 | 50
What I found more worrying is how terrible some of the real code was :D
EDIT another poster guessed GPT2 each time and found the frequency was 80 percent
SELECT id, code, real FROM code ORDER BY random() LIMIT 1
Unusual to be this lopsided (1-in-7), but not crazy.
It's rare that a p-value is what you want, but for answering "how unusual is this case", it's the exact right tool for the job.
If finetuning an existing GPT-2 model, you must use that BPE tokenizer; you could theoretically use your own but that wouldn't make a difference performance-wise and you'd just have a lot of wasted tokenspace.
The efficiencies of using your own tokenizer for bespoke, esoteric content that does not match typical internet speak (like this) are why I recommend training your own tokenizer and GPT-2 from scratch if possible.