* Trained 17GB of code from the top 10,000 most popular Debian packages. The source files were deduplicated using a process similar to the OpenWebText preprocessing (basically a locality-sensitive hash to detect near-duplicates).
If you want to look at some more fake samples, here are 256 generated snippets. These ones start at the beginning of the file each time, so you should get a better sense of context:
There are likely some "tells" but many fewer of them than I expected. I've seen it occasionally generate something malformed like "#includefrom", and like all GPT2 models it has a tendency to repeat things.
Yes, I think people definitely find it challenging. I'm keeping track of the correct and total guesses for each snippet, right now people are at almost exactly 50% accuracy:
I managed to get 20/25 correct, and most the wrong ones occurred in the first 10. Once I realized I had to look closer than simple logic errors and slowed way down to read carefully my accuracy increased significantly. Reading the textual parts have the most clues: Comments that didn't match the functionality, variable names that seemed like a soup of related terms, even one typo (separated became separata), debug statements that didn't match the variables they should be outputting. Beyond that the overall cohesiveness was better with real code. When I was a TA in school I used to find that reading through correct code is usually much easier than through incorrect code because you have to start accommodating surreal hypotheticals in your head that might make the code all come together in the end. Disclaimer: I work on NLP projects and I'm quite familiar with the limitations and typical failure modes of transformer models.
Yes it's not hard to spot the fake code if you take some time, I scored 10/10 on the first try. The GPT-2 code definitely 'looks real' but it doesn't make any sense most of the time, using unitialized variables, computing things but not using them, almost random comments that have no relation to the code, etc. The only snippets where I had to guess where those that just repeated many variations of the same thing as if they were machine-generated from some other input data (which sometimes makes sense to do in a 'real' project).
What I found more worrying is how terrible some of the real code was :D
Reminder that the p-value of a test is NOT the probability of H0 being true, see [0]. It only shows that, if we assume a significance of 0.05 we cannot reject the hypothesis (in our case that 89/200 is the result of a binomial distribution with p=.5).
It only shows the probability of observing either that statistic, or something even more extreme, under the null distribution. The implication that we can then "reject the null hypothesis" is more parlance and heuristic than anything.
How can you differ between people genuinely trying to discern them, and people randomly clicking one of the buttons to see the answer? The latter would also result in a 50% accuracy, regardless of the actual GPT quality.
I am curious, how were you able to feed the GPUs? Did you simply preload the entire dataset into RAM (it certainly seems possible)? Did you preapply BPE? Did you train your own BPE?
Note that the encoded dataset (through the GPT-2 BPE tokenizer) will be much, much less than 17GB, both on disk and in memory (in my experience it can be anywhere from 1/3rd to 1/2 the size)
If finetuning an existing GPT-2 model, you must use that BPE tokenizer; you could theoretically use your own but that wouldn't make a difference performance-wise and you'd just have a lot of wasted tokenspace.
The efficiencies of using your own tokenizer for bespoke, esoteric content that does not match typical internet speak (like this) are why I recommend training your own tokenizer and GPT-2 from scratch if possible.
Yeah, I trained my own BPE tokenizer for this and it results in pretty good compression. From 1024 BPE tokens you can generate anywhere from 2000-6000 actual characters of text. My guess is that it's a bit more efficient than English-BPE because there's a lot of repetitive stuff in source code (think spaces for indentation, or "if("/"while("/"for (int").
Yep, I trained my own BPE using HuggingFace's tokenizer library. During training I didn't keep the entire dataset in memory because even on an RTX8000 the full dataset + model weights + data used by the optimizer (ADAM) is too big.
* Trained 17GB of code from the top 10,000 most popular Debian packages. The source files were deduplicated using a process similar to the OpenWebText preprocessing (basically a locality-sensitive hash to detect near-duplicates).
* I used the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) code for training. Training took about 1 month on 4x RTX8000 GPUs.
* You can download the trained model here: https://moyix.net/~moyix/csrc_final.zip and the dataset/BPE vocab here: https://moyix.net/~moyix/csrc_dataset_large.json.gz https://moyix.net/~moyix/csrc_vocab_large.zip
Happy to answer any questions!