Hacker News new | past | comments | ask | show | jobs | submit login

Hi, author here! Some details on the model:

* Trained 17GB of code from the top 10,000 most popular Debian packages. The source files were deduplicated using a process similar to the OpenWebText preprocessing (basically a locality-sensitive hash to detect near-duplicates).

* I used the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) code for training. Training took about 1 month on 4x RTX8000 GPUs.

* You can download the trained model here: https://moyix.net/~moyix/csrc_final.zip and the dataset/BPE vocab here: https://moyix.net/~moyix/csrc_dataset_large.json.gz https://moyix.net/~moyix/csrc_vocab_large.zip

Happy to answer any questions!




For those trying to guess, you can find the real code using the Debian code search service:

https://codesearch.debian.net/


If you want to look at some more fake samples, here are 256 generated snippets. These ones start at the beginning of the file each time, so you should get a better sense of context:

https://moyix.net/~moyix/unconditional_samples.txt


Hi prof! This is super neat, Nick and I have actually been looking into code generation ourselves for part of Whize.


Thanks for stopping by! This is impressive. I would be curious to know if my hunch below about potential weaknesses/tells was at all correct.

Did people find it to be as challenging when you showed it to them as some of us are here? Did you expect that level of complexity?


There are likely some "tells" but many fewer of them than I expected. I've seen it occasionally generate something malformed like "#includefrom", and like all GPT2 models it has a tendency to repeat things.

Yes, I think people definitely find it challenging. I'm keeping track of the correct and total guesses for each snippet, right now people are at almost exactly 50% accuracy:

     correct | total | pct 
    ---------+-------+-----    
        6529 | 12963 |  50


I managed to get 20/25 correct, and most the wrong ones occurred in the first 10. Once I realized I had to look closer than simple logic errors and slowed way down to read carefully my accuracy increased significantly. Reading the textual parts have the most clues: Comments that didn't match the functionality, variable names that seemed like a soup of related terms, even one typo (separated became separata), debug statements that didn't match the variables they should be outputting. Beyond that the overall cohesiveness was better with real code. When I was a TA in school I used to find that reading through correct code is usually much easier than through incorrect code because you have to start accommodating surreal hypotheticals in your head that might make the code all come together in the end. Disclaimer: I work on NLP projects and I'm quite familiar with the limitations and typical failure modes of transformer models.


Yes it's not hard to spot the fake code if you take some time, I scored 10/10 on the first try. The GPT-2 code definitely 'looks real' but it doesn't make any sense most of the time, using unitialized variables, computing things but not using them, almost random comments that have no relation to the code, etc. The only snippets where I had to guess where those that just repeated many variations of the same thing as if they were machine-generated from some other input data (which sometimes makes sense to do in a 'real' project).

What I found more worrying is how terrible some of the real code was :D


Are you presenting real samples and GPT2 samples to users with equal probabilities?

EDIT another poster guessed GPT2 each time and found the frequency was 80 percent


It should be equal: there are 1000 real and 1000 generated samples in the database, retrieved via:

SELECT id, code, real FROM code ORDER BY random() LIMIT 1


I guessed GPT2 each time, 200 times in a row and only found that GPT2 was correct 89/200 times, so about 45% was GPT2 for me.


In [2]: scipy.stats.binom_test(89, 200, 0.5) Out[2]: 0.13736665086863936

Unusual to be this lopsided (1-in-7), but not crazy.


Reminder that the p-value of a test is NOT the probability of H0 being true, see [0]. It only shows that, if we assume a significance of 0.05 we cannot reject the hypothesis (in our case that 89/200 is the result of a binomial distribution with p=.5).

[0] https://en.wikipedia.org/wiki/Misuse_of_p-values#Clarificati...


Yes. That is what I said and how I interpreted it. If the split is even (H0 true), getting a result that lopsided is a 1-out-7 deal.

It's rare that a p-value is what you want, but for answering "how unusual is this case", it's the exact right tool for the job.


It only shows the probability of observing either that statistic, or something even more extreme, under the null distribution. The implication that we can then "reject the null hypothesis" is more parlance and heuristic than anything.


How can you differ between people genuinely trying to discern them, and people randomly clicking one of the buttons to see the answer? The latter would also result in a 50% accuracy, regardless of the actual GPT quality.


Yep, there could be a lot of noise in there too from people guessing lazily.


Can you share the dataset too?



I am curious, how were you able to feed the GPUs? Did you simply preload the entire dataset into RAM (it certainly seems possible)? Did you preapply BPE? Did you train your own BPE?


Note that the encoded dataset (through the GPT-2 BPE tokenizer) will be much, much less than 17GB, both on disk and in memory (in my experience it can be anywhere from 1/3rd to 1/2 the size)

If finetuning an existing GPT-2 model, you must use that BPE tokenizer; you could theoretically use your own but that wouldn't make a difference performance-wise and you'd just have a lot of wasted tokenspace.

The efficiencies of using your own tokenizer for bespoke, esoteric content that does not match typical internet speak (like this) are why I recommend training your own tokenizer and GPT-2 from scratch if possible.


Yeah, I trained my own BPE tokenizer for this and it results in pretty good compression. From 1024 BPE tokens you can generate anywhere from 2000-6000 actual characters of text. My guess is that it's a bit more efficient than English-BPE because there's a lot of repetitive stuff in source code (think spaces for indentation, or "if("/"while("/"for (int").


Yep, I trained my own BPE using HuggingFace's tokenizer library. During training I didn't keep the entire dataset in memory because even on an RTX8000 the full dataset + model weights + data used by the optimizer (ADAM) is too big.


This is crazy good. Well done.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: