Try to guess if code is real or GPT2-generated

moyix · on Feb 23, 2021

Hi, author here! Some details on the model:

* Trained 17GB of code from the top 10,000 most popular Debian packages. The source files were deduplicated using a process similar to the OpenWebText preprocessing (basically a locality-sensitive hash to detect near-duplicates).

* I used the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) code for training. Training took about 1 month on 4x RTX8000 GPUs.

* You can download the trained model here: https://moyix.net/~moyix/csrc_final.zip and the dataset/BPE vocab here: https://moyix.net/~moyix/csrc_dataset_large.json.gz https://moyix.net/~moyix/csrc_vocab_large.zip

Happy to answer any questions!

pabs3 · on Feb 24, 2021

For those trying to guess, you can find the real code using the Debian code search service:

https://codesearch.debian.net/

moyix · on Feb 24, 2021

If you want to look at some more fake samples, here are 256 generated snippets. These ones start at the beginning of the file each time, so you should get a better sense of context:

https://moyix.net/~moyix/unconditional_samples.txt

ianbutler · on Feb 24, 2021

Hi prof! This is super neat, Nick and I have actually been looking into code generation ourselves for part of Whize.

ivraatiems · on Feb 23, 2021

Thanks for stopping by! This is impressive. I would be curious to know if my hunch below about potential weaknesses/tells was at all correct.

Did people find it to be as challenging when you showed it to them as some of us are here? Did you expect that level of complexity?

moyix · on Feb 23, 2021

There are likely some "tells" but many fewer of them than I expected. I've seen it occasionally generate something malformed like "#includefrom", and like all GPT2 models it has a tendency to repeat things.

Yes, I think people definitely find it challenging. I'm keeping track of the correct and total guesses for each snippet, right now people are at almost exactly 50% accuracy:

     correct | total | pct 
    ---------+-------+-----    
        6529 | 12963 |  50

SomewhatLikely · on Feb 24, 2021

I managed to get 20/25 correct, and most the wrong ones occurred in the first 10. Once I realized I had to look closer than simple logic errors and slowed way down to read carefully my accuracy increased significantly. Reading the textual parts have the most clues: Comments that didn't match the functionality, variable names that seemed like a soup of related terms, even one typo (separated became separata), debug statements that didn't match the variables they should be outputting. Beyond that the overall cohesiveness was better with real code. When I was a TA in school I used to find that reading through correct code is usually much easier than through incorrect code because you have to start accommodating surreal hypotheticals in your head that might make the code all come together in the end. Disclaimer: I work on NLP projects and I'm quite familiar with the limitations and typical failure modes of transformer models.

w0utert · on Feb 24, 2021

Yes it's not hard to spot the fake code if you take some time, I scored 10/10 on the first try. The GPT-2 code definitely 'looks real' but it doesn't make any sense most of the time, using unitialized variables, computing things but not using them, almost random comments that have no relation to the code, etc. The only snippets where I had to guess where those that just repeated many variations of the same thing as if they were machine-generated from some other input data (which sometimes makes sense to do in a 'real' project).

What I found more worrying is how terrible some of the real code was :D

hntrader · on Feb 23, 2021

Are you presenting real samples and GPT2 samples to users with equal probabilities?

EDIT another poster guessed GPT2 each time and found the frequency was 80 percent

moyix · on Feb 23, 2021

It should be equal: there are 1000 real and 1000 generated samples in the database, retrieved via:

SELECT id, code, real FROM code ORDER BY random() LIMIT 1

lelandbatey · on Feb 23, 2021

I guessed GPT2 each time, 200 times in a row and only found that GPT2 was correct 89/200 times, so about 45% was GPT2 for me.

wnoise · on Feb 23, 2021

In [2]: scipy.stats.binom_test(89, 200, 0.5) Out[2]: 0.13736665086863936

Unusual to be this lopsided (1-in-7), but not crazy.

zaik · on Feb 24, 2021

Reminder that the p-value of a test is NOT the probability of H0 being true, see [0]. It only shows that, if we assume a significance of 0.05 we cannot reject the hypothesis (in our case that 89/200 is the result of a binomial distribution with p=.5).

[0] https://en.wikipedia.org/wiki/Misuse_of_p-values#Clarificati...

wnoise · on Feb 24, 2021

Yes. That is what I said and how I interpreted it. If the split is even (H0 true), getting a result that lopsided is a 1-out-7 deal.

It's rare that a p-value is what you want, but for answering "how unusual is this case", it's the exact right tool for the job.

hntrader · on Feb 25, 2021

It only shows the probability of observing either that statistic, or something even more extreme, under the null distribution. The implication that we can then "reject the null hypothesis" is more parlance and heuristic than anything.

crote · on Feb 23, 2021

How can you differ between people genuinely trying to discern them, and people randomly clicking one of the buttons to see the answer? The latter would also result in a 50% accuracy, regardless of the actual GPT quality.

moyix · on Feb 23, 2021

Yep, there could be a lot of noise in there too from people guessing lazily.

lostmsu · on Feb 23, 2021

Can you share the dataset too?

moyix · on Feb 23, 2021

Sure, it's here in JSON format: https://moyix.net/~moyix/csrc_dataset_large.json.gz

lostmsu · on Feb 23, 2021

I am curious, how were you able to feed the GPUs? Did you simply preload the entire dataset into RAM (it certainly seems possible)? Did you preapply BPE? Did you train your own BPE?

minimaxir · on Feb 23, 2021

Note that the encoded dataset (through the GPT-2 BPE tokenizer) will be much, much less than 17GB, both on disk and in memory (in my experience it can be anywhere from 1/3rd to 1/2 the size)

If finetuning an existing GPT-2 model, you must use that BPE tokenizer; you could theoretically use your own but that wouldn't make a difference performance-wise and you'd just have a lot of wasted tokenspace.

The efficiencies of using your own tokenizer for bespoke, esoteric content that does not match typical internet speak (like this) are why I recommend training your own tokenizer and GPT-2 from scratch if possible.

moyix · on Feb 23, 2021

Yeah, I trained my own BPE tokenizer for this and it results in pretty good compression. From 1024 BPE tokens you can generate anywhere from 2000-6000 actual characters of text. My guess is that it's a bit more efficient than English-BPE because there's a lot of repetitive stuff in source code (think spaces for indentation, or "if("/"while("/"for (int").

moyix · on Feb 23, 2021

Yep, I trained my own BPE using HuggingFace's tokenizer library. During training I didn't keep the entire dataset in memory because even on an RTX8000 the full dataset + model weights + data used by the optimizer (ADAM) is too big.

RantyDave · on Feb 24, 2021

This is crazy good. Well done.

Felk · on Feb 23, 2021

I got a function that assigned the same expression to three variables. Then it declared a void function with documentation stating "returns true on success, false otherwise". Apparently that code was written by a human, which makes me either doubt the correctness of that website, or the quality of the code it was fed with

skissane · on Feb 23, 2021

First code it showed me had getXXX() methods returning void, each of which contained nothing but a printf using the same string variable with no apparent connection to XXX, along with invalid format strings. Surely code this nonsensical has to be generated. Yet when I clicked "GPT2" it said I was wrong.

emteycz · on Feb 23, 2021

Don't underestimate the power of failed merges and indifference

josefx · on Feb 24, 2021

I know some closed source projects that suffered from that kind of getter. Sadly the coders complaining most about it also were the ones who committed the most crimes against common sense on it.

moyix · on Feb 23, 2021

This made me worried, so I went and spot-checked 5-6. Using the "cheat sheet" I was always able to guess correctly, so I think the site is working fine.

The list of packages the real snippets are drawn from is here (maybe if you want to avoid using them... ;) ):

https://moyix.net/~moyix/sample_pkgnames.txt

Note that the GPT samples are prompted with 128 characters randomly selected from those same packages, so you will see GPT2-generated code that mentions the package name etc. However, these packages were not used for training.

psyklic · on Feb 23, 2021

Same thought here - apparently humans read from uninitialized arrays immediately after declaring them! That said, it is still a pretty fun website :)

erikbye · on Feb 24, 2021

Are people surprised to see low quality code in the wild? Guess what, most code is subpar.

dataflow · on Feb 23, 2021

I actually ran into a case where I wanted to do this, but was forced not to.

What was the scenario? I had a couple of small, fixed-size char buffers and I wanted to swap their valid portions, but the obvious choice of swap_ranges(a, b, a + max(na, nb)) would run into this issue. (n.b. this wouldn't be correct for non-POD types anyway, but we're talking about chars.)

On top of it being annoying to not be able to do the convenient thing, it made life harder when debugging, because the "correct" solution does not preservs the bit patterns (0xCC/0xCD or whatever) that the debug build injects into uninitialized arrays, therefore making it harder to tell when I later read an uninitialized element from a swapped-from array.

cmeacham98 · on Feb 24, 2021

Why would you ever want to swap an uninitialized value into a buffer? You're wasting CPU cycles writing out data that you are guaranteed to never want to use. Why not just do a copy from the source buffer to the uninitialized one (as that is likely the half of the swap that is desired)?

dataflow · on Feb 24, 2021

I literally explained why I want to do that in my last paragraph?

cmeacham98 · on Feb 24, 2021

Your last comment just talks about not detecting when you read uninitialized values, but obviously, you wouldn't read uninitialized values _if you never wrote them_?

Unless your use case is swapping with an uninitialized buffer to mark a buffer as "done" and detect further use of it?

dataflow · on Feb 24, 2021

You're not understanding what I'm saying. I'm talking about what I see in the debugger when I'm debugging. When you see 0xCC in a variable in the debugger you know you probably had an out of bounds read. Because in debug mode the compiler and runtime leave these markers in uninitialized memory. For that to be helpful you need to swap until the max of the initialized sizes of both arrays, so that you preserve these markers. You defeat that helpful feature if you copy the uninitialized portion of the buffer instead of swapping.

cmeacham98 · on Feb 24, 2021

Oh I think I see, I never considered the scenario where the buffer was half uninitialized, I thought you meant you were swapping an (entirely) uninitialized buffer with an initialized one.

dataflow · on Feb 24, 2021

Well that certainly could be a special case of the more general scenario I wrote; that's what would happen if it turned out the first buffer had nothing useful in it and the second one was full. But yeah, in general they'd both be partially filled: you have [0, ?, ?] and [1, 1, ?] and want to swap them. You wouldn't touch the last ? in either, but you would swap the [0, ?] with [1, 1], and in debug mode you'd see ? = 0xCC. Except the language doesn't really let you do that, even though fundamentally there should be nothing wrong with it, and in fact is likely to be desirable in practice.

moyix · on Feb 23, 2021

Since these are snippets from a random position in the file, it's possible that the code that initialized them was outside the snippet?

dsilin · on Feb 23, 2021

Maybe the probability of GPT2 generating that sequence is nearly 0. Sometimes weird edge cases are more human.

et1337 · on Feb 23, 2021

This looks like overfitting to me. Some of the GPT samples were definitely real code, or largely real code. One looked like something from Xorg, another like it was straight from the COLLADA SDK. It’s really hard to define what “truly new code” is, if it’s just the same code copy pasted in different order. Blah blah Ship of Theseus etc.

moyix · on Feb 23, 2021

The generated snippets are prompted with 128 characters from real code (but not code from the training data), so they can often pick up on the name of the project etc.

et1337 · on Feb 23, 2021

Apologies if my comment was dismissive. This is an impressive project!

minimaxir · on Feb 23, 2021

Overfitting on 17GB of input data would be interesting, even though it's using the "large" 774M GPT-2 model.

It's possible training for a month may be too much.

sdflhasjd · on Feb 23, 2021

I'm 90% sure I just got a boost header which was apparently GPT-2 generated, hmmm.

Sadly, I can't do back and see it again

teruakohatu · on Feb 24, 2021

I, one the other hand, got a section of code with a method that had a complex name and took a bunch of parameters but only ever returned true. I was sure it was auto generated... But no it was just bad (real) code.

I got 8/9

dzdt · on Feb 23, 2021

I got some code related to VICE emulator. It looked pretty real, referring to concepts that make sense in the context of a C64 emulator, but the results said it was GPT not real code. It even had the correct GPL license matching that project. It seems the GPT model has learned quite a bit about the real projects it was fed as input.

moyix · on Feb 23, 2021

It has entirely memorized a bunch of common open source licenses, a bunch of contributor names/emails, and so on. However when I've tried to locate the actual code it's producing in the training data it's not there.

Aardwolf · on Feb 23, 2021

There was some code about TIFF headers, and it was apparently GPT2 generated

TIFF is a real thing, so some human was involved in some part of that code, it has just been garbled up by GPT2... In other words, the training set is showing quite visibly in the result

ironmagma · on Feb 24, 2021

Would be nice if the back button worked so you could see what you guessed wrong. This is a good example of where POST is used unnecessarily and URLs should be idempotent.

moyix · on Feb 24, 2021

I was trying to avoid letting people figure out what id mapped to real/fake but in retrospect this was a mistake, yeah. Sorry about that!

klik99 · on Feb 23, 2021

For the ones that were just part of the header file, listing a bunch of instance variables and function names, it seems impossible. But for the actual code, it is possible but still quite difficult, though I spent too long in finding some logical inconsistency that gave it away.

damenut · on Feb 23, 2021

This was so much harder than I thought it was going to be. I would get a few right and then be absolutely sure of the next one and be wrong. After a while I felt like I was noticing more aesthetic differences between the gpt and real, rather than distinguishing between the two based on their content. Very interesting...

efferifick · on Feb 24, 2021

I wonder how likely are code invariants found in the training set preserved in GPT-2/3. In other words, if I train GPT-2 with C source produce by Csmith (a program generator which I believe produces C programs without undefined behaviour) would programs produced by GPT-2/3 also do not exhibit undefined behaviour?

I understand that GPT-2/3 is just a very smart parrot that has no semantic knowledge of what it is outputting. Like I guess let's take a very dumb markov chain that was "trained" on the following input:

a.c ``` int array[6]; array[5] = 1; ```

b.c

``` int array[4]; array[3] = 2; ```

I guess a markov chain could theoretically produce the following code:

out.c

``` int array[4]; array[5] = 1; ```

which is undefined behaviour. But it is produced from two programs where no undefined behaviour is present. A better question would be, how can we guarantee that certain program invariants (like lack of undefined behaviour) can be preserved in the produced code? Or if there are no guarantees, can we calculate a probability? (Sorry, not an expert on machine learning, just excited about a potentially new way to fuzz programs. Technically, one could just instrument C programs with sanitizers and produce backwards static slices from the C sanitizer instrumentation to the beginning of the program and you get a sample of a program without undefined behaviour... so there is already the potential for some training set to be expanded beyond what Csmith can provide.)

EDIT: I don't know how to format here...

technologia · on Feb 23, 2021

This was a fun exercise, definitely think this could be difficult to suss out for greener devs or even more experienced ones. It’d be hilarious to have this model power a live screensaver in lieu of actually being busy at times.

iconara · on Feb 24, 2021

I was confused on most examples I got because they started in the middle of a block comment. That's clearly wrong, but was it an artefact of the presentation rather than the generation?

reillyse · on Feb 24, 2021

It's easy after a while... all the terribly written code is human made and the clean tidy code is GPT2, I for one welcome our new programming overlords.

dnautics · on Feb 24, 2021

There was one sample where the code written by gpt-3 had nonsensical grammar in the comments and another where it had an overloaded > operator with 3 parameters (is that even valid c++?)

reillyse · on Feb 24, 2021

I'd say always using proper grammar in your comments would be a dead giveaway :)

dnautics · on Feb 24, 2021

A mistake is a dead giveaway for human; proper grammar is a crapshoot; nonsensical grammar is a giveaway for gpt (like adjectives modifying words that aren't nouns, I think b/c gpt is and dropping a reference to an identifier into the comment)

sli · on Feb 24, 2021

I generally find spelling to be a much better tell than grammar. Most programmers I've known had (mostly) fine grammar but horrendous spelling. I frequently end up being the proofreader.

Which is fine with me.

thewarrior · on Feb 23, 2021

This is actually quite impressive. Try reading the comments in the code. The comments often make perfect sense in the local context even if it’s GPT-2 gibberish.

The real examples have worse comments at times.

The only flaw is that it shows fake code most of the time so you can game it that way.

jackson1442 · on Feb 23, 2021

Some of the "real" ones have absolutely atrocious comments. Two variables labelled with literally just the name of the variable like so:

    bool hasdied // has died

and then a `// done` for seemingly no reason after initializing variables... where did this code come from?!

moyix · on Feb 23, 2021

The “real” code came from these packages: https://moyix.net/~moyix/sample_pkgnames.txt

Gravityloss · on Feb 24, 2021

A lot of the real code seems superficially nonsensical as well!

Functions with lots of arguments while the body consists of "return true;"

I guess it tells what AI often tells us about ourselves: That what we do makes much less sense than we think it does. It is thus easy to fake.

How is it possible to churn out so much music or so many books, or so much software? Well, because most creative works are either not very original or are quite procedural or random.

And this kind of work could be automated indeed (or examined if it needs to be done in the first place).

ivraatiems · on Feb 23, 2021

I found this impressively hard at first glance. It just goes to show how difficult getting into context is in an unfamiliar codebase. I think with any amount of knowledge of anything allegedly involved (or, you know, a compiler), these examples would fall apart, but it's still an achievement.

I'm also pretty sure there are formatting, commenting, and in-string-text "tells" that indicate whether something is GPT2 reliably. Maybe I should try training an AI to figure that out...

pwinnski · on Feb 23, 2021

I tried using a weird indent as a signal of GPT-2... which gave me my first wrong answer. 4/5.

cryptica · on Feb 23, 2021

I was always able to correctly identify GPT2 but on a few occasions, I misidentified human-written code as being written by GPT2. Usually when the code was poorly written or the comments were unclear.

GPT2's code looks like correct code at a glance but when you try to understand what it's doing, that's when you understand that it could not have been written by a human.

It's similar to the articles produced by GPT3; they have the right form but no substance.

thombat · on Feb 24, 2021

I ran into some value setters which simply logged their parameters but did not store them. Hah! obviously GPT2, not understanding that a setter should, well, set.

Wrong, of course. Now maybe the human concerned was far down some inheritance tree and simply wanted to document misuse of a deliberately limited class but assert()ing would have been too punitive. Or may it was indeed "right form but no substance", but authored by an SWE.

cryptica · on Feb 24, 2021

Haha I also got tricked a few times by this sort of thing. One of the ones I got wrong was there was a comment which said '// end of for loop' after every closing brace of every 'for' loop and I thought that there is no way that a human would ever write such pointless comments...

I did manage to get 100% correct after a while but it takes a thorough reading of the code.

blueblimp · on Feb 24, 2021

It's fascinating that the main reason this is hard is that some of the human code is bad enough that it's hard to believe it's not GPT-2 output. (The first time this happened for me, I had to look it up to convince myself it's really human code.)

It reminds me of how GPT-3 is good at producing a certain sort of bad writing.

My guess as to why this happens: we humans have abilities of logical reasoning, clarity, and purposefulness that GPT doesn't have. When we use those abilities, we produce writing and code that GPT can't match. If we don't, though, our output isn't much better than GPT's.

_coveredInBees · on Feb 23, 2021

I got 4/4 GPT-2 guesses right. It is impressive but the "tell" I've found so far is just poor structure in the logic of how something is arranged. For example: a bunch of `if` statements in sequence without any `else` clauses with some directly opposing prior clauses. Another example was repeating the same operation a few times in individual lines of code which most human programmers would write in a simpler manner.

It's harder to do with some of the smaller excerpts though, and I'm sure there are probably examples of terrible human programmers who write worse code than GPT-2.

Aardwolf · on Feb 23, 2021

I found that a comment with license, followed by C++ code, but without any header inclusions in between, is a clear tell for GPT-2 generated code

TehCorwiz · on Feb 23, 2021

The two factors that seemed like dead giveaways were comments that didn't relate to the code, and sequences of repetition with minor or no variations.

hertzrat · on Feb 23, 2021

If only. Humans leave dead comments all the time and I was wrong when I guessed “gpt” wrote this based on that. Being confusing isn’t reliable either unless it’s a syntax error

ggreer · on Feb 24, 2021

I saw what I'm almost certain was a syntax error but when I guessed GPT-2, it said I was wrong.

I really wish the site let me view my previous guesses.

nickysielicki · on Feb 23, 2021

This is difficult... because these models are just regurgitating after training on real code. Fun little site but I hope nobody reads too much into this.

moyix · on Feb 23, 2021

I've tried searching for variable and function names and even bits of comments to see if they're copied from the training data. They're not!

Aeronwen · on Feb 23, 2021

Got 40/50 just smashing the GPT2 button.

lelandbatey · on Feb 23, 2021

Interesting, I guessed GPT2 each time 200 times in a row and only found that GPT2 was correct 89/200 times, so about 45% was GPT2 for me.

loa_in_ · on Feb 23, 2021

I had about 50% picking one option over and over.

thebean11 · on Feb 23, 2021

6/6, quitting while I'm ahead

qayxc · on Feb 23, 2021

same here :D

jeff-davis · on Feb 24, 2021

Cool! Sadly, it's really hard to tell whether OOP boilerplate is real or generated.

xingped · on Feb 24, 2021

The giveaway seems to be the comments. Some comments a computer could obviously generate and some are obviously written by humans. The code seemed to be irrelevant in making a determination in my playing around with it.

MisterBastahrd · on Feb 24, 2021

I picked up on that and also numeric parameters that were not assigned to variables first.

mhh__ · on Feb 24, 2021

I may have delusions of grandeur because I mainly work on compilers and libraries rather than business code but there is some diabolically awful code in this training set

kristopolous · on Feb 24, 2021

Stay in your bubble. Really.

You have no idea. It'll make you hate computers

hertzrat · on Feb 23, 2021

The goal when writing code is to be pretty machine like and to keep things extremely simple. People also write dead or off topic comments. That’s why this is so hard

azhenley · on Feb 24, 2021

The snippet I'm looking at isn't code at all. It is Polish, and there isn't any comment tokens or anything.

IceWreck · on Feb 24, 2021

Did afew, got all of the right. GPT2 generated code always seems to follow a pattern

The_rationalist · on Feb 23, 2021

How much of it is just regurgitating the training set and therefore chunks of real code?

tpoacher · on Feb 23, 2021

There is a "codes" top-level domain? Codes? CODES??

What's next? Advices? Feedbacks? Rests?

I give ups.

jwilk · on Feb 24, 2021

"codes" as in "coupon codes".

From <https://icannwiki.org/.codes>:

The .CODES TLD is attractive and useful to end-users as it better facilitates search, self-expression, information sharing and the provision of legitimate goods and services.

tpoacher · on Feb 27, 2021

Thank you. Faith in humanity restored just a tiny bit xD

pikzel · on Feb 24, 2021

Oh you'd be surprised. Here's a few (tld and target market)[0]:

    .kim - Kim (Korean surname)
    .lol - LOL: laughing out loud 
    .organic - organic gardeners, farmers, foods, etc. 
    .plumbing - plumbing businesses

[0] https://en.wikipedia.org/wiki/List_of_Internet_top-level_dom...

drdeca · on Feb 24, 2021

isn't there a .pizza ? there's also .black , and , I think maybe a .travel ? Or something along those lines.

neolog · on Feb 23, 2021

The black background on white background makes it annoying to read.

theurbandragon · on Feb 23, 2021

How long before we can just write specs instead of code?

jmpeax · on Feb 24, 2021

An interesting approach to lossy compression.

AnssiH · on Feb 23, 2021

Ah, 0/5, I give up :)

t0astbread · on Feb 23, 2021

Just invert your guesses then!

avipars · on Feb 24, 2021

why not use GPT-3 for next time?