
PHP code generated by GPT-2 - monort
https://gist.github.com/moyix/dda9c3180198fcb68ad64c3e6bc7afbc
======
ndpsr
Still better than Wordpress core.

~~~
lmilcin
I especially like how the comments have absolutely nothing in common with the
code they are supposedly commenting, just like in real life.

~~~
smhg
And almost-real naming like 'DbAppAndFNAAppRegistrationService'.

~~~
tyingq
Looking forward to com.ai.gpt-2.autogen.java.lang.factory.thing.verb

~~~
sli
Really looking forward to some iOS code, with (real) identifiers like:

outputImageProviderFromBufferWithPixelFormat:pixelsWide:pixelsHigh:baseAddress:bytesPerRow:releaseCallback:releaseContext:colorSpace:shouldColorMatch:

kCMSampleBufferConduitNotificationParameter_UpcomingOutputPTSRangeMayOverlapQueuedOutputPTSRange

------
tyingq
The tweets that go with this:
[https://twitter.com/moyix/status/1096255984866082816](https://twitter.com/moyix/status/1096255984866082816)

~~~
fffrantz
"Accidentally picked up" Somehow they must have fed it with some JavaScript
and some php, right ?

~~~
kowdermeister
With sloppy scraping I guess JS was picked up besides text.

PHP should be intentionally used as training material.

~~~
lugg
Sloppy scraping?

You forget html and JS is perfectly valid syntax to find in a .php file.

~~~
kowdermeister
Yes, of course. But JS is frontend mostly, so I can _imagine_ it's easier to
accidentally scrape some JS with text.

PHP, not so much, only if the server returns source by accident.

~~~
yorwba
What if the server is GitHub? Or some random blog about PHP development? There
are lots of situations where it's very intentional that PHP is contained in
HTML.

------
userbinator
Any context on what this is supposed to be...? I can vaguely read PHP, but the
code does not appear to be doing anything of much substance.

At first I thought it was something to do with a second revision of GUID
Partition Tables.

~~~
bitexploder
It’s a text generation algorithm. It’s not meant to be real code, just look
like it. This is the infamous “too dangerous to release” GPT-2 making this
code.

~~~
userbinator
Reminds me somewhat of
[https://en.wikipedia.org/wiki/Article_spinning](https://en.wikipedia.org/wiki/Article_spinning)

...and now I hope that searching for code in the future won't become polluted
by such spam.

~~~
bitexploder
I think everyone has written a Markov Chain generator and fed it The Bible
plus a random flavor text. Right?

~~~
corobo
The flavour text was always jerkcity.txt in every markov bot I've ever come
across

------
glup
I have been working with a group that is trying to clone this dataset and make
it publicly available
([https://github.com/jcpeterson/openwebtext](https://github.com/jcpeterson/openwebtext)),
and I have noticed quite a bit of code in the scraped dataset. Future releases
of our dataset will be pre-filtered with another LSTM language model that will
filter sentences by their probability under more conversational / literary
datasets.

------
pamparosendo
It will be interesting when AI finds out there's no need for her to generate
human-readeable code.

------
aboutruby
With some automated formatting:
[https://pastebin.com/7F2Leqy1](https://pastebin.com/7F2Leqy1)

~~~
Navarr
What on earth did you use to format that.

This will be far more readable to PHP devs;
[https://gist.github.com/navarr/a20284c0533ea6f6ebc0946d62c96...](https://gist.github.com/navarr/a20284c0533ea6f6ebc0946d62c96ac7)

~~~
munk-a
Until GPT-2 can participate in a formatting holy-war all our jobs are secure.
It's time to get worried when it starts posting opinionated comments on the
internet about "how spaces make my code look the same on everyone's machine"
that's when it'd be a good idea to invest in a bunker.

------
cubano
Just what I didn't need to see this morning.

I am _literally_ living in the streets, freezing my ass off and hungry,
looking for any kind of programming work for the past month, and now I have to
see some AI bot generating more inexpensive shit code that I am sure some
manager will convince themselves might get them that final career promotion by
lowering their labor costs to near zero.

WTG, geniuses, for developing AI that before you know will have all of us
living in the streets and hungry...

I'll save you a spot.

~~~
PaulHoule
They're going to have to make PHP that compiles before we lose our jobs...

~~~
jayar95
PHP is interpreted o.o

------
gambler
This is very fishy. You can get code like this by substituting words in
identifier names for other words, but how can an algorithm trained on English
dataset "learn" that keywords like 'function' and 'class' are exempt from
substitution? I know most people here have unwavering faith in the magic of
deep neural networks, but you'd need _a lot_ of examples to deduce this with
any certainty, regardless of how you do it.

~~~
hnarn
> you'd need _a lot_ of examples to deduce this with any certainty

Are you saying that "a lot" isn't almost semi-trivial to obtain, seeing how
much code is available online?

~~~
gambler
Why would a model trained on English texts see "a lot" of PHP code? What was
the prompt used for generating this code?

~~~
IanCal
It was trained on contents of links found on Reddit, wasn't it? Links to
sample code or stack overflow posts could be pretty prevalent.

~~~
gambler
So you're buying the idea that it looked at a bunch of code snippets embedded
at various pages, managed to build a sub-model for PHP (separate from all
other languages it should have encountered) and managed to generate a long,
nearly syntactically correct program uninterrupted by English text?

And while it makes tons of obvious mistakes in English (which is a much more
flexible and forgiving language), its PHP is somehow nearly syntactically
perfect?

-

Examples from GPT-2 GitHub have a lot of code:

[https://raw.githubusercontent.com/openai/gpt-2/master/gpt-2-...](https://raw.githubusercontent.com/openai/gpt-2/master/gpt-2-samples/conditional-
topk40.txt)

To me, this doesn't seem like an argument in favor of this model
"understanding" English (or C, or PHP). It seems more like an indication that
it memorizes way more information than the paper implies and then does clever
word substitution.

~~~
moyix
Yes, I do think that it learned a model of PHP and JavaScript syntax. 40GB of
text data is a _lot_ , and PHP syntax is a lot simpler than English grammar,
which it learns quite well.

See also the example in the paper of accidentally learning to translate into
French even though they tried to remove French pages from the corpus.

------
chadbennett
Motivated by this post, I decided to test it out. It's impressive how powerful
the software is even with the limitations. I made a simple tutorial on how to
test GPT-2 out for yourself at [https://medium.com/heroic-com/how-to-quickly-
generate-full-a...](https://medium.com/heroic-com/how-to-quickly-generate-
full-articles-using-openais-gpt-2-3876870aeb5c)

------
scrollaway
Important note: The AI did not generate that exact version of the code. It was
_almost_ syntactically correct. Here's the diff:

[https://gist.github.com/moyix/dda9c3180198fcb68ad64c3e6bc7af...](https://gist.github.com/moyix/dda9c3180198fcb68ad64c3e6bc7afbc/revisions#diff-8074138f091ac83e2bef3faec88bdb05)

~~~
yorwba
Also, the original in
[https://raw.githubusercontent.com/openai/gpt-2/master/gpt-2-...](https://raw.githubusercontent.com/openai/gpt-2/master/gpt-2-samples/unconditional.txt)
has two more lines.

The first

// web/application/handlers/add-full-no-app-directly.md (1071 bytes)

and the last

public function registerPipeHandlerInterceptor

------
kodablah
Can someone shed more light behind this? What is the true source? Was it
generated via the unreleased full model by an OpenAI employee? Or did someone
generate it with the released "smaller model"? Can we, the curious public, see
the model and replicate the results?

~~~
moyix
I found it while browsing the "raw" samples generated by the full-sized model.
You can read through them all here:

[https://raw.githubusercontent.com/openai/gpt-2/master/gpt-2-...](https://raw.githubusercontent.com/openai/gpt-2/master/gpt-2-samples/unconditional.txt)

The PHP sample is Sample 195.

------
beager
This makes me think that something like Stack Overflow could be used to train
a model that generates code to answer a question—and that software
specifications that are decomposed into a series of requirements or
"questions" could be fed into this model to produce code that's equivalent to
a team of remote contractors.

Your model would be based on NLP/votes of the questions, NLP/votes of the
answers, and separating the text from the code in both.

The fact that many markdown/code formatting tools have you select the language
for syntax highlighting is useful for classifying code as well.

~~~
scrollaway
Finally, StackSort ([https://xkcd.com/1185/](https://xkcd.com/1185/)) could
actually be useful in a real world application.

~~~
52-6F-62
My god—the JOBINTERVIEWQUICKSORT is just brilliant.

------
rbrtdrmpc-
Look ma, no Laravel

------
gambler
BTW, I would like to point you to the MIT's project Genesis as an example of
what a rule-based text comprehension system could do almost a decade ago.

