While I strongly doubt they would use Wikipedia as a training set, has anyone do...

bootlooped · on July 2, 2021

Almost 2000 results for one of the comment lines. I'm not going to read through those or check the licenses, but I think it's safe to say that block of code exists in many GitHub code bases, and it's likely many of those have permissive licenses. Given how famous it is (for a block of code) it's not unexpected.

https://github.com/search?q=%22evil+floating+point+bit+level...

A question that popped into my head is: if the machine sees the same exact block of code hundreds of times, does that suggest to it that it's more acceptable to regurgitate the entire thing verbatim? Not that this incident is totally 100% ok, but if it was doing this with code that existed in only a single repo that would be much more concerning.

Animats · on July 2, 2021

if the machine sees the same exact block of code hundreds of times, does that suggest to it that it's more acceptable to regurgitate the entire thing verbatim?

From a copyright standpoint, quite possibly. This is called the "Scènes à faire" doctrine. If there are some things that have to be there in a roughly standard form to do a standard job, that applies.

[1] https://en.wikipedia.org/wiki/Sc%C3%A8nes_%C3%A0_faire

nextaccountic · on July 3, 2021

This would need to first be tested in court; apparently Microsoft is happy in generating thousands (or millions) of violations, knowing most programmers don't enforce their copyright.

ajayyy · on July 2, 2021

It is probably based off GPT-3 with a layer on top trained for programming specifically, like what is done with AI dungeon.

an_opabinia · on July 2, 2021

Wait until people on the toxic orange site find out what has happened to AI Dungeon.

SamBam · on July 2, 2021

I'm out of the loop.

grawprog · on July 2, 2021

https://gitgud.io/AuroraPurgatio/aurorapurgatio

https://www.reddit.com/user/non-taken-name

_d7dt · on July 2, 2021

I don't get it, that seems like standard fare for an R-rated movie? And then it seems like some complained because they decided to start editing it down to a PG-13 movie?

grawprog · on July 2, 2021

Essentially, from my understanding, there was a data leak they never commented on, they instituted a poorly made content filter without saying anything. The filter frequently has false positives and negatives, someone discovered they trained the game using content the filter was designed to block, meaning the ai itself would frequently output filter triggering stuff, more people found out their private unpublished stories were being read by third parties after a job ad and the stories were posted on 4Chan, people recognized stories they wrote that had triggered the filter that were posted, and then they started instituting no warning bans.

I might have missed something, but that's the gist of it.

Filligree · on July 3, 2021

Also, before and while all this was going on, the quality of the AI's output has been steadily dropping to the point where NovelAI.net now generates what's in many ways better writing.

That's GPT-J-6B, to be clear. A 6-billion-parameter model is producing better output than a 300 billion parameter model, because of what I can only assume to be sheer incompetence on AI Dungeon's part. I've also used the raw GPT-3 API, and it does better at writing than either. In other words: Doing nothing would have been better than whatever they've been doing.

armatav · on July 2, 2021

It’s pre-trained, partially, on Wikipedia. GPT-2 did this sort of thing all the time: native to the architecture to surface examples from the fine-tuning training set by default.

edgyquant · on July 2, 2021

It’s GPT though and the GPT models were trained on data from Wikipedia