While I strongly doubt they would use Wikipedia as a training set, has anyone done a search of GitHub code to see if other projects have copied-and-pasted that function from Wikipedia into their more-permissive codebases?
Almost 2000 results for one of the comment lines. I'm not going to read through those or check the licenses, but I think it's safe to say that block of code exists in many GitHub code bases, and it's likely many of those have permissive licenses. Given how famous it is (for a block of code) it's not unexpected.
A question that popped into my head is: if the machine sees the same exact block of code hundreds of times, does that suggest to it that it's more acceptable to regurgitate the entire thing verbatim? Not that this incident is totally 100% ok, but if it was doing this with code that existed in only a single repo that would be much more concerning.
if the machine sees the same exact block of code hundreds of times, does that suggest to it that it's more acceptable to regurgitate the entire thing verbatim?
From a copyright standpoint, quite possibly. This is called the "Scènes à faire" doctrine. If there are some things that have to be there in a roughly standard form to do a standard job, that applies.
This would need to first be tested in court; apparently Microsoft is happy in generating thousands (or millions) of violations, knowing most programmers don't enforce their copyright.
I don't get it, that seems like standard fare for an R-rated movie? And then it seems like some complained because they decided to start editing it down to a PG-13 movie?
Essentially, from my understanding, there was a data leak they never commented on, they instituted a poorly made content filter without saying anything. The filter frequently has false positives and negatives, someone discovered they trained the game using content the filter was designed to block, meaning the ai itself would frequently output filter triggering stuff, more people found out their private unpublished stories were being read by third parties after a job ad and the stories were posted on 4Chan, people recognized stories they wrote that had triggered the filter that were posted, and then they started instituting no warning bans.
I might have missed something, but that's the gist of it.
Also, before and while all this was going on, the quality of the AI's output has been steadily dropping to the point where NovelAI.net now generates what's in many ways better writing.
That's GPT-J-6B, to be clear. A 6-billion-parameter model is producing better output than a 300 billion parameter model, because of what I can only assume to be sheer incompetence on AI Dungeon's part. I've also used the raw GPT-3 API, and it does better at writing than either. In other words: Doing nothing would have been better than whatever they've been doing.
It’s pre-trained, partially, on Wikipedia. GPT-2 did this sort of thing all the time: native to the architecture to surface examples from the fine-tuning training set by default.