A probability distribution isn't the order of words, it's a fact about the order...

A probability distribution isn't the order of words, it's a fact about the order of words.

Pedants have been complaining about this kind of thing for years. If you generate random data, no one has a copyright on that. But if you XOR it with a copyrighted work, the result is indistinguishable from random data. No one could tell you which was generated randomly and which was derived from the copyrighted work. But XOR them back together again and you get the copyrighted work.

Things like that get solved pragmatically, not mathematically. There is no basis for saying that one set of random bits is infringing and the other isn't, but if you're distributing them for the sole purpose of allowing people to reconstitute the copyrighted work, you're going to be in trouble.

Now we have something with different practicalities. The purpose of training the model on existing works is so that it can e.g. answer questions about Harry Potter, which the majority wants to be possible and is the same class of thing that search engines need to be able to do. But the same model can then produce fan fiction as an emergent property, so what now?