You also have to solve the (very difficult) subtitle alignment problem before you could begin training.
I'm not saying you're wrong, necessarily. Since copyright is so vague as to allow that interpretation, that shows how much copyright is incoherent, contradictory, broken, and, ultimately, nonsense.
It's not like you could take the neural net weights aggregated from thousands of movies and retrieve any form of entertainment from them. Is a derived work anything at all based on an original, or just something in the similar field, ie entertainment->entertainment?
> Not sure about the legal definition.
Perhaps stating "You certainly couldn't release such a model under the GPL." so surely isn't a great idea?
And even so, that is no reason why we as open source collaborators cannot create a million or billion or so samples of "Hello" in foo language as training data as a corpus for all to use.
But in order to use the movies for training you would need to buy the thousand and thousands of DVD's