I'm sure that nearly every DVD theatrical release has subtitles available. Speech against a wide range of background noise too, e.g. music, explosions, traffic, normal ambient noise, etc.
Seems a good start for acquiring a large corpus of labelled speech.
As for errors in the subtitles, that's still good enough. As long as the machine learning model can deal with uncertainty, it would just not learn from those examples and learn from the ones that are correct. It might even learn to abbreviate sentences itself!
You also have to solve the (very difficult) subtitle alignment problem before you could begin training.
I'm not saying you're wrong, necessarily. Since copyright is so vague as to allow that interpretation, that shows how much copyright is incoherent, contradictory, broken, and, ultimately, nonsense.
It's not like you could take the neural net weights aggregated from thousands of movies and retrieve any form of entertainment from them. Is a derived work anything at all based on an original, or just something in the similar field, ie entertainment->entertainment?
> Not sure about the legal definition.
Perhaps stating "You certainly couldn't release such a model under the GPL." so surely isn't a great idea?
And even so, that is no reason why we as open source collaborators cannot create a million or billion or so samples of "Hello" in foo language as training data as a corpus for all to use.
But in order to use the movies for training you would need to buy the thousand and thousands of DVD's