Not only is there a lot out there, a lot of it was released by companies like IBM, Google, Yahoo, Baidu, Microsoft, etc. So while I'm generally sympathetic to the FSF's position, this case almost seems like a bit of a reversal of things: there doesn't seem to be a problem with for-profit companies taking the fruits of the labors of volunteers and building on top of it... instead, we have a surplus of riches, released as Open Source by a bunch of big companies. It just happens that most of it is under a permissive license like the ALv2.
Of course, one could suggest that that state of affairs isn't natural and/or sustainable, and that this doesn't negate the issues the FSF is dedicated to. So I support this effort, even if it seems redundant on some level at the moment.
The hard bit is the training data. Good luck collecting 10k hours of transcribed speech, or 10k recordings of "Okay Google".
The best thing the assorted communities involved in this effort could do to accelerate the advance of open, accessible machine learning is to create good creative-commons datasets that anyone could use to train models that could be released open source. And as an academic, let me say that hundreds of researchers would figuratively kiss the ground you walk on for doing so. :)
I'm sure that nearly every DVD theatrical release has subtitles available. Speech against a wide range of background noise too, e.g. music, explosions, traffic, normal ambient noise, etc.
Seems a good start for acquiring a large corpus of labelled speech.
As for errors in the subtitles, that's still good enough. As long as the machine learning model can deal with uncertainty, it would just not learn from those examples and learn from the ones that are correct. It might even learn to abbreviate sentences itself!
You also have to solve the (very difficult) subtitle alignment problem before you could begin training.
I'm not saying you're wrong, necessarily. Since copyright is so vague as to allow that interpretation, that shows how much copyright is incoherent, contradictory, broken, and, ultimately, nonsense.
It's not like you could take the neural net weights aggregated from thousands of movies and retrieve any form of entertainment from them. Is a derived work anything at all based on an original, or just something in the similar field, ie entertainment->entertainment?
> Not sure about the legal definition.
Perhaps stating "You certainly couldn't release such a model under the GPL." so surely isn't a great idea?
And even so, that is no reason why we as open source collaborators cannot create a million or billion or so samples of "Hello" in foo language as training data as a corpus for all to use.
But in order to use the movies for training you would need to buy the thousand and thousands of DVD's
But yes, that is what we need more of - not matrix libraries.
Andrew Ng told a story about one of his first robots that was supposed to roam around the lab and collect coffee cups to deposit in the sink. He ran out of varieties of coffee cups to train the robot's vision well before the robot learned how to detect a coffee cup.
The key to being a successful AI company is to figure out how to get the world to send you data.
Edit: That or figure out how real brains work and how to scale them. Which is probably almost, but not quite, entirely unlike a convolutional neural net.
Edit2: This also leads us to a twist of the dogmatic refrain: If you're not the customer, you're the employee.
That, plus acquiring a team skilled enough to make good use of the code.