Hacker News new | past | comments | ask | show | jobs | submit login

> Good luck collecting 10k hours of transcribed speech

I'm sure that nearly every DVD theatrical release has subtitles available. Speech against a wide range of background noise too, e.g. music, explosions, traffic, normal ambient noise, etc.

Seems a good start for acquiring a large corpus of labelled speech.




Aside from the potential problem with regards to copyright, it should also be noted that subtitles in general are not transcripts of dialogue. The subtitlers often have to shorten down sentences of speech so that viewers have time to read before the next couple of subtitles appear on screen.


There shouldn't be any issues with copyright, as long as you aren't redistributing the original work. Otherwise all neural networks would be illegal, since most training data is copyrighted.

As for errors in the subtitles, that's still good enough. As long as the machine learning model can deal with uncertainty, it would just not learn from those examples and learn from the ones that are correct. It might even learn to abbreviate sentences itself!


Models trained on DVD audio are considered derived works. You certainly couldn't release such a model under the GPL.

You also have to solve the (very difficult) subtitle alignment problem before you could begin training.


I'm a neural net trained substantially on copyrighted books, music, tv, and movies. Does that mean I'm a derived work, and consequently all works I create are derived works as well?

I'm not saying you're wrong, necessarily. Since copyright is so vague as to allow that interpretation, that shows how much copyright is incoherent, contradictory, broken, and, ultimately, nonsense.


What about the countless audiobooks available on archive.org [1]? Sure, you may be limited to just books in the public domain, but that's still plenty of books.

[1]: https://archive.org/details/audio_bookspoetry


Really?

It's not like you could take the neural net weights aggregated from thousands of movies and retrieve any form of entertainment from them. Is a derived work anything at all based on an original, or just something in the similar field, ie entertainment->entertainment?


My own personal definition is whether the derivative work could survive if the first work did not exist, not for which purpose it was intended to be consumed. Not sure about the legal definition.


Legally, there is a huge gradient between length(work), sha(work), train(transcription(work)), transcription(work), thumbnail(work), etc. Your personal definition of "derived" sounds a lot like the mathematical definition, which isn't amazingly useful in a copyright context.

> Not sure about the legal definition.

Perhaps stating "You certainly couldn't release such a model under the GPL." so surely isn't a great idea?


That actually depends on if the audio is under copyright.

And even so, that is no reason why we as open source collaborators cannot create a million or billion or so samples of "Hello" in foo language as training data as a corpus for all to use.


> Models trained on DVD audio are considered derived works

[citation needed]


There are tons of freely available models based on copyrighted works. Are you sure this is true?

But in order to use the movies for training you would need to buy the thousand and thousands of DVD's




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: