Hacker News new | past | comments | ask | show | jobs | submit login

I agree with the general motivation that having too much AI research in the hands of software companies who keep it proprietary harms transparency and progress. But there is already a lot of neural-network free software, so why another package?

Not only is there a lot out there, a lot of it was released by companies like IBM[1], Google[2], Yahoo[3], Baidu[4], Microsoft[5], etc. So while I'm generally sympathetic to the FSF's position, this case almost seems like a bit of a reversal of things: there doesn't seem to be a problem with for-profit companies taking the fruits of the labors of volunteers and building on top of it... instead, we have a surplus of riches, released as Open Source by a bunch of big companies. It just happens that most of it is under a permissive license like the ALv2.

Of course, one could suggest that that state of affairs isn't natural and/or sustainable, and that this doesn't negate the issues the FSF is dedicated to. So I support this effort, even if it seems redundant on some level at the moment.

[1]: http://systemml.incubator.apache.org

[2]: http://tensorflow.org

[3]: http://yahoohadoop.tumblr.com/post/139916563586/caffeonspark...

[4]: https://github.com/baidu-research/warp-ctc

[5]: https://github.com/Microsoft/CNTK

Yeah they've really missed the fact that it isn't the algorithms or code that we're missing out on. Companies are usually pretty open about these because they know it isn't bit that is hard to compete on.

The hard bit is the training data. Good luck collecting 10k hours of transcribed speech, or 10k recordings of "Okay Google".

+this. Freely available data is a huge value to everyone. We use it in academia, and it's useful in companies big and small. (Even at Google -- I started exploring some of my research questions using MNIST and Imagenet because they're baselines that allow reproducibility, and because you don't have to deal with the privacy issues. For amusing anecdotes about this, consider what the Smart Reply team had to do: http://googleresearch.blogspot.com/2015/11/computer-respond-... It's much harder to train a network when you can't ever look at the training data!)

The best thing the assorted communities involved in this effort could do to accelerate the advance of open, accessible machine learning is to create good creative-commons datasets that anyone could use to train models that could be released open source. And as an academic, let me say that hundreds of researchers would figuratively kiss the ground you walk on for doing so. :)

>Another bizarre feature of our early prototype was its propensity to respond with “I love you” to seemingly anything. As adorable as this sounds, it wasn’t really what we were hoping for.

> Good luck collecting 10k hours of transcribed speech

I'm sure that nearly every DVD theatrical release has subtitles available. Speech against a wide range of background noise too, e.g. music, explosions, traffic, normal ambient noise, etc.

Seems a good start for acquiring a large corpus of labelled speech.

Aside from the potential problem with regards to copyright, it should also be noted that subtitles in general are not transcripts of dialogue. The subtitlers often have to shorten down sentences of speech so that viewers have time to read before the next couple of subtitles appear on screen.

There shouldn't be any issues with copyright, as long as you aren't redistributing the original work. Otherwise all neural networks would be illegal, since most training data is copyrighted.

As for errors in the subtitles, that's still good enough. As long as the machine learning model can deal with uncertainty, it would just not learn from those examples and learn from the ones that are correct. It might even learn to abbreviate sentences itself!

Models trained on DVD audio are considered derived works. You certainly couldn't release such a model under the GPL.

You also have to solve the (very difficult) subtitle alignment problem before you could begin training.

I'm a neural net trained substantially on copyrighted books, music, tv, and movies. Does that mean I'm a derived work, and consequently all works I create are derived works as well?

I'm not saying you're wrong, necessarily. Since copyright is so vague as to allow that interpretation, that shows how much copyright is incoherent, contradictory, broken, and, ultimately, nonsense.

What about the countless audiobooks available on archive.org [1]? Sure, you may be limited to just books in the public domain, but that's still plenty of books.

[1]: https://archive.org/details/audio_bookspoetry


It's not like you could take the neural net weights aggregated from thousands of movies and retrieve any form of entertainment from them. Is a derived work anything at all based on an original, or just something in the similar field, ie entertainment->entertainment?

My own personal definition is whether the derivative work could survive if the first work did not exist, not for which purpose it was intended to be consumed. Not sure about the legal definition.

Legally, there is a huge gradient between length(work), sha(work), train(transcription(work)), transcription(work), thumbnail(work), etc. Your personal definition of "derived" sounds a lot like the mathematical definition, which isn't amazingly useful in a copyright context.

> Not sure about the legal definition.

Perhaps stating "You certainly couldn't release such a model under the GPL." so surely isn't a great idea?

That actually depends on if the audio is under copyright.

And even so, that is no reason why we as open source collaborators cannot create a million or billion or so samples of "Hello" in foo language as training data as a corpus for all to use.

> Models trained on DVD audio are considered derived works

[citation needed]

There are tons of freely available models based on copyrighted works. Are you sure this is true?

But in order to use the movies for training you would need to buy the thousand and thousands of DVD's

That's the biggest by far and still only 1k hours.

But yes, that is what we need more of - not matrix libraries.

Huh, I wonder how much truth there is to your words. As an outsider to Machine learning and neural networks, it seems to me that algorithms can be very valuable and big companies do not lack training data. Of course, training data is expensive and very important, but if training data were that important, their most important resource would not be machine learning scientists, but an army of do-monkeys that provide training data. It won't be the victory of the smartest scientist but of the one with the most employees. And that does not seem to be the case.

It absolutely is the case. Why do you think Google has been providing such services as Google Voice, ReCaptcha, Street View, heck... Maps... and basically everything that's awesome and free? (other than the stuff that puts advertisements in your peripheral vision)

Andrew Ng told a story about one of his first robots that was supposed to roam around the lab and collect coffee cups to deposit in the sink. He ran out of varieties of coffee cups to train the robot's vision well before the robot learned how to detect a coffee cup.

The key to being a successful AI company is to figure out how to get the world to send you data.

Edit: That or figure out how real brains work and how to scale them. Which is probably almost, but not quite, entirely unlike a convolutional neural net.

Edit2: This also leads us to a twist of the dogmatic refrain: If you're not the customer, you're the employee.

It's mix between who has the most data, who can pose the problem best, who can wield the largest computers and who can handle the most complex algorithms

> The hard bit is the training data.

That, plus acquiring a team skilled enough to make good use of the code.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact