Not only is there a lot out there, a lot of it was released by companies like IBM, Google, Yahoo, Baidu, Microsoft, etc. So while I'm generally sympathetic to the FSF's position, this case almost seems like a bit of a reversal of things: there doesn't seem to be a problem with for-profit companies taking the fruits of the labors of volunteers and building on top of it... instead, we have a surplus of riches, released as Open Source by a bunch of big companies. It just happens that most of it is under a permissive license like the ALv2.
Of course, one could suggest that that state of affairs isn't natural and/or sustainable, and that this doesn't negate the issues the FSF is dedicated to. So I support this effort, even if it seems redundant on some level at the moment.
The hard bit is the training data. Good luck collecting 10k hours of transcribed speech, or 10k recordings of "Okay Google".
The best thing the assorted communities involved in this effort could do to accelerate the advance of open, accessible machine learning is to create good creative-commons datasets that anyone could use to train models that could be released open source. And as an academic, let me say that hundreds of researchers would figuratively kiss the ground you walk on for doing so. :)
I'm sure that nearly every DVD theatrical release has subtitles available. Speech against a wide range of background noise too, e.g. music, explosions, traffic, normal ambient noise, etc.
Seems a good start for acquiring a large corpus of labelled speech.
As for errors in the subtitles, that's still good enough. As long as the machine learning model can deal with uncertainty, it would just not learn from those examples and learn from the ones that are correct. It might even learn to abbreviate sentences itself!
You also have to solve the (very difficult) subtitle alignment problem before you could begin training.
I'm not saying you're wrong, necessarily. Since copyright is so vague as to allow that interpretation, that shows how much copyright is incoherent, contradictory, broken, and, ultimately, nonsense.
It's not like you could take the neural net weights aggregated from thousands of movies and retrieve any form of entertainment from them. Is a derived work anything at all based on an original, or just something in the similar field, ie entertainment->entertainment?
> Not sure about the legal definition.
Perhaps stating "You certainly couldn't release such a model under the GPL." so surely isn't a great idea?
And even so, that is no reason why we as open source collaborators cannot create a million or billion or so samples of "Hello" in foo language as training data as a corpus for all to use.
But in order to use the movies for training you would need to buy the thousand and thousands of DVD's
But yes, that is what we need more of - not matrix libraries.
Andrew Ng told a story about one of his first robots that was supposed to roam around the lab and collect coffee cups to deposit in the sink. He ran out of varieties of coffee cups to train the robot's vision well before the robot learned how to detect a coffee cup.
The key to being a successful AI company is to figure out how to get the world to send you data.
Edit: That or figure out how real brains work and how to scale them. Which is probably almost, but not quite, entirely unlike a convolutional neural net.
Edit2: This also leads us to a twist of the dogmatic refrain: If you're not the customer, you're the employee.
That, plus acquiring a team skilled enough to make good use of the code.
To be honest, I'm not sure how Gneural plans to compete with those packages without support from CUDA or cuDNN, all of which are distinctly not open source.
The Linux kernel undoubtedly many features that the Hurd system lacks, but that is due to the severe lack of manpower of the latter system and the billions of Dollars being poured into the former.
On the other hand the Hurd has features that the Linux kernel can never hope to achieve because of its architecture.
That's why GNU Hurd is essentially a dead project. Sadly it never attracted the attention and manpower necessary for it to survive.
> On the other hand the Hurd has features that the Linux kernel can never hope to achieve because of its architecture.
Also think of the effort it took to introduce namespaces to all the Linux subsystems. After a decade the user namespace still has problems. This is ridiculously easy on a distributed system, yet very hard on a monolithic one.
I don't see the point either. Gneural will probably never be better than Theano, Torch, Tensorflow, Caffe, et al., which are already open. If anything, time/resources are much better invested in contributing to a polished/competitive OpenCL backend to one of these packages.
It seems like there must be more at play, but I'll admit a lack of insight and imagination on this one.
I think the reasons are twofold: 1. CUDA had a big headstart over OpenCL. 2. NVIDIA has invested a lot in great libraries for scientific computing. E.g. for neural nets, they have made a library of primitives on top of CUDA for neural nets (cuDNN), which has been adopted by all the major packages.
AMD should have invested much more heavily into ML, if they had, their share price would probably look a bit better than it does now.
This looks interesting - running CUDA on any GPU. http://venturebeat.com/2016/03/09/otoy-breakthrough-lets-gam...
Also somewhat related: AMD seems to be moving towards supporting CUDA on its GPUs in the future: http://www.amd.com/en-us/press-releases/Pages/boltzmann-init...
Its sort of supporting CUDA, just like a car ferry sort of lets your car 'drive' across a large body of water.
Also, the premise of OpenCL is somewhat faulty. You end up optimizing for particular architectures regardless.
Yeah, this is one reason I'm really hoping some of the stuff AMD is pushing, in regards to openness around GPUs, gains traction. And why I am hoping OpenCL continues to improve so that it can be a viable option. Being dependent on nVidia for all time would blow.
I don't think this is wrong, per se, but it is ...funny when the fsf portrays their work as morally superior to us horrible corporate permissive license lovers, while inexorably depending on non-free components.
In an ideal world this project will be popular and will lead to someone on gnueral writing nvidia compatible drivers that will allow them to reject nvidia's, but I'm not optimistic. Not because of some incompetency on the Gnueral team, but nvidia's long history of making life very difficult for open driver writers.
It will be prohibitively difficult to train the model without some kind of hardware assistance (CUDA). This means that if we're building an ImageNet object detector, even if the code implements the model correctly the first time, training it to have close-to-state-of-the-art accuracy will take several consecutive months of CPU time. Torch has rudimentary support for OpenCL, but it isn't there yet. There are very good pre-trained models that are licensed under academic-only licenses that also help fill the gap. (This is about as permissively as it could be licensed because the ImageNet training data itself is under an academic-only license anyway.)
I'm not sure what niche this project fills. If you want an open-source neural network, you have several high-quality choices. If you need good models, you can either use any of the state-of-the-art academic only ones, or you would have to collect some dataset completely by yourself.
Does this necessarily follow, that a machine-learning model is a derived work of all data it's trained on? As far as I know, the law in this area isn't really settled. And many companies are operating on the assumption that this isn't the case. It would lead to some absurd conclusions in some cases, for example if you trained a model to recognize company logos, you'd need permission of the logos' owners to distribute it.
(This is assuming traditional copyright law; under jurisdictions like the E.U. that recognize a separate "database right" it's another story.)
I'd like to note that some publishers, like Elsevier, allow you access to their dataset (full texts of articles) under a license with the condition that you can not freely distribute models learnt from their data.
But the CPU fallback is there
OTOH, it obviously matters a lot if you're constantly iterating and training multiple times a day or whatever.
It takes a week to train a standard AlexNet model on 1 GPU on ImageNet (and this is pretty far from state of the art).
It takes 4 GPUs 2 weeks to train a marginally-below state of the art image classifier on ImageNet (http://torch.ch/blog/2016/02/04/resnets.html) - the 101 layer deep residual network. This would be 20 weeks on an ensemble of CPUs. (State of the art is 152 layers; I don't have the numbers but I'd guess-timate 3-4 weeks to train on 4 GPUs).
What I mean is, you're right of course, there's much better neural-network free software already available, but GNU endorsing an official package 1) could get people whose concerns are more strongly geared toward free software ethics to start paying attention to neural networks, and 2) get people, regardless of their ideological commitment to anything, to be more aware both of the ethical issues and of neural-network software--just in virtue of GNU's mild fame.
and of course, unless the maintainer of this package makes weird choices and alienates other projects, there's the other benefit of projects learning from each other and poaching code and strategies for the greater good.
I hope that future versions take inspiration from other open source machine learning libraries, which show how to use linear algebra and backpropagation and are much more effective.
- FANN seems like a pretty good alternative
- The value of the software at the big "monopolies" lies within the data, not necessarily the software
- This needs to be in some publicly accessible repo. Downloading a zip file and submitting patches? I thought we, as a society, were over that way of building OSS.
However, I find it quite amusing (and perhaps "out of place") that the maintainer uses a gmail address.
It is a valid viewpoint to find money-driven companies as a “bad thing” (or more exactly, companies whose main goal is to maximize shareholder value).
Data is the commodity. There is nowhere else you can get good raw data about, say, what people were publicly discussing last week, except through Twitter, in order to guess the stock market. Much of that data is closed off or incomplete even through their API. There is no other option except to create another Twitter.
Am I mistaken, or is the source repository for this project just tarballs checked into CVS?
Nvidia has a complete monopoly on all deep learning hardware and tooling. With the possible exception of Google (and maybe Facebook), 100% of all serious academic researchers are training their models on Nvidia hardware with Nvidia's propietary CUDA toolkit. Using anything else is currently completely unthinkable. Amazon and Nvidia have even teamed up to make CUDA training cheap (on the short term) for EC2 users.
I'd love to be able to switch to OpenCL, but there's so much momentum and very little perceived benefit when your lab already has four (very expensive) Titan X cards.
APL -> Guaranty of OSS for the desktop
APL Afferro -> Guaranty of OSS for the cloud
??? -> Guaranty of OSS for NNs
Now Google does have access to a whole lot of data that the rest of the world doesn't. and FB, Google, and etc. have more than a bit of a hardware advantage... for now, at least. Distribute a shared system over a P2P infrastructure, and you can change that. Perhaps rather significantly.
Shit, maybe I'm an AI.
I wish they had taken the initiative much sooner.
However, META ICBM, is a joke as old as the META tag, which I guess is 1995.