I'm still bummed that with all these companies implementing voice recognition there still is not anything close to a FOSS option. It is a major field and the kind of software that takes a huge amount of work to get right and I feel like in the future free operating systems are going to look archaic without it, but it does not seem like the kind of thing any small club of friends can pick up and build to match Google or Apple at.
The same applies to OCR and other photo recognition techniques like faces or red eye. Tesseract is probably the largest free software OCR project but it still seems to do so much worse than proprietary Adobe and Microsoft products. At least the OCR reader that came with my S4 does a terrible job, though it might be using Tesseract behind the scenes since I think its the one from f-droid.
Digikam does all right red eye correction but it does it with a layered filter rather than any recognition of eyes. It also sometimes can find faces, but not nearly as accurately as Google can.
All these fuzzy logic fields are things that take huge code bases and a lot of R&D to get right and nobody in the free software movement has the organization or just the raw bank to make them happen from what I can see. Red Hat surely is not investing in them (kind of outside their enterprise / server domain) and they are about the only company prominent and powerful enough to do it.
There's no shortage of FOSS software for automatic speech recognition. Most academic research these days uses the Kaldi toolkit (http://kaldi.sourceforge.net/about.html), which has an Apache license.
There are (at least) three problems that prevent the widespread availability of FOSS speech recognizers:
1. Data. Large corpora are available via Penn's Linguistic Data Consortium (LDC). Big academic institutions can afford their all-inclusive licenses; small corporations have to settle for their expensive a la carte options, and startups and hobbyists have to go without. Fortunately, there is now more and more freely available data, such as the LibriSpeech database (http://www.voxforge.org/home/forums/message-boards/audio-dis...), which is extracted from the LibriVox website (of public-domain audiobooks).
2. Task-specificity. Speech recognition systems need extensive customization to the intended use case to achieve good performance. Conventional wisdom is that your recognizer can be tuned to work with a wide variety of speakers, or support a large vocabulary, but not both. This customization requires lots of time, data, and expertise.
3. Expertise. Speech recognition development is a PhD-level activity. After 5 years of stable or declining real income, smart students either go work for big bucks at a large multinational, or can't get a visa and go back to their home country.
While there's some hope for #1, #2 and #3 aren't going away anytime soon.
> it does not seem like the kind of thing any small club of friends can pick up and build to match Google or Apple at
Actually I think it's not out of the question now. The recent advances in recognition accuracy are mostly due to deep neural nets. The research is all published open access, and the cutting-edge tools are mostly open source (Theano, Torch, Caffe). Training neural nets is actually a lot simpler than the old methods of doing speech recognition; I think it's much more accessible to a small team. The only really difficult requirement is lots and lots of clean labeled data for training.
I don't really see how the "old" methods were really less accessible. There were tools such as HTK, cmu sphinx, etc., or srilm for language modelling, each with documentation and a large user base. Granted, a lot of fiddling is involved if one wants to use speaker adaptative training (MLLR, VTLN), feature transforms (HLDA, MLLT), MLP features (TANDEM), etc., but DNN approaches come with their own set of screws to tweak..
It's just hard to make something work really well for a specific use case; when contributors to an open-source project are all trying to scratch their own itch (make it work for their specific use [language, vocabulary, etc.]), the result may not be universally satisfying.
The difference is that the old methods were large systems made up of many different pieces that all required a ton of domain knowledge specific to speech and language. Training DNNs requires a lot of knowledge about DNNs, but not nearly as much knowledge about speech. Knowledge of how to train DNNs is highly transferable between domains like speech and vision. Similarly, the actual code can be mostly shared as well; something like Theano would be just as suited to running speech nets as vision nets.
I don't think we're quite there yet, but DNNs have the potential to replace every piece of the speech pipeline with one single net that gets audio samples on one side and spits out characters on the other. All those acronyms you mentioned (with many, many PhD theses behind them) will be irrelevant, in the same way that tons of previously successful specialized computer vision feature detectors (HoG, SIFT, SURF, etc) are now irrelevant to the state of the art in object recognition.
A lot of the methods I listed don't have anything to do with speech per se. They DO have something to do with how to use data to e.g. remove unwanted variability or achieve better class separation. What makes them old methods is that they were developed in the context of using Gaussian mixture models for modelling Hidden Markov model state output probabilities. As such you could perhaps apply them in classifying birdsong, gunshots, or whatever else (with varying success, of course..).
I have no doubt that these methods and their acronyms will become irrelevant (perhaps they already are), but I guess some of the basic underlying ideas about variability will re-emerge in the training regime of DNNs. Sure, the algorithms (and implementations) for training DNNs are the same, but these ideas are incorporated in the preparation and handling of training data (compare that with augmentation, like creating translated images etc.).
Your prediction that DNNs will replace much of the pipeline is very interesting to me, but I hypothesize that you're at least partially wrong. I predict that DNNs will impact early stages of the pipeline which operate on continuously valued inputs, but I am skeptical that that DNNs will ultimately be the best solution for late discrete processing (e.g., decoding, language modeling). That DNNs ever perform well in discrete classification tasks just tells me we haven't spent enough time feature-engineering.
It's the ability of DNNs to replace feature engineering that makes them interesting. They have completely obsoleted feature engineering in object recognition in just a few short years. Have you seen the latest DNN results in translation and image captioning? I think DNNs are quickly going to surpass the state of the art in language modeling.
The same applies to OCR and other photo recognition techniques like faces or red eye. Tesseract is probably the largest free software OCR project but it still seems to do so much worse than proprietary Adobe and Microsoft products. At least the OCR reader that came with my S4 does a terrible job, though it might be using Tesseract behind the scenes since I think its the one from f-droid.
Digikam does all right red eye correction but it does it with a layered filter rather than any recognition of eyes. It also sometimes can find faces, but not nearly as accurately as Google can.
All these fuzzy logic fields are things that take huge code bases and a lot of R&D to get right and nobody in the free software movement has the organization or just the raw bank to make them happen from what I can see. Red Hat surely is not investing in them (kind of outside their enterprise / server domain) and they are about the only company prominent and powerful enough to do it.