I've done something similar with Sphinx, which worked quite well too.
However, I feel the real challenge to productify such a technology would be to make this system recognise utterances spoken casually, over loud music, over friends talking in the background.
Also, how would the speech to text recognise the user's commands over some sound playing from its own speaker. Suppose I set up a text-to-speech system to read a book to me. I want to be able to make it repeat a paragraph (because my attention had drifted away). However, the input channel of the microphone would contain not only me shouting "stop", but also the output of the text-to-speech system.
We humans have a way of factoring out our own voice when we are listening to someone else. Is there a way to make machines do that?
(These are my humbly presented thoughts, I am certainly no expert)
The crux of any method of differentiating the user's signal from background noise, I imagine, would look to leverage data that has a higher signal-to-noise ratio than the naive microphone recording of the environment.
At a high level this is how I would expect we humans do this. We have some sort of sensory data from us feeling the vibrations of our voice, and use this to "factor out" our own voice from what our ears perceive.
In terms of how this might work to factor background noise out of a microphone recording, if we had some kind of hardware (phone, smartwatch, keyboard) that is able to sense the vibrations of the user's voice (via phone in the pocket, watch on the wrist, keyboard that you're resting your palms on as you type), we might be able to use some signal processing algorithms with that signal and the environmental noise to factor out the user's voice.
(One might ask "well why not just use the signal from the hardware then?". In terms of quality, I'm not sure if it would be enough---intuitively, I'm guessing it might sound muffled if you played back the recordings of your voice as picked up by your palms resting on a keyboard. But, it might be good enough data that we can use some signal processing algorithm to tune our noisey, but high-quality, data for the user's voice.)
One problem I have with things like this is non-customizable hot words. For example, Google Now has this feature where it would always listen for "Ok Google" as long as your device is plugged in, which is a great feature. The problem is I have multiple devices, and saying "Ok Google" would wake all of them up, which is never what I wanted.
I think it can be used in other operating systems with little (if any) change, as far as you have all the python libraries requested. But, I've only tested it with Ubuntu.
That would be great!
I have some future plans for using it with some robots (https://www.youtube.com/watch?v=X5nTHwl0K_w), with smart-house technology, like "turn on/off the lights".
However, I feel the real challenge to productify such a technology would be to make this system recognise utterances spoken casually, over loud music, over friends talking in the background.
Also, how would the speech to text recognise the user's commands over some sound playing from its own speaker. Suppose I set up a text-to-speech system to read a book to me. I want to be able to make it repeat a paragraph (because my attention had drifted away). However, the input channel of the microphone would contain not only me shouting "stop", but also the output of the text-to-speech system.
We humans have a way of factoring out our own voice when we are listening to someone else. Is there a way to make machines do that?