NatI: Multi-language voice control system for Ubuntu written in Python

sherjilozair · on July 26, 2014

I've done something similar with Sphinx, which worked quite well too.

However, I feel the real challenge to productify such a technology would be to make this system recognise utterances spoken casually, over loud music, over friends talking in the background.

Also, how would the speech to text recognise the user's commands over some sound playing from its own speaker. Suppose I set up a text-to-speech system to read a book to me. I want to be able to make it repeat a paragraph (because my attention had drifted away). However, the input channel of the microphone would contain not only me shouting "stop", but also the output of the text-to-speech system.

We humans have a way of factoring out our own voice when we are listening to someone else. Is there a way to make machines do that?

srcreigh · on July 26, 2014

(These are my humbly presented thoughts, I am certainly no expert)

The crux of any method of differentiating the user's signal from background noise, I imagine, would look to leverage data that has a higher signal-to-noise ratio than the naive microphone recording of the environment.

At a high level this is how I would expect we humans do this. We have some sort of sensory data from us feeling the vibrations of our voice, and use this to "factor out" our own voice from what our ears perceive.

In terms of how this might work to factor background noise out of a microphone recording, if we had some kind of hardware (phone, smartwatch, keyboard) that is able to sense the vibrations of the user's voice (via phone in the pocket, watch on the wrist, keyboard that you're resting your palms on as you type), we might be able to use some signal processing algorithms with that signal and the environmental noise to factor out the user's voice.

(One might ask "well why not just use the signal from the hardware then?". In terms of quality, I'm not sure if it would be enough---intuitively, I'm guessing it might sound muffled if you played back the recordings of your voice as picked up by your palms resting on a keyboard. But, it might be good enough data that we can use some signal processing algorithm to tune our noisey, but high-quality, data for the user's voice.)

rcorcs · on July 26, 2014

It is doable with some analysis of the sound signal. Honda's ASIMO does that, somehow. But I think it would be a current scientific research topic.

sherjilozair · on July 26, 2014

Any links on relevant material? I'm okay for delving into current research topic.

rcorcs · on July 26, 2014

These are very interesting papers from Google: http://research.google.com/pubs/SpeechProcessing.html

khc · on July 26, 2014

One problem I have with things like this is non-customizable hot words. For example, Google Now has this feature where it would always listen for "Ok Google" as long as your device is plugged in, which is a great feature. The problem is I have multiple devices, and saying "Ok Google" would wake all of them up, which is never what I wanted.

melling · on July 26, 2014

Sounds like you need to be give each device a name. Jarvis!

zzmp · on July 25, 2014

I've been working with Julius a lot, another open-source speech recognition engine.

How easy did you find it to work with NLTK?

Did you have a lot of NL experience before this project? Do you think something like NLTK would be easy to pick up?

rcorcs · on July 25, 2014

I've studied NLTK before from the book Natural Language Processing with Python. I even use NLTK within NatI.

pavanky · on July 26, 2014

If it is written in Python, is this really only relevant to Ubuntu ?

rcorcs · on July 26, 2014

I think it can be used in other operating systems with little (if any) change, as far as you have all the python libraries requested. But, I've only tested it with Ubuntu.

coppolaemilio · on July 26, 2014

It would be nice if we all can contribute a little bit you improve the repository :) I've already started giving a little bit of my time to it

Nux · on July 25, 2014

Hm, something like this would be nice in conjunction with my XBMC htpc.

rcorcs · on July 26, 2014

That would be great! I have some future plans for using it with some robots (https://www.youtube.com/watch?v=X5nTHwl0K_w), with smart-house technology, like "turn on/off the lights".

kissgyorgy · on July 25, 2014

I thought the exact same thing :D

olefoo · on July 26, 2014

I don't know about you all; but I don't know if I trust a project that check .pyc files into version control.

rcorcs · on July 26, 2014

Actually, these files are not needed, you can delete them and just execute the python files.

olefoo · on July 26, 2014

It shows a lack of care and craftsmanship.

If someone can't be arsed to borrow a .gitignore ( and github will give you a fairly complete one for python ) what other corners will they cut.

PeterisP · on July 26, 2014

So, why then should they be in version control?

Caligula · on July 26, 2014

What asr engine are you using?

sherjilozair · on July 26, 2014

Google.

https://github.com/rcorcs/NatI/blob/master/gapi.py#L26

rcorcs · on Aug 6, 2014

I have just updated to use Google Speech API v2.