Hacker News new | past | comments | ask | show | jobs | submit login
NatI: Multi-language voice control system for Ubuntu written in Python (github.com/rcorcs)
59 points by rcorcs on July 25, 2014 | hide | past | favorite | 22 comments



I've done something similar with Sphinx, which worked quite well too.

However, I feel the real challenge to productify such a technology would be to make this system recognise utterances spoken casually, over loud music, over friends talking in the background.

Also, how would the speech to text recognise the user's commands over some sound playing from its own speaker. Suppose I set up a text-to-speech system to read a book to me. I want to be able to make it repeat a paragraph (because my attention had drifted away). However, the input channel of the microphone would contain not only me shouting "stop", but also the output of the text-to-speech system.

We humans have a way of factoring out our own voice when we are listening to someone else. Is there a way to make machines do that?


(These are my humbly presented thoughts, I am certainly no expert)

The crux of any method of differentiating the user's signal from background noise, I imagine, would look to leverage data that has a higher signal-to-noise ratio than the naive microphone recording of the environment.

At a high level this is how I would expect we humans do this. We have some sort of sensory data from us feeling the vibrations of our voice, and use this to "factor out" our own voice from what our ears perceive.

In terms of how this might work to factor background noise out of a microphone recording, if we had some kind of hardware (phone, smartwatch, keyboard) that is able to sense the vibrations of the user's voice (via phone in the pocket, watch on the wrist, keyboard that you're resting your palms on as you type), we might be able to use some signal processing algorithms with that signal and the environmental noise to factor out the user's voice.

(One might ask "well why not just use the signal from the hardware then?". In terms of quality, I'm not sure if it would be enough---intuitively, I'm guessing it might sound muffled if you played back the recordings of your voice as picked up by your palms resting on a keyboard. But, it might be good enough data that we can use some signal processing algorithm to tune our noisey, but high-quality, data for the user's voice.)


It is doable with some analysis of the sound signal. Honda's ASIMO does that, somehow. But I think it would be a current scientific research topic.


Any links on relevant material? I'm okay for delving into current research topic.


These are very interesting papers from Google: http://research.google.com/pubs/SpeechProcessing.html


One problem I have with things like this is non-customizable hot words. For example, Google Now has this feature where it would always listen for "Ok Google" as long as your device is plugged in, which is a great feature. The problem is I have multiple devices, and saying "Ok Google" would wake all of them up, which is never what I wanted.


Sounds like you need to be give each device a name. Jarvis!


I've been working with Julius a lot, another open-source speech recognition engine.

How easy did you find it to work with NLTK?

Did you have a lot of NL experience before this project? Do you think something like NLTK would be easy to pick up?


I've studied NLTK before from the book Natural Language Processing with Python. I even use NLTK within NatI.


If it is written in Python, is this really only relevant to Ubuntu ?


I think it can be used in other operating systems with little (if any) change, as far as you have all the python libraries requested. But, I've only tested it with Ubuntu.


It would be nice if we all can contribute a little bit you improve the repository :) I've already started giving a little bit of my time to it


Hm, something like this would be nice in conjunction with my XBMC htpc.


That would be great! I have some future plans for using it with some robots (https://www.youtube.com/watch?v=X5nTHwl0K_w), with smart-house technology, like "turn on/off the lights".


I thought the exact same thing :D


I don't know about you all; but I don't know if I trust a project that check .pyc files into version control.


Actually, these files are not needed, you can delete them and just execute the python files.


It shows a lack of care and craftsmanship.

If someone can't be arsed to borrow a .gitignore ( and github will give you a fairly complete one for python ) what other corners will they cut.


So, why then should they be in version control?


What asr engine are you using?



I have just updated to use Google Speech API v2.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: