
Ask HN: Open-source voice assistants like Siri? Or can I build one on my own? - mettamage
For a data sensitive home grown project I want to integrate an open source voice assistant or roll my own.<p>Do you guys have any tips or experience with this and how to get started? I expect there to be so
me gotcha’s that I am unaware of.<p>The voice assistant does not need to be perfect but does need to be good. It should capture at least 80% of what I say in formal non-slang English correctly. I want to be able to speak in sentences, like I do with Siri.<p>What would your approach be to build or integrate  this? Is it even feasible?<p>I am willing to invest one to two months full-time on learning the required machine learning. I currently know basic neural nets (Michael Nielsen’s book), basic statistics (e.g. logistic regression) and basic machine learning (SVM, Knn, PCA, random forest, decision trees, bag of words).
======
synesthesiam
You might be interested in trying Rhasspy:
[https://github.com/synesthesiam/rhasspy](https://github.com/synesthesiam/rhasspy)

Rhasspy lets you describe the set of sentences you want to speak using a
simple grammar with annotations for named entities
([https://rhasspy.readthedocs.io/en/latest/training/#sentences...](https://rhasspy.readthedocs.io/en/latest/training/#sentencesini)).
It outputs JSON over HTTP/Websockets/MQTT, so it works well with NodeRED, Home
Assistant, etc.

Disclaimer: I created and maintain Rhasspy.

~~~
nmstoker
That looks really impressive, integrating a number of open source tools with a
simple but nice UI.

Do you have any measures of how well it recognises spoken commands?

And have you seen anyone using it with non-American accents for English? (I
ask as it relies on the CMU dictionary and tools I've seen use it tend to
struggle with other accents, understandably)

~~~
synesthesiam
Thanks! In tests with a high quality headset mic, I've been able to get up to
98% (intent) recognition accuracy. Keep in mind that this is for a closed
domain with only ~600k possible sentences. I'd expect performance degradation
with a free-standing mic like the PS3 eye, but I don't have any measurements
yet.

Because Rhasspy combines both a speech and intent recognizer, transcription
word error rate isn't as important if the correct intent/named entities can
still be recognized. When your intents have more distinct vocabularies, your
recognition rate will be higher.

I've had folks with British and Australian accents use Rhasspy successfully
(with the CMU English 5.2 acoustic model). Some recent tests with an nnet3
Kaldi model have produced better results ([https://github.com/gooofy/zamia-
speech](https://github.com/gooofy/zamia-speech)), especially with noise. I
plan to add this model to Rhasspy in the near future.

If you're specifically thinking of an Indian accent, I noticed that CMU
recently published an Indian English acoustic model
([https://sourceforge.net/projects/cmusphinx/files/Acoustic%20...](https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Indian%20English/)).
I'd be happy to add this to Rhasspy if you're interested.

------
melling
I’ve seen this one mentioned:

[https://mycroft.ai/](https://mycroft.ai/)

[https://github.com/MycroftAI](https://github.com/MycroftAI)

~~~
mehhh
There is also this [https://snips.ai/](https://snips.ai/)

------
52-6F-62
There's Snips which has something of a developed ecosystem surrounding it now

[https://snips.ai/developers/](https://snips.ai/developers/)

[https://github.com/snipsco](https://github.com/snipsco)

------
nmstoker
This is definitely feasible, it'll depend on how much you are happy to go for
tools like those mentioned here or if you want to roll your own (which I
expect would be enjoyable but much slower)

If your roll your own, you'll probably still want to reuse existing components
for wake words, ASR and TTS, simply training them for your specific needs.

One tip: aim for higher than that 80% target - the sentence you mention that
in had 16 words, so you'd expect 3.2 errors if you read that, which will
quickly get annoying (it could throw your intent recognition off completely).
If you've got the ability to restrict it to a narrow vocab then you can train
a language model just with the minimal words needed and that should help the
word error rate dramatically.

------
beshrkayali
I tried a few of the open source available assistant-like systems on my
RaspberryPi and found that none of them is even half as good as Google's or
Apple's, which is expected.

After thinking about it though, I found that I don't need the voice
recognition at all really. What I really wanted is a device that can help me
do a few things well, mainly for my case, listen to the radio, announce
calendar events, and train time (Stockholm in my case), so I just built that
into a raspberry pi with a tiny screen and a few buttons. This little device
is more than enough for my case.

~~~
kleer001
After the initial excitement over the last six months with my new phone and
new assistant I found I was using it less and less. It just takes so much
longer than pressing the buttons and is far less accurate with complex things.
But then again I haven't been trained on what it can do.

------
digital_voodoo
I'm open-source oriented, and others have mentioned Mycroft so I'd add Leon AI
[https://getleon.ai/](https://getleon.ai/)

~~~
ginger_beer_m
It's useless because the core machine learning part is implemented in
JavaScript (nodejs). Problem is most of the state-of-the-art codes out there
are done in python.

------
rotorblade
I have been wondering something similar. Just a simple voice-command thing.
Not a general purpose, any voice, assistant, but just to be able to say a word
10 times to train (form some average) and then assign a command to that word.
Something like CellWriter does for handwriting but for voice. Does this
exists?

------
beamatronic
There was a voice controlled toy robot, Verbot, in the 1980s. Surely we can
improve on that.

------
couss
like ironnman ?

------
sohodlers
I heard there are some project in GitHub. You may check them out..

