
Ask HN: Non-cloud voice recognition for home use? - rs23296008n1
I&#x27;d like a home-based voice recognition without some off-site cloud.<p>I&#x27;d like a kind of echo dot like thing running on a set of raspberry pi devices each with a microphone and speaker. Ideally they&#x27;d be all over the house. I&#x27;m happy if they talk back via wifi to a server in my office for whatever real processing. The server might have 16 cores and 128Gb ram. Might even have two of these if required.<p>What options do I have? What limits? I&#x27;d really prefer answers from people who have experiences with the various options.<p>If it helps I&#x27;m happy to reduce vocabulary to a dictionary of words as long as I can add more words as necessary. Training is also ok. I&#x27;ve already analysed my voice conversations with an echo dot and the vocabulary isn&#x27;t that large.<p>Please remember: home use, no off-site clouds. I&#x27;m not interested in options involving even a free voice speech-to-text cloud. This eliminates google voice recognition, amazon etc. They are great but out of scope.<p>So far I&#x27;ve identified CMU Sphinx as a candidate but I&#x27;m sure there are others.<p>Ideas?
======
romwell
TL; DR: Win 10 IoT for RasPi does it.

\-----------------

Windows 10 IoT for Raspberry Pi comes with offline speech recognition API.

It was not hard to slap some code together that turns on a light when someone
says "banana" at a hackathon.

Sounds like exactly what you need.

>If it helps I'm happy to reduce vocabulary to a dictionary of words

You will do it with an XML grammar file for offline recognition[4].

[1][https://docs.microsoft.com/en-us/windows/iot-
core/tutorials/...](https://docs.microsoft.com/en-us/windows/iot-
core/tutorials/rpi)

[2][https://docs.microsoft.com/en-us/windows/iot-core/extend-
you...](https://docs.microsoft.com/en-us/windows/iot-core/extend-your-
app/speech)

Someone's demo project:

[3][https://www.hackster.io/krvarma/rpivoice-051857](https://www.hackster.io/krvarma/rpivoice-051857)

[4][https://docs.microsoft.com/en-
us/windows/uwp/design/input/sp...](https://docs.microsoft.com/en-
us/windows/uwp/design/input/speech-recognition)

~~~
lucb1e
This is really interesting, but I have a few questions:

\- The setup guide shows a Windows system making a Windows iot version. Can't
I just download an iso and flash it to an sdcard with dd? Does it need a
license?

\- The demo projects show C# and while I can develop in monodevelop, I don't
have a Windows to compile it with. Is a C# compiler included in Windows iot's
.NET distribution or are there also cross-platform (interpreted) languages
that run on Windows iot (e.g. Python3)?

~~~
Jaruzel
So, recently I played with Windows10 IoT.

Win10 IoT is written to the SD card, and left to first-boot inside the Pi
(this bit takes AGES). While it's doing that you install the Win IoT dashboard
toolkit onto your PC (Windows only). The Dashboard will find your Win10 IoT Pi
on the network, there's a few demo apps pre installed you can play with. It's
a free OS, but you need to pay for a dev licence if you want it to not reboot
every 24 hours. (There's also an on screen non-production warning).

Now you fire up Visual Studio, which has gained the ability to build C# apps
on ARM. You write a small app, including using the visual form designer, and
you debug using the PC based Win IoT emulator, or you deploy it directly
(using VS) to the the Pi.

Once you are happy with your app, you have to bake it into a new OS image that
gets written to the SD card(s) for proper deployment.

Win10 IoT can only run one app in the foreground. It does not have a classic
desktop, which should be fine for embedded or kiosk type applications.

Personally, I found it clunky and slow, even on a fast Pi. There's also a fair
amount of restrictions applied to your app (think the same restrictions as an
Android or iOS app) so if you are used to your C# app having full (read)
access to the machine it's running on, you won't get that on Win IoT.

If you want to develop GUI rich apps on the Pi, there are far better
alternatives (Mono, Python/GTK etc. on Raspbian).

This is just my take on Win10 IoT. I'm a Windows guy by profession so I don't
have an anti-MS bias here.

------
ftyers
Mozilla DeepSpeech trained on the Common Voice dataset for English. You can
get pretrained models too. They have a nice matrix channel where you can get
help, and pretty good documentation. It is also actively developed by several
engineers.
[http://voice.mozilla.org/en/datasets](http://voice.mozilla.org/en/datasets)
and
[http://github.com/mozilla/DeepSpeech/](http://github.com/mozilla/DeepSpeech/)

~~~
kingo55
Interesting. Are there any projects actively using this today?

~~~
nmstoker
I haven't tried it directly myself but there is this project, Dragonfire,
which looks quite reasonable using DeepSpeech:

[https://github.com/DragonComputer/Dragonfire](https://github.com/DragonComputer/Dragonfire)

There's a minimal demo app I put together here too:
[https://github.com/nmstoker/SimpleSpeechLoop](https://github.com/nmstoker/SimpleSpeechLoop)

------
albertzeyer
Are you searching for a complete solution including NLP and an engine to
perform actions? Some of these are already posted, like Home Assistant, and
Mycroft.

Sphinx is just for the automatic speech recognition (ASR) part. But there are
better solutions for that:

Kaldi ([https://kaldi-asr.org/](https://kaldi-asr.org/)) is probably the most
comprehensive ASR solution, which yields very competitive state-of-the-art
results.

RASR ([https://www-i6.informatik.rwth-aachen.de/rwth-
asr/](https://www-i6.informatik.rwth-aachen.de/rwth-asr/)) is for non-
commercial use only but otherwise similar as Kaldi.

If you want to use a simpler ASR system, nowadays end-to-end models perform
quite well. There are quite a huge number of projects which support these:

RETURNN
([https://github.com/rwth-i6/returnn](https://github.com/rwth-i6/returnn)) is
non-commercial TF-based. (Disclaimer: I'm one of the main authors.)

Lingvo
([https://github.com/tensorflow/lingvo](https://github.com/tensorflow/lingvo)),
from Google, TF-based.

ESPnet ([https://github.com/espnet/espnet](https://github.com/espnet/espnet)),
PyTorch/Chainer.

...

~~~
rs23296008n1
Already got the action engine: all the lights, hvac, tv, calculator,
computers, etc are all controllable. None require internet now. Or any kind of
location services for that matter.

I really just want the speech-to-text. Ideally I'd also like it to recognise
who's talking. But that's a bonus.

~~~
thomas536
I'll second the recommendation for Kaldi. It's more complicated to get running
vs pocketsphinx, but in my experience Kaldi has better accuracy/lower latency
in general cases vs pocketsphinx (assuming caveats below).

[https://github.com/gooofy/zamia-speech/](https://github.com/gooofy/zamia-
speech/) has been training good [acoustic] models which are worth looking at
(including training with robustness against noise). They've also got lots of
code and docker images and documentation.

pocketsphinx isn't actually that bad to use with their latest acoustic models
and small vocabularies (so its utility depends on your exact use case). But
it's not generally good with far field mics/dsp processed audio, not really
good with noise, and in my experiments quite not as fast as Kaldi.

Better/larger language models in my experience make a _world of difference_
(esp in the general vocab case) for improving accuracy for either of kaldi or
pocketsphinx. Nobody really seems to talk about this(?), since everyone always
uses the news corpus from like the 80s as the default language model.

I haven't really ever gotten the various ~deepspeech systems working, so I
can't speak to them.

~~~
rs23296008n1
I'm happy to feed it plenty of voice logs as well as a training corpus as
necessary. Sounds like an interesting journey.

------
daanzu
I develop Kaldi Active Grammar [1], which is mainly intended for use with
strict command grammars. Compared to normal language models, these can provide
much better accuracy, assuming you can describe (and speak) your command
structure exactly. (This is probably more acceptable for a voice assistant for
an audience that is more technical.) The grammar can be specified by a FST, or
you can use KaldiAG through Dragonfly, which allows you to specify them (and
their resultant actions) in Python. However, KaldiAG can also do simple plain
dictation if you want.

KaldiAG has an English model available, but other models could be trained.
Although you can't just drop in and use a standard Kaldi model with KaldiAG,
the modifications required are fairly minimal and don't require any training
or modification of its acoustic model. All recognition is performed locally
and off line by default, but you can also selectively choose to do some
recognition in the cloud, too.

Kaldi generally performs at the state of art. As a hybrid engine, although
training can be more complicated, it generally requires far less training data
to achieve high accuracy, compared to "end to end" engines.

[1] [https://github.com/daanzu/kaldi-active-
grammar](https://github.com/daanzu/kaldi-active-grammar)

~~~
daanzu
Too late to edit, but I should probably have noted that KaldiAG also would
make it easy to define "contexts" when (groups of) commands are active for
recognition. For example, if the TV is on, you could have commands for
adjusting the volume/etc. But if it is off, those commands are disabled, so
they can't be recognized, and further, the engine knows this and can therefore
better recognize the other commands that remain active.

~~~
barrystaes
Could Home Assistant use such commands by running a docker?

Also the video demo is rather impressive, in how accurate (and predictable) it
recognises.

~~~
daanzu
I don't know much about Home Assistant, but that certainly should be possible
to set up. The KaldiAG API is pretty low level, but basically: you define a
set of rules, and send in audio data, along with a bit mask of which rules are
active at the beginning of each utterance, and receive back the recognized
rule and text. The easy solution is probably to go through Dragonfly, which
makes it easy to define the rules, contexts, and actions. It might be a little
hacky to do, but you should be able to wire it up with Home Assistant somehow.

Although I mainly use it for computer control as demonstrated in the video, I
do have many commands akin to home automation, like adjusting the lights,
HVAC, etc.

------
guptaneil
Disclaimer: I am the founder of Hiome, a smart home startup focused on private
by design local-only products.

What actions are you looking to handle with the assistant?

Reason I ask is because a voice assistant is a command line interface with no
auto-complete or visual feedback. It doesn’t scale well as you add more
devices or commands to your home, because it becomes impossible to remember
all the phrases you programmed. We’ve found the person who sets up the voice
assistant will use it for simple tasks like “turn off all lights” but nobody
else benefits and it gets little use beyond timers and music. They are
certainly nice to have, but they don’t significantly improve the smart home
experience.

If you’re looking to control individual devices, I suggest taking a look at
actual occupancy sensors like Hiome ([https://hiome.com](https://hiome.com)),
which can let you automate your home with zero interaction so it just works
for everyone without learning anything (like in a sci-fi movie). Even if
you’re the only user, it’s much nicer to never think about your devices again.

Happy to answer any questions about Hiome or what we’ve learned helping people
with smart homes in general! -> neil@hiome.com

~~~
rs23296008n1
Sure. We've got a house with multiple buildings, including sheds, halls etc.

Around 100 people need separate profiles, each should be able to set alarms,
timers, reminders, etc. if they want a routine to create any of those or tell
them time or date or temperature they should be able to do that from any of
the voice assistants in any room. They might only want such a routine in a
particular room. They should be able to define a home device and a current
device. Home device would usually be a bedroom for those of us that need them
etc.

I definitely don't want to have to create any of those routines etc for any of
them. Nothing about these should be fixed in stone. They have to be able to
self-serve. We can assume they can navigate the ios amazon app as a baseline
level of knowledge.

Room settings include temperature, lighting, curtains, tv on/off, channel,
volume to name a few. The voice assistant in some rooms should be able to show
web pages on-screen, or even youtube etc. including the laptop someone plugged
in on HDMI1.

...the coffee machine automation is also a requirement. Its controlled by a
flask app. The voice control should be able to let you order a coffee, strong,
black. Or a Dave#5.

We'd also like device detection to trigger when people's phones appear in
certain locations.

What kinds of options exist for this?

~~~
guptaneil
That's definitely a new situation we haven't seen before!

Are the 100 people using all of the different rooms, or do people mostly stick
to their own rooms (like a hotel/dorm)?

I'd love to see what you've built so far, and better understand the problems
you're trying to fix. For example, does each room have its own coffee machine,
or is it a communal coffee machine? Are the people living here permanently or
rotating regularly? What is the goal for device detection (e.g., do you want
to use that for presence detection or as a security system or something else)?

We have a prototype for a machine learning system that learns how you use your
devices and then automates them by itself, so you don't have to set anything
up. Our focus is lights because that's what most people have, but it can also
control other on/off things like curtains or tv right now. It sounds like it
could be good fit for a situation like this, and I'd be happy to chat with you
more on whether it makes sense to try out!

Can you send me an email (neil@hiome.com)?

------
DataDrivenMD
Have you considered the Almond integration for Home Assistant?
([https://www.home-assistant.io/integrations/almond/](https://www.home-
assistant.io/integrations/almond/))

Alternatively, you could just fork the Almond project directly and take it
from there: [https://github.com/stanford-oval/almond-
cloud](https://github.com/stanford-oval/almond-cloud)

~~~
rs23296008n1
No I hadn't. This was why I asked: I know there's alternatives out there but
there's so much noise.

Thanks.

~~~
Shawnecy
Almond has been posted on HN previously[1]. User voltagex_ commented[2] that
self-hosting, while possible is not recommended in the official installation
instructions[3] as it is considered significantly more challenging to manage.
This may or may not effect your decision to go forward with Almond.

[1]
[https://news.ycombinator.com/item?id=17532003](https://news.ycombinator.com/item?id=17532003)

[2]
[https://news.ycombinator.com/item?id=17534793](https://news.ycombinator.com/item?id=17534793)

[3] [https://github.com/stanford-oval/almond-
cloud/blob/master/do...](https://github.com/stanford-oval/almond-
cloud/blob/master/doc/installing-almond-cloud.md)

~~~
saghm
Is there any product where self-hosting _isn't_ more difficult? That seems
like a generic warning that could apply to pretty much any product in this
space. It seems more like a warning to non-technical users who might not have
the experience or know-how to successfully set up a server.

~~~
Shawnecy
If it was simply an issue of just "more difficult", then it wouldn't be worth
pointing out.

However, the words "significantly more challenging to manage" are straight
from their documentation that I linked which I think makes it worth pointing
out.

Whether or not it is too challenging is for each individual to decide for
themselves.

~~~
amenod
It also gives an indication what the preferred method of deployment is (from
authors' point of view). In this case I read it as a warning that it might
stop being supported in the future.

------
perturbation
If you don't mind getting your hands dirty a bit, I think Nvidia's model
[Jasper]([https://arxiv.org/pdf/1904.03288.pdf](https://arxiv.org/pdf/1904.03288.pdf))
is near SOTA, and they have [pretrained
models]([https://ngc.nvidia.com/catalog/models/nvidia:jaspernet10x5dr](https://ngc.nvidia.com/catalog/models/nvidia:jaspernet10x5dr))
and [tutorials /
scripts]([https://nvidia.github.io/NeMo/asr/tutorial.html](https://nvidia.github.io/NeMo/asr/tutorial.html))
freely available. The first is in their library "nemo", but they also have it
available in [vanilla
Pytorch]([https://github.com/NVIDIA/DeepLearningExamples/tree/master/P...](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper))
as well.

~~~
homarp
and you have a version for the jetson nano:
[https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_co...](https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/jasper-
Mini-for-Jetson.py)

with install scripts:
[https://github.com/NVIDIA/OpenSeq2Seq/blob/master/scripts/je...](https://github.com/NVIDIA/OpenSeq2Seq/blob/master/scripts/jetson_install_dependencies.sh)

------
nshm
You are welcome to try Vosk

[https://github.com/alphacep/vosk-api](https://github.com/alphacep/vosk-api)

Advantages are:

1) Supports 7 languages - English, German, French, Spanish, Portuguese,
Chinese, Russian

2) Works offline even on lightweight devices - Raspberry Pi, Android, iOS

3) Install it with simple `pip install vosk`

4) Model size per language is just 50Mb

5) Provides streaming API for the best user experience (unlike popular speech-
recognition python package)

6) There are APIs for different languages too - java/csharp etc.

7) Allows quick reconfiguration of vocabulary for best accuracy.

8) Supports speaker identification beside simple speech recognition

~~~
rs23296008n1
Sound great, thanks

------
notemaker
[https://rhasspy.readthedocs.io](https://rhasspy.readthedocs.io)

Haven't used it, but seems very nice.

[https://youtu.be/ijKTR_GqWwA](https://youtu.be/ijKTR_GqWwA)

~~~
synesthesiam
Rhasspy author here in case you have any questions :)

If you're looking for something for the command-line, check out
[https://voice2json.org](https://voice2json.org)

~~~
rs23296008n1
I think I'll dig in first. Cheers.

------
lukifer
I’m currently assembling an offline home assistant setup using Node-RED and
voice2json, all running on Raspberry Pi’s:

[http://voice2json.org/](http://voice2json.org/)

[https://nodered.org/](https://nodered.org/)

Requires a little customization and/or coding, but it’s quite elegant, and all
voice recognition happens on-device. Part of what makes the recognition much
more accurate (subjectively, 99%ish) is the constrained vocabulary; the
grammars are compiled from a simple user-defined markup language, and then
parsed into JSON intents, containing both the full text string and appropriate
keywords/variables split out into slots.

Just finished a similar rig in my car, acting as a voice-controlled MP3
player, with thousands of artists and albums compiled into intents from iTunes
XML database. Works great, and feels awesome to have a little 3-watt baby
computer doing a job normally delegated to massive corporate server farms. ;)

~~~
stragies
Hi Lukifer, thanks for chiming in! I had a setup using snips, that i'm looking
to replace. Please do document your setup, and your little helper scripts in a
blogblost or such, and ping me/us :)

------
carbon85
I was not able to find the same article online, but the Volume 72 of Make
Magazine has a great overview of different non-cloud voice recognition
platforms. Here is a preview:
[https://www.mydigitalpublication.com/publication/?m=38377&i=...](https://www.mydigitalpublication.com/publication/?m=38377&i=649256&p=0)

~~~
jvyduna
Today they published an online version which covers many of the platforms
listed in these comments:

[https://makezine.com/2020/03/17/private-by-design-free-
and-p...](https://makezine.com/2020/03/17/private-by-design-free-and-private-
voice-assistants/)

------
awinter-py
important question

I think there's a group of highly technical people who feel increasingly left
behind by 'convenience tech' because of what they have to give up in order to
use it

~~~
rs23296008n1
Well I've got all the gadgets controllable now without internet as a
requirement. Only the voice part requires it now for us at least. Google
home/Amazon echos and phone apps can communicate with the house and surrounds
without issue.

Loss of internet access is not an excuse for ignoring basic voice commands in
my opinion.

Privacy is also an important factor but not the primary driver for us.

------
skamoen
I've read good things about Mycroft [1], though I haven't tried it myself.
Ticks all the boxes though

[1] [https://mycroft.ai/](https://mycroft.ai/)

~~~
ghaff
Mycroft can only kinda sorta be run offline according to the project/company:
[https://mycroft-ai.gitbook.io/docs/about-mycroft-ai/faq](https://mycroft-
ai.gitbook.io/docs/about-mycroft-ai/faq)

------
reaperducer
I wish you luck with this, and more importantly, hope that it inspires many
people to start building similar projects.

I know virtually nothing about voice recognition, but my spidey sense tells me
that it should be possible with the hardware you specify.

A Commodore 64 with a Covox VoiceMaster could recognize voice commands and
trigger X-10 switches around a house. (Usually. My setup had about a 70%
success rate, but pretty good for the time!) Surely a 16 core, 128GB RAM
machine should be able to do far more.

~~~
rs23296008n1
Its beginning to take shape. Already got a bunch of good candidates for
experimentation next week based on the answers so far.

------
otodic
My company develops SDKs for on-device speech recognition on Android/iOS:
[https://keenresearch.com/keenasr-docs](https://keenresearch.com/keenasr-docs)
(Raspberry Pi is an option too, we'll have a GA release in Q2)

We license this on commercial bases but would be open to indy-developer
friendly licensing. We offer a trial SDK that makes testing/evaluation super
easy (it works for 15min at the time).

Ogi

ogi@keeenresearch.com

~~~
lunixbochs
> Currently, the SDK supports English and Spanish out of the box. Additional
> ASR Bundles for most major spoken languages can be provided upon request
> within 6-8 weeks.

Does this mean you have a standing offer to train a new language on demand?

~~~
otodic
Yes, for a number of European and some Asian languages we can do this; it's
mainly the question of business opportunity.

~~~
lunixbochs
That's very cool. Did you somehow solve autogenerating a training corpus for a
new language as long as it's popular? Based on my impressions working with
other people in the space, coming up with the training data seems like a big
bottleneck - as the available engines are pretty good at learning new
languages already.

~~~
otodic
We have relevant training data for a number of languages.

------
winkelwagen
I've had some good experience with [https://snips.ai](https://snips.ai) .
Works like advertised. easy to implement. Hardest thing was getting the
microphone and the pie to get along.

~~~
arendtio
Did you visit the website lately? Doesn't seem to be an option anymore :-/

~~~
rs23296008n1
They only seem to do audio equipment. Did they once do more things more
general?

~~~
arendtio
They were acquired by Sonos. A few months ago they were offering tools to
build your own speech recognition agents.

------
JanisL
I was one of the maintainers of the Persephone project which is an automated
phonetic transcription tool. This came about from a research project that
required a non-cloud solution. This project is open source and can be found on
GitHub:

[https://github.com/persephone-tools](https://github.com/persephone-tools)

This may be a little too low level for what as there's no language model but
maybe it's helpful as part of your system

------
gibs0ns
I was in the process of planning my multi-room voice-AI setup based on SnipsAI
(to be integrated with Home Assistant) when it was announced they were bought
by Sonos, which killed their open source project. Since then I have been left
trying various projects that meet my needs.

Among those, I tried MyCroft, which still requires a cloud account to config
various things on it, and it doesn't support a multi-room setup at this time.

I've since switched to Rhasspy, which offers a larger array of config options
and engines, and also multi-room (I'm yet to config multi-room tho)

In the long-term I plan to "train" the voice-AI for various additions,
including a custom wake word - No, I'm not calling it `Jarvis` ;)

I'm running each of these voice-AI's on a Raspberry Pi 4 (4GB model), though
I'm considering switching them to Pi 3's. I'm using the `ReSpeaker 2mic Pi-
Hat` on each pi for the mic input. I'm planning to configure all the satellite
nodes (voice-AI in each room) to PXE boot, that way they don't require an sd-
card and I can easily update their images/configs from a central location.

~~~
OlympusMonds
Is it possible to make your config available, e.g., GitHub?

I'm just starting to get going with Rhasspy, integrating with Home Assistant,
and the docs miss just enough that I hit walls everytime I try.

Thanks for the info you've already provided though, sounds like I want exactly
what you do.

------
villgax
Google has papers on device speech recognition, these are used in the keyboard
& for live caption on Pixel devices.

~~~
teapourer
They are trained on a ton of non-public data though, and I’m not sure if pre-
trained models are around.

~~~
villgax
Nope, they aren't available. CC YouTube videos with captions or radio
broadcasts + transcripts could prove helpful for multiple languages as well as
being able to create a multilingual ASR.

------
coryrc
I tried to use Julius for this. I may have misconfigured it, but it would
always match something to what it was hearing. I encoded some sounds in my
grammar to error terms that it would detect in quiet noise (like 'aa' and
'hh'), but it would still occasionally match words when nothing was going on.

Later I worked on the Microsoft Kinect with its 4-microphone array. With only
a single microphone, it's so much harder to filter out background noise. If
you don't find a system based on multiple microphones, I don't believe you can
be successful if there's any ongoing noise (dishwasher, loud fans, etc), but a
system that works in only quiet conditions is possible.

------
beerandt
Homeseer automation software has this built in, with client listening apps for
different platforms. I haven't used the voice recognition beyond testing, but
I've been very happy with the software overall. It's relatively expensive, but
goes on sale for about half price once or twice a year. There's a free 30 day
trial.

I think there's two id phrases per sub-device by default, but using virtual
devices vastly expands the software's capability. Especially for mapping
virtual switches to multiple devices.

They also have zwave devices that for the most part are much better than most.

[https://www.homeseer.com](https://www.homeseer.com)

------
Havoc
>I'm happy if they talk back via wifi to a server in my office for whatever
real processing. The server might have 16 cores and 128Gb ram.

Pretty sure Mycroft is capable of that - in theory - you'll need to config it
manually. The standard raspberry pi route isn't powerful enough for local.

Check out reespeaker for a raspberry microphone. You'll want one of the more
expensive ones for range. Though at like 40 bucks they're not that wildly
expensive.

Make sure it's a rasp 4 since wake word is processed locally. And you probably
don't need 128gb RAM. No idea what they use but doubt that much.

~~~
rs23296008n1
As discussed elsewhere in this thread by others, mycroft can't do offline
processing, according to their faq at least.

128GB is the minimum I use for general purpose servers so this machine would
be a repurposed machine rather than something specially ordered or built.

~~~
Havoc
>mycroft can't do offline processing, according to their faq at least.

Not sure what FAQ you're reading there. There is a whole section of STT
engines you can plug into it.

[https://mycroft-ai.gitbook.io/docs/using-mycroft-
ai/customiz...](https://mycroft-ai.gitbook.io/docs/using-mycroft-
ai/customizations/stt-engine)

Including a local one. It's gonna be a pain in the ass but as I said it's
possible

------
abrichr
Precise [1], Snowboy [2], and Porcupine [3] are all designed to work offline.

[1] [https://github.com/MycroftAI/mycroft-
precise](https://github.com/MycroftAI/mycroft-precise)

[2] [https://github.com/kitt-ai/snowboy](https://github.com/kitt-ai/snowboy)

[3]
[https://github.com/Picovoice/porcupine](https://github.com/Picovoice/porcupine)

~~~
lukifer
All good projects (I’m using Snowboy), but these are all just wake words,
which are very different beasts from full text recognition.

------
Animats
Is there a voice dialer for Android that doesn't use Google? There used to be,
but it disappeared in the "upgrade" which has the mothership listening all the
time.

------
LargeWu
The most recent edition of Make magazine had a pretty good overview of some
different options. Doesn't go too much into depth but provides a good starting
point.

------
lunixbochs
Hi, I'm the dev behind [https://talonvoice.com](https://talonvoice.com)

I've been working with Facebook's wav2letter project and the results (speed on
CPU, command accuracy) are extremely good in my experience. They also hold the
"state of the art" for librispeech (a common benchmark) on wer_are_we [1].
Granted, that's with a 2GB model that doesn't run very well on CPU, but I
think most of the fully "state of the art" models are computationally
expensive and expected to run on GPU. Wav2letter has other models that are
very fast on CPU and still extremely accurate.

You can run their "Streaming ConvNets" model on CPU to transcribe multiple
live audio streams in parallel, see their wav2letter@anywhere post for more
info [2]

I am getting very good accuracy on the in-progress model I am training for
command recognition (3.7% word error rate on librispeech clean, about 8% WER
on librispeech other, 20% WER on common voice, 3% WER on "speech commands"). I
plan to release it alongside my other models here [5] once I'm done working on
it.

There's a simple WER comparison between some of the command engines here [3]
Between this and wer_are_we [1] it should give you a general idea of what to
expect when talking about Word Error Rate (WER). (Note the wav2letter-talonweb
entry in [3] is a rather old model I trained, known to have worse accuracy,
it's not even the same NN architecture).

\----

As far as constraining the vocabulary, you can try train a kenlm language
model for kaldi, deepspeech, and wav2letter by grabbing KenLM and piping
normalized (probably lowercase it and remove everything but ascii and quotes)
text into lmplz:

    
    
        cat corpus.txt | kenlm/build/bin/lmplz -o 4 > model.arpa
    

And you can turn it into a compressed binary model for wav2letter like this:

    
    
        kenlm/build/bin/build_binary -a 22 -q 8 -b 8 trie model.arpa model.bin
    

There are other options, like using a "strict command grammar", but I don't
have enough context as to how you want to program this to guide you there.

I also have tooling I wrote around wav2letter, such as wav2train [4] which
builds wav2letter training and runtime data files for you.

I'm generally happy to talk more and answer any questions.

\----

[1] [https://github.com/syhw/wer_are_we](https://github.com/syhw/wer_are_we)

[2] [https://ai.facebook.com/blog/online-speech-recognition-
with-...](https://ai.facebook.com/blog/online-speech-recognition-with-
wav2letteranywhere/)

[3] [https://github.com/daanzu/kaldi-active-
grammar/blob/master/d...](https://github.com/daanzu/kaldi-active-
grammar/blob/master/docs/models.md)

[4]
[https://github.com/talonvoice/wav2train](https://github.com/talonvoice/wav2train)

[5] [https://talonvoice.com/research/](https://talonvoice.com/research/)

~~~
rs23296008n1
Looks interesting. I've got a few GPUs I could use if the CPU is too much of a
bottleneck.

We've identified a dictionary of the types of commands and words we use and
have a recording of all our amazon and other commands. Training wave files are
not an issue.

Have you had any issues with recognising multiple languages?

Thanks!

~~~
lunixbochs
Are you talking about recognizing multiple languages at once (e.g. you don't
know which language the user will speak and you want it to react appropriately
to all of them)? You're going to have a harder time doing that with any
system, but it is possible. You would likely need to run multiple models and
pick between their output, or train a specialized separate model that can
classify the language, then use it to pick the model.

I haven't personally tested wav2letter with other languages yet. I know zamia-
speech trained a german model, and some users have been talking about training
for other languages. I've been helping someone who is training several other
languages and they've reported great success as well.

If you want to make a new model from scratch in any language, you'll probably
want a couple hundred hours of transcribed speech for it, but it doesn't need
to be your own speech. Common Voice is a good data source for that.

~~~
rs23296008n1
I'd expected some training required for each separately so we've got a decent
collection of voice examples. Some of us mix languages within a sentence but
that's a bad habit anyway so unsupported. I don't see figuring out the
language as being a problem as my proof-of-concept already handles that well
enough with around 90% accuracy. Once its selected a language then the
appropriate model can be used. To make it fast we'd likely just keep the whole
thing in RAM. Might need more however.

The issue I see with talon is its currently mac only. That would however still
help one of us who lives on wheels (got a 16" macbook IIRC and a mac mini as
well). Different set of use-cases so things would be more relaxed.

I see some hints about a linux version however. I've got windows / linux VMs
on the server but no other macs. GPUs will be installed soon when I decom some
old gaming rigs.

Plenty to think about.

~~~
lunixbochs
The talon beta is on windows/linux/mac. I was recommending wav2letter directly
instead of talon specifically because you mentioned thin clients, and I'm not
really targeting something like headless raspberry pis yet.

I mostly mentioned wav2letter@anywhere because it could handle a bunch of
audio streams centrally, so you can stream from 16 pis to a central box, and
it's very accurate.

------
belbob
maybe [https://jasperproject.github.io/](https://jasperproject.github.io/)

~~~
stragies
Jasper is another project that uses CMUSPINX from
[https://cmusphinx.github.io/wiki/faq/](https://cmusphinx.github.io/wiki/faq/)
under the hood.

------
microtherion
Apple platforms offer an API (SFSpeechRecognizer) which for some languages
supports on-device recognition. Trivial to set up, super easy to use, and
pretty reasonable accuracy.

Disclaimer: Working for Apple, not directly on this API but in related
subjects.

~~~
deca6cda37d0
I guess not always.

“The speech recognition process involves capturing audio of the user’s voice
and sending that data to Apple’s servers for processing. The audio you capture
constitutes sensitive user data, and you must make every effort to protect it.
You must also obtain the user’s permission before sending that data across the
network to Apple’s servers. You request authorization using the APIs of the
Speech framework.“

[https://developer.apple.com/documentation/speech/asking_perm...](https://developer.apple.com/documentation/speech/asking_permission_to_use_speech_recognition)

Which languages are processed on device and not send to Apple’s servers?

~~~
microtherion
> I guess not always.

Yeah, I suppose I should have formulated that more clearly. The API offers
cloud speech recognition for a set of languages, and on-device speech
recognition for a subset of these.

> Which languages are processed on device and not send to Apple’s servers?

It's not a static set, because (1) availability tends to expand over time and
(2) when you start using a new language, the on-device model needs to be
downloaded first.

So what you need to do is create a SFSpeechRecognizer and then test the
supportsOnDeviceRecognition property. If that is set, you can set
requiresOnDeviceRecognition on the SFSpeechRecognitionRequest.

------
cosmic_ape
As an aside, it seems you're interested in speech recognition, or speech to
text, not voice recognition. _Voice recognition_ is a different problem, where
the particular speaker needs to be recognized from voice.

~~~
rs23296008n1
I'd primarily like speech-to-text and the ability to know who is speaking. I
have low expectations of the identification of speaker however.

~~~
peglasaurus
If you're wanting a lot of people to use your solution as you described,
recognition of who is speaking could add a lot of extra possibilities.

------
vinniejames
I've been waiting for Mycroft to release something new,
[https://mycroft.ai/](https://mycroft.ai/)

Caviat, they keep having delays and may never release v2 imo

------
thesuperbigfrog
Modern web browsers will support the Web Speech API
([https://developer.mozilla.org/en-
US/docs/Web/API/Web_Speech_...](https://developer.mozilla.org/en-
US/docs/Web/API/Web_Speech_API/Using_the_Web_Speech_API)) which may or may not
involve a cloud service.

Here is the Google Chrome Web Speech API demo page:
[https://www.google.com/intl/en/chrome/demos/speech.html](https://www.google.com/intl/en/chrome/demos/speech.html)

~~~
detaro
In which browser doesn't it involve a cloud service?

~~~
thesuperbigfrog
Use Firefox.

Google Chrome does use a cloud service.

Edit: Firefox does not support the Web Speech API at this time. There are not
currently any offline versions of this API as far as I can tell.

~~~
detaro
According to MDN, Firefox doesn't support it?

[https://developer.mozilla.org/en-
US/docs/Web/API/Web_Speech_...](https://developer.mozilla.org/en-
US/docs/Web/API/Web_Speech_API#Browser_compatibility)

EDIT: According to other documentation, it is behind a config flag. If you
enable it, it will send the data to Google's API through a Mozilla-operated
proxy: [https://wiki.mozilla.org/index.php?title=Web_Speech_API_-
_Sp...](https://wiki.mozilla.org/index.php?title=Web_Speech_API_-
_Speech_Recognition&oldid=1220468#Where_does_the_audio_go.3F)

~~~
thesuperbigfrog
You're right.

It looks like there is no offline version :(

------
ParanoidShroom
Android has a local speech recognizer, maybe give that a go ? You'll have to
make an Android app tho.

------
mirimir
If cloud services are such an issue (as they would be for me) then it's worth
considering the security of local logs and whatever. Maybe limit lifetime, or
even use FDE.

------
wtvanhest
This feels like an area where a Dropbox of self hosted solution will emerge.

------
stevewilhelm
Question, what aspect of your product restricts the software architecture to
not use "off-site cloud?"

~~~
rs23296008n1
If you've got an amazon echo or equivalent, try this experiment: disconnect
your internet. Now issue a voice command to it. Why is your light not
switching on? Why is alexa not answering your questions?

Now you know why I want "off-site cloud" for particular things. Not
everything. Just what matters. A few things here and there.

There's nothing wrong with the cloud and nothing stops me from using its
benefits. Except one minor detail. My use-cases are likely different from
others. I wish to keep cloud access as needed but also just a little bit more.

------
gok
Don't bother.

The cloud based solutions are so vastly superior to the current non-cloud
solutions that unless you're something of an expert in ASR you're just going
to get frustrated. If you're worried about privacy, Google lets you pay a
little extra to immediately delete the audio after you send it to their
servers.

~~~
mister_hn
But none is sure that Google does keep a copy of the data.

Cloud = giving out your data

~~~
hutzlibu
And some of us still have those odd privacy concerns ..

But seriously, it feels sometimes, we are very small minority.

