Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Non-cloud voice recognition for home use?
440 points by rs23296008n1 on March 14, 2020 | hide | past | favorite | 127 comments
I'd like a home-based voice recognition without some off-site cloud.

I'd like a kind of echo dot like thing running on a set of raspberry pi devices each with a microphone and speaker. Ideally they'd be all over the house. I'm happy if they talk back via wifi to a server in my office for whatever real processing. The server might have 16 cores and 128Gb ram. Might even have two of these if required.

What options do I have? What limits? I'd really prefer answers from people who have experiences with the various options.

If it helps I'm happy to reduce vocabulary to a dictionary of words as long as I can add more words as necessary. Training is also ok. I've already analysed my voice conversations with an echo dot and the vocabulary isn't that large.

Please remember: home use, no off-site clouds. I'm not interested in options involving even a free voice speech-to-text cloud. This eliminates google voice recognition, amazon etc. They are great but out of scope.

So far I've identified CMU Sphinx as a candidate but I'm sure there are others.


TL; DR: Win 10 IoT for RasPi does it.


Windows 10 IoT for Raspberry Pi comes with offline speech recognition API.

It was not hard to slap some code together that turns on a light when someone says "banana" at a hackathon.

Sounds like exactly what you need.

>If it helps I'm happy to reduce vocabulary to a dictionary of words

You will do it with an XML grammar file for offline recognition[4].



Someone's demo project:



The Microsoft offline speech recognizer is pretty good. I did some work with it many years ago [0]. The only problem we had was with accents: My French co-worker had to use his most obnoxiously over-the-top American accent for reasonable accuracy. ISTR that we could switch to Australian English for the Aussies and Kiwis.

[0] https://github.com/spc-ofp/ObserverLengthSampler

This is really interesting, but I have a few questions:

- The setup guide shows a Windows system making a Windows iot version. Can't I just download an iso and flash it to an sdcard with dd? Does it need a license?

- The demo projects show C# and while I can develop in monodevelop, I don't have a Windows to compile it with. Is a C# compiler included in Windows iot's .NET distribution or are there also cross-platform (interpreted) languages that run on Windows iot (e.g. Python3)?

So, recently I played with Windows10 IoT.

Win10 IoT is written to the SD card, and left to first-boot inside the Pi (this bit takes AGES). While it's doing that you install the Win IoT dashboard toolkit onto your PC (Windows only). The Dashboard will find your Win10 IoT Pi on the network, there's a few demo apps pre installed you can play with. It's a free OS, but you need to pay for a dev licence if you want it to not reboot every 24 hours. (There's also an on screen non-production warning).

Now you fire up Visual Studio, which has gained the ability to build C# apps on ARM. You write a small app, including using the visual form designer, and you debug using the PC based Win IoT emulator, or you deploy it directly (using VS) to the the Pi.

Once you are happy with your app, you have to bake it into a new OS image that gets written to the SD card(s) for proper deployment.

Win10 IoT can only run one app in the foreground. It does not have a classic desktop, which should be fine for embedded or kiosk type applications.

Personally, I found it clunky and slow, even on a fast Pi. There's also a fair amount of restrictions applied to your app (think the same restrictions as an Android or iOS app) so if you are used to your C# app having full (read) access to the machine it's running on, you won't get that on Win IoT.

If you want to develop GUI rich apps on the Pi, there are far better alternatives (Mono, Python/GTK etc. on Raspbian).

This is just my take on Win10 IoT. I'm a Windows guy by profession so I don't have an anti-MS bias here.

You might be able to use .NET Core which is opensource and can run on Linux/Mac.

Sounds good to me. RasPi are solid performers to us. I'm assuming the XML would need to be updated as the dictionary changes. Sounds easy enough. The loading of languages might get fussy/impossible if I want multiple. A stretch goal is to support multiple languages from the same device.

I'd be hoping I can also load in text-to-speech as well either separately or as part of same application. From what I've read the windows approach to the Pi is more like an appliance. Your application takes over the whole device. This is fine as long as I can load in more functionality to that application.

I need to read more about this.

Thanks for the pointers.

Does the IoT version track everything you do and cram ads down your throat like the regular version of Win 10?

I don't think this question is in good faith, but the answer is no.

I worded it as an adversarial way but it was a serious question. I'm glad to hear they don't.

Mozilla DeepSpeech trained on the Common Voice dataset for English. You can get pretrained models too. They have a nice matrix channel where you can get help, and pretty good documentation. It is also actively developed by several engineers. http://voice.mozilla.org/en/datasets and http://github.com/mozilla/DeepSpeech/

I had limited luck using the provided language model, but very good results when providing my own. So if you feel the results are poor, try building your own language model.

AFAIK DeepSpeech works by using the neural net to detect characters from speech, and then the language model is used to try to make a sentence out of the character stream, by doing a kind of graph search. Thus if the language model doesn't contain the words you want it to recognize, it'll have a hard time giving good output.

Anyway, I used the following tutorial[1] as a base to build the language model. For the kenlm tools I used Ubuntu WSL, and the generate_trie executable was part of the DeepSpeech native tools package for Windows.

[1]: https://discourse.mozilla.org/t/tutorial-how-i-trained-a-spe...

For a software engineer with no experience in machine learning / AI, what does it mean to build your own language model? Does it require coding? Hundreds of hours of audio data from your own voice? A significant amount of computing power?

The tools available means you only need to provide a list of normal sentences, and they should include the words you'd like it to know about.

For my case I just wanted to train it on like 30 different sentences, that took less than a second. But for a general assistant ala Google Home you'll want a large number of sentences and I hear it can take a while (hour or few?).

Due to using probabilities it will match words in other sentences than what you give it, but from my understanding it will be partial to the ones you feed it if DeepSpeech mis-classifies a character or two.

Here's an example I did using a custom LM with DeepSpeech - the description links back to the forum with the steps for producing it.


This was on a slightly earlier version, and they've made improvements in speed and quality of recognition since then.

> Hundreds of hours of audio data from your own voice?

I should clarify this. As I mentioned, training the neural net part requires tons of audio and the corresponding text (and people should totally contribute[1], the resulting data sets are released to the public). The neural net in DeepSpeech is then used on an audio stream and outputs a stream of characters.

Turning that stream of characters into sentences is what the language model is for.

Training the neural net is very data and compute intensive, but fortunately Mozilla provides pre-trained models.

Generating the language model is relatively cheap. And if your target language shares sounds with English, you may get away with using the English-trained neural net but with a non-English language model.

[1]: https://voice.mozilla.org/

The models (`deepspeech-0.6.1-models.tar.gz`) weigh 1.14 gb, if anyone's interested.

Here is a paper from a german university, where they tried to adopt DeepSpeech to german:


The results are, that adopting works good, but as of writing of the paper somemonths ago, the results were not very good yet. So it takes better trained models for other languages. English seems to be quite good.

What surprised me, is that it works offline very fast even on a rasperry pi!


Interesting. Are there any projects actively using this today?

I haven't tried it directly myself but there is this project, Dragonfire, which looks quite reasonable using DeepSpeech:


There's a minimal demo app I put together here too: https://github.com/nmstoker/SimpleSpeechLoop

I've been using DeepSpeech to learn to build voice controls for all sorts of things in JavaScript. And I've got a way to connect it to the web so you'll be able write speech recognition enabled web pages using client side JavaScript.


Are you searching for a complete solution including NLP and an engine to perform actions? Some of these are already posted, like Home Assistant, and Mycroft.

Sphinx is just for the automatic speech recognition (ASR) part. But there are better solutions for that:

Kaldi (https://kaldi-asr.org/) is probably the most comprehensive ASR solution, which yields very competitive state-of-the-art results.

RASR (https://www-i6.informatik.rwth-aachen.de/rwth-asr/) is for non-commercial use only but otherwise similar as Kaldi.

If you want to use a simpler ASR system, nowadays end-to-end models perform quite well. There are quite a huge number of projects which support these:

RETURNN (https://github.com/rwth-i6/returnn) is non-commercial TF-based. (Disclaimer: I'm one of the main authors.)

Lingvo (https://github.com/tensorflow/lingvo), from Google, TF-based.

ESPnet (https://github.com/espnet/espnet), PyTorch/Chainer.


Already got the action engine: all the lights, hvac, tv, calculator, computers, etc are all controllable. None require internet now. Or any kind of location services for that matter.

I really just want the speech-to-text. Ideally I'd also like it to recognise who's talking. But that's a bonus.

I'll second the recommendation for Kaldi. It's more complicated to get running vs pocketsphinx, but in my experience Kaldi has better accuracy/lower latency in general cases vs pocketsphinx (assuming caveats below).

https://github.com/gooofy/zamia-speech/ has been training good [acoustic] models which are worth looking at (including training with robustness against noise). They've also got lots of code and docker images and documentation.

pocketsphinx isn't actually that bad to use with their latest acoustic models and small vocabularies (so its utility depends on your exact use case). But it's not generally good with far field mics/dsp processed audio, not really good with noise, and in my experiments quite not as fast as Kaldi.

Better/larger language models in my experience make a world of difference (esp in the general vocab case) for improving accuracy for either of kaldi or pocketsphinx. Nobody really seems to talk about this(?), since everyone always uses the news corpus from like the 80s as the default language model.

I haven't really ever gotten the various ~deepspeech systems working, so I can't speak to them.

I'm happy to feed it plenty of voice logs as well as a training corpus as necessary. Sounds like an interesting journey.

Would love to know more about this and how you've done it

I develop Kaldi Active Grammar [1], which is mainly intended for use with strict command grammars. Compared to normal language models, these can provide much better accuracy, assuming you can describe (and speak) your command structure exactly. (This is probably more acceptable for a voice assistant for an audience that is more technical.) The grammar can be specified by a FST, or you can use KaldiAG through Dragonfly, which allows you to specify them (and their resultant actions) in Python. However, KaldiAG can also do simple plain dictation if you want.

KaldiAG has an English model available, but other models could be trained. Although you can't just drop in and use a standard Kaldi model with KaldiAG, the modifications required are fairly minimal and don't require any training or modification of its acoustic model. All recognition is performed locally and off line by default, but you can also selectively choose to do some recognition in the cloud, too.

Kaldi generally performs at the state of art. As a hybrid engine, although training can be more complicated, it generally requires far less training data to achieve high accuracy, compared to "end to end" engines.

[1] https://github.com/daanzu/kaldi-active-grammar

Too late to edit, but I should probably have noted that KaldiAG also would make it easy to define "contexts" when (groups of) commands are active for recognition. For example, if the TV is on, you could have commands for adjusting the volume/etc. But if it is off, those commands are disabled, so they can't be recognized, and further, the engine knows this and can therefore better recognize the other commands that remain active.

Could Home Assistant use such commands by running a docker?

Also the video demo is rather impressive, in how accurate (and predictable) it recognises.

I don't know much about Home Assistant, but that certainly should be possible to set up. The KaldiAG API is pretty low level, but basically: you define a set of rules, and send in audio data, along with a bit mask of which rules are active at the beginning of each utterance, and receive back the recognized rule and text. The easy solution is probably to go through Dragonfly, which makes it easy to define the rules, contexts, and actions. It might be a little hacky to do, but you should be able to wire it up with Home Assistant somehow.

Although I mainly use it for computer control as demonstrated in the video, I do have many commands akin to home automation, like adjusting the lights, HVAC, etc.

Disclaimer: I am the founder of Hiome, a smart home startup focused on private by design local-only products.

What actions are you looking to handle with the assistant?

Reason I ask is because a voice assistant is a command line interface with no auto-complete or visual feedback. It doesn’t scale well as you add more devices or commands to your home, because it becomes impossible to remember all the phrases you programmed. We’ve found the person who sets up the voice assistant will use it for simple tasks like “turn off all lights” but nobody else benefits and it gets little use beyond timers and music. They are certainly nice to have, but they don’t significantly improve the smart home experience.

If you’re looking to control individual devices, I suggest taking a look at actual occupancy sensors like Hiome (https://hiome.com), which can let you automate your home with zero interaction so it just works for everyone without learning anything (like in a sci-fi movie). Even if you’re the only user, it’s much nicer to never think about your devices again.

Happy to answer any questions about Hiome or what we’ve learned helping people with smart homes in general! -> neil@hiome.com

Sure. We've got a house with multiple buildings, including sheds, halls etc.

Around 100 people need separate profiles, each should be able to set alarms, timers, reminders, etc. if they want a routine to create any of those or tell them time or date or temperature they should be able to do that from any of the voice assistants in any room. They might only want such a routine in a particular room. They should be able to define a home device and a current device. Home device would usually be a bedroom for those of us that need them etc.

I definitely don't want to have to create any of those routines etc for any of them. Nothing about these should be fixed in stone. They have to be able to self-serve. We can assume they can navigate the ios amazon app as a baseline level of knowledge.

Room settings include temperature, lighting, curtains, tv on/off, channel, volume to name a few. The voice assistant in some rooms should be able to show web pages on-screen, or even youtube etc. including the laptop someone plugged in on HDMI1.

...the coffee machine automation is also a requirement. Its controlled by a flask app. The voice control should be able to let you order a coffee, strong, black. Or a Dave#5.

We'd also like device detection to trigger when people's phones appear in certain locations.

What kinds of options exist for this?

That's definitely a new situation we haven't seen before!

Are the 100 people using all of the different rooms, or do people mostly stick to their own rooms (like a hotel/dorm)?

I'd love to see what you've built so far, and better understand the problems you're trying to fix. For example, does each room have its own coffee machine, or is it a communal coffee machine? Are the people living here permanently or rotating regularly? What is the goal for device detection (e.g., do you want to use that for presence detection or as a security system or something else)?

We have a prototype for a machine learning system that learns how you use your devices and then automates them by itself, so you don't have to set anything up. Our focus is lights because that's what most people have, but it can also control other on/off things like curtains or tv right now. It sounds like it could be good fit for a situation like this, and I'd be happy to chat with you more on whether it makes sense to try out!

Can you send me an email (neil@hiome.com)?

Have you considered the Almond integration for Home Assistant? (https://www.home-assistant.io/integrations/almond/)

Alternatively, you could just fork the Almond project directly and take it from there: https://github.com/stanford-oval/almond-cloud

No I hadn't. This was why I asked: I know there's alternatives out there but there's so much noise.


Almond has been posted on HN previously[1]. User voltagex_ commented[2] that self-hosting, while possible is not recommended in the official installation instructions[3] as it is considered significantly more challenging to manage. This may or may not effect your decision to go forward with Almond.

[1] https://news.ycombinator.com/item?id=17532003

[2] https://news.ycombinator.com/item?id=17534793

[3] https://github.com/stanford-oval/almond-cloud/blob/master/do...

Is there any product where self-hosting _isn't_ more difficult? That seems like a generic warning that could apply to pretty much any product in this space. It seems more like a warning to non-technical users who might not have the experience or know-how to successfully set up a server.

If it was simply an issue of just "more difficult", then it wouldn't be worth pointing out.

However, the words "significantly more challenging to manage" are straight from their documentation that I linked which I think makes it worth pointing out.

Whether or not it is too challenging is for each individual to decide for themselves.

It also gives an indication what the preferred method of deployment is (from authors' point of view). In this case I read it as a warning that it might stop being supported in the future.

Pertinent quote from [3] regarding "more challenging":

"You must also deploy custom NLP models, as the official ones will not be compatible. Use this setup only if you absolutely need custom Thingpedia interfaces, and cannot provide these interfaces on Thingpedia."

If you don't mind getting your hands dirty a bit, I think Nvidia's model [Jasper](https://arxiv.org/pdf/1904.03288.pdf) is near SOTA, and they have [pretrained models](https://ngc.nvidia.com/catalog/models/nvidia:jaspernet10x5dr) and [tutorials / scripts](https://nvidia.github.io/NeMo/asr/tutorial.html) freely available. The first is in their library "nemo", but they also have it available in [vanilla Pytorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/P...) as well.

Do you have any experience/opinions on those?

You are welcome to try Vosk


Advantages are:

1) Supports 7 languages - English, German, French, Spanish, Portuguese, Chinese, Russian

2) Works offline even on lightweight devices - Raspberry Pi, Android, iOS

3) Install it with simple `pip install vosk`

4) Model size per language is just 50Mb

5) Provides streaming API for the best user experience (unlike popular speech-recognition python package)

6) There are APIs for different languages too - java/csharp etc.

7) Allows quick reconfiguration of vocabulary for best accuracy.

8) Supports speaker identification beside simple speech recognition

Sound great, thanks


Haven't used it, but seems very nice.


Rhasspy author here in case you have any questions :)

If you're looking for something for the command-line, check out https://voice2json.org

I think I'll dig in first. Cheers.

It can use https://github.com/cmusphinx/pocketsphinx (offline) or Kaldi(offline), or Google(online) for the "speech-to-text"-part.

The description from a superficial read looks good. Thanks!

I’m currently assembling an offline home assistant setup using Node-RED and voice2json, all running on Raspberry Pi’s:



Requires a little customization and/or coding, but it’s quite elegant, and all voice recognition happens on-device. Part of what makes the recognition much more accurate (subjectively, 99%ish) is the constrained vocabulary; the grammars are compiled from a simple user-defined markup language, and then parsed into JSON intents, containing both the full text string and appropriate keywords/variables split out into slots.

Just finished a similar rig in my car, acting as a voice-controlled MP3 player, with thousands of artists and albums compiled into intents from iTunes XML database. Works great, and feels awesome to have a little 3-watt baby computer doing a job normally delegated to massive corporate server farms. ;)

Hi Lukifer, thanks for chiming in! I had a setup using snips, that i'm looking to replace. Please do document your setup, and your little helper scripts in a blogblost or such, and ping me/us :)

I was not able to find the same article online, but the Volume 72 of Make Magazine has a great overview of different non-cloud voice recognition platforms. Here is a preview: https://www.mydigitalpublication.com/publication/?m=38377&i=...

Today they published an online version which covers many of the platforms listed in these comments:


important question

I think there's a group of highly technical people who feel increasingly left behind by 'convenience tech' because of what they have to give up in order to use it

Well I've got all the gadgets controllable now without internet as a requirement. Only the voice part requires it now for us at least. Google home/Amazon echos and phone apps can communicate with the house and surrounds without issue.

Loss of internet access is not an excuse for ignoring basic voice commands in my opinion.

Privacy is also an important factor but not the primary driver for us.

I've read good things about Mycroft [1], though I haven't tried it myself. Ticks all the boxes though

[1] https://mycroft.ai/

Mycroft can only kinda sorta be run offline according to the project/company: https://mycroft-ai.gitbook.io/docs/about-mycroft-ai/faq

I wish you luck with this, and more importantly, hope that it inspires many people to start building similar projects.

I know virtually nothing about voice recognition, but my spidey sense tells me that it should be possible with the hardware you specify.

A Commodore 64 with a Covox VoiceMaster could recognize voice commands and trigger X-10 switches around a house. (Usually. My setup had about a 70% success rate, but pretty good for the time!) Surely a 16 core, 128GB RAM machine should be able to do far more.

Its beginning to take shape. Already got a bunch of good candidates for experimentation next week based on the answers so far.

My company develops SDKs for on-device speech recognition on Android/iOS: https://keenresearch.com/keenasr-docs (Raspberry Pi is an option too, we'll have a GA release in Q2)

We license this on commercial bases but would be open to indy-developer friendly licensing. We offer a trial SDK that makes testing/evaluation super easy (it works for 15min at the time).



> Currently, the SDK supports English and Spanish out of the box. Additional ASR Bundles for most major spoken languages can be provided upon request within 6-8 weeks.

Does this mean you have a standing offer to train a new language on demand?

Yes, for a number of European and some Asian languages we can do this; it's mainly the question of business opportunity.

That's very cool. Did you somehow solve autogenerating a training corpus for a new language as long as it's popular? Based on my impressions working with other people in the space, coming up with the training data seems like a big bottleneck - as the available engines are pretty good at learning new languages already.

We have relevant training data for a number of languages.

I've had some good experience with https://snips.ai . Works like advertised. easy to implement. Hardest thing was getting the microphone and the pie to get along.

Did you visit the website lately? Doesn't seem to be an option anymore :-/

There code is open source: https://github.com/snipsco

Though there are about 100 repositories there. I am not sure if it is easy to put it all together.

They only seem to do audio equipment. Did they once do more things more general?

They were acquired by Sonos. A few months ago they were offering tools to build your own speech recognition agents.

Before they were bought, you could build (with the help of a WebApp hosted by them) an offline usable speech recognition module, that could comfortably run on a Pi, and that would output parsed sentences in JSON format onto MQTT. Easy to integrate with everything in IOT. I loved it. Now i'm also looking for an alternative for the Speech-to-text(-to-json) part, like you.

That sucks.

Like Pebble.

I was one of the maintainers of the Persephone project which is an automated phonetic transcription tool. This came about from a research project that required a non-cloud solution. This project is open source and can be found on GitHub:


This may be a little too low level for what as there's no language model but maybe it's helpful as part of your system

I was in the process of planning my multi-room voice-AI setup based on SnipsAI (to be integrated with Home Assistant) when it was announced they were bought by Sonos, which killed their open source project. Since then I have been left trying various projects that meet my needs.

Among those, I tried MyCroft, which still requires a cloud account to config various things on it, and it doesn't support a multi-room setup at this time.

I've since switched to Rhasspy, which offers a larger array of config options and engines, and also multi-room (I'm yet to config multi-room tho)

In the long-term I plan to "train" the voice-AI for various additions, including a custom wake word - No, I'm not calling it `Jarvis` ;)

I'm running each of these voice-AI's on a Raspberry Pi 4 (4GB model), though I'm considering switching them to Pi 3's. I'm using the `ReSpeaker 2mic Pi-Hat` on each pi for the mic input. I'm planning to configure all the satellite nodes (voice-AI in each room) to PXE boot, that way they don't require an sd-card and I can easily update their images/configs from a central location.

Is it possible to make your config available, e.g., GitHub?

I'm just starting to get going with Rhasspy, integrating with Home Assistant, and the docs miss just enough that I hit walls everytime I try.

Thanks for the info you've already provided though, sounds like I want exactly what you do.

Google has papers on device speech recognition, these are used in the keyboard & for live caption on Pixel devices.

This article from the Google AI blog about the Gboard speech recognition is really interesting: https://ai.googleblog.com/2019/03/an-all-neural-on-device-sp...

They are trained on a ton of non-public data though, and I’m not sure if pre-trained models are around.

Nope, they aren't available. CC YouTube videos with captions or radio broadcasts + transcripts could prove helpful for multiple languages as well as being able to create a multilingual ASR.

I tried to use Julius for this. I may have misconfigured it, but it would always match something to what it was hearing. I encoded some sounds in my grammar to error terms that it would detect in quiet noise (like 'aa' and 'hh'), but it would still occasionally match words when nothing was going on.

Later I worked on the Microsoft Kinect with its 4-microphone array. With only a single microphone, it's so much harder to filter out background noise. If you don't find a system based on multiple microphones, I don't believe you can be successful if there's any ongoing noise (dishwasher, loud fans, etc), but a system that works in only quiet conditions is possible.

Homeseer automation software has this built in, with client listening apps for different platforms. I haven't used the voice recognition beyond testing, but I've been very happy with the software overall. It's relatively expensive, but goes on sale for about half price once or twice a year. There's a free 30 day trial.

I think there's two id phrases per sub-device by default, but using virtual devices vastly expands the software's capability. Especially for mapping virtual switches to multiple devices.

They also have zwave devices that for the most part are much better than most.


>I'm happy if they talk back via wifi to a server in my office for whatever real processing. The server might have 16 cores and 128Gb ram.

Pretty sure Mycroft is capable of that - in theory - you'll need to config it manually. The standard raspberry pi route isn't powerful enough for local.

Check out reespeaker for a raspberry microphone. You'll want one of the more expensive ones for range. Though at like 40 bucks they're not that wildly expensive.

Make sure it's a rasp 4 since wake word is processed locally. And you probably don't need 128gb RAM. No idea what they use but doubt that much.

As discussed elsewhere in this thread by others, mycroft can't do offline processing, according to their faq at least.

128GB is the minimum I use for general purpose servers so this machine would be a repurposed machine rather than something specially ordered or built.

>mycroft can't do offline processing, according to their faq at least.

Not sure what FAQ you're reading there. There is a whole section of STT engines you can plug into it.


Including a local one. It's gonna be a pain in the ass but as I said it's possible

Precise [1], Snowboy [2], and Porcupine [3] are all designed to work offline.

[1] https://github.com/MycroftAI/mycroft-precise

[2] https://github.com/kitt-ai/snowboy

[3] https://github.com/Picovoice/porcupine

All good projects (I’m using Snowboy), but these are all just wake words, which are very different beasts from full text recognition.

Is there a voice dialer for Android that doesn't use Google? There used to be, but it disappeared in the "upgrade" which has the mothership listening all the time.

The most recent edition of Make magazine had a pretty good overview of some different options. Doesn't go too much into depth but provides a good starting point.

Hi, I'm the dev behind https://talonvoice.com

I've been working with Facebook's wav2letter project and the results (speed on CPU, command accuracy) are extremely good in my experience. They also hold the "state of the art" for librispeech (a common benchmark) on wer_are_we [1]. Granted, that's with a 2GB model that doesn't run very well on CPU, but I think most of the fully "state of the art" models are computationally expensive and expected to run on GPU. Wav2letter has other models that are very fast on CPU and still extremely accurate.

You can run their "Streaming ConvNets" model on CPU to transcribe multiple live audio streams in parallel, see their wav2letter@anywhere post for more info [2]

I am getting very good accuracy on the in-progress model I am training for command recognition (3.7% word error rate on librispeech clean, about 8% WER on librispeech other, 20% WER on common voice, 3% WER on "speech commands"). I plan to release it alongside my other models here [5] once I'm done working on it.

There's a simple WER comparison between some of the command engines here [3] Between this and wer_are_we [1] it should give you a general idea of what to expect when talking about Word Error Rate (WER). (Note the wav2letter-talonweb entry in [3] is a rather old model I trained, known to have worse accuracy, it's not even the same NN architecture).


As far as constraining the vocabulary, you can try train a kenlm language model for kaldi, deepspeech, and wav2letter by grabbing KenLM and piping normalized (probably lowercase it and remove everything but ascii and quotes) text into lmplz:

    cat corpus.txt | kenlm/build/bin/lmplz -o 4 > model.arpa
And you can turn it into a compressed binary model for wav2letter like this:

    kenlm/build/bin/build_binary -a 22 -q 8 -b 8 trie model.arpa model.bin
There are other options, like using a "strict command grammar", but I don't have enough context as to how you want to program this to guide you there.

I also have tooling I wrote around wav2letter, such as wav2train [4] which builds wav2letter training and runtime data files for you.

I'm generally happy to talk more and answer any questions.


[1] https://github.com/syhw/wer_are_we

[2] https://ai.facebook.com/blog/online-speech-recognition-with-...

[3] https://github.com/daanzu/kaldi-active-grammar/blob/master/d...

[4] https://github.com/talonvoice/wav2train

[5] https://talonvoice.com/research/

Looks interesting. I've got a few GPUs I could use if the CPU is too much of a bottleneck.

We've identified a dictionary of the types of commands and words we use and have a recording of all our amazon and other commands. Training wave files are not an issue.

Have you had any issues with recognising multiple languages?


Are you talking about recognizing multiple languages at once (e.g. you don't know which language the user will speak and you want it to react appropriately to all of them)? You're going to have a harder time doing that with any system, but it is possible. You would likely need to run multiple models and pick between their output, or train a specialized separate model that can classify the language, then use it to pick the model.

I haven't personally tested wav2letter with other languages yet. I know zamia-speech trained a german model, and some users have been talking about training for other languages. I've been helping someone who is training several other languages and they've reported great success as well.

If you want to make a new model from scratch in any language, you'll probably want a couple hundred hours of transcribed speech for it, but it doesn't need to be your own speech. Common Voice is a good data source for that.

I'd expected some training required for each separately so we've got a decent collection of voice examples. Some of us mix languages within a sentence but that's a bad habit anyway so unsupported. I don't see figuring out the language as being a problem as my proof-of-concept already handles that well enough with around 90% accuracy. Once its selected a language then the appropriate model can be used. To make it fast we'd likely just keep the whole thing in RAM. Might need more however.

The issue I see with talon is its currently mac only. That would however still help one of us who lives on wheels (got a 16" macbook IIRC and a mac mini as well). Different set of use-cases so things would be more relaxed.

I see some hints about a linux version however. I've got windows / linux VMs on the server but no other macs. GPUs will be installed soon when I decom some old gaming rigs.

Plenty to think about.

The talon beta is on windows/linux/mac. I was recommending wav2letter directly instead of talon specifically because you mentioned thin clients, and I'm not really targeting something like headless raspberry pis yet.

I mostly mentioned wav2letter@anywhere because it could handle a bunch of audio streams centrally, so you can stream from 16 pis to a central box, and it's very accurate.

My "workaround" for using offline recognition in several languages in `snips.ai` was configuring a different wake-word per language, and then running several wake-word-detectors on the same microphone input.

Exactly how mine works. Each language uses a different trigger. Simple.

Even then it can misinterpret so plenty of room for improvement. My quick POC is around 90% accurate for language detection based on the trigger word.

Are you willing to list the languages you'd like to recognize?

Jasper is another project that uses CMUSPINX from https://cmusphinx.github.io/wiki/faq/ under the hood.

Apple platforms offer an API (SFSpeechRecognizer) which for some languages supports on-device recognition. Trivial to set up, super easy to use, and pretty reasonable accuracy.

Disclaimer: Working for Apple, not directly on this API but in related subjects.

I guess not always.

“The speech recognition process involves capturing audio of the user’s voice and sending that data to Apple’s servers for processing. The audio you capture constitutes sensitive user data, and you must make every effort to protect it. You must also obtain the user’s permission before sending that data across the network to Apple’s servers. You request authorization using the APIs of the Speech framework.“


Which languages are processed on device and not send to Apple’s servers?

> I guess not always.

Yeah, I suppose I should have formulated that more clearly. The API offers cloud speech recognition for a set of languages, and on-device speech recognition for a subset of these.

> Which languages are processed on device and not send to Apple’s servers?

It's not a static set, because (1) availability tends to expand over time and (2) when you start using a new language, the on-device model needs to be downloaded first.

So what you need to do is create a SFSpeechRecognizer and then test the supportsOnDeviceRecognition property. If that is set, you can set requiresOnDeviceRecognition on the SFSpeechRecognitionRequest.

As an aside, it seems you're interested in speech recognition, or speech to text, not voice recognition. Voice recognition is a different problem, where the particular speaker needs to be recognized from voice.

I used to work in the speech reco space for many years. Fighting this battle was lost long long long ago. Virtually no one outside of the space really cares, and people use speech and voice recognition interchangeably. It’s really more the case that voice recognition is just an ambiguous term.

It’s pretty hard to blame lay people when speech reco products like Dragon are widely marketed as “voice recognition”.

If you want to be clear and not just pedantic just call it speaker recognition.

I'd primarily like speech-to-text and the ability to know who is speaking. I have low expectations of the identification of speaker however.

If you're wanting a lot of people to use your solution as you described, recognition of who is speaking could add a lot of extra possibilities.

I've been waiting for Mycroft to release something new, https://mycroft.ai/

Caviat, they keep having delays and may never release v2 imo

Modern web browsers will support the Web Speech API (https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...) which may or may not involve a cloud service.

Here is the Google Chrome Web Speech API demo page: https://www.google.com/intl/en/chrome/demos/speech.html

In which browser doesn't it involve a cloud service?

Use Firefox.

Google Chrome does use a cloud service.

Edit: Firefox does not support the Web Speech API at this time. There are not currently any offline versions of this API as far as I can tell.

According to MDN, Firefox doesn't support it?


EDIT: According to other documentation, it is behind a config flag. If you enable it, it will send the data to Google's API through a Mozilla-operated proxy: https://wiki.mozilla.org/index.php?title=Web_Speech_API_-_Sp...

You're right.

It looks like there is no offline version :(

Android has a local speech recognizer, maybe give that a go ? You'll have to make an Android app tho.

If cloud services are such an issue (as they would be for me) then it's worth considering the security of local logs and whatever. Maybe limit lifetime, or even use FDE.

This feels like an area where a Dropbox of self hosted solution will emerge.

Question, what aspect of your product restricts the software architecture to not use "off-site cloud?"

If you've got an amazon echo or equivalent, try this experiment: disconnect your internet. Now issue a voice command to it. Why is your light not switching on? Why is alexa not answering your questions?

Now you know why I want "off-site cloud" for particular things. Not everything. Just what matters. A few things here and there.

There's nothing wrong with the cloud and nothing stops me from using its benefits. Except one minor detail. My use-cases are likely different from others. I wish to keep cloud access as needed but also just a little bit more.

Don't bother.

The cloud based solutions are so vastly superior to the current non-cloud solutions that unless you're something of an expert in ASR you're just going to get frustrated. If you're worried about privacy, Google lets you pay a little extra to immediately delete the audio after you send it to their servers.

I think I will bother, the use cases are valid and worthwhile to pursue for multiple reasons, many of which aren't even on my radar. Bonus: I can have my cake AND eat it too.

But none is sure that Google does keep a copy of the data.

Cloud = giving out your data

And some of us still have those odd privacy concerns ..

But seriously, it feels sometimes, we are very small minority.

Poster specifically asks non cloud.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact