Ask HN: Non-cloud voice recognition for home use?

romwell · on March 14, 2020

TL; DR: Win 10 IoT for RasPi does it.

-----------------

Windows 10 IoT for Raspberry Pi comes with offline speech recognition API.

It was not hard to slap some code together that turns on a light when someone says "banana" at a hackathon.

Sounds like exactly what you need.

>If it helps I'm happy to reduce vocabulary to a dictionary of words

You will do it with an XML grammar file for offline recognition[4].

[1]https://docs.microsoft.com/en-us/windows/iot-core/tutorials/...

[2]https://docs.microsoft.com/en-us/windows/iot-core/extend-you...

Someone's demo project:

[3]https://www.hackster.io/krvarma/rpivoice-051857

[4]https://docs.microsoft.com/en-us/windows/uwp/design/input/sp...

coredog64 · on March 14, 2020

The Microsoft offline speech recognizer is pretty good. I did some work with it many years ago [0]. The only problem we had was with accents: My French co-worker had to use his most obnoxiously over-the-top American accent for reasonable accuracy. ISTR that we could switch to Australian English for the Aussies and Kiwis.

[0] https://github.com/spc-ofp/ObserverLengthSampler

lucb1e · on March 14, 2020

This is really interesting, but I have a few questions:

- The setup guide shows a Windows system making a Windows iot version. Can't I just download an iso and flash it to an sdcard with dd? Does it need a license?

- The demo projects show C# and while I can develop in monodevelop, I don't have a Windows to compile it with. Is a C# compiler included in Windows iot's .NET distribution or are there also cross-platform (interpreted) languages that run on Windows iot (e.g. Python3)?

Jaruzel · on March 15, 2020

So, recently I played with Windows10 IoT.

Win10 IoT is written to the SD card, and left to first-boot inside the Pi (this bit takes AGES). While it's doing that you install the Win IoT dashboard toolkit onto your PC (Windows only). The Dashboard will find your Win10 IoT Pi on the network, there's a few demo apps pre installed you can play with. It's a free OS, but you need to pay for a dev licence if you want it to not reboot every 24 hours. (There's also an on screen non-production warning).

Now you fire up Visual Studio, which has gained the ability to build C# apps on ARM. You write a small app, including using the visual form designer, and you debug using the PC based Win IoT emulator, or you deploy it directly (using VS) to the the Pi.

Once you are happy with your app, you have to bake it into a new OS image that gets written to the SD card(s) for proper deployment.

Win10 IoT can only run one app in the foreground. It does not have a classic desktop, which should be fine for embedded or kiosk type applications.

Personally, I found it clunky and slow, even on a fast Pi. There's also a fair amount of restrictions applied to your app (think the same restrictions as an Android or iOS app) so if you are used to your C# app having full (read) access to the machine it's running on, you won't get that on Win IoT.

If you want to develop GUI rich apps on the Pi, there are far better alternatives (Mono, Python/GTK etc. on Raspbian).

This is just my take on Win10 IoT. I'm a Windows guy by profession so I don't have an anti-MS bias here.

thejosh · on March 15, 2020

You might be able to use .NET Core which is opensource and can run on Linux/Mac.

rs23296008n1 · on March 14, 2020

Sounds good to me. RasPi are solid performers to us. I'm assuming the XML would need to be updated as the dictionary changes. Sounds easy enough. The loading of languages might get fussy/impossible if I want multiple. A stretch goal is to support multiple languages from the same device.

I'd be hoping I can also load in text-to-speech as well either separately or as part of same application. From what I've read the windows approach to the Pi is more like an appliance. Your application takes over the whole device. This is fine as long as I can load in more functionality to that application.

I need to read more about this.

Thanks for the pointers.

driverdan · on March 15, 2020

Does the IoT version track everything you do and cram ads down your throat like the regular version of Win 10?

romwell · on March 16, 2020

I don't think this question is in good faith, but the answer is no.

driverdan · on March 17, 2020

I worded it as an adversarial way but it was a serious question. I'm glad to hear they don't.

ftyers · on March 14, 2020

Mozilla DeepSpeech trained on the Common Voice dataset for English. You can get pretrained models too. They have a nice matrix channel where you can get help, and pretty good documentation. It is also actively developed by several engineers. http://voice.mozilla.org/en/datasets and http://github.com/mozilla/DeepSpeech/

magicalhippo · on March 14, 2020

I had limited luck using the provided language model, but very good results when providing my own. So if you feel the results are poor, try building your own language model.

AFAIK DeepSpeech works by using the neural net to detect characters from speech, and then the language model is used to try to make a sentence out of the character stream, by doing a kind of graph search. Thus if the language model doesn't contain the words you want it to recognize, it'll have a hard time giving good output.

Anyway, I used the following tutorial[1] as a base to build the language model. For the kenlm tools I used Ubuntu WSL, and the generate_trie executable was part of the DeepSpeech native tools package for Windows.

[1]: https://discourse.mozilla.org/t/tutorial-how-i-trained-a-spe...

ghostpepper · on March 14, 2020

For a software engineer with no experience in machine learning / AI, what does it mean to build your own language model? Does it require coding? Hundreds of hours of audio data from your own voice? A significant amount of computing power?

magicalhippo · on March 14, 2020

The tools available means you only need to provide a list of normal sentences, and they should include the words you'd like it to know about.

For my case I just wanted to train it on like 30 different sentences, that took less than a second. But for a general assistant ala Google Home you'll want a large number of sentences and I hear it can take a while (hour or few?).

Due to using probabilities it will match words in other sentences than what you give it, but from my understanding it will be partial to the ones you feed it if DeepSpeech mis-classifies a character or two.

nmstoker · on March 15, 2020

Here's an example I did using a custom LM with DeepSpeech - the description links back to the forum with the steps for producing it.

https://youtu.be/LWUBK6PAaxM

This was on a slightly earlier version, and they've made improvements in speed and quality of recognition since then.

magicalhippo · on March 14, 2020

> Hundreds of hours of audio data from your own voice?

I should clarify this. As I mentioned, training the neural net part requires tons of audio and the corresponding text (and people should totally contribute[1], the resulting data sets are released to the public). The neural net in DeepSpeech is then used on an audio stream and outputs a stream of characters.

Turning that stream of characters into sentences is what the language model is for.

Training the neural net is very data and compute intensive, but fortunately Mozilla provides pre-trained models.

Generating the language model is relatively cheap. And if your target language shares sounds with English, you may get away with using the English-trained neural net but with a non-English language model.

[1]: https://voice.mozilla.org/

dmos62 · on March 14, 2020

The models (`deepspeech-0.6.1-models.tar.gz`) weigh 1.14 gb, if anyone's interested.

hutzlibu · on March 15, 2020

Here is a paper from a german university, where they tried to adopt DeepSpeech to german:

https://www.researchgate.net/publication/336532830_German_En...

The results are, that adopting works good, but as of writing of the paper somemonths ago, the results were not very good yet. So it takes better trained models for other languages. English seems to be quite good.

What surprised me, is that it works offline very fast even on a rasperry pi!

https://www.hackster.io/dmitrywat/offline-speech-recognition...

kingo55 · on March 14, 2020

Interesting. Are there any projects actively using this today?

nmstoker · on March 15, 2020

I haven't tried it directly myself but there is this project, Dragonfire, which looks quite reasonable using DeepSpeech:

https://github.com/DragonComputer/Dragonfire

There's a minimal demo app I put together here too: https://github.com/nmstoker/SimpleSpeechLoop

dsteinman · on March 15, 2020

I've been using DeepSpeech to learn to build voice controls for all sorts of things in JavaScript. And I've got a way to connect it to the web so you'll be able write speech recognition enabled web pages using client side JavaScript.

https://github.com/jaxcore/deepspeech-plugin

albertzeyer · on March 14, 2020

Are you searching for a complete solution including NLP and an engine to perform actions? Some of these are already posted, like Home Assistant, and Mycroft.

Sphinx is just for the automatic speech recognition (ASR) part. But there are better solutions for that:

Kaldi (https://kaldi-asr.org/) is probably the most comprehensive ASR solution, which yields very competitive state-of-the-art results.

RASR (https://www-i6.informatik.rwth-aachen.de/rwth-asr/) is for non-commercial use only but otherwise similar as Kaldi.

If you want to use a simpler ASR system, nowadays end-to-end models perform quite well. There are quite a huge number of projects which support these:

RETURNN (https://github.com/rwth-i6/returnn) is non-commercial TF-based. (Disclaimer: I'm one of the main authors.)

Lingvo (https://github.com/tensorflow/lingvo), from Google, TF-based.

ESPnet (https://github.com/espnet/espnet), PyTorch/Chainer.

...

rs23296008n1 · on March 14, 2020

Already got the action engine: all the lights, hvac, tv, calculator, computers, etc are all controllable. None require internet now. Or any kind of location services for that matter.

I really just want the speech-to-text. Ideally I'd also like it to recognise who's talking. But that's a bonus.

thomas536 · on March 14, 2020

I'll second the recommendation for Kaldi. It's more complicated to get running vs pocketsphinx, but in my experience Kaldi has better accuracy/lower latency in general cases vs pocketsphinx (assuming caveats below).

https://github.com/gooofy/zamia-speech/ has been training good [acoustic] models which are worth looking at (including training with robustness against noise). They've also got lots of code and docker images and documentation.

pocketsphinx isn't actually that bad to use with their latest acoustic models and small vocabularies (so its utility depends on your exact use case). But it's not generally good with far field mics/dsp processed audio, not really good with noise, and in my experiments quite not as fast as Kaldi.

Better/larger language models in my experience make a world of difference (esp in the general vocab case) for improving accuracy for either of kaldi or pocketsphinx. Nobody really seems to talk about this(?), since everyone always uses the news corpus from like the 80s as the default language model.

I haven't really ever gotten the various ~deepspeech systems working, so I can't speak to them.

rs23296008n1 · on March 14, 2020

I'm happy to feed it plenty of voice logs as well as a training corpus as necessary. Sounds like an interesting journey.

arejaytee · on March 15, 2020

Would love to know more about this and how you've done it

daanzu · on March 15, 2020

I develop Kaldi Active Grammar [1], which is mainly intended for use with strict command grammars. Compared to normal language models, these can provide much better accuracy, assuming you can describe (and speak) your command structure exactly. (This is probably more acceptable for a voice assistant for an audience that is more technical.) The grammar can be specified by a FST, or you can use KaldiAG through Dragonfly, which allows you to specify them (and their resultant actions) in Python. However, KaldiAG can also do simple plain dictation if you want.

KaldiAG has an English model available, but other models could be trained. Although you can't just drop in and use a standard Kaldi model with KaldiAG, the modifications required are fairly minimal and don't require any training or modification of its acoustic model. All recognition is performed locally and off line by default, but you can also selectively choose to do some recognition in the cloud, too.

Kaldi generally performs at the state of art. As a hybrid engine, although training can be more complicated, it generally requires far less training data to achieve high accuracy, compared to "end to end" engines.

[1] https://github.com/daanzu/kaldi-active-grammar

daanzu · on March 15, 2020

Too late to edit, but I should probably have noted that KaldiAG also would make it easy to define "contexts" when (groups of) commands are active for recognition. For example, if the TV is on, you could have commands for adjusting the volume/etc. But if it is off, those commands are disabled, so they can't be recognized, and further, the engine knows this and can therefore better recognize the other commands that remain active.

barrystaes · on March 15, 2020

Could Home Assistant use such commands by running a docker?

Also the video demo is rather impressive, in how accurate (and predictable) it recognises.

daanzu · on March 15, 2020

I don't know much about Home Assistant, but that certainly should be possible to set up. The KaldiAG API is pretty low level, but basically: you define a set of rules, and send in audio data, along with a bit mask of which rules are active at the beginning of each utterance, and receive back the recognized rule and text. The easy solution is probably to go through Dragonfly, which makes it easy to define the rules, contexts, and actions. It might be a little hacky to do, but you should be able to wire it up with Home Assistant somehow.

Although I mainly use it for computer control as demonstrated in the video, I do have many commands akin to home automation, like adjusting the lights, HVAC, etc.

guptaneil · on March 14, 2020

Disclaimer: I am the founder of Hiome, a smart home startup focused on private by design local-only products.

What actions are you looking to handle with the assistant?

Reason I ask is because a voice assistant is a command line interface with no auto-complete or visual feedback. It doesn’t scale well as you add more devices or commands to your home, because it becomes impossible to remember all the phrases you programmed. We’ve found the person who sets up the voice assistant will use it for simple tasks like “turn off all lights” but nobody else benefits and it gets little use beyond timers and music. They are certainly nice to have, but they don’t significantly improve the smart home experience.

If you’re looking to control individual devices, I suggest taking a look at actual occupancy sensors like Hiome (https://hiome.com), which can let you automate your home with zero interaction so it just works for everyone without learning anything (like in a sci-fi movie). Even if you’re the only user, it’s much nicer to never think about your devices again.

Happy to answer any questions about Hiome or what we’ve learned helping people with smart homes in general! -> neil@hiome.com

rs23296008n1 · on March 15, 2020

Sure. We've got a house with multiple buildings, including sheds, halls etc.

Around 100 people need separate profiles, each should be able to set alarms, timers, reminders, etc. if they want a routine to create any of those or tell them time or date or temperature they should be able to do that from any of the voice assistants in any room. They might only want such a routine in a particular room. They should be able to define a home device and a current device. Home device would usually be a bedroom for those of us that need them etc.

I definitely don't want to have to create any of those routines etc for any of them. Nothing about these should be fixed in stone. They have to be able to self-serve. We can assume they can navigate the ios amazon app as a baseline level of knowledge.

Room settings include temperature, lighting, curtains, tv on/off, channel, volume to name a few. The voice assistant in some rooms should be able to show web pages on-screen, or even youtube etc. including the laptop someone plugged in on HDMI1.

...the coffee machine automation is also a requirement. Its controlled by a flask app. The voice control should be able to let you order a coffee, strong, black. Or a Dave#5.

We'd also like device detection to trigger when people's phones appear in certain locations.

What kinds of options exist for this?

guptaneil · on March 15, 2020

That's definitely a new situation we haven't seen before!

Are the 100 people using all of the different rooms, or do people mostly stick to their own rooms (like a hotel/dorm)?

I'd love to see what you've built so far, and better understand the problems you're trying to fix. For example, does each room have its own coffee machine, or is it a communal coffee machine? Are the people living here permanently or rotating regularly? What is the goal for device detection (e.g., do you want to use that for presence detection or as a security system or something else)?

We have a prototype for a machine learning system that learns how you use your devices and then automates them by itself, so you don't have to set anything up. Our focus is lights because that's what most people have, but it can also control other on/off things like curtains or tv right now. It sounds like it could be good fit for a situation like this, and I'd be happy to chat with you more on whether it makes sense to try out!

Can you send me an email (neil@hiome.com)?

byteshock · on March 14, 2020

[flagged]

guptaneil · on March 15, 2020

Sorry you felt like that was corporate shilling. This is a problem we see people hit often and have built something to improve the experience, so it felt worth sharing. If it helps, we're a 2 person team (hardly a corporation) and we received YC's Startup School grant last year, for whatever that's worth on their forums.

I do mean it that I'm happy to talk about anything smart home related, so feel free to email me! I spend a lot of my time helping people plan out their smart home build from scratch, including choosing between types of light switches, thermostats, platform, etc unrelated to our sensors.

rs23296008n1 · on March 15, 2020

Don't know about "corporate shilling" - you weren't covert at all.

I did ask for options after all.

The only gotcha is the requirements will overwhelm most basic setups fairly quickly. Not talking 20 lights here or a tv or desk or two.

I've replied to your original comment, thanks.

PS I forgot to mention, all this exists right now as a mixture of amazon echo and a set of text inputs via phone over wifi for evaluation. As we implement more it turns into more voice control and less of a giant hack. So some bits are half-baked but functional.

etrabroline · on March 14, 2020

I don't think mentioning your smart home startup in response to someone asking for recommendations for their smart home is unappreciated "corporate shilling".

detaro · on March 14, 2020

It feels very borderline to me, given that it doesn't really address the question, but instead goes more like "actually, you don't want what you ask. instead, what I'm selling is better". At least that's how it comes across to me.

mthoms · on March 14, 2020

From the site guidelines:

> Please don't post insinuations about astroturfing, shilling, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email us and we'll look at the data.

Besides, the commenter was upfront about their affiliation and offered insights that appear helpful to the discussion.

From elsewhere in the site guidelines:

> Assume good faith

detaro · on March 14, 2020

I don't think that first rule applies here, precisely because they were upfront (and thus byteshock isn't insinuating some hidden motive, but directly criticizing what they said)

peglasaurus · on March 16, 2020

I think the guidelines almost need some "case law" examples. Although I understand why they wouldn't.

DataDrivenMD · on March 14, 2020

Have you considered the Almond integration for Home Assistant? (https://www.home-assistant.io/integrations/almond/)

Alternatively, you could just fork the Almond project directly and take it from there: https://github.com/stanford-oval/almond-cloud

rs23296008n1 · on March 14, 2020

No I hadn't. This was why I asked: I know there's alternatives out there but there's so much noise.

Thanks.

Shawnecy · on March 14, 2020

Almond has been posted on HN previously[1]. User voltagex_ commented[2] that self-hosting, while possible is not recommended in the official installation instructions[3] as it is considered significantly more challenging to manage. This may or may not effect your decision to go forward with Almond.

[1] https://news.ycombinator.com/item?id=17532003

[2] https://news.ycombinator.com/item?id=17534793

[3] https://github.com/stanford-oval/almond-cloud/blob/master/do...

saghm · on March 14, 2020

Is there any product where self-hosting _isn't_ more difficult? That seems like a generic warning that could apply to pretty much any product in this space. It seems more like a warning to non-technical users who might not have the experience or know-how to successfully set up a server.

Shawnecy · on March 14, 2020

If it was simply an issue of just "more difficult", then it wouldn't be worth pointing out.

However, the words "significantly more challenging to manage" are straight from their documentation that I linked which I think makes it worth pointing out.

Whether or not it is too challenging is for each individual to decide for themselves.

amenod · on March 14, 2020

It also gives an indication what the preferred method of deployment is (from authors' point of view). In this case I read it as a warning that it might stop being supported in the future.

rs23296008n1 · on March 14, 2020

Pertinent quote from [3] regarding "more challenging":

"You must also deploy custom NLP models, as the official ones will not be compatible. Use this setup only if you absolutely need custom Thingpedia interfaces, and cannot provide these interfaces on Thingpedia."

perturbation · on March 14, 2020

If you don't mind getting your hands dirty a bit, I think Nvidia's model [Jasper](https://arxiv.org/pdf/1904.03288.pdf) is near SOTA, and they have [pretrained models](https://ngc.nvidia.com/catalog/models/nvidia:jaspernet10x5dr) and [tutorials / scripts](https://nvidia.github.io/NeMo/asr/tutorial.html) freely available. The first is in their library "nemo", but they also have it available in [vanilla Pytorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/P...) as well.

homarp · on March 14, 2020

and you have a version for the jetson nano: https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_co...

with install scripts: https://github.com/NVIDIA/OpenSeq2Seq/blob/master/scripts/je...

rs23296008n1 · on March 14, 2020

Do you have any experience/opinions on those?

nshm · on March 14, 2020

You are welcome to try Vosk

https://github.com/alphacep/vosk-api

Advantages are:

1) Supports 7 languages - English, German, French, Spanish, Portuguese, Chinese, Russian

2) Works offline even on lightweight devices - Raspberry Pi, Android, iOS

3) Install it with simple `pip install vosk`

4) Model size per language is just 50Mb

5) Provides streaming API for the best user experience (unlike popular speech-recognition python package)

6) There are APIs for different languages too - java/csharp etc.

7) Allows quick reconfiguration of vocabulary for best accuracy.

8) Supports speaker identification beside simple speech recognition

rs23296008n1 · on March 14, 2020

Sound great, thanks

notemaker · on March 14, 2020

https://rhasspy.readthedocs.io

Haven't used it, but seems very nice.

https://youtu.be/ijKTR_GqWwA

synesthesiam · on March 14, 2020

Rhasspy author here in case you have any questions :)

If you're looking for something for the command-line, check out https://voice2json.org

rs23296008n1 · on March 14, 2020

I think I'll dig in first. Cheers.

stragies · on March 14, 2020

It can use https://github.com/cmusphinx/pocketsphinx (offline) or Kaldi(offline), or Google(online) for the "speech-to-text"-part.

rs23296008n1 · on March 14, 2020

The description from a superficial read looks good. Thanks!

lukifer · on March 15, 2020

I’m currently assembling an offline home assistant setup using Node-RED and voice2json, all running on Raspberry Pi’s:

http://voice2json.org/

https://nodered.org/

Requires a little customization and/or coding, but it’s quite elegant, and all voice recognition happens on-device. Part of what makes the recognition much more accurate (subjectively, 99%ish) is the constrained vocabulary; the grammars are compiled from a simple user-defined markup language, and then parsed into JSON intents, containing both the full text string and appropriate keywords/variables split out into slots.

Just finished a similar rig in my car, acting as a voice-controlled MP3 player, with thousands of artists and albums compiled into intents from iTunes XML database. Works great, and feels awesome to have a little 3-watt baby computer doing a job normally delegated to massive corporate server farms. ;)

stragies · on March 15, 2020

Hi Lukifer, thanks for chiming in! I had a setup using snips, that i'm looking to replace. Please do document your setup, and your little helper scripts in a blogblost or such, and ping me/us :)

carbon85 · on March 14, 2020

I was not able to find the same article online, but the Volume 72 of Make Magazine has a great overview of different non-cloud voice recognition platforms. Here is a preview: https://www.mydigitalpublication.com/publication/?m=38377&i=...

jvyduna · on March 18, 2020

Today they published an online version which covers many of the platforms listed in these comments:

https://makezine.com/2020/03/17/private-by-design-free-and-p...

awinter-py · on March 14, 2020

important question

I think there's a group of highly technical people who feel increasingly left behind by 'convenience tech' because of what they have to give up in order to use it

rs23296008n1 · on March 14, 2020

Well I've got all the gadgets controllable now without internet as a requirement. Only the voice part requires it now for us at least. Google home/Amazon echos and phone apps can communicate with the house and surrounds without issue.

Loss of internet access is not an excuse for ignoring basic voice commands in my opinion.

Privacy is also an important factor but not the primary driver for us.

skamoen · on March 14, 2020

I've read good things about Mycroft [1], though I haven't tried it myself. Ticks all the boxes though

[1] https://mycroft.ai/

ghaff · on March 14, 2020

Mycroft can only kinda sorta be run offline according to the project/company: https://mycroft-ai.gitbook.io/docs/about-mycroft-ai/faq

reaperducer · on March 14, 2020

I wish you luck with this, and more importantly, hope that it inspires many people to start building similar projects.

I know virtually nothing about voice recognition, but my spidey sense tells me that it should be possible with the hardware you specify.

A Commodore 64 with a Covox VoiceMaster could recognize voice commands and trigger X-10 switches around a house. (Usually. My setup had about a 70% success rate, but pretty good for the time!) Surely a 16 core, 128GB RAM machine should be able to do far more.

rs23296008n1 · on March 14, 2020

Its beginning to take shape. Already got a bunch of good candidates for experimentation next week based on the answers so far.

otodic · on March 14, 2020

My company develops SDKs for on-device speech recognition on Android/iOS: https://keenresearch.com/keenasr-docs (Raspberry Pi is an option too, we'll have a GA release in Q2)

We license this on commercial bases but would be open to indy-developer friendly licensing. We offer a trial SDK that makes testing/evaluation super easy (it works for 15min at the time).

Ogi

ogi@keeenresearch.com

lunixbochs · on March 14, 2020

> Currently, the SDK supports English and Spanish out of the box. Additional ASR Bundles for most major spoken languages can be provided upon request within 6-8 weeks.

Does this mean you have a standing offer to train a new language on demand?

otodic · on March 14, 2020

Yes, for a number of European and some Asian languages we can do this; it's mainly the question of business opportunity.

lunixbochs · on March 15, 2020

That's very cool. Did you somehow solve autogenerating a training corpus for a new language as long as it's popular? Based on my impressions working with other people in the space, coming up with the training data seems like a big bottleneck - as the available engines are pretty good at learning new languages already.

otodic · on March 15, 2020

We have relevant training data for a number of languages.

winkelwagen · on March 14, 2020

I've had some good experience with https://snips.ai . Works like advertised. easy to implement. Hardest thing was getting the microphone and the pie to get along.

arendtio · on March 14, 2020

Did you visit the website lately? Doesn't seem to be an option anymore :-/

vineet · on March 14, 2020

There code is open source: https://github.com/snipsco

Though there are about 100 repositories there. I am not sure if it is easy to put it all together.

rs23296008n1 · on March 14, 2020

They only seem to do audio equipment. Did they once do more things more general?

arendtio · on March 14, 2020

They were acquired by Sonos. A few months ago they were offering tools to build your own speech recognition agents.

stragies · on March 14, 2020

Before they were bought, you could build (with the help of a WebApp hosted by them) an offline usable speech recognition module, that could comfortably run on a Pi, and that would output parsed sentences in JSON format onto MQTT. Easy to integrate with everything in IOT. I loved it. Now i'm also looking for an alternative for the Speech-to-text(-to-json) part, like you.

dillonmckay · on March 14, 2020

That sucks.

Like Pebble.

JanisL · on March 15, 2020

I was one of the maintainers of the Persephone project which is an automated phonetic transcription tool. This came about from a research project that required a non-cloud solution. This project is open source and can be found on GitHub:

https://github.com/persephone-tools

This may be a little too low level for what as there's no language model but maybe it's helpful as part of your system

gibs0ns · on March 15, 2020

I was in the process of planning my multi-room voice-AI setup based on SnipsAI (to be integrated with Home Assistant) when it was announced they were bought by Sonos, which killed their open source project. Since then I have been left trying various projects that meet my needs.

Among those, I tried MyCroft, which still requires a cloud account to config various things on it, and it doesn't support a multi-room setup at this time.

I've since switched to Rhasspy, which offers a larger array of config options and engines, and also multi-room (I'm yet to config multi-room tho)

In the long-term I plan to "train" the voice-AI for various additions, including a custom wake word - No, I'm not calling it `Jarvis` ;)

I'm running each of these voice-AI's on a Raspberry Pi 4 (4GB model), though I'm considering switching them to Pi 3's. I'm using the `ReSpeaker 2mic Pi-Hat` on each pi for the mic input. I'm planning to configure all the satellite nodes (voice-AI in each room) to PXE boot, that way they don't require an sd-card and I can easily update their images/configs from a central location.

OlympusMonds · on March 15, 2020

Is it possible to make your config available, e.g., GitHub?

I'm just starting to get going with Rhasspy, integrating with Home Assistant, and the docs miss just enough that I hit walls everytime I try.

Thanks for the info you've already provided though, sounds like I want exactly what you do.

villgax · on March 14, 2020

Google has papers on device speech recognition, these are used in the keyboard & for live caption on Pixel devices.

cjbassi · on March 14, 2020

This article from the Google AI blog about the Gboard speech recognition is really interesting: https://ai.googleblog.com/2019/03/an-all-neural-on-device-sp...

teapourer · on March 14, 2020

They are trained on a ton of non-public data though, and I’m not sure if pre-trained models are around.

villgax · on March 16, 2020

Nope, they aren't available. CC YouTube videos with captions or radio broadcasts + transcripts could prove helpful for multiple languages as well as being able to create a multilingual ASR.

coryrc · on March 15, 2020

I tried to use Julius for this. I may have misconfigured it, but it would always match something to what it was hearing. I encoded some sounds in my grammar to error terms that it would detect in quiet noise (like 'aa' and 'hh'), but it would still occasionally match words when nothing was going on.

Later I worked on the Microsoft Kinect with its 4-microphone array. With only a single microphone, it's so much harder to filter out background noise. If you don't find a system based on multiple microphones, I don't believe you can be successful if there's any ongoing noise (dishwasher, loud fans, etc), but a system that works in only quiet conditions is possible.

beerandt · on March 15, 2020

Homeseer automation software has this built in, with client listening apps for different platforms. I haven't used the voice recognition beyond testing, but I've been very happy with the software overall. It's relatively expensive, but goes on sale for about half price once or twice a year. There's a free 30 day trial.

I think there's two id phrases per sub-device by default, but using virtual devices vastly expands the software's capability. Especially for mapping virtual switches to multiple devices.

They also have zwave devices that for the most part are much better than most.

https://www.homeseer.com

Havoc · on March 14, 2020

>I'm happy if they talk back via wifi to a server in my office for whatever real processing. The server might have 16 cores and 128Gb ram.

Pretty sure Mycroft is capable of that - in theory - you'll need to config it manually. The standard raspberry pi route isn't powerful enough for local.

Check out reespeaker for a raspberry microphone. You'll want one of the more expensive ones for range. Though at like 40 bucks they're not that wildly expensive.

Make sure it's a rasp 4 since wake word is processed locally. And you probably don't need 128gb RAM. No idea what they use but doubt that much.

rs23296008n1 · on March 14, 2020

As discussed elsewhere in this thread by others, mycroft can't do offline processing, according to their faq at least.

128GB is the minimum I use for general purpose servers so this machine would be a repurposed machine rather than something specially ordered or built.

Havoc · on March 17, 2020

>mycroft can't do offline processing, according to their faq at least.

Not sure what FAQ you're reading there. There is a whole section of STT engines you can plug into it.

https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/customiz...

Including a local one. It's gonna be a pain in the ass but as I said it's possible

abrichr · on March 14, 2020

Precise [1], Snowboy [2], and Porcupine [3] are all designed to work offline.

[1] https://github.com/MycroftAI/mycroft-precise

[2] https://github.com/kitt-ai/snowboy

[3] https://github.com/Picovoice/porcupine

lukifer · on March 15, 2020

All good projects (I’m using Snowboy), but these are all just wake words, which are very different beasts from full text recognition.

Animats · on March 14, 2020

Is there a voice dialer for Android that doesn't use Google? There used to be, but it disappeared in the "upgrade" which has the mothership listening all the time.

LargeWu · on March 15, 2020

The most recent edition of Make magazine had a pretty good overview of some different options. Doesn't go too much into depth but provides a good starting point.

lunixbochs · on March 14, 2020

Hi, I'm the dev behind https://talonvoice.com

I've been working with Facebook's wav2letter project and the results (speed on CPU, command accuracy) are extremely good in my experience. They also hold the "state of the art" for librispeech (a common benchmark) on wer_are_we [1]. Granted, that's with a 2GB model that doesn't run very well on CPU, but I think most of the fully "state of the art" models are computationally expensive and expected to run on GPU. Wav2letter has other models that are very fast on CPU and still extremely accurate.

You can run their "Streaming ConvNets" model on CPU to transcribe multiple live audio streams in parallel, see their wav2letter@anywhere post for more info [2]

I am getting very good accuracy on the in-progress model I am training for command recognition (3.7% word error rate on librispeech clean, about 8% WER on librispeech other, 20% WER on common voice, 3% WER on "speech commands"). I plan to release it alongside my other models here [5] once I'm done working on it.

There's a simple WER comparison between some of the command engines here [3] Between this and wer_are_we [1] it should give you a general idea of what to expect when talking about Word Error Rate (WER). (Note the wav2letter-talonweb entry in [3] is a rather old model I trained, known to have worse accuracy, it's not even the same NN architecture).

----

As far as constraining the vocabulary, you can try train a kenlm language model for kaldi, deepspeech, and wav2letter by grabbing KenLM and piping normalized (probably lowercase it and remove everything but ascii and quotes) text into lmplz:

    cat corpus.txt | kenlm/build/bin/lmplz -o 4 > model.arpa

And you can turn it into a compressed binary model for wav2letter like this:

    kenlm/build/bin/build_binary -a 22 -q 8 -b 8 trie model.arpa model.bin

There are other options, like using a "strict command grammar", but I don't have enough context as to how you want to program this to guide you there.

I also have tooling I wrote around wav2letter, such as wav2train [4] which builds wav2letter training and runtime data files for you.

I'm generally happy to talk more and answer any questions.

----

[1] https://github.com/syhw/wer_are_we

[2] https://ai.facebook.com/blog/online-speech-recognition-with-...

[3] https://github.com/daanzu/kaldi-active-grammar/blob/master/d...

[4] https://github.com/talonvoice/wav2train

[5] https://talonvoice.com/research/

rs23296008n1 · on March 14, 2020

Looks interesting. I've got a few GPUs I could use if the CPU is too much of a bottleneck.

We've identified a dictionary of the types of commands and words we use and have a recording of all our amazon and other commands. Training wave files are not an issue.

Have you had any issues with recognising multiple languages?

Thanks!

lunixbochs · on March 14, 2020

Are you talking about recognizing multiple languages at once (e.g. you don't know which language the user will speak and you want it to react appropriately to all of them)? You're going to have a harder time doing that with any system, but it is possible. You would likely need to run multiple models and pick between their output, or train a specialized separate model that can classify the language, then use it to pick the model.

I haven't personally tested wav2letter with other languages yet. I know zamia-speech trained a german model, and some users have been talking about training for other languages. I've been helping someone who is training several other languages and they've reported great success as well.

If you want to make a new model from scratch in any language, you'll probably want a couple hundred hours of transcribed speech for it, but it doesn't need to be your own speech. Common Voice is a good data source for that.

rs23296008n1 · on March 14, 2020

I'd expected some training required for each separately so we've got a decent collection of voice examples. Some of us mix languages within a sentence but that's a bad habit anyway so unsupported. I don't see figuring out the language as being a problem as my proof-of-concept already handles that well enough with around 90% accuracy. Once its selected a language then the appropriate model can be used. To make it fast we'd likely just keep the whole thing in RAM. Might need more however.

The issue I see with talon is its currently mac only. That would however still help one of us who lives on wheels (got a 16" macbook IIRC and a mac mini as well). Different set of use-cases so things would be more relaxed.

I see some hints about a linux version however. I've got windows / linux VMs on the server but no other macs. GPUs will be installed soon when I decom some old gaming rigs.

Plenty to think about.

lunixbochs · on March 14, 2020

The talon beta is on windows/linux/mac. I was recommending wav2letter directly instead of talon specifically because you mentioned thin clients, and I'm not really targeting something like headless raspberry pis yet.

I mostly mentioned wav2letter@anywhere because it could handle a bunch of audio streams centrally, so you can stream from 16 pis to a central box, and it's very accurate.

stragies · on March 14, 2020

My "workaround" for using offline recognition in several languages in `snips.ai` was configuring a different wake-word per language, and then running several wake-word-detectors on the same microphone input.

rs23296008n1 · on March 14, 2020

Exactly how mine works. Each language uses a different trigger. Simple.

Even then it can misinterpret so plenty of room for improvement. My quick POC is around 90% accurate for language detection based on the trigger word.

lunixbochs · on March 15, 2020

Are you willing to list the languages you'd like to recognize?

belbob · on March 14, 2020

maybe https://jasperproject.github.io/

stragies · on March 14, 2020

Jasper is another project that uses CMUSPINX from https://cmusphinx.github.io/wiki/faq/ under the hood.

microtherion · on March 14, 2020

Apple platforms offer an API (SFSpeechRecognizer) which for some languages supports on-device recognition. Trivial to set up, super easy to use, and pretty reasonable accuracy.

Disclaimer: Working for Apple, not directly on this API but in related subjects.

deca6cda37d0 · on March 15, 2020

I guess not always.

“The speech recognition process involves capturing audio of the user’s voice and sending that data to Apple’s servers for processing. The audio you capture constitutes sensitive user data, and you must make every effort to protect it. You must also obtain the user’s permission before sending that data across the network to Apple’s servers. You request authorization using the APIs of the Speech framework.“

https://developer.apple.com/documentation/speech/asking_perm...

Which languages are processed on device and not send to Apple’s servers?

microtherion · on March 15, 2020

> I guess not always.

Yeah, I suppose I should have formulated that more clearly. The API offers cloud speech recognition for a set of languages, and on-device speech recognition for a subset of these.

> Which languages are processed on device and not send to Apple’s servers?

It's not a static set, because (1) availability tends to expand over time and (2) when you start using a new language, the on-device model needs to be downloaded first.

So what you need to do is create a SFSpeechRecognizer and then test the supportsOnDeviceRecognition property. If that is set, you can set requiresOnDeviceRecognition on the SFSpeechRecognitionRequest.

cosmic_ape · on March 14, 2020

As an aside, it seems you're interested in speech recognition, or speech to text, not voice recognition. Voice recognition is a different problem, where the particular speaker needs to be recognized from voice.

catblast · on March 14, 2020

I used to work in the speech reco space for many years. Fighting this battle was lost long long long ago. Virtually no one outside of the space really cares, and people use speech and voice recognition interchangeably. It’s really more the case that voice recognition is just an ambiguous term.

It’s pretty hard to blame lay people when speech reco products like Dragon are widely marketed as “voice recognition”.

If you want to be clear and not just pedantic just call it speaker recognition.

rs23296008n1 · on March 14, 2020

I'd primarily like speech-to-text and the ability to know who is speaking. I have low expectations of the identification of speaker however.

peglasaurus · on March 16, 2020

If you're wanting a lot of people to use your solution as you described, recognition of who is speaking could add a lot of extra possibilities.

vinniejames · on March 14, 2020

I've been waiting for Mycroft to release something new, https://mycroft.ai/

Caviat, they keep having delays and may never release v2 imo

thesuperbigfrog · on March 14, 2020

Modern web browsers will support the Web Speech API (https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...) which may or may not involve a cloud service.

Here is the Google Chrome Web Speech API demo page: https://www.google.com/intl/en/chrome/demos/speech.html

detaro · on March 14, 2020

In which browser doesn't it involve a cloud service?

thesuperbigfrog · on March 14, 2020

Use Firefox.

Google Chrome does use a cloud service.

Edit: Firefox does not support the Web Speech API at this time. There are not currently any offline versions of this API as far as I can tell.

detaro · on March 14, 2020

According to MDN, Firefox doesn't support it?

https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...

EDIT: According to other documentation, it is behind a config flag. If you enable it, it will send the data to Google's API through a Mozilla-operated proxy: https://wiki.mozilla.org/index.php?title=Web_Speech_API_-_Sp...

thesuperbigfrog · on March 14, 2020

You're right.

It looks like there is no offline version :(

ParanoidShroom · on March 14, 2020

Android has a local speech recognizer, maybe give that a go ? You'll have to make an Android app tho.

mirimir · on March 14, 2020

If cloud services are such an issue (as they would be for me) then it's worth considering the security of local logs and whatever. Maybe limit lifetime, or even use FDE.

wtvanhest · on March 14, 2020

This feels like an area where a Dropbox of self hosted solution will emerge.

stevewilhelm · on March 14, 2020

Question, what aspect of your product restricts the software architecture to not use "off-site cloud?"

rs23296008n1 · on March 14, 2020

If you've got an amazon echo or equivalent, try this experiment: disconnect your internet. Now issue a voice command to it. Why is your light not switching on? Why is alexa not answering your questions?

Now you know why I want "off-site cloud" for particular things. Not everything. Just what matters. A few things here and there.

There's nothing wrong with the cloud and nothing stops me from using its benefits. Except one minor detail. My use-cases are likely different from others. I wish to keep cloud access as needed but also just a little bit more.

gok · on March 14, 2020

Don't bother.

The cloud based solutions are so vastly superior to the current non-cloud solutions that unless you're something of an expert in ASR you're just going to get frustrated. If you're worried about privacy, Google lets you pay a little extra to immediately delete the audio after you send it to their servers.

rs23296008n1 · on March 14, 2020

I think I will bother, the use cases are valid and worthwhile to pursue for multiple reasons, many of which aren't even on my radar. Bonus: I can have my cake AND eat it too.

mister_hn · on March 14, 2020

But none is sure that Google does keep a copy of the data.

Cloud = giving out your data

hutzlibu · on March 15, 2020

And some of us still have those odd privacy concerns ..

But seriously, it feels sometimes, we are very small minority.

supermatt · on March 14, 2020

Poster specifically asks non cloud.