
A 2019 Guide for Automatic Speech Recognition - mwitiderrick
https://heartbeat.fritz.ai/a-2019-guide-for-automatic-speech-recognition-f1e1129a141c
======
skykooler
So how can one actually use one of these systems? I'm familiar with
pocketsphinx, where you download it and then run procketsphinx_continuous and
it prints a transcription of the microphone. Or Julius, where you write a
grammar and run it with that and it prints the output. The process for
DeepSpeech seems to be along the lines of "get access to a machine with ten
GPUs, find a huge dataset, train it, and then run that somehow" \- are there
compiled models that can actually be shipped as part of an application?

~~~
jamesonthecrow
Obviously the big cloud players offer their own APIs and SDKs (for a price),
but there are a few other solutions worth looking at.

Facebook has open sourced some pre-trained models:
[https://github.com/facebookresearch/wav2letter](https://github.com/facebookresearch/wav2letter)

Picovoice has some smaller, more efficient models capable of running on edge
devices: [https://github.com/Picovoice](https://github.com/Picovoice)

Full ASR does require quite large models and datasets, but you don't need
nearly that much power or data to fine-tune a model for your own domain.

~~~
GordonS
Wasn't aware of Picovoice. Just tried the live do they have on their
website... wow, it's... not great! Even if I spoke as precisely as possible
or/and put on an American accent, it was _way_ off the mark.

------
gok
I'm a little confused about the title because the first paper is from 2014.

It's also too bad this doesn't mention any traditional HMM-based ASR
techniques, as HMMs continue to be used on many SOTA systems, particularly
those that can be reproduced publicly:
[https://github.com/syhw/wer_are_we](https://github.com/syhw/wer_are_we)

~~~
priansh
This.

The article quotes DeepSpeech, wavenets, LSTMs of all sorts; essentially, all
the neural networks that scale terribly. DeepSpeech for example is pretty
heavy and requires a decent GPU to get anywhere near a realtime factor of 1.

Meanwhile ASR through HMM's consistently hits realtime factors sub-1 and can
run on small CPUs. e.g. the default models Kaldi ships with outperform
DeepSpeech on a lot of modern examples, and are exponentially faster.

Moreover, training these HMM's is something that is feasible for a normal
developer. Training newer models requires data of scale and quality (iirc
Mozilla's models are trained on Common Speech which is an enormous crowd
sourced dataset, and Google's wavenet models use an internal dataset of very
high quality and quantity).

Until the models get more practically achievable, ASR for average people will
probably continue to be dominated by Kaldi, Sphinx etc

~~~
lunixbochs
I'm hitting full-encode/decode realtime-factor in the ballpark of <0.01x on a
quad core CPU with a tuned wav2letter

~~~
priansh
wav2letter is pretty fast we haven't been able to break 1.1x on a t2.medium in
any of our benchmarks -- what's your setup here?

I definitely think it's a big step in the right direction; it's easily 100x
faster than DeepSpeech for us.

If I could have anything I wanted for xmas, I'd ask for a speech to text
system that is fast enough to work in browser thru wasm or something.

~~~
lunixbochs
My setup is a work in progress, see the sibling thread. My 2015 2-core MacBook
is probably not faster than a t2.medium, so you should be able to hit the same
sort of 0.05x ballpark numbers easily with the same sort of setup.

Is there a SIMD.js / WASM equivalent optimized convolution / GEMM? That's
pretty much all we'd need to port this to web... well, that and maybe a
language model that isn't 1GB. The wav2letter acoustic model I'm using is
based on the librispeech conv_glu, which is almost entirely served by conv1d
layers.

I've honestly already been considering a demo for my main project (which is
mixed english / command decoding) that runs entirely in a web page, if you
have engineering time to throw at your christmas wish, we should talk :P

------
GordonS
Are there any good quality OSS speech recognition libraries that are easy to
get started with, or is it still so complex/expensive that this is a fantasy?

I really love the idea of hacking something together for development so I
don't need to use my arms and hands so much, or could lessen my mouse use (for
accessibility reasons), but I don't know how realistic that is.

Edit: I looked into cloud-based ASR, such as that provided by Azure and AWS,
but that would mean network latency on top of the recognition latency, and
that would drive me _nuts_!

~~~
wes-k
I dream of building a competitor to Siri and google and I’d probably use
[https://snips.ai/](https://snips.ai/). I think it gains recognition accuracy
by having a limited skill set. Looks good though and has functionality for
defining skills.

~~~
GordonS
This looks interesting, as I don't need "conversation-level" ASR, I'd just
need it to work with a limited grammar.

But I'm unsure of what Snips actually is - I had a look at the website, but I
don't know if this is OSS, commercial software, a library or what?

~~~
ragebol
I've been trying out Snips, it's pretty cool and works reasonably well. A lot
of the overall system is open source and runs offline, but the training
happens on their servers and is closed source AFAIK. You download the trained
model and can run it on a raspberry pi etc offline. But they claim that what
Snips offers for free isn't nearly as good as what the commercial offering
does. My priorities unfortunately shifted away from finding out about the
quality difference. IIRC Snips's business model is to create custom voice
agents that run offline for other companies and services around that, eg.
custom hotwords etc.

------
konz
Related guide from last week: A 2019 Guide to Speech Synthesis with Deep
Learning
([https://news.ycombinator.com/item?id=20819672](https://news.ycombinator.com/item?id=20819672))

------
m4tthumphrey
Off topic: Why does that site not show the scrollbar? I use it to work out how
long a post is...

~~~
nwalker85
It's a design decision that is gaining a lot of traction lately.

