
Speech-to-Text Benchmark: Framework for benchmarking speech-to-text engines - kenarsa
https://github.com/Picovoice/stt-benchmark
======
smt88
I'm surprised no one has questioned that this benchmark is published by the
creators of Cheetah. The conflict of interest is extreme.

Is anyone enough of a STT expert to weigh in? Why should I trust this?

~~~
kenarsa
I agree with your point. That is why we open sourced the benchmark so you can
verify it yourself. You just need an Ubuntu machine with python 3.6. I
appreciate the question.

~~~
smt88
It's the methodology, not the results or the code, that I'm suspicious of. For
a highly variable task like STT, I'm sure an expert could contrive a test that
gives far better results for either of the other programs tested.

That's why it would help to know why this test is comprehensive and
representative enough to be considered unbiased or otherwise where its biases
are. I don't have that expertise myself.

~~~
p1esk
I think the biases are pretty obvious, but the most serious shortcoming of
this benchmark is that their result (30% WER on CV) is not reproducible: it's
not clear what they trained their model on, and the model itself is not
available, so you just have to take their word for it.

~~~
kenarsa
Thanks for the comment. Just wanted to quickly clarify that the model is
available here: [https://github.com/Picovoice/stt-
benchmark/tree/master/resou...](https://github.com/Picovoice/stt-
benchmark/tree/master/resources/cheetah)

~~~
p1esk
No, "reproducible" means that I can train it on the same data you used, and
get the claimed result.

Anything other than that is taking your word for it.

~~~
neikos
As someone who has no idea about ML, why is knowing how it was trained
important?

I would guess to anticipate possible problems, but I don't really know.

~~~
lern_too_spel
To make sure it wasn't trained on the test set, even if by accident.

~~~
canada_dry
Kinda like creating your acceptance test using the same scripts as your unit
test... the result would look good, but in fact be less than reliable.

~~~
DougMerritt
More than kind of like; you nailed it: exactly like. This has been an infamous
issue with machine learning for decades, where unwary researchers/developers
can do this quite accidentally if they're not careful.

The thing is that training data is very often hard to come by due to monetary
or other cost, so it's extremely tempting to share some data between training
and testing -- yet it's a cardinal sin for the reason you said.

Historically there have been a number of cases where the best machine learning
packages (OCR, speech recognition, and more) have been best because of the
quality of their data (including separating training from test) more than
because of the underlying algorithms themselves.

Anyway it's important to note that developers can and do fall into this trap
naively, not only when they're _trying_ to cheat -- they can have good
intentions and still get it wrong.

Therefore discussions of methodology, as above, are pretty much always in
order, and not inherently some kind of challenge to the honesty and honor of
the devs.

------
el5r
Looks like Kaldi gets 4.27% WER on CV: [https://github.com/kaldi-
asr/kaldi/blob/master/egs/commonvoi...](https://github.com/kaldi-
asr/kaldi/blob/master/egs/commonvoice/s5/RESULTS#L19)

But that's likely trained exclusively on CV audio and using a language model
derived from the CV training data so that's also not a fair comparison unless
the other engines were trained the same way.

Comparing systems trained on different datasets (with different language
models) like this is like comparing apples to oranges. Mixing wildly different
CPU and memory requirements into the benchmarks just makes it worse.

It should be fairly straightforward to do an unbiased comparison by training
and evaluating with the standard Librispeech split and language model. It
might be interesting to see how accuracy improves as the models scale up until
they match the resource requirements of the other engines.

That said, the speed and memory usage are impressive and I like the focus on
very low resource environments. Seems like it has a lot of potential even if
it may not be SOTA.

~~~
kenarsa
Thanks for the link. I'll be sure to look into it. It is impressive.

A disclaimer is that we use the valid train portion of CV as part of our
training set. But it is less than 10% of the train set (in terms of hours).
Also, we do not employ an LM mainly because the systems we are targeting do
not have enough storage for a strong LM (usually the storage on them maxes out
at 64 MB). Cheetah is an end-to-end acoustic model. For later versions, we
might be able to add a well-pruned LM for specific domains with limited
vocabulary to boost the accuracy with limited storage available.

I fully agree with your points. I am taking notes here as I think we should
follow up on a couple of your suggestions. Scaling up to DeepSpeech model size
can be a bit tricky as it would require much more compute resources (GPU). But
should be quite doable with time and budget.

Thanks again for your comments and suggestions. As you correctly pointed out
our main focus is the very low resource (CPU/Memory) embedded systems.

------
DonHopkins
It would be great to have a service that let you personally audition a bunch
of speech recognition engines to measure and compare how well they understand
YOUR voice.

~~~
sonnyblarney
Yeah, but it'd be like getting a blood test and trying to interpret it
yourself :).

There are so, so many variables - and many engines could be optimized for
'your voice and the things you're going to say right now given cpu/memory,
quality of microphone, background noise etc..'

The game of NLP is inherently about dealing with 'noisy channels' (in the
academic sense) in which there is kind of a probabilistic guarantee of
imperfection. So then it comes down to creating the best products in a given
context, which is almost always 'less than optimized' for any individual.

So there's the model size, cpu/ram, quality of signal (microphone, network)
just to start.

Optimizing for standard english means probably reducing quality for people
with accents. Maybe in a specific context you could go from 80% accuracy to
85% accuracy for 'most of us ' \- but then yo go from 60% to 40% for anyone
with an accent.

And if you reduce the accepted vocabulary, you can get way better results. Of
course, we all might want to say words like 'obvolute' and 'abalienate' every
so often.

Kind of thing.

It's really fun from an R&D perspective, but it's a product managers
nightmare. Consumer expectations with these technologies are really
challenging because of inherent ambiguity in a system, people kind of want
perfection. And there are always corner cases where it would _seem_ like
things should be say, but they're not - because the word you're saying is
common, and you'r saying it 'perfectly clear' ... but little do you know there
are 2 or 3 other very rare words that sound 'just like that' ergo ...
problems.

From a product perspective, it basically always feels 'broken' which is such a
terrible feeling :).

But it can be fun if you like really hard product challenges which have less
to do with tech and more to do with pure user experience, expectations,
behaviours etc.

~~~
glup
I think this is a great summary of where we are at now, but I think stronger
broad-coverage language models (=expectations for what people say, better
generative models of speakers) are feasible (for a couple 10s-100s of millions
in R&D) to bring ASR up to parity with people. It's pretty clear we are
getting close to the limits of what acoustics can offer, and it's the language
model that is the next frontier both in terms of accuracy and real-time
performance.

~~~
sonnyblarney
Agreed. We are getting close to a nice reality as all of the pieces are
getting better.

------
CommanderData
Can this be used to transcribe voice data in real time?

I am building a docker image which will eventually accept in-browser audio via
WebSockets outputting transcription in real time without needing Google
WebSpeech.

[https://github.com/ashwan1/django-deepspeech-
server](https://github.com/ashwan1/django-deepspeech-server)

I planned to use DeepSpeech but this looks promising given it's low resources.

~~~
themarkn
Could you tell me some more about this? I have used the web speech recognition
API in Chrome (coming soon to Firefox also) to create a free project for real-
time, editable, transcriptions in the browser that could be projected on a
screen or subscribed to on a person's own device.

I am frustrated by:

\- lack of cross-browser support on whatever device is generating the
transcript. It will work from any Android phone but not from iOS

\- No offline functionality, because the work is actually not done locally

\- Poor accuracy.. we can correct this on the fly with the live editing, but
it feels like it is not using the latest & greatest Google has to offer in
terms of STT

I don't know much about Docker or how to "use" a Docker image. Not asking you
to teach me, but do you think your project would be useful for what I'm
talking about? Currently using Firebase for the backend but it's really just
exploratory at this point.

~~~
CommanderData
My goal is to create a docker image with all necessary dependencies deployable
with WebSocket endpoint exposed. Similar Watson S2T demo ([https://speech-to-
text-demo.mybluemix.net](https://speech-to-text-demo.mybluemix.net)) which
utilizes WS.

This is after I struggled to find anything in usable form on GitHub and Google
Cloud pricing is prohibitive for my free projects. Perhaps there has been
progress?

 _web speech recognition API in Chrome_

I looked at this last year (See my comments 8-12 months back) and it seems
either Google or Chromium dev team removed recognition.speechURI . It was
there in Chrome however later removed. Would have been really helpful to just
switch out provider from the browsers default to an alternative perhaps
cheaper or free option. I don't know enough about the decision behind this, or
if it was actually working in the first place. To be fair the whole
"SpeechRecognition" is still in draft. See: [https://developer.mozilla.org/en-
US/docs/Web/API/SpeechRecog...](https://developer.mozilla.org/en-
US/docs/Web/API/SpeechRecognition/serviceURI)

End goal is to easily provide ASR/S2T through WebSockets. The uses and
possibilities I have thought out just myself are many and I am sure it will be
of use to others.

> I don't know much about Docker or how to "use" a Docker image [....] but do
> you think your project would be useful for what I'm talking about? [....]

Yes. But accuracy is probably something Google is winning on here. If your app
is browser based, unless my assumption is incorrect you do not need an API key
therefore are not bound by quota limits/costs.

~~~
themarkn
Thanks for expanding! Yes with the spec being in draft who knows what will
change. I'm really interested in seeing what the Firefox implementation turns
out to be. Hopefully close enough to Chrome's that everything still works.

Are you on Twitter or is there some other way for me to know about your
progress?

------
gok
If Mozilla’s DeepSpeech is getting a 30% WER on this test set with a >2GB
model... something is very wrong.

~~~
kenarsa
This is a bit of surprise to me as well. That being said. The dataset is a
tough one. I am ESL, but some of the accented examples I have a hard time
understanding. Also the recordings are not near field and there is sometime
some background noise.

~~~
p1esk
What is the WER of your model on Librispeech?

~~~
kenarsa
We used entire librispeech as part of our training set.

~~~
p1esk
Until you publish Librispeech WER for your model, it's pointless to discuss
any advances in efficiency. I have no clue if your results on CommonVoice are
good or bad.

------
oulipo
Hi, I'm a co-founder of [https://snips.ai](https://snips.ai) and we are
building a 100% on-device and private-by-design, open-source VoiceAI platform
with our own ASR which works on Raspberry Pi3, Linux, iOS, and Android

We already have a community of more than 14 000 developpers on the platform,

you can get your own assistants running on a Raspbbery Pi in less than 1h,
those are a few tutorials:

\- [https://medium.com/snips-ai/voice-controlled-lights-with-
a-r...](https://medium.com/snips-ai/voice-controlled-lights-with-a-raspberry-
pi-and-snips-822e53d7ede6) \- [https://medium.com/snips-ai/an-introduction-to-
snips-nlu-the...](https://medium.com/snips-ai/an-introduction-to-snips-nlu-
the-open-source-library-behind-snips-embedded-voice-platform-b12b1a60a41a)

~~~
mcjiggerlog
I can't be the only one that sees "Token Sale" and immediately loses all
interest.

------
nathell
I wonder how these engines compare to cloud services described in [1].

[1]: [https://blog.rebased.pl/2016/12/08/speech-
recognition-1.html](https://blog.rebased.pl/2016/12/08/speech-
recognition-1.html)

~~~
kenarsa
Great question. I believe that someone has already performed this measurement
on a variety of could APIs (Google, Amazon. MS. etc.). I remember seeing it on
GitHub while ago. I can't find it right now with a simple Google search. But I
will spend more time later in the evening and comment here when I have it.

The comments are absolutely correct. Could services (can) do better simply
because they have access to more compute resources and also data (what is sent
to them can be/is used for training later). There are situations where on-
device is preferred due to privacy reasons, latency, cost, or lack of internet
connection.

------
malceore
This basically seems like marketing paraded as research.

------
berbec
If I'm reading this correctly, the resources used by Cheetah would allow your
laptop to do stt for around 30 streams?

~~~
kenarsa
If you use the CPU that we used for testing yes. It's an Intel CPU. The detail
is in readme. You can do 30 streams per core. If you have 2 cores you can do
60.

~~~
berbec
With such amazing efficiency, I would suggest forking off a second branch that
massively increasing complexity for gains in your WER. CPU is just going to
keep getting cheaper and faster, and being able to leverage the extra cycles
for a platform that has them would allow you to dominate from embedded up to
the Xeon space.

~~~
kenarsa
That is a good point. Definitely a valid roadmap and something we should
consider. Thank you for the suggestion.

------
mhei
I wonder if they actually trained Cheetah on a different or the same dataset
they are benchmarking on.

~~~
kenarsa
Thanks for comment. We don't train on the dataset being tested on. I am fairly
certain the other engines don't as well.

------
tootie
What would you folks use for a much narrower use case more like an IVR system?
Like if I could give a very restricted set of inputs to recognize and don't
need much if any learning to happen.

~~~
aviv
IBM Watson with
[https://freeswitch.org/confluence/plugins/servlet/mobile#con...](https://freeswitch.org/confluence/plugins/servlet/mobile#content/view/16352039)

Or stream to Google Cloud.

Both are used in production successfully already. Latency is on par with what
you would experience calling FedEx and talking to their ASR bot.

------
ausjke
Just checked out and tried to play with it, turns out this is a binary-
only(closed source) release actually. I'm on a mips platform so there is no
way I can test it there.

~~~
kenarsa
Unfortunately, we only offer Ubuntu x86_64 at the moment. We are planning to
add more platforms (Mac and Windows). I make a note of your request and add it
to our todo list. Thank you.

------
btashton
I'm curious about the background of the team. They don't seem to have any
linguistics experts from what I could tell.

~~~
glup
If that quote attributed to Fred Jelineck ("Every time I fire a linguist, the
performance of the speech recognizer goes up") is right, they should have
negative linguists.

~~~
DougMerritt
> If that quote attributed to Fred Jelineck ("Every time I fire a linguist,
> the performance of the speech recognizer goes up") is right, they should
> have negative linguists.

That's a really, really, really stupid quote that pushes the increasingly
popular world view that experts only drag the world down on every subject.

Linguists' contribution is unrelated to performance, positive or negative.
Performance is the job of computer scientists designing architecture and of
skilled programmers doing smart implementations of architecture.

The performance would probably go up if the company fired all presidents/vice
presidents/CFOs etc., too (giving implementors free reign, unconstrained by
costs and schedules), but the company would probably go under.

Every specialist has their own role to play. Thinking that subject domain
expertise is a negative is indefensible.

~~~
glup
I think you are misinterpreting the quote. He meant that experts
(computational linguists and statisticians) had a firmer grasp of the task of
inference in language than linguists, who tend to care more about formal
structure (i.e. Chomsky's competence/performance distinction) and are less
aware of problems like overfitting. If anything, it's stressing the importance
of domain expertise, just suggesting that who those experts are may be
nonobvious.

~~~
DougMerritt
Thanks for explaining. I would then say that a quote that actually means the
opposite of what it appears to mean is a stupid quote for other reasons.

Even with the explanation, it still comes across more as blaming the linguists
than the people who assigned the wrong people to the wrong tasks.

(Why are they fired rather than reassigned? Why were they ever hired if
they're the wrong specialty and aren't needed elsewhere?)

