
A Comparison of Automatic Speech Recognition Systems - aberoham
https://blog.timbunce.org/2018/05/15/a-comparison-of-automatic-speech-recognition-asr-systems/
======
bluGill
I wonder how mozilla deepspeech compares
[https://github.com/mozilla/DeepSpeech](https://github.com/mozilla/DeepSpeech)

I'm sure it is not better than Google, but it seems to do fairly well in my
limited tests.

~~~
gok
Note that Mozilla's "Project DeepSpeech" is just a training and recognition
engine, and really an implementation of a specific design for LVCSR. A full
recognition system requires statistical models, generally trained from very
large sources of data, which you have to bring yourself.

~~~
singularity2001
don't they provide (pre)trained models? someone on github does.

~~~
stephensonsco
Pretrained models are very nice to spin up a system and start using it. You
need that because training the model is so hard. But, the pretrained models
are by no means a good general model. They are trained with narrow params
(like #/size of datasets), so the ability to generalize is very low. It's not
uncommon to train a system like that to be 5% WER on Switchboard (think
MS/IBM), then have that same system perform at 40% WER on other audio.

------
raisedbyninjas
Could this corpus be tested against Cortana?
[https://blogs.microsoft.com/ai/historic-achievement-
microsof...](https://blogs.microsoft.com/ai/historic-achievement-microsoft-
researchers-reach-human-parity-conversational-speech-recognition/)

~~~
gok
There is Microsoft's Azure speech-to-text API:
[https://azure.microsoft.com/en-us/services/cognitive-
service...](https://azure.microsoft.com/en-us/services/cognitive-
services/speech-to-text/)

But MSR's research system from that paper is probably quite different from
what's deployed in Cortana/Azure, and Switchboard is a very different task
from podcast transcription.

------
gok
This article seems to use "Google Text-to-Speech" in several places it means
to say "Speech-to-Text"

~~~
timbunce
I could only see one. I’ve fixed it. Thanks.

~~~
gok
Think it’s still wrong in the table?

~~~
timbunce
Ah, yes. I’m away and just using my phone at the moment making it tricky to
read. Fixed now. Thanks again!

~~~
gok
Thanks for fixing! Any chance you could post your test set, BTW?

------
stephensonsco
You're doing awesome (arduous) work. The text normalization is especially a
total bear. I feel your pain. Limiting your text to one file is good in many
ways because it allows you to scope down the amount of work needed to do a
comparison (but it's a big systematic risk, but hey, there are only so many
hours in the day).

Your previous blog post helps in understanding how much work needs to go into
comparing speech services. It's super common to undervalue just how much
processing a human is doing innately while listening to audio; hearing words,
feeling out ideas, resolving ambiguities, etc. So, it's awesome to see deep
work into it (besides the speech teams working on these problems like at
Google, Baidu, Microsoft, Deepgram [btws I'm a founder of Deepgram]).

I wouldn't be so quick to say the differences in WER should be attributed to
how 'modern' the system is. It's more about the areas they play in; what audio
type they care about, what training datasets they use, what post processing
they do, and language models they choose to apply. (Speed/TurnAroundTime gives
you a much better indication of how modern a system is.)

For many speech transcription systems, they focus on specific types of audio
as their target market. There are ~4 main types: phone (customer
support/sales), broadcast (news/podcast/videos), command and control
(siri/google assistant), and ambient (meetings/lectures/security).

Google's video model is perfect for what you are doing (broadcast/podcast, 2
dudes talking into probably pretty good mics).

In other instances the results will be very different (if you compared phone
calls, for example). It won't be different just in accuracy, but also speed
(throughput _and_ latency), price, and reliability.

It's awesome to see an in depth comparison being discussed broadly. Speech
interfacing and understanding is _just_ getting started. We're still at the
tip of the Intelligence Revolution and there's still a long way to go. The
scale of compute and data is huge, even to bring just one language up to
snuff.

Aside: It's a dirty little secret that there actually aren't 20 different
speech recognition companies in the world using 20 different systems. There
are only a handful (many use Google and tweak the outputs). They are mostly
doing one of four things: using old and aged tech, using old but well-oiled
tech (like Google, this takes a ton of manpower and no other company spends
the money to do it), using an open source spinoff (like Kaldi or Mozilla),
building your own from scratch (like Deepgram), or reselling someone else's.

If you care about current times, this is a reasonably good finger in the wind
in Sept. 2018:

Use Google if you are doing command and control or broadcast audio, do not use
Google if you are doing meetings or phone calls or you need a reliable system
(it's unreliable at scale). DO use Google in all cases (even phone/meeting)
for audio that is in a language other than English (no other company is even
close).

Use Google to prototype systems and teach yourself about how to use a speech
recognition API and what results to expect as a baseline.

Do not use Google if you need scale and speed and reliability and
affordability.

Do not use Google if you need to use your own vocabulary or if your audio has
repetitive things being said in it that have accents or jargon (like call
centers). In tat case, use a company that can do a true custom acoustic model
and vocabulary for that (like Deepgram). There are only a few companies that
will consider doing this (and Google is not one of them.

Expect that many more things are going to be addressed.

Think of it like: what can a human do?

A human can jump into a conversation and quickly tell you: there are 3 people,
speaking about rebuilding a feature in the main code, two people are male, one
is female, male1 and female1 are doing most of the talking in the beginning,
then it's the two dudes at the end, it sounds like the recording is of a
meeting they are having, they never came to a definitive conclusion and next
steps, they spent 80 minutes in the meeting. All of that (and I'm sure more)
will be done by machine in the future.

~~~
taf2
I disagree that google is unreliable at scale and also I’ve found with the
enhanced phone model it’s much better for phone calls then anyone else...

------
sam1r
Is your code for your testing hosted anywhere? Would love to contribute

~~~
timbunce
Thanks for the offer. For the evaluation I’ve just written a few tiny Perl
scripts. I’ve used the services manually, e.g. via curl or the website. For
transcripts in JSON I ran jq to extract the text. I can put the scripts in a
repo but there’s not much to it.

------
singularity2001
So many unknown new players, almost all better than Nuance (thus Siri). Hard
to believe though that those new companies are approaching Googles accuracy.
As much as Google is worth being shunned for eroding privacy, their TTS always
feels miles ahead of anything

------
xerosanyam
So happy to see Spext at #3 :)

------
rorrr
What ever happened to the seemingly amazing MS Research system that was demoed
in 2012?

[https://www.youtube.com/watch?v=Nu-
nlQqFCKg](https://www.youtube.com/watch?v=Nu-nlQqFCKg)

