
Conversational Speech Recognition System [pdf] - dsr12
https://www.microsoft.com/en-us/research/wp-content/uploads/2017/08/ms_swbd17-2.pdf
======
hiddencost
So, for people that aren't aware:

There are a couple big shops that publish papers like this every year. The
sole reason they do so is so that they can tell their managers and potential
professional services customers that they have "the best ASR system". It's BS.

IBM and Microsoft are among the most guilty.

If you're seriously interested in ASR state-of-the-art performance, "systems"
papers like this are largely nonsense. Usually they spend most of their time
tuning hyperparameters to small publicly available datasets, to the point that
the model will often not generalize well to real world settings because of how
aggressively they've overfit the benchmark set.

~~~
yorwba
I do find it strange that their test WER is uniformly lower than their devset
WER. I would expect validating on the devset to lead to the opposite effect,
where the model overfits to the devset and performance degrades during test.

Or do you mean by "tuning hyperparameters to small publicly available
datasets" that the datasets do not have enough real-world variability and can
be fit too easily? Are there no large, realistic datasets or are those just
not available to the public?

~~~
hiddencost
Even though they'll use the dev set to tune hyperparameters, they'll end up
making a "is this a good model for the paper" decision based on the test set.
So, e.g., they'll try different architectures, tune hyperparameters on the dev
set, then decide whether to use that architecture by looking at the test set.
So that's part of it. AKA "gradient descent via grad student"

The other part is, as you say, relatively constrained data sets.

------
melling
We’re so close to voice recognition being perfected but we can’t seem get over
the final hurdles for everyone to use it.

Is it simply a matter of collecting more data?

Google is collecting more voice data in Docs.

[http://www.pcworld.com/article/3038200/data-center-
cloud/how...](http://www.pcworld.com/article/3038200/data-center-cloud/how-to-
use-voice-dictation-in-google-docs.html)

And Mozilla is doing common voice:

[https://voice.mozilla.org](https://voice.mozilla.org)

~~~
ximeng
I dislike the way Google collects AI training data for reCaptcha to the extent
I often will not bother rather than log in to sites like Stack Overflow that
use it. Common voice sounds like a much more open approach to which I would
actually be interested in helping.

From this paper Microsoft are hitting similar accuracy to professional
transcribers, so it may not be accuracy that's holding back adoption.

There also seems to be a trend away from voice interaction these days, I send
messages far more than I talk to people by phone and I don't think I'm alone
in preferring this in many cases. Having automatic transcription of voice
messages might be useful, but I suspect privacy issues would prevent that
becoming widespread without more trust in the companies providing these
services.

~~~
melling
I send more text messages these days too. However, I try to dictate them.
Basic editing by voice is definitely needed.

------
unlikelymordant
This is from oct 2016. Previous discussion
[https://news.ycombinator.com/item?id=12736409](https://news.ycombinator.com/item?id=12736409)

Are you sure you didn't mean to link to this one
[https://www.microsoft.com/en-us/research/wp-
content/uploads/...](https://www.microsoft.com/en-us/research/wp-
content/uploads/2017/08/ms_swbd17-2.pdf)

It just came out and is an update of the original 2016 system

~~~
dang
Ok, we'll change to that. Thanks!

From [https://arxiv.org/abs/1610.05256](https://arxiv.org/abs/1610.05256) via
[https://blogs.microsoft.com/ai/2016/10/18/historic-
achieveme...](https://blogs.microsoft.com/ai/2016/10/18/historic-achievement-
microsoft-researchers-reach-human-parity-conversational-speech-recognition/).

