
Updates to Cloud Speech-to-Text and general availability of Cloud Text-to-Speech - rayshan
https://cloud.google.com/blog/products/ai-machine-learning/announcing-updates-to-cloud-speech-to-text-and-general-availability-of-cloud-text-to-speech
======
danShumway
Google's speech-to-text is powerful, but I'd be pretty skeptical about tying a
project to it given how services like Maps have been handled recently. There
are companies like Mozilla trying to build more open solutions, but to the
best of my knowledge (please correct me if I'm wrong) any pre-trained services
Mozilla offers will also still involve you connecting to their servers.

Maybe I'm just paranoid, but I just can't imagine using a speech-to-text
system for anything serious that I can't self-host. It feels like we've just
seen example and example over and over again why this is a bad idea -- to the
point that when I hear a company like Google talk about a locked-down cloud
platform as "making AI accessible to everyone" it feels almost dishonest.

Especially once we start talking about text-to-speech. We can already do a lot
of that locally - we should be pretty hesitant about coupling new text-to-
speech techniques to strategies that require us to move logic away from local
devices onto the cloud.

~~~
moconnor
Mozilla’s DeepSpeech is available as an offline pre-trained system for
English.

The quality is far below Google’s speech API as the model is somewhat out of
date and more importantly the training data set is much smaller and less
general.

The best pretrained speech to text model I’ve seen is from Baidu’s DeepSpeech
2 repository. They provide pretrained models for English and Chinese based on
their internal data. The quality is astonishingly good!

Edit: both of these models can be comfortably run in real-time on a desktop.
At Arm I recently worked on a project to run <5% word error rate models in
real-time on a mobile phone.

~~~
dan0-
Any links to Baidu's DeepSpeech 2 pretrained model? I'm likely looking right
past it, spent the past few days playing around with Mozilla DeepSpeech and
their pretrained model, building a simple little API and webpags to feed it
files:
[https://github.com/AccelerateNetworks/DeepSpeech_Frontend](https://github.com/AccelerateNetworks/DeepSpeech_Frontend)

I really want to put DeepSpeech 2 with Baidu's model up against Mozilla's
model and see which is better, seems like it could be quite interesting!

~~~
aseipp
[https://github.com/PaddlePaddle/DeepSpeech#released-
models](https://github.com/PaddlePaddle/DeepSpeech#released-models)

~~~
dan0-
Any chance there is a mirror of the BaiduEN8k Model that isn't in China? I'm
getting about 20KB/s when trying to download it, and using a DNS override to
180.76.189.142 for cloud.dlnel.org gets me what looks to be a partial file at
slightly less than 200MB in size.

------
oulipo
If you want to build open-source, 100% on-device and private-by-design Voice
assistants which can run on a Raspberry Pi, you can take a look at what we are
building at [https://snips.ai](https://snips.ai) (disclaimer: I'm a co-
founder)

We want to make it possible to have embedded assistants in all your objects
which preserve people privacy, and do this with open-source:
[https://medium.com/snips-ai/an-introduction-to-snips-nlu-
the...](https://medium.com/snips-ai/an-introduction-to-snips-nlu-the-open-
source-library-behind-snips-embedded-voice-platform-b12b1a60a41a)

Take a look at our blog to get started in 1h: [https://medium.com/snips-
ai/voice-controlled-lights-with-a-r...](https://medium.com/snips-ai/voice-
controlled-lights-with-a-raspberry-pi-and-snips-822e53d7ede6)

It also binds in popular Home automation platforms like Home Assistant and the
Jeedom platform

~~~
bhishmaraj
Hey I’ve been working on a chatbot project lately and I came across semantic
parsing
([http://nbviewer.jupyter.org/github/wcmac/sippycup/blob/maste...](http://nbviewer.jupyter.org/github/wcmac/sippycup/blob/master/sippycup-
unit-0.ipynb)) . Even Though we didn’t use it , I found it to be more robust
and capable to handle very complicated utterances. The main disadvantage of
using a grammar based approach is that it’s hard to extend the grammar to
incorporate new intents and also creating a grammar is time consuming .

But the approach used in various NLU services such as Snips and RASA is much
more simpler. This can work fine for easy queries but once we start asking
complicated questions using conjunctions and disjunctions these systems start
becoming brittle. If they try to capture all the possible logical forms
through intents they’ll need an exponential number of intents and also a huge
dataset to capture all the intents.

I would like to know your take on using a grammar based semantic parsing.

------
zawerf
Anyone know how this relates to the Web Speech API[1]?

Will they ship it with chrome to replace the existing speech synthesis api? (I
believe right now it just uses whatever voices are available to the device or
OS but chrome can fallback to a serverside voice)

[1] [https://developer.mozilla.org/en-
US/docs/Web/API/Web_Speech_...](https://developer.mozilla.org/en-
US/docs/Web/API/Web_Speech_API)

[2] [https://developer.mozilla.org/en-
US/docs/Web/API/SpeechSynth...](https://developer.mozilla.org/en-
US/docs/Web/API/SpeechSynthesis)

~~~
singularity2001
pure guess: Mozilla will use their DeepSpeech model for that soon

------
andrewstuart
On my machine the demo page doesn't work at [https://cloud.google.com/text-to-
speech/](https://cloud.google.com/text-to-speech/)

I tried to get Google to fix this a long time ago and it seemed to work for a
while after being offline for weeks.

~~~
mkoryak
Works for me. Whats your machine like?

~~~
andrewstuart
Sadly it is like this:

OSX 10.13

Model Name: Mac mini Model Identifier: Macmini7,1 Processor Name: Intel Core
i5 Processor Speed: 2.6 GHz Number of Processors: 1 Total Number of Cores: 2
L2 Cache (per Core): 256 KB L3 Cache: 3 MB Memory: 16 GB

------
TheChaplain
Anyone knows how this compare to Dragon NaturallySpeaking?

~~~
singularity2001
Google is orders of magnitudes better on trendy words and phrases etc. Just
try Siri vs Google.

~~~
taeric
Is Siri done by nuance?

~~~
singularity2001
Yes, much to the pain of Apple

~~~
taeric
You have any links to where that has caused issues? I've heard more people
personally like Siri than Google Assistant. Of course, I have my own love/hate
with Alexa. Haven't used the others.

------
pasta
A friend is working for a newspaper. He records interviews.

We tried all the software we could find to turn the recording (Dutch) into
text but there is nothing that gives a helpful result.

I know that a recording-to-text is different than speech-to-text but even when
I use OK Google most of the time the results are horrible.

So after all those years I am still a little skeptical.

~~~
dvfjsdhgfv
That's why all the hype about DL is so irritating. Yes, it allowed us to make
enormous progress in certain fields. No, it's nowhere near usable in others.
So why pretend it is?

Just one example from the preface to Chollet's "Deep Learning with Python":

> If you’ve picked up this book, you’re probably aware of the extraordinary
> progress that deep learning has represented for the field of artificial
> intelligence in the recent past. In a mere five years, we’ve gone from near-
> unusable image recognition and speech transcription, to superhuman
> performance on these tasks.

Come on, speech to text is still far from usable unless in a very limited
scenarios. Why pretend it's different?

~~~
singularity2001
in most languages google ASR works shockingly well.

------
ezoe
I really want a free software implementation of Text-To-Speech and Speech-To-
Text that runs on local computer without network.

I don't trust those cloud-based solutions.

------
joshmn
Speech-to-text is great. I'm using it to transcribe voicemails in a product
I'm building.

------
_wmd
I'd love to read what this page has to say, but somehow it managed to load
with some click-grabbing Gawker-type theme? Half expecting a "100 Surprising
Cloud Facts, And Number 12 Will Shock You" link to appear in that inexcusable
waste of space along the bottom.
[https://i.imgur.com/Uk1udNo.jpg](https://i.imgur.com/Uk1udNo.jpg)

