Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Updates to Cloud Speech-to-Text and general availability of Cloud Text-to-Speech (cloud.google.com)
76 points by rayshan on Aug 31, 2018 | hide | past | favorite | 38 comments


Google's speech-to-text is powerful, but I'd be pretty skeptical about tying a project to it given how services like Maps have been handled recently. There are companies like Mozilla trying to build more open solutions, but to the best of my knowledge (please correct me if I'm wrong) any pre-trained services Mozilla offers will also still involve you connecting to their servers.

Maybe I'm just paranoid, but I just can't imagine using a speech-to-text system for anything serious that I can't self-host. It feels like we've just seen example and example over and over again why this is a bad idea -- to the point that when I hear a company like Google talk about a locked-down cloud platform as "making AI accessible to everyone" it feels almost dishonest.

Especially once we start talking about text-to-speech. We can already do a lot of that locally - we should be pretty hesitant about coupling new text-to-speech techniques to strategies that require us to move logic away from local devices onto the cloud.


Mozilla’s DeepSpeech is available as an offline pre-trained system for English.

The quality is far below Google’s speech API as the model is somewhat out of date and more importantly the training data set is much smaller and less general.

The best pretrained speech to text model I’ve seen is from Baidu’s DeepSpeech 2 repository. They provide pretrained models for English and Chinese based on their internal data. The quality is astonishingly good!

Edit: both of these models can be comfortably run in real-time on a desktop. At Arm I recently worked on a project to run <5% word error rate models in real-time on a mobile phone.


Any links to Baidu's DeepSpeech 2 pretrained model? I'm likely looking right past it, spent the past few days playing around with Mozilla DeepSpeech and their pretrained model, building a simple little API and webpags to feed it files: https://github.com/AccelerateNetworks/DeepSpeech_Frontend

I really want to put DeepSpeech 2 with Baidu's model up against Mozilla's model and see which is better, seems like it could be quite interesting!



Any chance there is a mirror of the BaiduEN8k Model that isn't in China? I'm getting about 20KB/s when trying to download it, and using a DNS override to 180.76.189.142 for cloud.dlnel.org gets me what looks to be a partial file at slightly less than 200MB in size.


I'm happy to be wrong about that; I was under the impression that DeepSpeech was available as a data set for training, but not as a trained system in and of itself.

My impression is that a lot of the hardware costs from modern AI comes from training and updating the model.

So (speaking as a non-expert) it doesn't seem like there's any technological problem with running a model locally on a private network -- you just need access to the model and you need someone else to generate it. And that's exactly what Google isn't providing. It's like they found a way to turn modern AI into even more of a black box.

At the point where I'm OK consuming someone else's model and not being able to control how it gets updated, then I'm probably also OK with using a compressed, static model and needing to occasionally download and deploy new versions.


DeepSpeech is not a voice corpus, that would be Mozilla's Common Voice project that Mycroft users are also contributing to.

DeepSpeech offers trained models that are about 70% right, but none of them use the Common Voice corpus yet. I think that plus recent changes to the codebase should produce a much better transcription, but I don't have the GPU resources to go and train a model sadly, Mozilla will hopefully release another trained model soon though!

I am working on a basic web frontend and API for DeepSpeech: https://github.com/AccelerateNetworks/DeepSpeech_Frontend

Also, here is Common Voice: https://voice.mozilla.org/en


Keep up the good work ++


> can be comfortably run in real-time on a desktop

that's progress. how long did it take you to get these things running (including downloads and dependency installation)?


The downloads total a bit above 2GB for Mozilla Deepspeech 0.1.1, but besides that DeepSpeech is quick to set up and pretty performant on my i5-4200U (half realtime transcription) and its even better on my Ryzen box. Going to do some GPU testing tomorrow, need to clean up the docs on the little web frontend/API I've been trying to get ready for production use: https://github.com/AccelerateNetworks/DeepSpeech_Frontend


I was more interested in 'astonishingly good' Baidu solution. will that install chinese spyware thou? sorry, very cautious about Baidu et al.


Not sure, but the results and info OP linked to are quite interesting: https://github.com/PaddlePaddle/DeepSpeech#released-models


Ah, I just asked a similar question of the person you initially replied to. The Baidu DeepSpeech 2 model seems appealing, but from my brief look on Github I haven't found the model files...


Not sure why it would be a worthwhile endeavor to roll up your own text to speech system when basically all major cloud providers give that service for a price, unless this cost would make a significant fraction of your total costs. Unless that's the case, why not use this service until the day comes when Google might raise prices and then we could decide what to do about it? It's still an API call, afterall.

While I agree that Google maps price increase was a bit much, I'm sure Google realizes they can't offer services to businesses and viably keep doing this kind of 180 too ofren, especially if they offer it as part of their Google cloud. So it's not very likely they will pull the rug under this service again.


It is also extremely expensive ($1.44/hr of audio) which really limits its potential. I looked at it for a project but it just didn't make sense.


I can't imagine many use cases where $1.44/hr is a significant cost.

If you sell audiobooks, then a one off cost of a few dollars to convert the book to audio form is tiny compared to the authorship of the text.

If you are doing something like turn by turn navigation, clips are typically only a few seconds long, so very cheap, and again, many of your users will be needing the same clip, so no need to pay for it twice.


I've been using it for one of my projects but it just requires one-time use of the audio. As in, I use the text to speech to create audio files which I just recycle. Overall, it is more effective compared to using my own voice or hiring a voice over person to do it.


If you want to build open-source, 100% on-device and private-by-design Voice assistants which can run on a Raspberry Pi, you can take a look at what we are building at https://snips.ai (disclaimer: I'm a co-founder)

We want to make it possible to have embedded assistants in all your objects which preserve people privacy, and do this with open-source: https://medium.com/snips-ai/an-introduction-to-snips-nlu-the...

Take a look at our blog to get started in 1h: https://medium.com/snips-ai/voice-controlled-lights-with-a-r...

It also binds in popular Home automation platforms like Home Assistant and the Jeedom platform


Hey I’ve been working on a chatbot project lately and I came across semantic parsing (http://nbviewer.jupyter.org/github/wcmac/sippycup/blob/maste...) . Even Though we didn’t use it , I found it to be more robust and capable to handle very complicated utterances. The main disadvantage of using a grammar based approach is that it’s hard to extend the grammar to incorporate new intents and also creating a grammar is time consuming .

But the approach used in various NLU services such as Snips and RASA is much more simpler. This can work fine for easy queries but once we start asking complicated questions using conjunctions and disjunctions these systems start becoming brittle. If they try to capture all the possible logical forms through intents they’ll need an exponential number of intents and also a huge dataset to capture all the intents.

I would like to know your take on using a grammar based semantic parsing.


Does Snips support existing MQTT broker yet? Haven't checked up on development since late 2017.


Anyone know how this relates to the Web Speech API[1]?

Will they ship it with chrome to replace the existing speech synthesis api? (I believe right now it just uses whatever voices are available to the device or OS but chrome can fallback to a serverside voice)

[1] https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...

[2] https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynth...


pure guess: Mozilla will use their DeepSpeech model for that soon


On my machine the demo page doesn't work at https://cloud.google.com/text-to-speech/

I tried to get Google to fix this a long time ago and it seemed to work for a while after being offline for weeks.


It works very intermittently for me (Chrome on Windows, with plenty of RAM and CPU).

When I click the "speak" button, it makes a network request (as shown in the Chrome devtools) and immediately gets back a response that looks big enough to be an audio blob.

But most of the time, instead of actually playing the audio, it just sits there with a spinner for anywhere between 30 seconds and 5 minutes before giving up, with no errors in the console.


Works for me. Whats your machine like?


Sadly it is like this:

OSX 10.13

Model Name: Mac mini Model Identifier: Macmini7,1 Processor Name: Intel Core i5 Processor Speed: 2.6 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 3 MB Memory: 16 GB


Anyone knows how this compare to Dragon NaturallySpeaking?


Google is orders of magnitudes better on trendy words and phrases etc. Just try Siri vs Google.


Is Siri done by nuance?


Yes, much to the pain of Apple


You have any links to where that has caused issues? I've heard more people personally like Siri than Google Assistant. Of course, I have my own love/hate with Alexa. Haven't used the others.


A friend is working for a newspaper. He records interviews.

We tried all the software we could find to turn the recording (Dutch) into text but there is nothing that gives a helpful result.

I know that a recording-to-text is different than speech-to-text but even when I use OK Google most of the time the results are horrible.

So after all those years I am still a little skeptical.


That's why all the hype about DL is so irritating. Yes, it allowed us to make enormous progress in certain fields. No, it's nowhere near usable in others. So why pretend it is?

Just one example from the preface to Chollet's "Deep Learning with Python":

> If you’ve picked up this book, you’re probably aware of the extraordinary progress that deep learning has represented for the field of artificial intelligence in the recent past. In a mere five years, we’ve gone from near-unusable image recognition and speech transcription, to superhuman performance on these tasks.

Come on, speech to text is still far from usable unless in a very limited scenarios. Why pretend it's different?


in most languages google ASR works shockingly well.


Dutch isn't yet a language that the new machine learning models are used for because there isn't enough dutch training data, nor enough people who speak dutch to make the engineering and compute investment worthwhile.

Just wait - it'll come in a few years though.


I really want a free software implementation of Text-To-Speech and Speech-To-Text that runs on local computer without network.

I don't trust those cloud-based solutions.


Speech-to-text is great. I'm using it to transcribe voicemails in a product I'm building.


I'd love to read what this page has to say, but somehow it managed to load with some click-grabbing Gawker-type theme? Half expecting a "100 Surprising Cloud Facts, And Number 12 Will Shock You" link to appear in that inexcusable waste of space along the bottom. https://i.imgur.com/Uk1udNo.jpg




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: