
Toward better phone call and video transcription with new Cloud Speech-to-Text - stanzheng
https://cloudplatform.googleblog.com/2018/04/toward-better-phone-call-and-video-transcription-with-new-Cloud-Speech-to-Text.html
======
andrewstuart
Google can't even get its demo text to speech to work - it's been offline for
weeks [https://cloud.google.com/text-to-
speech/](https://cloud.google.com/text-to-speech/)

~~~
wpietri
Their user focus is shockingly poor.

I followed the link from the blog post that said "check out the demo on our
product website". Then there's a big button that says "TRY IT FREE". Good, I
say. That leads me through a signup process that involves credit cards and
whatnot, and then dumps me out on what I guess is the equivalent of the AWS
console, not some nice audio test page.

So then I root around in the console, finally find the text to speech stuff,
and screw around with various interfaces. None of them seems to be the right
thing. Eventually I decide I must have missed something, go back to the
product website, and scroll down further to find the "convert your speech to
text right now". Great, say I.

The blog post explicitly talks about video. I want to see if it can transcribe
a talk I did, so I tried uploading a file; nothing appears to happen on
Firefox. I try a couple more times. I sigh heavily and switch to Chrome.

It does appear to work on Chrome, but it's entirely infuriating. I tried
uploading a video file, which was over 50MB, so it refused. I then figured out
how to extract the audio alone and uploaded that, at which point it complained
it was over a minute. Then I find another incantation to chop my audio to a
minute (which they just should have done for me, and which anyway should be
explained in the interface).

Finally, I upload 60 seconds of audio. And nothing fucking happens. After all
that, the thing just doesn't doesn't work. No error messages, no anything.

This is my first impression of the Google Cloud Platform, and all I hear is
the squeaking of clown shoes. I'm sure the rest of it can't be this bad, but
if they can't make a simple demo work, I'm unlikely to find out.

~~~
wpietri
Update: I decided to try through the Google console, and also try Amazon's
speech recognition through the AWS console.

AWS just let me transcribe my MP3 in a pretty straightforward way once I'd
uploaded it to an S3 bucket. The transcript is done in 2-3x real time, and the
quality seems decent. It comes as a complex JSON file with confidence numbers
and timestamps for every word, with alternate words when it knows it isn't
sure. It's pretty neat.

Google made me use a sort of query builder interface to construct an API
request. The query builder did not actually match the features announced in
the blog post, so I just tried going with what was there. When I eventually
got a valid-looking request, it blew up because it turns out it can't parse
MP3s. So then I reencoded to FLAC and uploaded that. I tried a variety of
queries, but none of them worked. The one that got closest complained about a
bad value for a field the query builder apparently would not let me add.

I gave up. Squeak, squeak, squeak!

And I should add that the people I know at Google are all perfectly smart, so
I don't want anybody to think I'm saying that the individual engineers who
made this are dumb or bad. This seems like a giant organizational failure,
where what gets built is deeply disconnected from user need and the lived user
experience.

Normally when I get insight on a place where this happens, the priority is not
actually delivering value, but making managers look good according to easily
measured but harmful metrics, like, "Are we at competitive parity at a feature
checklist level?" or "Did we launch by some made-up deadline so that a manager
could claim success?"

If anybody at Google wants to send me their horror stories, please do email or
DM me on Twitter. I'd love to know what the hell happened here, and I promise
to keep things as confidential as you like.

~~~
jgh
I keep harping on this about google, but this is so typical of google. It's
the same kinda crap with WebRTC, QUIC, VP9/Webm. The one complete WebRTC lib
is dug out of Chromium. QUIC, last I checked, is buried in Chromium. VP9/Webm
doesn't actually support transparency but Google went and added a custom
extension to support it (in Chromium) and so anybody that wants to support
alpha with VP9 needs to do it Google's nonstandard way (including adding
Google code to FFmpeg to do it).

They just throw stuff that would otherwise be useful to the world out there in
the least user-friendly way possible. And then they make a big PR push for a
while talking about how great the new thing is and then they forget about it
and the project languishes.

------
xbmcuser
How long before we can get a kodi plugin that transcribes the text and
translates to the subtitle language you have chosen. I would really be
interested in this for Japanese, Korean and Chinese shows that I have to wait
sometime months or years before fansubs are available. Though because of
Netflix english subs are being available a lot quicker than previously for
many of these shows.

~~~
make3
.. there's nothing stopping you from writing it, it's just a few calls to
their Google cloud api. it's not an afternoon's work in the scripting language
of your choice. the real issue is that I suspect translation might be a bit
off at times

------
tudorconstantin
I wonder how much it will take until countries will require Telecom companies
to transcribe and store all the phone calls for a "limited time period" of,
let's say, 6 months, for "our security".

And then run algorithms on these texts to classify the conversations into
"potentially crime related discussions" classes.

~~~
aviv
Companies are already asking for this themselves. There's huge demand for all-
calls voice transcription, and companies are willing to pay for it.

~~~
gregsadetsky
Do you know of specific demand? Would you mind contacting me (gs @ my HN
username .com) to chat about this? Thanks!

------
diminish
I'm still waiting for 10x-100x drop in their prices - I believe that's going
to enable a lot of new startups.

------
UperSpaceGuru
Met Dan at an AI conference & having worked with the API, I think it's really
cool that your average dev has access to this level of Transcription that's a
non-trivial problem (been working on Speech Recognition since early '00s).

I agree with some of the comments regarding Google being a big co & having big
co issues. But at the core of it, the team, the offering & attention to what
matters is solid.

It's certainly going to open up a whole new realm of possibilities.

------
gok
> Cloud Speech-to-Text (formerly known as Cloud Speech API)

Interesting name change. It’s certainly more precise, but was “Speech API”
really confusing people?

~~~
rahimnathwani
Speech API sounds like it generates speech, i.e. TTS

------
monkeydust
Going through a course udemy at the moment and thought it would be great if
the whole thing was transcribed for easy referencing later.

------
6841iam
the number of voice based startups that have built business logic on top of
this fundamental api is staggering. some names: voicera (automated meeting
minutes), voiceops (call center call analysis), chorus.ai (phone call
analytics)

the focus on improving call center performance is where the money is. plenty
more vendors will enter this market.

------
adorable
Does anybody know what the existing open-source / free alternatives are? (that
you could run on your own servers)?

~~~
woodson
The Kaldi toolkit is state of the art, but you have to know quite a bit about
speech and natural language processing to create a comparable service that
works well (or invest the time to learn it). Definitely not plug and play,
though.

Then there are implementations of Baidu’s DeepSpeech (PaddlePaddle:
[https://github.com/PaddlePaddle/DeepSpeech](https://github.com/PaddlePaddle/DeepSpeech),
or Mozilla’s version).

------
walterbell
For offline use, a (paid) alternative is Nuance’s Dragon,
[https://www.nuance.com/dragon.html](https://www.nuance.com/dragon.html)

~~~
flarg
I use this on a daily basis and it's pretty good but can't cope with fast
speech or poor sound quality. For the price I don't expect more but it's not
amazing.

------
infocollector
What is the best Speech-To-Text program currently that is free and does not
require an internet connection?

------
gaius
So what happens to these transcripts? Google keeps a copy right? What do they
do with it after?

------
Animats
How long before Google Voice listens to your phone calls for ad tracking
purposes?

~~~
lozf
The tracking part will be one step away - i.e, it'll "read the transcript"
including all the missed subtle inflections and other subtleties; contextual
clues from pitch or tone e.g. sarcasm, which will lead to some hilarious mis-
targeting, until LEO (or other Authorities) use a similar system and something
dreadful happens.

