
Amazon Transcribe Streaming Now Supports WebSockets - WalterSobchak
https://aws.amazon.com/blogs/aws/amazon-transcribe-streaming-now-supports-websockets/
======
iandanforth
Anyone have experience with the accuracy of Amazon Transcribe vs the Google
offerings? (Google Live Transcribe for Android currently tops my list of
impressive transcription offerings.)

~~~
rococode
I built a project where we transcribed speech into chat bubbles for an AR app
(Magic Leap). We started with Google Speech-to-Text and swapped to Amazon
Transcribe.

Performance-wise they seemed mostly similar - both can pick up text fairly
accurately in somewhat noisy environments. We did notice that Google's
offering seemed to be slightly more accurate. But the difference was marginal
and only noticeable when we compared side-by-side on intentionally poorly
pronounced words. There were two big differences that made it impossible for
us to use Google's product:

1\. Google has a 1 minute limit on streaming speech-to-text. It just closes
the connection at 1 minute. It doesn't even send you a "final result", so if
speech was being recorded at the time the connection dropped, that
transcription is lost. Speaking of which...

2\. Google doesn't provide incremental updates. So if someone speaks for a
while, you only get an update at the end of it.

Note that this is their API - my impression is the product they use in their
apps is superior in functionality to the product available on Google Cloud.

Amazon Transcribe, on the other hand, has a 4 hour limit and sends incremental
updates, so in a longer sentence, like this one I'm currently writing, you
would get a message every couple words, which is essential when the goal is to
show a live transcription.

~~~
deadmutex
>2\. Google doesn't provide incremental updates.

Can you explain this comment? From my recollection, this statement is not
true. Also, their site says:

"Returns text transcription in real time for short-form or long-form audio
Cloud Speech-to-Text can stream text results, immediately returning text as
it’s recognized from streaming audio or as the user is speaking.
Alternatively, Cloud Speech-to-Text can return recognized text from audio
stored in a file. It’s capable of analyzing short-form and long-form audio."

-[https://cloud.google.com/speech-to-text/](https://cloud.google.com/speech-to-text/)

~~~
rococode
Perhaps something has changed since I used it - about 2 months ago - or we
were using it incorrectly, but in our experience it was "real-time"
transcription, but not truly incremental. (Since you and another person have
mentioned it's possible I'm leaning towards us not looking through the
documentation carefully enough. We didn't spend too much time trying to figure
it out because the time limit was a bigger dealbreaker)

Probably best explained as an example. Consider someone saying this:

"How are you doing? I'm doing well. Do you plan to go to the park with Alice
and Bob tomorrow to see the fireworks show?"

When we were using it, Google would give:

1\. "How are you doing?"

2\. "I'm doing well."

3\. "Do you plan to go to the park with Alice and Bob tomorrow to see the
fireworks show?"

as three separate messages.

So it's certainly not waiting until the end of the entire streaming to give
you a result - it sends you those three messages as you speak. In that sense
it is "returning text as it's recognized", because once it recognizes you've
finished a sentence it computes the words and gives them back to you. But the
issue for us was that we could only get results after full sentences or long
pauses.

Amazon, on the other hand, would give for the last sentence something like:

1\. "Do you plan"

2\. "Do you plan to go"

3\. "Do you plan to go to the park"

So for our purposes (showing a chat bubble above a person's ahead), the latter
version was much more useful. The reason is that in spoken language sentences
often ramble, so we wanted to be able to show some incremental updates to the
user as a long sentence was spoken so they wouldn't have to read 30 words at
once.

~~~
deadmutex
From my memory (which is over a year ago), Google's service would give real-
time results, likely meaning incremental word by word results. Did you use the
gRPC interface or JSON? I think the streaming service is only available via
gRPC.

Disclaimer: My comment reflects my own views, and not those of my employer,
etc.

------
dankohn1
I help organize KubeCon + CloudNativeCon, and we are planning to add live
transcription (also known as open captioning) to future events. We also want
to offer simultaneous translation to and from Chinese for our Shanghai event.
I'd love to see a comparison of the major offerings if anyone has done it. If
not, I guess we'll need to.

~~~
zachruss92
Disclaimer: I am an organizer of a GDG, but these opinions are my own.

Google has a cloud text to speech API that supports streaming audio. Google
blows Amazon out of the water here with accuracy, speed, and features at the
same price or cheaper. They also have translation APIs.

I'm more than happy to help out if needed!

[https://cloud.google.com/speech-to-text/](https://cloud.google.com/speech-to-
text/)

------
dmix
This sounds great. I hope to see this being adopted by blogs and news sites!

The browser highlight->speech plugins have always been a bit iffy.

------
emilfihlman
Listening to that voiced transcript, those "humanlike" sounds of breathing and
so on are actually very, very annoying and bring nothing of value to the
table. They are actually taking away from the experience, a lot. They are
distracting and not natural.

------
amelius
> I love services like Amazon Transcribe. They are the kind of just-
> futuristic-enough technology that excites my imagination the same way that
> magic does. It’s incredible that we have accurate, automatic speech
> recognition for a variety of languages and accents, in real-time.

I personally _hate_ it when I have to use a service for something that could
be done locally on my computer or smartphone. And I don't get that fuzzy
magical feeling, but instead I think of a (very nearby) dystopian future where
a single company knows what all citizens say or do in real time.

Needless to say, I didn't read the rest of the article.

------
pjmlp
> It’s incredible that we have accurate, automatic speech recognition for a
> variety of languages and accents, in real-time.

Then a few paragraphs later:

> For real-time transcription, Amazon Transcribe currently supports British
> English (en-GB), US English (en-US), French (fr-FR), Canadian French (fr-
> CA), and US Spanish (es-US).

So basically it boils down to two English variants, two French variants and US
Spanish variant.

And then one wonders why such projects never pick up steam around the world.

~~~
philliphaydon
So you’re saying that amazon should release this with over 100 languages
support on day 1?

~~~
pjmlp
Being on a language group that gets routinely ignored, in spite of being the
6th mostly spoken one, is what triggers this kind of remarks from me.

Actually I must admit that at least they did look beyond North America for the
2nd English and French variants, and not the usual American English and the
rest comes when it comes, if ever.

