Amazon Transcribe Streaming Now Supports WebSockets

iandanforth · on July 27, 2019

Anyone have experience with the accuracy of Amazon Transcribe vs the Google offerings? (Google Live Transcribe for Android currently tops my list of impressive transcription offerings.)

rococode · on July 27, 2019

I built a project where we transcribed speech into chat bubbles for an AR app (Magic Leap). We started with Google Speech-to-Text and swapped to Amazon Transcribe.

Performance-wise they seemed mostly similar - both can pick up text fairly accurately in somewhat noisy environments. We did notice that Google's offering seemed to be slightly more accurate. But the difference was marginal and only noticeable when we compared side-by-side on intentionally poorly pronounced words. There were two big differences that made it impossible for us to use Google's product:

1. Google has a 1 minute limit on streaming speech-to-text. It just closes the connection at 1 minute. It doesn't even send you a "final result", so if speech was being recorded at the time the connection dropped, that transcription is lost. Speaking of which...

2. Google doesn't provide incremental updates. So if someone speaks for a while, you only get an update at the end of it.

Note that this is their API - my impression is the product they use in their apps is superior in functionality to the product available on Google Cloud.

Amazon Transcribe, on the other hand, has a 4 hour limit and sends incremental updates, so in a longer sentence, like this one I'm currently writing, you would get a message every couple words, which is essential when the goal is to show a live transcription.

sjaknanxnnx · on July 27, 2019

FYI google recently increased the limit to 5 mins. I found out when my streams stopped failing at 60 seconds :)

Also, they definitely have incremental updates.

deadmutex · on July 27, 2019

>2. Google doesn't provide incremental updates.

Can you explain this comment? From my recollection, this statement is not true. Also, their site says:

"Returns text transcription in real time for short-form or long-form audio Cloud Speech-to-Text can stream text results, immediately returning text as it’s recognized from streaming audio or as the user is speaking. Alternatively, Cloud Speech-to-Text can return recognized text from audio stored in a file. It’s capable of analyzing short-form and long-form audio."

-https://cloud.google.com/speech-to-text/

rococode · on July 27, 2019

Perhaps something has changed since I used it - about 2 months ago - or we were using it incorrectly, but in our experience it was "real-time" transcription, but not truly incremental. (Since you and another person have mentioned it's possible I'm leaning towards us not looking through the documentation carefully enough. We didn't spend too much time trying to figure it out because the time limit was a bigger dealbreaker)

Probably best explained as an example. Consider someone saying this:

"How are you doing? I'm doing well. Do you plan to go to the park with Alice and Bob tomorrow to see the fireworks show?"

When we were using it, Google would give:

1. "How are you doing?"

2. "I'm doing well."

3. "Do you plan to go to the park with Alice and Bob tomorrow to see the fireworks show?"

as three separate messages.

So it's certainly not waiting until the end of the entire streaming to give you a result - it sends you those three messages as you speak. In that sense it is "returning text as it's recognized", because once it recognizes you've finished a sentence it computes the words and gives them back to you. But the issue for us was that we could only get results after full sentences or long pauses.

Amazon, on the other hand, would give for the last sentence something like:

1. "Do you plan"

2. "Do you plan to go"

3. "Do you plan to go to the park"

So for our purposes (showing a chat bubble above a person's ahead), the latter version was much more useful. The reason is that in spoken language sentences often ramble, so we wanted to be able to show some incremental updates to the user as a long sentence was spoken so they wouldn't have to read 30 words at once.

deadmutex · on July 27, 2019

From my memory (which is over a year ago), Google's service would give real-time results, likely meaning incremental word by word results. Did you use the gRPC interface or JSON? I think the streaming service is only available via gRPC.

Disclaimer: My comment reflects my own views, and not those of my employer, etc.

technics256 · on July 27, 2019

Live transcription limits are now 5 minutes.

Lindenmayer · on July 27, 2019

I really can recommend Otter.ai for English transcription. And I'm pretty happy with the accuracy. You can tag and listen to each part of the transcribed text. I use it for consuming lengthy YouTube conference talks. They also have a very generous free plan of 10 hours per month. Big fan here :-)

tastroder · on July 27, 2019

I'm always intrigued by smaller players in this domain and how they handle user data versus their larger competitors so I checked out their website. Can we please make this style of privacy policy mandatory [1]? While there's some scrolling issues on Android/Chrome for me, I just love how fine-grained, clear and understandable the items are listed in non-legalese terms.

[1] https://otter.ai/privacy-policy

iandanforth · on July 29, 2019

I've been playing around with Otter all day and it's great. Really slick interface on both mobile and web. Has tons of great UX touches and overall feels polished. Thanks for the recommendation!

zakshay · on July 27, 2019

They don't seem to have an API?

superasn · on July 27, 2019

Yes we have implemented both plus Watson and at least IMO Amazon's transcription was the best, followed by watson and then Google. Which has honestly a big surprise to me since I was expecting something more like Google, Amazon, big gap and then Watson.

Also one thing I liked about it IIRC is that Amazon was the only service at the time that offered punctuations (to mark end of sentence) in English which is very useful in some cases.

nostrebored · on July 27, 2019

Yes. The segmentation of sentences is critical with a lot of downstream integrations you might want to use (translation works best within an aws environment when called sentence by sentence, and doing any sort of sentiment analysis or entity detection is also affected by sentence breaks)

thelazydogsback · on July 28, 2019

Any reason Microsoft Cognitive Services is missing from the discussion?

Can test Speech reco API in the page here: https://azure.microsoft.com/en-us/services/cognitive-service...

dankohn1 · on July 27, 2019

I help organize KubeCon + CloudNativeCon, and we are planning to add live transcription (also known as open captioning) to future events. We also want to offer simultaneous translation to and from Chinese for our Shanghai event. I'd love to see a comparison of the major offerings if anyone has done it. If not, I guess we'll need to.

zachruss92 · on July 27, 2019

Disclaimer: I am an organizer of a GDG, but these opinions are my own.

Google has a cloud text to speech API that supports streaming audio. Google blows Amazon out of the water here with accuracy, speed, and features at the same price or cheaper. They also have translation APIs.

I'm more than happy to help out if needed!

https://cloud.google.com/speech-to-text/

pouta · on July 27, 2019

I can help you guys with that. I will send you an email

dmix · on July 27, 2019

This sounds great. I hope to see this being adopted by blogs and news sites!

The browser highlight->speech plugins have always been a bit iffy.

emilfihlman · on July 27, 2019

Listening to that voiced transcript, those "humanlike" sounds of breathing and so on are actually very, very annoying and bring nothing of value to the table. They are actually taking away from the experience, a lot. They are distracting and not natural.

amelius · on July 27, 2019

> I love services like Amazon Transcribe. They are the kind of just-futuristic-enough technology that excites my imagination the same way that magic does. It’s incredible that we have accurate, automatic speech recognition for a variety of languages and accents, in real-time.

I personally hate it when I have to use a service for something that could be done locally on my computer or smartphone. And I don't get that fuzzy magical feeling, but instead I think of a (very nearby) dystopian future where a single company knows what all citizens say or do in real time.

Needless to say, I didn't read the rest of the article.

pjmlp · on July 27, 2019

> It’s incredible that we have accurate, automatic speech recognition for a variety of languages and accents, in real-time.

Then a few paragraphs later:

> For real-time transcription, Amazon Transcribe currently supports British English (en-GB), US English (en-US), French (fr-FR), Canadian French (fr-CA), and US Spanish (es-US).

So basically it boils down to two English variants, two French variants and US Spanish variant.

And then one wonders why such projects never pick up steam around the world.

philliphaydon · on July 27, 2019

So you’re saying that amazon should release this with over 100 languages support on day 1?

pjmlp · on July 27, 2019

Being on a language group that gets routinely ignored, in spite of being the 6th mostly spoken one, is what triggers this kind of remarks from me.

Actually I must admit that at least they did look beyond North America for the 2nd English and French variants, and not the usual American English and the rest comes when it comes, if ever.