Anyone have experience with the accuracy of Amazon Transcribe vs the Google offerings? (Google Live Transcribe for Android currently tops my list of impressive transcription offerings.)
I built a project where we transcribed speech into chat bubbles for an AR app (Magic Leap). We started with Google Speech-to-Text and swapped to Amazon Transcribe.
Performance-wise they seemed mostly similar - both can pick up text fairly accurately in somewhat noisy environments. We did notice that Google's offering seemed to be slightly more accurate. But the difference was marginal and only noticeable when we compared side-by-side on intentionally poorly pronounced words. There were two big differences that made it impossible for us to use Google's product:
1. Google has a 1 minute limit on streaming speech-to-text. It just closes the connection at 1 minute. It doesn't even send you a "final result", so if speech was being recorded at the time the connection dropped, that transcription is lost. Speaking of which...
2. Google doesn't provide incremental updates. So if someone speaks for a while, you only get an update at the end of it.
Note that this is their API - my impression is the product they use in their apps is superior in functionality to the product available on Google Cloud.
Amazon Transcribe, on the other hand, has a 4 hour limit and sends incremental updates, so in a longer sentence, like this one I'm currently writing, you would get a message every couple words, which is essential when the goal is to show a live transcription.
Can you explain this comment? From my recollection, this statement is not true. Also, their site says:
"Returns text transcription in real time for short-form or long-form audio
Cloud Speech-to-Text can stream text results, immediately returning text as it’s recognized from streaming audio or as the user is speaking. Alternatively, Cloud Speech-to-Text can return recognized text from audio stored in a file. It’s capable of analyzing short-form and long-form audio."
Perhaps something has changed since I used it - about 2 months ago - or we were using it incorrectly, but in our experience it was "real-time" transcription, but not truly incremental. (Since you and another person have mentioned it's possible I'm leaning towards us not looking through the documentation carefully enough. We didn't spend too much time trying to figure it out because the time limit was a bigger dealbreaker)
Probably best explained as an example. Consider someone saying this:
"How are you doing? I'm doing well. Do you plan to go to the park with Alice and Bob tomorrow to see the fireworks show?"
When we were using it, Google would give:
1. "How are you doing?"
2. "I'm doing well."
3. "Do you plan to go to the park with Alice and Bob tomorrow to see the fireworks show?"
as three separate messages.
So it's certainly not waiting until the end of the entire streaming to give you a result - it sends you those three messages as you speak. In that sense it is "returning text as it's recognized", because once it recognizes you've finished a sentence it computes the words and gives them back to you. But the issue for us was that we could only get results after full sentences or long pauses.
Amazon, on the other hand, would give for the last sentence something like:
1. "Do you plan"
2. "Do you plan to go"
3. "Do you plan to go to the park"
So for our purposes (showing a chat bubble above a person's ahead), the latter version was much more useful. The reason is that in spoken language sentences often ramble, so we wanted to be able to show some incremental updates to the user as a long sentence was spoken so they wouldn't have to read 30 words at once.
From my memory (which is over a year ago), Google's service would give real-time results, likely meaning incremental word by word results. Did you use the gRPC interface or JSON? I think the streaming service is only available via gRPC.
Disclaimer: My comment reflects my own views, and not those of my employer, etc.
I really can recommend Otter.ai for English transcription. And I'm pretty happy with the accuracy. You can tag and listen to each part of the transcribed text. I use it for consuming lengthy YouTube conference talks. They also have a very generous free plan of 10 hours per month. Big fan here :-)
I'm always intrigued by smaller players in this domain and how they handle user data versus their larger competitors so I checked out their website. Can we please make this style of privacy policy mandatory [1]? While there's some scrolling issues on Android/Chrome for me, I just love how fine-grained, clear and understandable the items are listed in non-legalese terms.
I've been playing around with Otter all day and it's great. Really slick interface on both mobile and web. Has tons of great UX touches and overall feels polished. Thanks for the recommendation!
Yes we have implemented both plus Watson and at least IMO Amazon's transcription was the best, followed by watson and then Google. Which has honestly a big surprise to me since I was expecting something more like Google, Amazon, big gap and then Watson.
Also one thing I liked about it IIRC is that Amazon was the only service at the time that offered punctuations (to mark end of sentence) in English which is very useful in some cases.
Yes. The segmentation of sentences is critical with a lot of downstream integrations you might want to use (translation works best within an aws environment when called sentence by sentence, and doing any sort of sentiment analysis or entity detection is also affected by sentence breaks)
I help organize KubeCon + CloudNativeCon, and we are planning to add live transcription (also known as open captioning) to future events. We also want to offer simultaneous translation to and from Chinese for our Shanghai event. I'd love to see a comparison of the major offerings if anyone has done it. If not, I guess we'll need to.
Disclaimer: I am an organizer of a GDG, but these opinions are my own.
Google has a cloud text to speech API that supports streaming audio. Google blows Amazon out of the water here with accuracy, speed, and features at the same price or cheaper. They also have translation APIs.
Listening to that voiced transcript, those "humanlike" sounds of breathing and so on are actually very, very annoying and bring nothing of value to the table. They are actually taking away from the experience, a lot. They are distracting and not natural.
> I love services like Amazon Transcribe. They are the kind of just-futuristic-enough technology that excites my imagination the same way that magic does. It’s incredible that we have accurate, automatic speech recognition for a variety of languages and accents, in real-time.
I personally hate it when I have to use a service for something that could be done locally on my computer or smartphone. And I don't get that fuzzy magical feeling, but instead I think of a (very nearby) dystopian future where a single company knows what all citizens say or do in real time.
Needless to say, I didn't read the rest of the article.
> It’s incredible that we have accurate, automatic speech recognition for a variety of languages and accents, in real-time.
Then a few paragraphs later:
> For real-time transcription, Amazon Transcribe currently supports British English (en-GB), US English (en-US), French (fr-FR), Canadian French (fr-CA), and US Spanish (es-US).
So basically it boils down to two English variants, two French variants and US Spanish variant.
And then one wonders why such projects never pick up steam around the world.
Being on a language group that gets routinely ignored, in spite of being the 6th mostly spoken one, is what triggers this kind of remarks from me.
Actually I must admit that at least they did look beyond North America for the 2nd English and French variants, and not the usual American English and the rest comes when it comes, if ever.