Show HN: Gemini LLM corrects ASR YouTube transcripts

wood_spirit · 2024-11-25T19:58:22 1732564702

As an aside, has anyone else had some big hallucinations with the Gemini meet summaries? Have been using it a week or so and loving the quality of the grammar of the summary etc, but noticed two recurring problems: omitting what was actually the most important point raised, and hallucinating things like “person x suggested y do z” when, really, that is absolutely the last thing x would really suggest!

leetharris · 2024-11-25T20:19:33 1732565973

The Google ASR is one of the worst on the internet. We run benchmarks of the entire industry regularly and the only hyperscaler with a good ASR is Azure. They acquired Nuance for $20b a while ago and they have a solid lead in the cloud space.

And to run it on a "free" product they probably use a very tiny, heavily quantized version of their already weak ASR.

There's lots and lots of better meeting bots if you don't mind paying or have low usage that works for a free tier. At Rev we give away something like 300 minutes a month.

jll29 · 2024-11-26T00:44:54 1732581894

Interesting. Do you have any peer reviewed scientific publications or technical reports regarding this work?

We also compared Amazon, Google, Microsoft Azure as well as a bunch of smaller players (from Edinburgh and Cambridge) and - consistent with what you reported - we also found Google ranked worst - but that was a one-off study from 2019 (unpublished) on financial news.

Word Error Rate (WER), the standard metric for the tast, is not everything. For some applications, the ability to upload custom lexicons is paramount (ASR systems that are word-based (almost all) as opposted to phoneme based require each word to be defined ahead of being able to recognize said word).

baxtr · 2024-11-25T21:08:08 1732568888

Very interesting. Thanks for sharing.

Since you have experience in this, I’d like to hear your thoughts on a common assumption.

It goes like this: don’t build anything that would be feature for a Hyperscalar because ultimately they win.

I guess a lot of it is a question of timing?

leetharris · 2024-11-25T22:22:12 1732573332

I think it really depends on whether or not you can offer a competitive solution and what your end goals are. Do you want an indie hacker business, do you want a lifestyle business, do you want a big exit, do you want to go public, etc?

It is hard to compete with these hyperscalers because they use pseudo anti-competitive tactics that honestly should be illegal.

For example, I know some ASR providers have lost deals to GCP or AWS because those providers will basically throw in ASR for free if you sign up for X amount of EC2 or Y amount of S3, services that have absurd margins for the cloud providers.

Still, stuff like Supabase, Twilio, etc show there is a market. But it's likely shrinking as consolidation continues, exits slow, and the DOJ turns a blind eye to all of this.

hackernewds · 2024-11-26T00:33:54 1732581234

Counter argument: Zoom, DocuSign

But you do have to be next to amazing at execution

aftbit · 2024-11-25T22:36:00 1732574160

Are there any self-hosted options that are even remotely competitive? I have tried Whisper2 a fair bit, and it seems to work okay in very clean situations, like adding subtitles to movie dialog, but not so well when dealing with multiple speakers or poor audio quality.

albertzeyer · 2024-11-25T22:43:56 1732574636

K2/Kaldi is using more traditional ASR technology. It's probably more difficult to set up but you will more reliable outputs (no hallucinations or so).

hunter2_ · 2024-11-25T20:16:26 1732565786

It can simultaneously be [the last thing x would suggest] and [a conclusion that an uninvolved person tasked with summarizing might mistakenly draw, with slightly higher probability of making this mistake than not making it] and theoretically an LLM attempts to output the latter. The same exact principle applies to missing the most important point.

jazzyjackson · 2024-11-25T19:57:51 1732564671

Thinking about that time Berkeley delisted thousands of recordings of course content as a result of a lawsuit complaining that they could not be utilized by deaf individuals. Can this be resolved with current technology? Google's auto captioning has been abysmal up to this point, I've often wondered what the cost would be for google to run modern tech over the entire backlog of youtube. At least then they might have a new source of training data.

https://news.berkeley.edu/2017/02/24/faq-on-legacy-public-co...

Discussed at the time (2017) https://news.ycombinator.com/item?id=13768856

andai · 2024-11-25T20:02:49 1732564969

Didn't YouTube have auto-captions at the time this was discussed? Yeah they're a bit dodgy but I often watch videos in public with sound muted and 90% of the time you can guess what word it was meant to be from context. (And indeed more recent models do way, way, way better on accuracy.)

zehaeva · 2024-11-25T20:41:09 1732567269

I have a few Deaf/Hard of Hearing friends who find the auto-captions to be basically useless.

Anything that's even remotely domain specific becomes a garbled mess. Even watching documentaries about light engineering/archeology/history subjects are hilariously bad. Names of historical places and people are randomly correct and almost always never consistent.

The second anyone has a bit of an accent then it's completely useless.

I keep them on partially because I'm of the "everything needs to have subtitles else I can't hear the words they're saying" cohort. So I can figure out what they really mean, but if you couldn't hear anything I can see it being hugely distracting/distressing/confusing/frustrating.

creato · 2024-11-26T01:46:07 1732585567

I use youtube closed captions all the time when I don't want to have audio. The captions are almost always fine. I definitely am not watching videos that would have had professional/human edited captions either.

There may be mistakes like the ones you mentioned (getting names wrong/inconsistent), but if I know what was intended, it's pretty easy to ignore that. I think expecting "textual" correctness is unreasonable. Usually when there are mistakes, they are "phonetic", i.e. if you spoke the caption out loud, it would sound pretty similar to what was spoken in the video.

hunter2_ · 2024-11-25T22:31:07 1732573867

With this context, it seems as though correction-by-LLM might be a net win among your Deaf/HoH friends even if it would be a net loss for you, since you're able to correct on the fly better than an LLM probably would, while the opposite is more often true for them, due to differences in experience with phonetics?

Soundex [0] is a prevailing method of codifying phonetic similarity, but unfortunately it's focused on names exclusively. Any correction-by-LLM really ought to generate substitution probabilities weighted heavily on something like that, I would think.

[0] https://en.wikipedia.org/wiki/Soundex

schrodinger · 2024-11-26T02:08:48 1732586928

I'd assume Soundex is too basic and English-centric to be a practical solution for an international company like Google. I was taught it and implemented it in a freshman level CS course in 2004, it can't be nearly state of the art!

jonas21 · 2024-11-25T21:05:27 1732568727

Yes, but the DOJ determined that the auto-generated captions were "inaccurate and incomplete, making the content inaccessible to individuals with hearing disabilities." [1]

If the automatically-generated captions are now of a similar quality as human-generated ones, then that changes things.

[1] https://news.berkeley.edu/wp-content/uploads/2016/09/2016-08...

jazzyjackson · 2024-11-25T20:11:48 1732565508

Definitely depends on audio quality and how closely a speaker's dialect matches the mid-atlantic accent, if you catch my drift.

IME youtube transcripts are completely devoid of meaningful information, especially when domain-specific vocabulary is used.

hackernewds · 2024-11-26T00:32:24 1732581144

What a silly requirement? Since 1% cannot benefit, let's remove it for the 99%

delusional · 2024-11-25T20:05:07 1732565107

That's a legal issue. If humans wanted that content to be up, we just could have agreed to keep it up. Legal issues don't get solved by technology.

jazzyjackson · 2024-11-25T20:09:29 1732565369

Well. The legal complaint was that transcripts don't exist. The issue was that it was prohibitively expensive to resolve the complaint. Now that transcription is 0.1% of the cost it was 8 years ago, maybe the complaint could have been resolved.

Is building a ramp to meet ADA requirements not using technology to solve a legal issue?

delusional · 2024-11-25T20:22:44 1732566164

Nowhere on the linked page at least does it say that it was due to cost. It would seem more likely to me that it was a question of nobody wanting to bother standing up for the videos. If nobody wants to take the fight, the default judgement becomes to take it down.

Building a ramp solves a problem. Pointing at a ramp 5 blocks away 7 years later and asking "doesn't this solve this issue" doesn't.

pests · 2024-11-25T21:10:38 1732569038

Yet this feels very harrison bergeron to me. To handicap those with ability so we all can be at the same level.

tombh · 2024-11-25T20:56:06 1732568166

ASR: Automatic Speech Recognition

joshdavham · 2024-11-25T22:27:32 1732573652

I was too afraid to ask!

throwaway106382 · 2024-11-25T22:40:22 1732574422

Not to be confused with "Autonomous Sensory Meridian Response" (ASMR) - a popular category of video on Youtube.

hackernewds · 2024-11-26T00:31:35 1732581095

How would they be confused?

xanth · 2024-11-26T01:09:01 1732583341

This was a clever jape; a good example of a ironic anti-humor. But I don't think you were confused by that ether ;)

djmips · 2024-11-26T02:43:25 1732589005

clever japes are not desired on HN - there's Reddit for that my friend.

alsetmusic · 2024-11-25T19:07:57 1732561677

Seems like one of the places where LLMs make a lot of sense. I see some boneheaded transcriptions in videos pretty regularly. Comparing them against "more-likely" words or phrases seems like an ideal use case.

leetharris · 2024-11-25T20:17:05 1732565825

A few problems with this approach:

1. It brings everything back to the "average." Any outliers get discarded. For example, someone who is a circus performer plays fetch with their frog. An LLM would think this is an obvious error and correct it to "dog."

2. LLMs want to format everything as internet text which does not align well to natural human speech.

3. Hallucinations still happen at scale, regardless of model quality.

We've done a lot of experiments on this at Rev and it's still useful for the right scenario, but not as reliable as you may think.

falcor84 · 2024-11-25T22:15:45 1732572945

Regarding the frog, I would assume that the way to address this would be to feed the LLM screenshots from the video, if the budget allows.

leetharris · 2024-11-25T22:26:25 1732573585

Generally yes. That being said, sometimes multimodal LLMs show decreased performance with extra modalities.

The extra dimensions of analysis cause increased hallucination at times. So maybe it solves the frog problem, but now it's hallucinating in another section because it got confused by another frame's tokens.

One thing we've wanted to explore lately has been video based diarization. If I have a video to accompany some audio, can I help with cross talk and sound separation by matching lips with audio and assign the correct speaker more accurately? There's likely something there.

orion138 · 2024-11-25T23:00:03 1732575603

Google published Looking to Listen a while back.

https://research.google/blog/looking-to-listen-audio-visual-...

dylan604 · 2024-11-25T21:45:54 1732571154

What about the cases where the human speaking is actually using nonsense words during a meandering off topic bit of "weaving"? Replacing those nonsense words would be a disservice as it would totally change the tone of the speech.

devmor · 2024-11-25T20:18:10 1732565890

Those transcriptions are already done by LLMs in the first place - in fact, audio transcription was one of the very first large scale commercial uses of the technology in its current iteration.

This is just like playing a game of markov telephone where the step in OP's solution is likely higher compute cost than the step YT uses, because YT is interested in minimizing costs.

albertzeyer · 2024-11-25T22:46:41 1732574801

Probably just "regular" LMs, not large LMs, I assume. I assume some LM with 10-100M params or so, which is cheap to use (and very standard for ASR).

petesergeant · 2024-11-25T20:10:35 1732565435

Also useful I think for checking human-entered transcriptions, which even on expensively produced shows, can often be garbage or just wrong. One human + two separate LLMs, and something to tie-break, and we could possibly finally get decent subtitles for stuff.

Timwi · 2024-11-25T23:56:43 1732579003

Can I use this to generate subtitles for my own videos? I would love to have subtitles on them but I can't be bothered to do all the timing synchronization by hand. Surely there must be a way to automate that?

geor9e · 2024-11-26T00:06:06 1732579566

That's called Youtube Automatic Speech Recognition (captioning), and is what this tool uses as input. You can turn those on in youtube studio.

sorenjan · 2024-11-25T21:20:19 1732569619

Using an LLM to correct text is a good idea, but the text transcript doesn't have information about how confident the speech to text conversion is. Whisper can output confidence for each word, this would probably make for a better pipeline. It would surprise me if Google doesn't do something like this soon, although maybe a good speech to text model is too computationally expensive for Youtube at the moment.

dylan604 · 2024-11-25T21:41:35 1732570895

Depends on your purpose of the transcript. If you are expecting the exact form of the words spoken in written form, then any deviation from that is no longer a transcription. At that point it is text loosely based on the spoken content.

Once you accept it okay for the LLM to just replace words in a transcript, you might as well just let it make up a story based on character names you've provided.

falcor84 · 2024-11-25T22:25:22 1732573522

> any deviation from that is no longer a transcription

That's a wild exaggeration. Professional transcripts often have small (and not so small) mistakes, caused by typos, mishearing or lack of familiarity with the subject matter. Depending on the case, these are then manually proofread, but even after proofreading, some mistakes often remain, and occasionally even introduced.

dylan604 · 2024-11-25T22:55:54 1732575354

maybe, but typos are not even the same thing as an LLM thinking of better next choice in words than actually just transcribing what was heard.

icelancer · 2024-11-25T20:51:05 1732567865

Nice use of an LLM - we use Groq 70b models for this in our pipelines at work. (After using WhisperX ASR on meeting files and such)

One of the better reasons to use Cerebras/Groq that I've found so you can return huge amounts of clean text back fast for processing in other ways.

kelvinjps · 2024-11-25T21:40:32 1732570832

Google should have the needed tech for good AI transcription, why the don't integrate them in their auto-captioning? and instead the offer those crappy auto subtitles

briga · 2024-11-25T21:50:05 1732571405

Are they crappy though? Most of the time it gets things right, even if they aren't as accurate as a human. And sure, they probably have better techniques for this, but are they cost-effective to run at YouTube-scale? I think their current solution is good enough for most purposes, even if it isn't perfect

InsideOutSanta · 2024-11-25T22:10:48 1732572648

I'm watching YouTube videos with subtitles for my wife, who doesn't speak English. For videos on basic topics where people speak clear, unaccented English, they work fine (i.e. you usually get what people are saying). If the topic is in any way unusual, the recording quality is poor, or people have accents, the results very quickly turn into a garbled mess that is incomprehensible at best, and misleading (i.e. the subtitles seem coherent, but are wrong) at worst.

wahnfrieden · 2024-11-25T22:39:36 1732574376

Japanese auto captions suck

summerlight · 2024-11-26T00:38:58 1732581538

YT is using USM, which is supposed to be their SOTA ASR model. Gemini have much better linguistic knowledge, but it's likely prohibitively expensive to be used on all YT videos uploaded everyday. But this "correction" approach seems to be a nice cost-effective methodology to apply LLM indeed.

leetharris · 2024-11-25T20:13:23 1732565603

The main challenge with using LLMs pretrained on internet text for transcript correction is that you reduce verbatimicity due to the nature of an LLM wanting to format every transcript as internet text.

Talking has a lot of nuances to it. Just try to read a Donald Trump transcript. A professional author would never write a book's dialogue like that.

Using a generic LLM on transcripts almost always reduces accuracy as a whole. We have endless benchmark data to demonstrate this at RevAI. It does, however, help with custom vocabulary, rare words, proper nouns, and some people prefer the "readability" of an LLM-formatted transcript. It will read more like a wikipedia page or a book as opposed to the true nature of a transcript, which can be ugly, messy, and hard to parse at times.

dylan604 · 2024-11-25T21:51:53 1732571513

> A professional author would never write a book's dialogue like that.

That's a bit too far. Ever read Huck Finn?

dr_dshiv · 2024-11-25T19:38:30 1732563510

The first time I used Gemini, I gave it a youtube link and asked for a transcript. It told me how I could transcribe it myself. Honestly, I haven't used it since. Was that unfair of me?

robrenaud · 2024-11-25T19:47:42 1732564062

Gemini is much worse as a product than 4o or Claude. I recommend using it from Google AI studio rather than the official consumer facing interface. But for tasks with large audio/visual input, it's better than 4o or Claude.

Whether you want to deal with it being annoying is your call.

andai · 2024-11-25T20:04:17 1732565057

GPT told me the same thing when I asked it to make an API call, or do an image search, or download a transcript of a YouTube video, or...

Spooky23 · 2024-11-25T19:51:37 1732564297

The consumer Gemini is very prudish and optimized against risk to Google.