Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: LLM Aided Transcription Improvement (github.com/dicklesworthstone)
11 points by eigenvalue 28 days ago | hide | past | favorite | 4 comments
I recently submitted another project for using LLMs to correct errors and improve formatting of OCRed documents which was well received. The low cost and high quality/speed of the latest "value tier" models from OpenAI and Anthropic have made it possible to get compelling results at a very reasonable price in that application.

It occured to me that the same approach taken there (namely, splitting documents into chunks and sending each chunk through a chain of LLM prompts that each take the output of the previous prompt and apply an additional layer of processing) could be easily applied to a related problem, that of "improving" raw transcripts of spoken word content to make them more coherent, to correct speech errors, to turn utterances into polished sentences with full punctuation, to add markdown formatting, etc.

Note that this is very different from taking a raw transcript and trying to make it look like a formal transcript from, say, a magazine article, with proper speaker diarization and formatting. There are several other projects that seek to do that, and it's not really possible to get great results with the raw transcript data alone (you also need to look at the audio for really robust speaker identification, for instance).

Where this project is more useful is for people like YouTubers who have made a video on a subject, but the video might be a bit informal and rambling-- the kind of thing that sounds fine when listening to it, but if you were to read an exact transcript of it, it wouldn't feel polished enough. This project lets you easily end up with something that can stand on its own in written form. As a result of this different goal, it takes a lot more liberties with changing/transforming the original content, so in that sense it's quite a bit different than the OCR correction project.

The best way to see this is to just look at a sample. In this case, it's a YouTube video a friend of mine made about music theory:

Original Transcript JSON File (Output from Whisper): https://github.com/Dicklesworthstone/llm_aided_transcription...

Final LLM Generated Markdown: https://github.com/Dicklesworthstone/llm_aided_transcription...

As you can see from the example, although the essence of the original content has been preserved, the form it takes is really quite different.

This new projects pairs well with another past project I submitted a while back to HN, which is for easily generating transcripts of a whole playlist of YouTube videos (or just a single video) using Whisper:

https://github.com/Dicklesworthstone/bulk_transcribe_youtube...

Someone with a lot of recorded content (either YouTube videos, podcasts, etc.) can just crank them all through this code in a few minutes and end up with a bunch of written materials which they could use for blog posts, handouts, etc. It's the kind of thing that would take days or weeks to do by hand, and which I think this latest crop of low-cost LLMs is quite effective at doing in an automated way, and for a couple bucks of API calls at most.

As always, you'll want to read over the output to ensure that it's not hallucinating stuff that was never in the original! Future work here will likely include optional "modules" (that you could enable with an option flag) for generating related ancillary content along with the improved primary "transcript" output, such as multiple choice questions, "top takeaways," powerpoint presentation slides, etc.

Hope you like it!




I record long, rambling voice memos in noisy environments which Whisper struggles to parse. Perhaps this can rescue me from the tedium of hand-stitching the fragmented results together. GIGO, of course, but there's an equilibrium here that might be struck.


Give it a try and let me know how well it works for you! You might improve results by slightly tweaking the first stage of the prompting to mention that these are your own voice memos to yourself to give it better context.


i'm curious about the chunk splitting approach you mentioned. how do you determine the optimal chunk size for processing? seems like there could be a tradeoff between context preservation and processing efficiency. have you experimented with different chunk sizes and their impact on the quality of the final output? this could be really important for handling things like long-range dependencies in the text.


I just tried messing around with the chunk size until the results looked best to my eye. One important thing to note is that each chunk includes a portion of the previous chunk as context. But yes, it’s not going to have context that spans the entire document. For that you would need a final step that takes the entire output. One thing I learned from this is that you can’t ask too much at once from these value tier models. If you keep your requests modest and focused, they do really well. When you try to cram too much complexity and too many rules/requests at once, they start messing up and leaving stuff out and hallucinating.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: