I recently submitted another project for using LLMs to correct errors and improve formatting of OCRed documents which was well received. The low cost and high quality/speed of the latest "value tier" models from OpenAI and Anthropic have made it possible to get compelling results at a very reasonable price in that application.
It occured to me that the same approach taken there (namely, splitting documents into chunks and sending each chunk through a chain of LLM prompts that each take the output of the previous prompt and apply an additional layer of processing) could be easily applied to a related problem, that of "improving" raw transcripts of spoken word content to make them more coherent, to correct speech errors, to turn utterances into polished sentences with full punctuation, to add markdown formatting, etc.
Note that this is very different from taking a raw transcript and trying to make it look like a formal transcript from, say, a magazine article, with proper speaker diarization and formatting. There are several other projects that seek to do that, and it's not really possible to get great results with the raw transcript data alone (you also need to look at the audio for really robust speaker identification, for instance).
Where this project is more useful is for people like YouTubers who have made a video on a subject, but the video might be a bit informal and rambling-- the kind of thing that sounds fine when listening to it, but if you were to read an exact transcript of it, it wouldn't feel polished enough. This project lets you easily end up with something that can stand on its own in written form. As a result of this different goal, it takes a lot more liberties with changing/transforming the original content, so in that sense it's quite a bit different than the OCR correction project.
The best way to see this is to just look at a sample. In this case, it's a YouTube video a friend of mine made about music theory:
Original Transcript JSON File (Output from Whisper): https://github.com/Dicklesworthstone/llm_aided_transcription...
Final LLM Generated Markdown: https://github.com/Dicklesworthstone/llm_aided_transcription...
As you can see from the example, although the essence of the original content has been preserved, the form it takes is really quite different.
This new projects pairs well with another past project I submitted a while back to HN, which is for easily generating transcripts of a whole playlist of YouTube videos (or just a single video) using Whisper:
https://github.com/Dicklesworthstone/bulk_transcribe_youtube...
Someone with a lot of recorded content (either YouTube videos, podcasts, etc.) can just crank them all through this code in a few minutes and end up with a bunch of written materials which they could use for blog posts, handouts, etc. It's the kind of thing that would take days or weeks to do by hand, and which I think this latest crop of low-cost LLMs is quite effective at doing in an automated way, and for a couple bucks of API calls at most.
As always, you'll want to read over the output to ensure that it's not hallucinating stuff that was never in the original! Future work here will likely include optional "modules" (that you could enable with an option flag) for generating related ancillary content along with the improved primary "transcript" output, such as multiple choice questions, "top takeaways," powerpoint presentation slides, etc.
Hope you like it!