looks great. I made a similar app called Scribe where you can highlight passages of the transcript.
It's working on the web but also as an iOS app.
https://www.appblit.com/scribe
To solve the server IP sometimes being blocked by YouTube, the app fetches the transcripts in the browser.
It’s not uploaded anywhere: the client fetches Gemini servers directly from your browser.
But I understand it can be difficult to trust: that’s why the project is on GitHub so you can run it on your own machine and look at how the key is used.
I will try to offer a version that doesn’t require any key.
Although Gemini accepts very long input context, I found that sending more than 512 or so words at a time to the LLM for "cleaning up the text" yields hallucinations. That's why I chunk the raw transcript into 512-word chunks.
Are you saying it works with 70B models on Groq? Mixtral, Llama? Other?
Yeah, I've had no issues sending tokens up to the context limit. I cut it off with a 10% buffer but that's just to ensure I don't run into tokenization miscounting between tiktoken and whatever tokenizer my actual LLM uses.
I have had little success with Gemini and long videos. My pipeline is video -> ffmpeg strip audio -> whisperX ASR -> groq (L3-70b-specdec) -> gpt-4o/sonnet-3.5 for summarization. Works great.
Do you have something to read about your study, experiments? Genuinely interested. Perhaps the prompts can be made to tell the LLM it's specifically handling human speech, not written text?