Hacker News new | past | comments | ask | show | jobs | submit | ldenoue's comments login

Here’s one video above 1 hour and it works with Scribe https://www.appblit.com/scribe?v=FQUo2r-ow-k


looks great. I made a similar app called Scribe where you can highlight passages of the transcript. It's working on the web but also as an iOS app. https://www.appblit.com/scribe

To solve the server IP sometimes being blocked by YouTube, the app fetches the transcripts in the browser.


Same method as my open source lib https://github.com/ldenoue/readabletranscripts but several folks asked for a hosted version so here you go.

200 free minutes on signup so you can try for free.

LLM corrected transcripts are really good, and you can highlight text which I find super useful to study and share quotes.


This thing is amazing can you add recording?

Perhaps some samples you or visitors create?

Then add a little sampler for beat and it’s a fantastic tool


This is the version of my open source library that many people liked https://news.ycombinator.com/item?id=42238890

This will give you 120 free minutes to try out this LLM powered YouTube correction.


LLM corrected transcript (using Gemini Flash 8B over the raw YouTube transcript) https://www.appblit.com/scribe?v=YD-9NG1Ke5Y#0


How do you prevent Gemini from just swallowing text after some time?

Audio transcript correction is one area where I struggle to see good results from any LLM unless I chunk it to no more than one or two pages.

Or did you use any tool?


Did you see this using the API or the online product gemini?


The online product, I haven’t tried the API.


It’s not uploaded anywhere: the client fetches Gemini servers directly from your browser.

But I understand it can be difficult to trust: that’s why the project is on GitHub so you can run it on your own machine and look at how the key is used.

I will try to offer a version that doesn’t require any key.


Although Gemini accepts very long input context, I found that sending more than 512 or so words at a time to the LLM for "cleaning up the text" yields hallucinations. That's why I chunk the raw transcript into 512-word chunks.

Are you saying it works with 70B models on Groq? Mixtral, Llama? Other?


When you did this, I am assuming you cut the audio off around 5 mins?

https://github.com/google-gemini/generative-ai-js/issues/269...


Yeah, I've had no issues sending tokens up to the context limit. I cut it off with a 10% buffer but that's just to ensure I don't run into tokenization miscounting between tiktoken and whatever tokenizer my actual LLM uses.

I have had little success with Gemini and long videos. My pipeline is video -> ffmpeg strip audio -> whisperX ASR -> groq (L3-70b-specdec) -> gpt-4o/sonnet-3.5 for summarization. Works great.


Do you have something to read about your study, experiments? Genuinely interested. Perhaps the prompts can be made to tell the LLM it's specifically handling human speech, not written text?


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: