llama3.2 1b & 3b is really useful for quick tasks like creating some quick scrip...

taosx · 2024-10-16T08:26:12.000000Z

Looks very nice, saved it for later. Last week, I worked on implementing always-on speech-to-text functionality for automating tasks. I've made significant progress, achieving decent accuracy, but I imposed some self-imposed constraints to implement certain parts from scratch to deliver a single binary deployable solution, which means I still have work to do (audio processing is new territory for me). However, I'm optimistic about its potential.

That being said, I think the more straightforward approach would be to utilize an existing library like https://github.com/collabora/WhisperLive/ within a Docker container. This way, you can call it via WebSocket and integrate it with my LLM, which could also serve as a nice feature in your product.

xyc · 2024-10-16T08:43:10.000000Z

Thanks! lmk when/if you wanna give it a spin as free trial hasn't been updated with the latest but I'll try to do it this week.

I've actually been playing around with speech to text recently. Thank you for the pointer, docker is a bit too heavy to deploy for desktop app use case but it's good to know about the repo. Building binaries with Pyinstaller could be an option though.

Real time transcription seems a bit complicated as it involves VAD so a feasible path for me is to first ship simple transcription with whisper.cpp. large-v3-turbo looks fast enough :D

taosx · 2024-10-16T08:52:19.000000Z

Yes it's fast enough, especially if you don't need something live.

afro88 · 2024-10-16T09:16:30.000000Z

Can you list some real temporary automation needs you've fulfilled? The demo shows asking for facts about space. Lower param models seem to be not great as raw chat models, so I'm interested in what they are doing well for you in this context

xyc · 2024-10-16T20:23:41.000000Z

Things like grab some markdown text and ask to make a pip/npm install one liner, or quick js scripts to paste in the console (which I didn't bother to open an editor), a fun use case was random drawing some lucky winners for the app giveaway from reddit usernames. Mostly it's converting unstructured text to short/one-liner executable scripts & doesn't require much intelligence. For more complex automation/scripts that I'll save for later, I do resort to providers (cursor w sonnet 3.5 mostly).