Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for this, I actually appreciate the honesty. It is always difficult for me to parse the actual quality of things I don't have intimate experience with.

Can I ask another question? If I wanted to hack around with STT and TTS (inference only) on a pi (4B+) is there anything that is approximately appropriate and can be done on device? (I could process on my main machine but I'd love to do it on the pi even with a decent delay)




For STT, take a look at Wenet: https://github.com/wenet-e2e/wenet

They provide support for running in a Raspberry Pi and it runs in real-time. I have tried the desktop version and the quality is good enough when the audio is clean.


No problem!

There are other ML TTS models that are both lightweight and can run on a CPU. Check out Glow-TTS for something that will probably work.

Also swap out the HifiGan vocoder for Melgan or MB-Melgan as these will also better support your use case.

I ran this exact setup on cheap Digital Ocean droplets (without GPUs) and it ran faster than real time. It should work on a Pi.

Unfortunately I'm not aware of STT models that operate under these same hardware constraints, but you should be good to go for TTS. With a little bit of poking around, I'm sure you can find a solution for STT too.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: