Thanks for this, I actually appreciate the honesty. It is always difficult for me to parse the actual quality of things I don't have intimate experience with.
Can I ask another question? If I wanted to hack around with STT and TTS (inference only) on a pi (4B+) is there anything that is approximately appropriate and can be done on device? (I could process on my main machine but I'd love to do it on the pi even with a decent delay)
They provide support for running in a Raspberry Pi and it runs in real-time. I have tried the desktop version and the quality is good enough when the audio is clean.
There are other ML TTS models that are both lightweight and can run on a CPU. Check out Glow-TTS for something that will probably work.
Also swap out the HifiGan vocoder for Melgan or MB-Melgan as these will also better support your use case.
I ran this exact setup on cheap Digital Ocean droplets (without GPUs) and it ran faster than real time. It should work on a Pi.
Unfortunately I'm not aware of STT models that operate under these same hardware constraints, but you should be good to go for TTS. With a little bit of poking around, I'm sure you can find a solution for STT too.
Can I ask another question? If I wanted to hack around with STT and TTS (inference only) on a pi (4B+) is there anything that is approximately appropriate and can be done on device? (I could process on my main machine but I'd love to do it on the pi even with a decent delay)