Hacker News new | past | comments | ask | show | jobs | submit login

I worked on this for a couple years during a previous startup attempt.

I designed a custom STT model via Kaldi [0] and hosted it using a modified version of this server [1]. I deployed it to a 4GB EC2 instance with configurable docker layers (one for core utils, one for speech utils, one for the model) so we could spin up as many servers as we needed for each language.

I would recommend the WebRTC or Gstreamer approach, but I wouldn't recommend trying to build your own model. It's really hard. Google's Cloud API [2] works well across lots of accents and the price is honestly about the same as running your own server. If you want to host your own STT (for privacy or whatever), I'd recommend using Coqui [3] (from the guys that ran Mozilla's OpenSpeech program). Note that this will likely be much, much worse on accents than Google's model.

[0]: https://kaldi-asr.org/

[1]: https://github.com/alumae/kaldi-gstreamer-server

[2]: https://cloud.google.com/speech-to-text

[3]: https://coqui.ai/code

Edit: Forgot to mention, there's also a YC company called Deepgram that provides ASR/STT as a service, you could give them a shot: https://deepgram.com/




In my experience, Google's API completely fails when any slightly unusual vocabulary is involved (e.g. in this instance, grandparents talking about their past jobs), and tends to just silently skip over things. Amazon's wasn't much better with vocab., but at least didn't leave things out, so you could see problems. I don't have experience with any of these others, but I think for my purposes (subtitles for maths education videos) no one will have made an appropriate model yet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: