Hi HN,
I'm sharing Monika, an open-source AI assistant I've built. The main focus was on leveraging local processing for the speech components to enhance privacy and create more natural-sounding interactions.
Key components:
* Speech-to-Text: Uses OpenAI's Whisper running locally.
* Text-to-Speech: Uses RealtimeTTS with the Orpheus model for emotional expression, also running locally.
* NLP: Uses Google Gemini on the backend
It includes Voice Activity Detection (VAD) and a basic web interface using Flask. The idea was to see how well local STT and expressive local TTS could work together for a conversational agent.
Tech stack: Python, Flask, Whisper, Gemini, RealtimeTTS.
Video Demo: [https://www.youtube.com/watch?v=_vdlT1uJq2k]
The project is MIT licensed. I'd appreciate any feedback, thoughts on the approach, or suggestions you might have!