TLDR: We created a personalised Andrej Karpathy tutor that can response to questions about his Youtube videos in sub 1 second responses (voice-to-voice). We do this using a voice enabled RAG agent. See later in the post for demo link, Github Repo and blog write up.
A few weeks ago we released the worlds fastest voice bot, achieving 500ms voice-to-voice response times, including a 200ms delay waiting for a user to stop speaking.
After reaching the front page of HN, we thought about how we could take this a step further based on feedback we were getting from the community. Many companies were looking for a way to implement function calling and RAG with voice interfaces while retaining a low enough latency. We couldn’t find many resources about how to do this online that:
1. Allowed us to achieve sub-second voice-to-voice latency
2. Was more flexible than existing solutions. Vapi, Retell, [Bland.ai](http://Bland.ai) are too opinionated plus since they just orchestrate API’s which incur network latency at every step. See requirement above
3. The unit economics actually work at scale.
So we decided to create a implementation of our own.
Process:
As we mentioned in our previous release, if you want to achieve response times this low you need to make everything as local as possible. So below was our setup
- Local STT: Deepgram model
- Local Embedding model: Nomic v1.5
- Local VectorDB: Turso
- Local LLM: Llama 3B
- Local TTS: Deepgram model
From our previous example, the only new components where:
- Local Embedding model: We chose Nomic Embed text v1.5 model that gave a processing time of roughly ~200ms
- Turso offers local embedded replicas combined with edgeDB’s which meant we were able to achieve 0.01 second read times. Pinecone also gave us good times of 0.043 seconds.
The above changes led us to achieve sub 1 second voice-to-voice response times
Application:
With Andrej Karpathy’s announcement around [Eureka Labs](https://eurekalabs.ai/), a new AI+Education company we thought we would create our very own personalised Andrej tutor.
Listen to anyone of his Youtube lectures, as soon as your start specking, the video will pause and he will reply. Once your question has been answered you can then tell him to continue with the lecture and the video will automatically start playing.
Demo: https://educationbot.cerebrium.ai/
Blog: https://www.cerebrium.ai/blog/creating-a-realtime-rag-voice-...
Github Repo: https://github.com/CerebriumAI/examples/tree/master/19-voice...
For demo purposes:
- We used OpenAI for GPT-4-mini and embeddings (its cheaper to run on a CPU than GPU’s when running demos at scale. These changes add about ~1 second to the response time
- We used Eleven labs to clone his voice to make replies sound more realistic. This adds about 300ms to the response time.
The improvements that can be made which we would like the community to contribute to are:
- Embed the video screens as well that when you ask certain questions it can show you the relevant lecture slide for the same chuck that it got context from to answer.
- Insert the timestamps in the vectorDB timestamps so that if a question will be answered later in the lecture he can let you know
This unlocks so many use cases in education, employee training, sales etc that it would be great to see what the community builds!