Agora's Conversational AI Engine

BryceNotSoNice · 2025-03-06T23:01:48 1741302108

This seems really cool! From an architectural standpoint, how does the Conversational AI Engine handle concurrent voice streams and manage real-time speech-to-text and text-to-speech processing at scale? I ask especially since I’ve seen other implementations struggle with latency and reliability under heavy loads.

iamhermes · 2025-03-07T00:36:02 1741307762

In the article it doesn't give that detail, thats in the documentation.

RE: Concurrent users, the API reference shows the `remote_rtc_uids` field allows developers to set a list of users that can interact with the ai.

> remote_rtc_uids: array[string] - The list of user IDs that the agent subscribes to in the channel. Only subscribed users can interact with the agent. "*" means that the agent subscribes to all users in the channel.

RE: Real-time Speech-To-Text, the audio is streamed using Agora's low latency SDRTN and the the Conversational AI Engine joins the stream and handles the STT and passes the text as input to the LLM of a developers choosing (using either OpenAI standard or Gemini).

RE: Text-To-Speech, thats a setting when initializing the Conversational AI Engine. It handles streaming the LLM's output text to the developer's TTS provider and handles piping that audio stream back into the Agora stream.

The latency is very low, and the Voice Activity Detection is configurable (even defaults are on-point).

and_thomp_21 · 2025-03-13T21:03:50 1741899830

Interesting. Does it matter which LLM provider we use?

PatrickFerriter · 2025-03-06T22:24:59 1741299899

Makes it easy for devs to use any model and TTS solution to create a natural human-AI interaction

iamhermes · 2025-03-07T00:39:06 1741307946

Yes, very easy...dare I say effortless. No complex infrastructure to deploy, it's as simple as setting some variables in a POST request.