As someone not super familiar with deployment but enough to know that GPUs are difficult to work with due to being costly and sometimes hard to allocate: apart from optimizing the models themselves, what's the trick for handling cloud GPU resources at scale to serve something like this, supporting many realtime connections with low latency? Do you just allocate a GPU per websocket connection? Which would mean keeping a pool of GPU instances allocated in case someone connects, otherwise cold start time would be bad.. but isn't that super expensive? I feel like I'm missing some trick in the cloud space that makes this kind of thing possible and affordable.
We're partnering with GPU infrastructure providers like Replicate. In addition, we have done some engineering to bring down our stack's cold and warm boot times. With sufficient caches on disk, and potentially a running process/memory snapshot we can bring these cold/warm boot times down to under 5 seconds. Of course, we're making progress every week on this, and it's getting better all the time.
You can do parallel rendering jobs on a GPU. (Think of how each GPU-accelerated window on a desktop OS has its own context for rendering resources.)
So if the rendering is lightweight enough, you can multiplex potentially lots of simultaneous jobs onto a smaller pool of beefy GPU server instances.
Still, all these GPU-backed cloud services are expensive to run. Right now it’s paid by VC money — just like Uber used to be substantially cheaper than taxis when they were starting out. Similarly everybody in consumer AI hopes to be the winner who can eventually jack up prices after burning billions getting the customers.
(Not the author but I work in real-time voice.) WebSockets don't really translate to actual GPU load, since they spend a ton of time idling. So strictly speaking, you don't need a GPU per WebSocket assuming your GPU infra is sufficiently decoupled from your user-facing API code.
That said, a GPU per generation (for some operational definition of "generation") isn't uncommon, but there's a standard bag of tricks, like GPU partitioning and batching, that you can use to maximize throughput.
While degrading the experience sometimes, little or by a lot, thanks to possible "noisy neighbors". Worth keeping in mind that most things are trade-offs somehow :) Mostly important for "real-time" rather than batched/async stuff of course.