Hacker News new | past | comments | ask | show | jobs | submit login

As someone not super familiar with deployment but enough to know that GPUs are difficult to work with due to being costly and sometimes hard to allocate: apart from optimizing the models themselves, what's the trick for handling cloud GPU resources at scale to serve something like this, supporting many realtime connections with low latency? Do you just allocate a GPU per websocket connection? Which would mean keeping a pool of GPU instances allocated in case someone connects, otherwise cold start time would be bad.. but isn't that super expensive? I feel like I'm missing some trick in the cloud space that makes this kind of thing possible and affordable.



We're partnering with GPU infrastructure providers like Replicate. In addition, we have done some engineering to bring down our stack's cold and warm boot times. With sufficient caches on disk, and potentially a running process/memory snapshot we can bring these cold/warm boot times down to under 5 seconds. Of course, we're making progress every week on this, and it's getting better all the time.


Not the author, but their description implies that they are running more than one stream per GPU.

So you can basically spin off a few GPUs as a baseline, allocate streams to them then boot up a new GPU when existing GPUs get overwhelmed.

Does not look very different than standard cloud compute management. I’m not saying it’s easy, but definitely not rocket science either.


You can do parallel rendering jobs on a GPU. (Think of how each GPU-accelerated window on a desktop OS has its own context for rendering resources.)

So if the rendering is lightweight enough, you can multiplex potentially lots of simultaneous jobs onto a smaller pool of beefy GPU server instances.

Still, all these GPU-backed cloud services are expensive to run. Right now it’s paid by VC money — just like Uber used to be substantially cheaper than taxis when they were starting out. Similarly everybody in consumer AI hopes to be the winner who can eventually jack up prices after burning billions getting the customers.


(Not the author but I work in real-time voice.) WebSockets don't really translate to actual GPU load, since they spend a ton of time idling. So strictly speaking, you don't need a GPU per WebSocket assuming your GPU infra is sufficiently decoupled from your user-facing API code.

That said, a GPU per generation (for some operational definition of "generation") isn't uncommon, but there's a standard bag of tricks, like GPU partitioning and batching, that you can use to maximize throughput.


> that you can use to maximize throughput

While degrading the experience sometimes, little or by a lot, thanks to possible "noisy neighbors". Worth keeping in mind that most things are trade-offs somehow :) Mostly important for "real-time" rather than batched/async stuff of course.


It is expensive. They charge in 6 second increments. I have not found anywhere that says how much per 6 second stream.

Okay found it, $0.24 per minute, on the bottom of the pricing page.

That means they can spend $14/hour on GPU and still break even. So I believe that leaves a bit of room for profit.


Scroll down the page and the per minute pricing is there: https://www.tavus.io/pricing

We bill in 6 second increments, so you only pay for what you use in 6 second bins.


Oh sorry I didn't see that. Got it. $0.24 per minute.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: