Yep, we are actively working on getting this down. We can meet SLAs with tuning for the real time vision workloads but trying to get rid of this compromise is our next big development task.
For consumers, we want to just pass on price to performance ratio. For enthusiasts and companies, we do see people want their own models/ ability to use the massive amounts of data they have.
Also curious about this. We have a 30 day content retention policy and have to have access to your fine-tuned model/LoRa if deploying that. If there's anything we can change, happy to hear it out.
We usually charge by GPU hour for those finetunes, around 8-10 dollars depending on GPU type and volume! This is similar to Modal, but since the engine is fully ours, you don't wait ~1 min for cold starts. Ideally, we will make onboarding super frictionless and self serve, but onboarding people manually for now.
Haha sorry for the typo! Your F500 use case is exactly who we want to target, especially as they start serving finetunes on their own data. Thanks for the feedback!
our SLA is actually higher and we are lower priced. We are also using this as a step into serving finetuned models for much cheaper than Fireworks/Together and not having the horrible cold starts of Modal. We're essentially trying to prove that our engine can hang with the best providers while multiplexing models.