Hacker News new | past | comments | ask | show | jobs | submit login

The problem to solve is GPU utilization rate. Training a “huge model” on n GPUs will have a cyclical pattern of idle GPUs and GPUs are expensive. This approach is likely scheduling GPU access across t training jobs to insure GPU is always in use. Maximizing GPU utility is a key requirement for training as a service.

The biggest issue is that the failure mode of this approach (one fault hoses the entire pipeline of T jobs) is exactly opposing the goals of its most likely users: Training SaaS providers.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: