The problem to solve is GPU utilization rate. Training a “huge model” on n GPUs will have a cyclical pattern of idle GPUs and GPUs are expensive. This approach is likely scheduling GPU access across t training jobs to insure GPU is always in use. Maximizing GPU utility is a key requirement for training as a service.
The biggest issue is that the failure mode of this approach (one fault hoses the entire pipeline of T jobs) is exactly opposing the goals of its most likely users: Training SaaS providers.
The biggest issue is that the failure mode of this approach (one fault hoses the entire pipeline of T jobs) is exactly opposing the goals of its most likely users: Training SaaS providers.