Why would they create a bespoke abstraction layer instead of just relying on k8s?
There is only pain on the path of recreating it, it will end up almost as complex as k8s and it will be hell to hire and train for. Best to just use something battle-tested that works with a large pool of people trained for it, even better: their own LLM has gobbled up all the content possible about k8s to help their engineers. K8s complexity came to be for reasons discovered during growing the stack which anyone doing a bespoke similar system might run into, and it's pretty modular since you can pick-and-choose the parts you actually need for your cluster.
Wasting manpower to recreate a bespoke Kubernetes doesn't sound great for a company burning billions per quarter, it's just more waste.
I tried to expressly say that Kubernetes could be used.
And, given their unusual needs and scale, there will probably be some kind of bespoke abstraction, whether it's an SDK, or a document, that says this is the subset of things you should be using, and how to use them, so that we can practically deploy our very unusual setup with different providers and customer facilities.
OpenAI has the resources to define that abstraction, and make it work well across multiple providers.
There is only pain on the path of recreating it, it will end up almost as complex as k8s and it will be hell to hire and train for. Best to just use something battle-tested that works with a large pool of people trained for it, even better: their own LLM has gobbled up all the content possible about k8s to help their engineers. K8s complexity came to be for reasons discovered during growing the stack which anyone doing a bespoke similar system might run into, and it's pretty modular since you can pick-and-choose the parts you actually need for your cluster.
Wasting manpower to recreate a bespoke Kubernetes doesn't sound great for a company burning billions per quarter, it's just more waste.