Are you colocating your storage plane and GPUs? What’s ingress/egress to a node and are those links near saturated (with comfy room for returning model output, but I’m assuming moving the models dwarfs I/O from customer workloads)? Do you see high reusability across workloads? Have you explored chunking/hashing your workloads IPFS style (do these models radically change, or is there a high chance that two models that share an ancestor also share 50% of their bits. If you’re chunking your models and colocating the storage plane with GPUs, can you distribute chunks to increase the hit-rate of a chunk being on-node? Is your scheduler aware of the existing distribution of chunks across nodes? Given the workload patterns you see, and the shared bits between models, is it even practical to try and chase a local cache hit rate to reduce bits-over-wire? If you have a cache miss, what’s the path to getting those bits to the node with the GPU? How does the cost of that path compare to the cost of the scheduler making a decision?
Can only publicly answer one of these:
- reusability of workloads: yes, introducing the community templates feature (https://banana.dev/templates) for common models has dramatically cut back on storage requirements and transfers. We're still majority custom code, but it's helped prevent us from exploding storage over people running the same "model of the week"
As for caching / chunking, sounds like you're thinking on our wavelength, perhaps even ahead of us, so maybe I should take you up on the offer to chat! Will reach out.
Are you colocating your storage plane and GPUs? What’s ingress/egress to a node and are those links near saturated (with comfy room for returning model output, but I’m assuming moving the models dwarfs I/O from customer workloads)? Do you see high reusability across workloads? Have you explored chunking/hashing your workloads IPFS style (do these models radically change, or is there a high chance that two models that share an ancestor also share 50% of their bits. If you’re chunking your models and colocating the storage plane with GPUs, can you distribute chunks to increase the hit-rate of a chunk being on-node? Is your scheduler aware of the existing distribution of chunks across nodes? Given the workload patterns you see, and the shared bits between models, is it even practical to try and chase a local cache hit rate to reduce bits-over-wire? If you have a cache miss, what’s the path to getting those bits to the node with the GPU? How does the cost of that path compare to the cost of the scheduler making a decision?