All HPC and LLMs tend to get fully optimized to their hardware specs. When you t...

All HPC and LLMs tend to get fully optimized to their hardware specs. When you train models with over 405B parameters and process about 2 million tokens per second calculating derivatives on all these parameters every few seconds, you do end up at the boundary of latency and bandwidth at all scales (from host to host, host to device, and the multiple rates within each device). Typical LLM training at these scales multiplexes three or more different types of parallelism to avoid keeping the devices idle and of course they have to also deal with redundancy and frequent failures of these erratic hardwares (if a single H100 fails once every five years, 100K of them would have more than two failures per hour.)