Could you use some sort of RAID array of GPUs to compensate...?

nonplus · 2024-03-31T14:16:57 1711894617

nvidia-smi exposes all cards, so you could run the same workload on multiple cards. This (likely) won't solve the problem of certain failure modes being intrinsic to the work being completed/compute environment. I would speculate some of those aggressive failure modes would present themselves across all the hardware.

Maybe someone could run workloads across CUDA and ZLUDA (Nvidia, and other hardware), but really we just might need more reliability to efficiently and reliability run a file system across disparate GPU hardware.