It's not really. And 8x7B is not a 7B model, it's a MoE that's closer to 60B that has to be kept in memory, and uses 2 experts per token so it runs at 15B speeds.
All of the current frameworks support MoE and sharding among GPUs so I don't see what the issue is.
All of the current frameworks support MoE and sharding among GPUs so I don't see what the issue is.