* for synchronous model-based distributed training to scale linearly, the time required to broadcast the model must be much larger than the time required for a worker (GPU) to process a batch
* there are very large batch sizes (8196)
* it's a low latency network
* for synchronous model-based distributed training to scale linearly, the time required to broadcast the model must be much larger than the time required for a worker (GPU) to process a batch
* there are very large batch sizes (8196)
* it's a low latency network