
PyTorch Distributed Training - keyboardman
https://leimao.github.io/blog/PyTorch-Distributed-Training/
======
smeeth
Very useful! Thanks!

> Although PyTorch has offered a series of tutorials on distributed training,
> I found it insufficient or overwhelming to help the beginners to do state-
> of-the-art PyTorch distributed training

Can confirm. Am relatively junior, spent way too long trying to understand the
documentation.

Tangentially related that bothers me: distributed training is is poorly
documented, distributed memory management is barely documented and not fully
supported.

If your models are too large to fit in the memory of a single GPU (like really
big word embeddings, which are not exotic by any stretch) things get wacky.
Distributing model parts across different GPUs is supported but poorly
documented and a bit hacky. If your models are too large for all of your GPUs
you're stuck with an outside package like SpeedTorch
([https://github.com/Santosh-Gupta/SpeedTorch](https://github.com/Santosh-
Gupta/SpeedTorch)) to pin everything in regular memory.

------
minimaxir
See also pytorch-lightning ([https://pytorch-
lightning.readthedocs.io/en/latest/](https://pytorch-
lightning.readthedocs.io/en/latest/)), which has had cluster support but
coincidentally just added support for Horovod, and will add support for ray
next release.

------
jackallis
what happened at Duke?

