Why async gradient update doesn't get popular in LLM community?

sighingnow · on Oct 10, 2023

The pipedream-2bw paper[1] and the zero-offload paper[2] both show that 1-step delayed asynchronous gradient update doesn't affect the convergence (and perplexity) while improve the training efficiency (by fully utilize the bubbles in pipeline parallelism) at a large margin.

However, both the Megatron-LM[3] and the DeepSpeed[4] don't use pipedream-2bw scheduling. Could anyone share me some insights or ideas about why such an efficient scheduling scheme doesn't get popular in the LLM pretraining community? Does it suffer convergence/accuracy issue in practice? Or are there any other concerns that blocking it become the default / most popular pipeline parallelism scheduling?

[1]: https://arxiv.org/abs/2006.09503

[2]: https://arxiv.org/abs/2101.06840

[3]: https://github.com/nvidia/Megatron-LM

[4]: https://github.com/microsoft/DeepSpeed/issues/1110