Hacker News new | past | comments | ask | show | jobs | submit login
Why async gradient update doesn't get popular in LLM community? (github.com/sighingnow)
3 points by sighingnow on Oct 10, 2023 | hide | past | favorite | 1 comment



The pipedream-2bw paper[1] and the zero-offload paper[2] both show that 1-step delayed asynchronous gradient update doesn't affect the convergence (and perplexity) while improve the training efficiency (by fully utilize the bubbles in pipeline parallelism) at a large margin.

However, both the Megatron-LM[3] and the DeepSpeed[4] don't use pipedream-2bw scheduling. Could anyone share me some insights or ideas about why such an efficient scheduling scheme doesn't get popular in the LLM pretraining community? Does it suffer convergence/accuracy issue in practice? Or are there any other concerns that blocking it become the default / most popular pipeline parallelism scheduling?

[1]: https://arxiv.org/abs/2006.09503

[2]: https://arxiv.org/abs/2101.06840

[3]: https://github.com/nvidia/Megatron-LM

[4]: https://github.com/microsoft/DeepSpeed/issues/1110




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: