
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding - lawrenceyan
https://www.youtube.com/watch?v=1VdEw_mGjFk
======
lawrenceyan
For easier viewing:

0:00 - Intro & Overview

4:10 - Main Results

5:10 - Mixture-of-Experts

16:00 - Difference to Scaling Classic Transformers

18:50 - Backpropagation in Mixture-of-Experts

20:05 - MoE Routing Algorithm in GShard

38:20 - GShard Einsum Examples

47:40 - Massively Multilingual Translation

56:00 - Results

1:11:30 - Conclusion & Comments

