Hacker News new | past | comments | ask | show | jobs | submit login

They trained a >600B parameter translation model (GPT-3 is 175B parameters).

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (https://arxiv.org/abs/2006.16668)




Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: