In theory it should be linear, however, the parallelization is not perfect and s...

bitL on Jan 18, 2023 | parent | context | favorite | on: Let's build GPT: from scratch, in code, spelled ou...

In theory it should be linear, however, the parallelization is not perfect and some overlapping parts of gradients are computed on multiple GPUs at the same time so expect some constant factor slowdown on average.