Is this sensible? Transformers are memory bandwidth bound. Schlepping activation...

alexandercheema · 2024-07-16T17:21:52.000000Z

"Transformers are memory bandwidth bound" - this is the precise reason why this makes sense. If a model doesn't fit into memory on a single device, it needs to be incrementally loaded into memory (offloading), which is bottlenecked by memory bandwidth. Splitting the model over multiple devices avoids this, instead trading off for latency of communicating between nodes. The network bandwidth requirements are minimal since only the activations (intermediary embeddings) are passed between devices. For Llama-3-8B these are ~10KB, for Llama-3-70B these are ~32KB.

boroboro4 · 2024-07-17T13:01:29.000000Z

It worth noticing that number you're quoting is for embeddings between layers. If you split your model between 5 nodes you will need to send this 32kb 5 times. Also it's per token. Meaning if you process 1K tokens it turns to be 32 MB of data, 1M tokens - 32 GB...