Hacker News new | past | comments | ask | show | jobs | submit login

Is this sensible? Transformers are memory bandwidth bound. Schlepping activations around your home network (which is liable to be lossy) seems like it would result in atrocious TPS.



"Transformers are memory bandwidth bound" - this is the precise reason why this makes sense. If a model doesn't fit into memory on a single device, it needs to be incrementally loaded into memory (offloading), which is bottlenecked by memory bandwidth. Splitting the model over multiple devices avoids this, instead trading off for latency of communicating between nodes. The network bandwidth requirements are minimal since only the activations (intermediary embeddings) are passed between devices. For Llama-3-8B these are ~10KB, for Llama-3-70B these are ~32KB.


It worth noticing that number you're quoting is for embeddings between layers. If you split your model between 5 nodes you will need to send this 32kb 5 times. Also it's per token. Meaning if you process 1K tokens it turns to be 32 MB of data, 1M tokens - 32 GB...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: