for llama 3-8b (smaller model) you need 8GB/16GB of memory: minimum/recommended ||
for llama 3-70b (larger model) you need 64GB/96GB of memory
what kind of memory?
there are 3 types of memory:
GPU memory: medium capacity | high speed ||
Apple Unified memory: high capacity | high speed ||
CPU memory: high capacity | low speed ||
rtx 3090 has 24GB of memory at 935GB/s memory bandwidth ||
rtx 4060ti has 16GB of memory at 288GB/s memory bandwidth ||
m2 mac mini has 8GB/16GB of memory with 100GB/s memory bandwidth ||
m3 macbook pro has 32-128GB of memory with 400GB/s memory bandwidth ||
new x86 consumer cpu have up to 256GB of DDR5 at 100GB/s memory bandwidth ||
old cpu with ddr4 and ddr3 are much slower ||
||
Remember, there is a simple formula for tokens/s:
you need to divide memory bandwidth by model size, so for example:
Because of memory bandwidth.
H100 has 3350gB/s of bandwidth, more gpus will give you more memory but not bandwidth.
If you load 175b parameters in 8bit then you can get theoretically
3350/175=19 tokens/second.
In MoE you need to process only one expert at a time so sparse 8x220b model would be only slightly slower than dense 220b model.
Okay, memory bandwidth certainly matters, but 19 tokens a second is not some fundamental lower limit on the speed of a language model and so this doesn't really explain why the limit would be 220b rather than say 440b or 800b?
It's not a fundamental limit. Google palm had 540B parameters as dense model.
But it's a practical limit because models with over 1T would be extremely slow even on newest gpus. Even now, OpenAI has limit of 25 messages.
You can read more here: https://bounded-regret.ghost.io/how-fast-can-we-perform-a-fo...
I'm not trying to say memory bandwidth isn't a bottleneck for very large models. I'm wondering why he picked 220b which is weirdly specific. (To be honest although I completely agree the costs would be very high, I think there are people who would pay for and wait for answers at seconds or even minutes per token if they were good enough, so not completely sure I even agree it's a practical limit)
what kind of memory? there are 3 types of memory: GPU memory: medium capacity | high speed || Apple Unified memory: high capacity | high speed || CPU memory: high capacity | low speed ||
rtx 3090 has 24GB of memory at 935GB/s memory bandwidth || rtx 4060ti has 16GB of memory at 288GB/s memory bandwidth ||
m2 mac mini has 8GB/16GB of memory with 100GB/s memory bandwidth || m3 macbook pro has 32-128GB of memory with 400GB/s memory bandwidth ||
new x86 consumer cpu have up to 256GB of DDR5 at 100GB/s memory bandwidth || old cpu with ddr4 and ddr3 are much slower ||
||
Remember, there is a simple formula for tokens/s: you need to divide memory bandwidth by model size, so for example:
mac mini runs llama3-8b at 18.8 tps using 5GB of memory 100GB/s/5GB = 20 https://twitter.com/awnihannun/status/1781345824611680596
rtx 3090ti runs llama2-7b at 179 tps also using around 5GB 1000GB/s/5gb = 200 https://github.com/turboderp/exllamav2