slimesli's comments

slimesli · on April 21, 2024

for llama 3-8b (smaller model) you need 8GB/16GB of memory: minimum/recommended || for llama 3-70b (larger model) you need 64GB/96GB of memory

what kind of memory? there are 3 types of memory: GPU memory: medium capacity | high speed || Apple Unified memory: high capacity | high speed || CPU memory: high capacity | low speed ||

rtx 3090 has 24GB of memory at 935GB/s memory bandwidth || rtx 4060ti has 16GB of memory at 288GB/s memory bandwidth ||

m2 mac mini has 8GB/16GB of memory with 100GB/s memory bandwidth || m3 macbook pro has 32-128GB of memory with 400GB/s memory bandwidth ||

new x86 consumer cpu have up to 256GB of DDR5 at 100GB/s memory bandwidth || old cpu with ddr4 and ddr3 are much slower ||

||

Remember, there is a simple formula for tokens/s: you need to divide memory bandwidth by model size, so for example:

mac mini runs llama3-8b at 18.8 tps using 5GB of memory 100GB/s/5GB = 20 https://twitter.com/awnihannun/status/1781345824611680596

rtx 3090ti runs llama2-7b at 179 tps also using around 5GB 1000GB/s/5gb = 200 https://github.com/turboderp/exllamav2

slimesli · on June 21, 2023

Because of memory bandwidth. H100 has 3350gB/s of bandwidth, more gpus will give you more memory but not bandwidth. If you load 175b parameters in 8bit then you can get theoretically 3350/175=19 tokens/second. In MoE you need to process only one expert at a time so sparse 8x220b model would be only slightly slower than dense 220b model.

fancyfredbot · on June 21, 2023

Okay, memory bandwidth certainly matters, but 19 tokens a second is not some fundamental lower limit on the speed of a language model and so this doesn't really explain why the limit would be 220b rather than say 440b or 800b?

slimesli · on June 21, 2023

It's not a fundamental limit. Google palm had 540B parameters as dense model. But it's a practical limit because models with over 1T would be extremely slow even on newest gpus. Even now, OpenAI has limit of 25 messages. You can read more here: https://bounded-regret.ghost.io/how-fast-can-we-perform-a-fo...

fancyfredbot · on June 21, 2023

I'm not trying to say memory bandwidth isn't a bottleneck for very large models. I'm wondering why he picked 220b which is weirdly specific. (To be honest although I completely agree the costs would be very high, I think there are people who would pay for and wait for answers at seconds or even minutes per token if they were good enough, so not completely sure I even agree it's a practical limit)