Any of the newer M2+ Max chips runs 400GB/s and can run 70b pretty well. It's no...

staticman2 · 2024-09-21T16:01:11 1726934471

Apparently Mac purchasers like to talk about tokens per second without talking about Mac's atrocious time to first token. They also like to enthusiastically talk about tokens per second asking a 200 token question rather than a longer prompt.

I'm not sure what the impact is on a 70b model but it seems there's a lot of exaggeration going on in this space by Mac fans.

lhl · 2024-09-21T21:35:53 1726954553

For those interested, a few months ago someone posted benchmarks with their MBP 14 w/ an M3 Max [1] (128GB, 40CU, theoretical: 28.4 FP16 TFLOPS, 400GB/s MBW)

The results for Llama 2 70B Q4_0 (39GB) was 8.5 tok/s for text generation (you'd expect a theoretical max of a bit over 10 tok/s based on theoretical MBW) and a prompt processing of 19 tok/s. On a 4K context conversation, that means you would be waiting about 3.5min between turns before tokens started outputting.

Sadly, I doubt that Strix Halo will perform much better. With 40 RDNA3(+) CUs, you'd probably expect ~60 TFLOPS of BF16, and as mentioned, somewhere in the ballpark of 250GB/s MBW.

Having lots of GPU memory even w/ weaker compute/MBW would be good for a few things though:

* MoE models - you'd need something like 192GB of VRAM to be able to run DeepSeek V2.5 (21B active, but 236B in weights) at a decent quant - a Q4_0 would be about 134GB to load the weights, but w/ far fewer activations, you would still be able to inference at ~20 tok/s). Still, even with "just" 96GB you should be able to just fit a Mixtral 8x22B, or easily fit one of the new MS (GRIN/Phi MoEs).

* Long context - even with kvcache quantization, you need lots of memory for these new big context windows, so having extra memory for much smaller models is still pretty necessary. Especially if you want to do any of the new CoT/reasoning techniques, you will need all the tokens you can get.

* Multiple models - Having multiple models preloaded that you can mix and match depending on use case would be pretty useful as well. Even some of the smaller Qwen2.5 models looks like they might do code as well as some much bigger models, you might want a model that's specifically tuned for function calling, a VLM, SRT/TTS, etc. While you might be able to swap adapters for some of this stuff eventually, for now, being able to have multiple models pre-loaded locally would still be pretty convenient.

* Batched/offline inference - being able to load up big models would still be really useful if you have any tasks that you could queue up/process overnight. I think these types of tools are actually relatively underexplored atm, but has as many use cases/utility as real-time inferencing.

One other thing to note is that on the Mac side, you're mainly relegated to llama.cpp and MLX. With ROCm, while there are a few CUDA-specific libs missing, you still have more options - Triton, PyTorch, ExLlamaV2, vLLM, etc.

[1] https://www.nonstopdev.com/llm-performance-on-m3-max/

zaptrem · 2024-09-22T05:46:10 1726983970

> On a 4K context conversation, that means you would be waiting about 3.5min between turns before tokens started outputting.

Wouldn't the time be negligible with interturn kv caching? Many inference providers already do this.

lhl · 2024-09-22T07:20:02 1726989602

Yes, for single user multiturn kvcache reuse could help a lot. vLLM has support for this via Automatic Prefix Caching (APC) so you’d be able to take advantage of this w/ Strix Halo now. llama.cpp has had a “prompt-cache” option but when I last looked it was a bit weird (only works for non-interactive use, saves and loads cache to disk) so it might not help on the Mac side.