Compute? Inference doesn't only need memory bandwidth. You need to actually do work with the memory you're loading which needs compute power. Which needs more electricity, which needs more cooling, which isn't practical for something as thin as a MBP.
MacBook Pro has plenty of compute for local LLMs to be usable. I'm getting up to ~150 tokens/s with Deepseek-v4-Flash on a MacBook M5 Max. It's quite capable for coding assistant usage.
In general LLMs are bottlenecked by memory bandwidth rather than raw compute power.
Yes, it's quantized (4 bit). Sure, it's... not quite as good as what's on offer via API. And sure, "up to" does a lot of work (I don't have an average/median for you but it feels fast to me).
But it's usable, fully local, fully private, and has no subscriptions and no operating costs other than electricity.
I mean it’d take minutes of research to realize people are successfully and efficiently running 4-bit quantized GLM 5.2 on MacStudio 512GB M3 Ultras at over 60 tok/s. K2 2.7 is quite literally designed for 4 bit quantization and runs even better.
The integrated GPU. Not enough compute onboard to handle prefill for 100gb+ models, and the decode is constrained by memory bandwidth that's lower than most dGPUs that price.
Apple would be in a much stronger spot right now if they didn't pretend like eGPUs were inconceivable black magic that Macs are incompatible with.
I'm not sure I follow - 614 GB/sec is pretty squarely in dGPU territory (~5070 level). External GPUs can definitely exceed that on the very high end, but it seems pretty competitive, no?
Competitive for 16-24GB dGPUs, but for 100gb+ inference workloads it's going to be a decode bottleneck. For smaller models it'd be fine, but the same goes for the smaller GPUs.
In particular though, the fatal bottleneck is the weakness of the iGPU. Filling a KV cache on a 100gb+ model could take a few minutes, or even hours if you're trying to restore a 256k-to-1m token session.