I doubt you can run a model that requires hundreds of GB of RAM at an acceptable...

aroman · 2026-06-25T20:14:10 1782418450

What would be the bottleneck?

Rohansi · 2026-06-26T02:38:31 1782441511

Compute? Inference doesn't only need memory bandwidth. You need to actually do work with the memory you're loading which needs compute power. Which needs more electricity, which needs more cooling, which isn't practical for something as thin as a MBP.

reverius42 · 2026-06-26T07:36:33 1782459393

MacBook Pro has plenty of compute for local LLMs to be usable. I'm getting up to ~150 tokens/s with Deepseek-v4-Flash on a MacBook M5 Max. It's quite capable for coding assistant usage.

In general LLMs are bottlenecked by memory bandwidth rather than raw compute power.

bel8 · 2026-06-26T11:32:39 1782473559

But it's quantized right? It's not the same almost-free DeepSeek you get from the API.

And once the context gets large, it slows down.

"up to" 150, is doing a lot of work there.

reverius42 · 2026-06-26T11:49:20 1782474560

Yes, it's quantized (4 bit). Sure, it's... not quite as good as what's on offer via API. And sure, "up to" does a lot of work (I don't have an average/median for you but it feels fast to me).

But it's usable, fully local, fully private, and has no subscriptions and no operating costs other than electricity.

cududa · 2026-06-26T03:44:59 1782445499

I mean it’d take minutes of research to realize people are successfully and efficiently running 4-bit quantized GLM 5.2 on MacStudio 512GB M3 Ultras at over 60 tok/s. K2 2.7 is quite literally designed for 4 bit quantization and runs even better.

This is already a thing

bigyabai · 2026-06-25T23:22:19 1782429739

The integrated GPU. Not enough compute onboard to handle prefill for 100gb+ models, and the decode is constrained by memory bandwidth that's lower than most dGPUs that price.

Apple would be in a much stronger spot right now if they didn't pretend like eGPUs were inconceivable black magic that Macs are incompatible with.

aroman · 2026-06-26T00:07:07 1782432427

I'm not sure I follow - 614 GB/sec is pretty squarely in dGPU territory (~5070 level). External GPUs can definitely exceed that on the very high end, but it seems pretty competitive, no?

bigyabai · 2026-06-26T00:27:11 1782433631

Competitive for 16-24GB dGPUs, but for 100gb+ inference workloads it's going to be a decode bottleneck. For smaller models it'd be fine, but the same goes for the smaller GPUs.

In particular though, the fatal bottleneck is the weakness of the iGPU. Filling a KV cache on a 100gb+ model could take a few minutes, or even hours if you're trying to restore a 256k-to-1m token session.