Consider this. One of the smallest Qwen models (4B parameters) powers my home au...

marci · 2026-06-26T06:54:09 1782456849

But with Apple's AFM 3 architecture, we might end up with huge SOTA adjacent on devices with limited RAM.

They use a technique where you only load between 1B and 4B of a 20B dense model for an entire prompt run, not token by token like a MoE, and use mostly the low power ANE instead of GPU cores.

Now, imagine if/when they scale up to 100B or more? On a chip using 2W?

kinnth · 2026-06-26T08:02:11 1782460931

I think we're also ignoring a potential innovative move in how models work.

If someone could splinter or fragment the models into more specific tasks i.e "spellchecker AI" and get these working as well as Sonnet 4.6-4.8 on those tasks on a personal laptop. You then question the $100 a month fee.

Bear in mind these laptops are likely to be $5000 or so because of the memory, HDD and M7 chip they likely need.

It feels to me like the beginning of the inflection point but software updates not hardware updates will be the accelerant.

marci · 2026-06-26T09:57:17 1782467837

  "That’s where EMO comes in.

  We show that EMO – a 1B-active, 14B-total-parameter (8-expert active, 128-expert total) MoE trained on 1 trillion tokens – supports selective expert use: for a given task or domain, we can use only a small subset of experts (just 12.5% of total experts) while retaining near full-model performance."

https://allenai.org/blog/emo

dainiusse · 2026-06-26T06:25:45 1782455145

Curious, what exactly does it do for you? I has bad luck with these small models to do anything useful tbh.

drnick1 · 2026-06-26T17:05:34 1782493534

It's a voice assistant that can respond to commands such as "turn on the light," or explain things (within the abilities of the small model).