Consider this. One of the smallest Qwen models (4B parameters) powers my home automation voice assistant, and runs on CPU alone at >20 tok/s. It is enough for that use case, and could be made even better/faster with a modest GPU. It isn't as smart as some cloud-connected thingamajig, but I would never allow a literal Google or Amazon bug in my home. Huge SOTA models aren't relevant everywhere. Most people use LLMs for rather trivial tasks such as finding typos or drafting text.
But with Apple's AFM 3 architecture, we might end up with huge SOTA adjacent on devices with limited RAM.
They use a technique where you only load between 1B and 4B of a 20B dense model for an entire prompt run, not token by token like a MoE, and use mostly the low power ANE instead of GPU cores.
Now, imagine if/when they scale up to 100B or more? On a chip using 2W?
I think we're also ignoring a potential innovative move in how models work.
If someone could splinter or fragment the models into more specific tasks i.e "spellchecker AI" and get these working as well as Sonnet 4.6-4.8 on those tasks on a personal laptop. You then question the $100 a month fee.
Bear in mind these laptops are likely to be $5000 or so because of the memory, HDD and M7 chip they likely need.
It feels to me like the beginning of the inflection point but software updates not hardware updates will be the accelerant.
"That’s where EMO comes in.
We show that EMO – a 1B-active, 14B-total-parameter (8-expert active, 128-expert total) MoE trained on 1 trillion tokens – supports selective expert use: for a given task or domain, we can use only a small subset of experts (just 12.5% of total experts) while retaining near full-model performance."