And almost by happenstance Apple. Turns out they have a great platform for inference and torched almost nothing comparatively on Siri. The Apple/Gemini deal is interesting, Google continues to demonstrate their willingness to degrade their experience on Apple to try and force people to switch.
If you do the math (I did), in 2 years, open source models that you can run on a future MacBook Pro will be as capable as the frontier cloud models are today. Memory bandwidth is growing rapidly, as is the die area dedicated to the neural cores. And all the while, we have the silicon getting more power efficient and increasingly dense (as it always does). These hardware improvements are coming along as the open source models improve through research advancements. And while the cloud models will always be better (because they can make use of as much power as they want to - up in the cloud), what matters to most of us is whether a model can do a meaningful share of knowledge work for us. At the same time, energy consumption to run cloud infrastructure is out-pacing the creation of new energy supply, which is a problem not easily solved. I believe scarcity of energy will increasingly drive frontier labs toward power efficiency, which necessarily implies that the Pareto frontier of performance between cloud and local execution will narrow.
People had this "why you probably can't run a GPT-4 (or even GPT-3.5) class model on your MBP anytime soon" conversation before.
Today's LLMs are able pack much more capabilities into fewer parameters compared to 2023. We might still be at the very rudimentary phase of this technology there are low-hanging efficiency gains to be had left and right. These models consume many orders of magnitude more energy than a human brain, this all seems like room for improvement.
The right question: is there a law in information theory that fundamentally prevents a 70B model of any architecture from being as smart as Opus 4.7?
The OP said "as capable as the frontier cloud models are today" which might assume model improvements that do more with less. Opus 4.7/Gpt5.5 performance might be achievable with a fraction of the parameters.
Exactly. I also feel like being able to choose a model for the use case could be worth an idea. So instead of trying to squeeze all kinds of knowledge into a single model, even if it's moe, just focus models on use cases. I bet you only need double digit billion parameter models for that with same or even better performance
Opus and Gpt are generic LLMs with knowledge on all sort of topics. For specific use cases you probably don't need all the parameters? Suppose you want to generate code with opencode, what part of the generic LLM is needed and what parts can be removed?
As far as I can tell Minimax M2.7 is better than anything available a year ago, but it runs on an ordinary PC. Will that continue? Not sure, but the trend has continued for the last two years and I don't know of any fundamental limits the models are approaching.
I wish more people were more aware of this. I think so much of the current optimism is based on "it doesn't matter if companies are raising prices since I'm just going to run the model locally", doesn't fly.
> A Opus 4.7/Gpt5.5 class model is 5 trillion parameters.
Or so they say.
If it's true then that just shows how far behind the cloud providers are lagging while wasting investor money.
(There's a huge amount of diminishing returns in increasing parameter counts and the intelligent AI company should be hard at work figuring out the optimal count without overfitting.)
Do that will only be possible with something like better 3D NAND flash memory, needs a new hardware. People are already trying to bring that the market. Contemplated taking a compiler position in such a company.
HBF is a non-starter, it runs way too hot compared to DRAM (which only pays for refresh at idle) for the same memory traffic. Only helps for extremely sparse MoE models - probably sparser than we're seeing today.
> A Opus 4.7/Gpt5.5 class model is 5 trillion parameters[1].
You could run it on a cluster of nodes that each do some mix of fetching parameters from disk and caching them in RAM. Use pipeline parallelism to minimize network bandwidth requirements given the huge size. Then time to first token may be a bit slow, but sustained inference should achieve enough throughput for a single user. That's a costly setup of course, but it doesn't cost $900k.
True but a cluster built on pipeline parallelism can naturally stream from multiple SSD's in parallel. That probably makes offload somewhat more effective. And you also have RAM caching available as a natural possibility.
I did this calculation a bit ago and don't think frontier models are just a few MacBook Pro generations away. Yes numbers reliably go up in tech in general but in specific semiconductors & standards have long lead-times and published roadmaps, so we can have high confidence in what we're getting even in 3-4 years in terms of both transistor density and RAM speeds.
In mid-2028 we have N2E/N2P with around 15% greater transistor density than today's N3P, and by EOY2028 we'll likely have A14 with about 35-40% density improvement.
Meanwhile, we'll be on LPDDR6 by that point, which takes M-series Pros from 307GB/s -> ~400GB/s, and Max's from 614GB/s -> ~800GB/s.
Model improvements obviously will help out, but on the raw hardware front these aren't in the ballpark for frontier model numbers. An H100 has 3TB/s memory bandwidth, fwiw
What do you need 3 TB/s memory bandwidth for in a single user context? DeepSeek V4 pro (the latest near-SOTA model) has about 25 GB worth of active parameters (it uses a FP4 format for most layers) which gives 12 tok/s on a 307 GB/s platform as the current memory bandwidth bottleneck, maybe a bit less than that if you consider KV cache reads. That's not quite great but it's not terrible either for a pro quality model. Of course that totally ignores RAM limits which are the real issue at present: limited RAM forces you to fetch at least some fraction of params from storage, which while relatively fast is nowhere near as fast as RAM so your real tok/s are far lower (about 2 for a broadly similar model on a top-end M5 Pro laptop).
So long as you don't require deep search grounding like massive web indexes or document stores which are hard to reproduce locally. You can do local agentic things that get close or even do better depending on search strategy, but theoretically a massive cloud service with huge data stores at hand should be able to produce better results.
In practice unless you're doing some kind of deep research thing with the cloud, it'll try to optimize mostly for time and get you a good enough answer rather than spending an hour or two. An hour of cloud searching with huge data stores is not equivalent to an hour of local agentic searching, presumably.
I think that problem will improve a little in the coming years as we kind of create optimized data curation, but the information world will keep growing so the advantage will likely remain with centralized services as long as they offer their complete potential rather than a fraction.
Also, all the cloud models don't have to be the best frontier models, and you don't need to focus on hitting the benchmark of shrinking Opus 4.7 down to a single MBP to make significant improvements. If you get it so that an Opus 4.7 benchmark-compatible model can run in $250k of datacenter capex (and associated reduced opex for power+cooling) that'd be a massive cost improvement that makes the cloud models cheaper. And for most consumers that'll probably be good enough. You don't need to run on a $5k laptop to make a big difference.
Apple is basically in the same boat as AMD and Intel. They have a weak, raster-focused GPU architecture that doesn't scale to 100B+ inference workloads and especially struggles with large context prefill. TPUs smoke them on inference, and Nvidia hardware is far-and-away more efficient for training.
The GPUs are bottom-barrel for compute-focused industries. It is mobile-grade hardware that arguably can't even scale to prior Mac Pro workloads.
> The GPU is monstrously good. Depending on the workload, the M1 series GPU using 120W could beat an RTX 3090 using 420W.
You're just listing the TDP max of both chips. If you limit a 3090 to 120W then it would still run laps around an M1 Max in several workloads despite being an 8nm GPU versus a 5nm one.
> It is kind of sad Apple neglects helping developers optimize games for the M-series
Apple directly advocated for ports like Death Stranding, Cyberpunk 2077 and Resident Evil internally. Advocacy and optimization are not the issue, Apple's obsession over reinventing the wheel with Metal is what puts the Steam Deck ahead.
Edit (response to matthewmacleod):
> Bold of them to reinvent something that hadn't been invented yet.
Vulkan was not the first open graphics API, as most Mac developers will happily inform you.
I'm confused how anyone ever thought the NPU would be a good idea. The GPU is almost always underutilized on Mac and could do the brunt of the work for inference if it embraced GPGPU principles from the start. Creating a dedicated hardware block to alleviate a theoretical congestion issue is... bewildering. That goes for most NPUs I've seen.
Apple had the technology to scale down a GPGPU-focused architecture just like Nvidia did. They had the money to take that risk, and had the chip design chops to take a serious stab at it. On paper, they could have even extended it to iPhone-level edge silicon similar to what Nvidia did with the Jetson and Tegra SOCs.
I think they built the NPU with whatever models they needed to run on the iPhone in mind vs trying to build a general purpose chip, and then got lucky it was also useful for LLMs.
(Like “I want to do object detection for cutting people into stickers on device without blowing a hole in the battery, make me a chip for that”.)
I'm not sure even Apple thought that, given that they don't officially provide access to ANE internals under macOS (barring unsupported hacks). But if that was fixed, it could then be useful for improving the power efficiency of prefill, where the CPU/GPU hardware is quite weak (especially prior to the M5 Neural Accelerators).
I very recently ran the numbers on these GPUs for an upcoming blog post. The token generation performance is bad, but the prefill performance is _really_ bad.
For a Qwen 3.6 35B / 3B MoE, 4-bit quant:
- parsing a 4k prompt on a M4 Macbook Air takes 17 seconds before generating a single token.
- on an M4 Max Mac Studio it's faster at 2.3 seconds
- on an RTX 5090, it's 142ms.
RTX 5090 uses more power than an M4 Max Mac Studio but it's not 16x more power.
Somehow Apple has always been able to sell their stuff as somehow Magic. Remember the megahertz myth? Apple hertzes and apple bytes are much better than PC hertzes and bytes because they are made by virgin elves during a full moon.
> Apple hertzes and apple bytes are much better than PC hertzes and bytes because they are made by virgin elves during a full moon.
The thing that Apple has always been excellent at is efficiency - even during the Intel era, MacBooks outclassed their Windows peers. Same CPU, same RAM, same disks, so it definitely wasn't the hardware, it was the software, that allowed Apple to pull much more real-world performance out of the same clock cycles and power usage.
Windows itself, but especially third party drivers, are disastrous when it comes to code quality, and they are much much more generic (and thus inefficient) compared to Apple with its very small amount of different SKUs. Apple insisted on writing all drivers and IIRC even most of the firmware for embedded modules themselves to achieve that tight control... which was (in addition to the 2010-ish lead-free Soldergate) why they fired NVIDIA from making GPUs for Apple - NV didn't want to give Apple the specs any more to write drivers.
> NV didn't want to give Apple the specs any more to write drivers.
I think that's a valid demand, considering Nvidia's budding commitment to CUDA and other GPGPU paradigms. Apple, backing OpenCL, would have every reason to break Nvidia's code and ship half-baked drivers. They did it with AMD's GPUs later down the line, pretending like Vulkan couldn't be implemented so they could promote Metal.
Apple wouldn't have made GeForce more efficient with their own firmware, they would have installed a Sword of Damocles over Nvidia's head.
> They did it with AMD's GPUs later down the line, pretending like Vulkan couldn't be implemented so they could promote Metal.
It was even worse than that, they just stopped updating OpenGL for years before either Vulkan or Metal existed at all. Taking a Macbook and using bootcamp would instantly raise the GPU feature level by several generations just because Apple's GPU drivers were so fucking old & outdated.
On Geekbench 5, the M1 hits 483 FPS and the RTX 3090 hits 504 FPS.
There are other workloads where the M1 actually beats the 3090.
Apple does plenty of hyping but it's always cute when irrational haters like you put them down. The M1 was (well, is) a marvel and absolutely smokes a 3090 in perf per watt.
What geekbench 5 fps are you talking about? Geekbench only has OpenCL and Vulkan scores for the 3090 as far as I can tell, and the M1 Ultra is less than half the OpenCL score of the 3090. And the M1 Ultra was significantly more expensive.
Find or link these workloads you think exist, please
> The M1 was (well, is) a marvel and absolutely smokes a 3090 in perf per watt.
The GTX 1660 also smokes the 3090 in perf per watt. Being more efficient while being dramatically slower is not exactly an achievement, it's pretty typical power consumption scaling in fact. Perf per watt is only meaningful if you're also able to match the perf itself. That's what actually made the M1 CPU notable. M-series GPUs (not just the M1, but even the latest) haven't managed to match or even come close to the perf, so being more efficient is not really any different than, say, Nvidia, AMD, or Intel mobile GPU offerings. Nice for laptops, insignificant otherwise
Here you go[0]. 'Aztek Ruins offscreen'. Although I misremembered the exact FPS, the 3090 is at 506 FPS.
Also note how the M1 Ultra is pushing 2/3 of the FPS of the 3090 despite 1/3 of the power budget and the game itself being poorly optimized for the M-series architecture.
And here[1] you have it smoking an Intel i9 12900K + RTX 3900. The difference doesn't look too impressive until you realize the power envelope for that build is 700-800W.
Also, the GTX 1660 (technically an RTX 2000 series, but whatever) is about 26% less efficient than an 3090[2].
> Being more efficient while being dramatically slower
That's my whole point and what you're refusing to see. The M1 is not dramatically slower than an i9 or 3090 despite having dramatically lower power use.
The proof for this will really start to come once Qualcomm and Mediatek have gotten a handle on their PC ARM chips and Valve decides they're good enough for a Steam Deck 2 or 3. You'll get to see 2-3x the battery life along a modest performance increase.
> Here you go[0]. 'Aztek Ruins offscreen'. Although I misremembered the exact FPS, the 3090 is at 506 FPS.
Oh, GFXBench not geekbench.
Realistically that 506 fps result is probably CPU bottlenecked, not that aztec ruins is all that relevant. It's a very old benchmark, released in 2018, that was destroyed for mobile GPUs, so realistically is using a 2010-ish GPU feature set.
If that's your use case, great. But it's not significant at all.
> And here[1] you have it smoking an Intel i9 12900K + RTX 3900.
Not using the GPU, so irrelevant. Also not using 700-800w
> Also, the GTX 1660 (technically an RTX 2000 series, but whatever) is about 26% less efficient than an 3090[2].
"bestvaluegpu" I've never heard of but holy AI slop nonsense batman. Taking 3dmark score and dividing it by TDP is easily one of the worst ways to compare possible.
Apple is in a much better boat than AMD or Intel. They have a gigantic warchest and can just snap up whoever looks like a leader coming out of the bubble burst.
It's becoming increasingly clear that there is no moat on models. The winners will be the ones who have existing products and ecosystems they can tie AI in to. You will pay adobe for credits because that will be the only AI that works in Photoshop, you will pay microsoft because only theirs will work on your microsoft cloud apps.
Open AI has nothing. Their tech will rapidly be devalued by free models the moment they stop lighting stacks of cash on fire.
I kind of agree with you at this point. When ChatGPT was rapidly gaining popularity I thought that they will eventually replace search (esp. for shopping), which would have given them a huge ad revenue. Maybe they could have even tried social networking e.g., to help you sort out the huge flow of information that today's social networks are and get to the important/rewarding/whatever posts. But now ChatGPT is kind of getting commoditized. I would even dare say that gemini feels to me a bit better now, so the search route for ChatGPT is clearly gone.
The parent post was arguing that they can do this now because they are lighting stacks of cash on fire. And once they stop doing that, their LLM lead will be gone in a hurry. They appear to not have a moat, like other more established players do.
Counterpoint: Apple's opportunity to invest in GPGPU architectures was ~2017 when Apple Silicon was in it's design stages. Apple always knew they were surrendering a large market segment by depreciating CUDA and OpenCL, they just never knew how big the market segment would get. Their liquid cash is wasted mid-bubble, and waiting for it to pop is not a realistic timeline anymore.
Arguably, Apple isn't even in a boat right now. At least AMD and Intel both ship hardware that synergizes with CUDA - Apple jumped off that ship, their hardware doesn't even come up in infrastructure discussions where AMD, Intel and Nvidia are taken for granted.
They also degrade their own direct services with little warning or thought put into change management, so, to be fair, Apple may be getting the same quality of service as the rest of us.
I think that's just how Google is, by nature. They don't intentionally degrade their services. They just aren't a customer centric company. They run on numbers. As a corporate, it doesn't really encourage support and maintenance work either.
Indeed. I'm wondering if Apple's "miss the train" with AI ended up being a blessing for them. Not only in the Google deal but also there's a lot of people doing interesting stuff locally..
That feels like it might have perverse incentives if it were formally tracked. On the other hand, as a low level grunt, I would enjoy having some quantitative metrics about the topic. Who acts like this is a game trying to get the high score?
Depending on how it was implemented, might also be possible to unblind people’s salary. A big no-no for a big corporation where there might be laughable pay disparity.
If you can pass on a chunk of customers sure. I've canceled a purchase more than once at checkout when I saw there is no PayPal available, if the website was unknown or looked a little shady, and I didn't desperately need the item. There are people who don't buy at all if there's no PayPal just because it's less convenient.
This. Also, remember that from the consumer standpoint, PayPal was the first ever trusted payment processor that didn't pass your payment account info (bank, CC#, debit card info) along to the vendor. Granted, they passed along your email+shipping address. But the vendor would have had that info anyhow if you were purchasing some physical item from them.
So there's a large swath of the consumer population that views PayPal positively and will skip a purchase if there's no PayPal option.
reply