Does anyone know or have a guess on the size of this latest thinking models and what hardware they use to run inference? As in how much memory and what quantization it uses and if it's "theoretically" possible to run it on something like Mac Studio M3 Ultra with 512GB RAM. Just curious from theoretical perspective.
To generate one token, all active parameters must pass from memory to the processor (disregarding tricks like speculative decoding)
Multiplying 1748 tokens per second with the 5.1B parameters and 4 bits per parameter gives us a memory bandwidth of 4457 GB/sec (probably more, since small models are more difficult to optimize).
If we divide the memory bandwidth by the 57.37 tokens per second for Claude Opus 4.5, we get about 80 GB of active parameters.
With speculative decoding, the numbers might change by maybe a factor of two or so. One could test this by measuring whether it is faster to generate predictable text.
Of course, this does not tell us anything about the number of total parameters. The ratio of total parameters to active parameters can vary wildly from around 10 to over 30:
120 : 5.1 for gpt-oss-120b
30 : 3 for Qwen3-30B-A3B
1000 : 32 for Kimi K2
671 : 37 for DeepSeek V3
Even with the lower bound of 10, you'd have about 800 GB of total parameters, which does not fit into the 512 GB RAM of the M3 Ultra (you could chain multiple, at the cost of buying multiple).
Thanks! That's a great way to analyze it by comparing to open source models. Though I wonder if they use the same hardware for gpt-oss-120b and Claude Opus.
That all depends on what you consider to be reasonably running it. Huge RAM isn’t required to run them, that just makes them faster. I imagine technically all you'd need is a few hundred megabytes for the framework and housekeeping, but you’d have to wait for the some/most/all of the model to be read off the disk for each token it processes.
None of the closed providers talk about size, but for a reference point of the scale: Kimi K2 Thinking can spar in the big leagues with GPT-5 and such…if you compare benchmarks that use words and phrasing with very little in common with how people actually interact with them…and at FP16 you’ll need 2.9TB of memory @ 256,000 context. It seems it was recently retrained it at INT4 (not just quantized apparently) and now:
“
The smallest deployment unit for Kimi-K2-Thinking INT4 weights with 256k seqlen on mainstream H200 platform is a cluster with 8 GPUs with Tensor Parallel (TP).
(https://huggingface.co/moonshotai/Kimi-K2-Thinking)
“
-or-
“
62× RTX 4090 (24GB) or 16× H100 (80GB) or 13× M3 Max (128GB)
“
But again, that’s for speed. You can run them more-or-less straight off the disk, but (~1TB / SSD_read_speed + computation_time_per_chunk_in_RAM) = a few minutes per ~word or punctuation.
> (~1TB / SSD_read_speed + computation_time_per_chunk_in_RAM) = a few minutes per ~word or punctuation.
You have to divide SSD read speed by the size of the active parameters (~16GB at 4 bit quantization) instead of the entire model size. If you are lucky, you might get around one token per second with speculative decoding, but I agree with the general point that it will be very slow.
Yeah thanks for calling that out. I kind of panicked when I reached that part of the explanation and was stuck on whether or not I should go into dense models vs MoE. The question was about ‘big stuff like that’, which they most certainly use MoE, then I even chose an MoE as an example, but then there are giant dense models like Llama, but that’s not what was asked, although it wasn’t not asked because ‘also big league stuff’…anyway, I basically thought “you’re welcome” and “no problem”, then said “you’re problem”.