Since no one specifically answered your question yet, yes, you should be able to get usable performance. A Q4_K_M GGUF of DeepSeek-R1 is 404GB. This is a 671B MoE that "only" has 37B activations per pass. You'd probably expect in the ballpark of 20-30 tok/s (depends on how much actually MBW can be utilized) for text generation.
From my napkin math, the M3 Ultra TFLOPs is still relatively low (around 43 FP16 TFLOPs?), but it should be more than enough to handle bs=1 token generation (should be way <10 FLOPs/byte for inference). Now as far is its prefill/prompt processing speed... well, that's another matter.
I actually think it’s not a coincidence and they specifically built this M3 Ultra for DeepSeek R1 4-bit. They also highlight in their press release that they tested it with 600B class LLMs (DeepSeek R1 without referring to it by name). And they specifically did not stop at 256 GB RAM to make this happen. Maybe I’m reading too much into it.
Pretty sure this has absolutely nothing to do with Deepseek and even local LLM at large, which has been a thing for a while and an obvious use case original Llama leak and llama.cpp coming around.
Fact is Mac Pros in the Intel days supported 1.5TB RAM in some configurations[1] and that was 6 years ago expectations of their high end customer base. They needed to address the gap for those customers so they would have shipped such a product regardless. Local LLM is cherry-on-top. Deepseek in particular almost certainly had nothing to do with it. They will still need to double their supported RAM in their SoC to get there. Perhaps in a Mac Pro or a different quad-Max-glued chip.
I understand why they are excited about it—just pointing out it is a happy coincidence. They would have and should have made such a product to address the need of RAM users alone, not VRAM in particular, before they have a credible case to cut macOS releases on Intel.
RIP Jay Miner who watched his unified memory daughters Agnus, Denise and Paula be slowly murdered by Jack Tramiel's vengeance against Irving Gould. [Why couldn't the shareholders have stormed their boardroom 180 days before the company ran out of cash, installed interim management who, in turn, would have brought back the megalomaniac Founder that would, until his dying breath, keep spreading their cash to the super brilliant geniuses that made all the magic chips happen and then turn the resulting empire over to ops people to make their workplace so uncomfortable they all retire early and live happily ever after on tropical islands and snowy mountain tops?]
Yep! Though one could argue the Amiga wasn't true unified memory due to the chip RAM limitations. Depending on the Agnus revision, you'd be limited to 512, 1 meg, or 2 meg max of RAM addressable by the custom chips ("chip RAM".)
fun fact: M-series that are configured to use more than 75% of shared memory for GPU can make the system go boom...something to do with assumptions macOS makes that can be fixed by someone with a "private key" to access kernel mode (maybe not a hardware limit).
That or it's the luckiest coincidence! In all seriousness, Apple is fairly consistent about not pushing specs that don't matter and >256GB is just unnecessary for most other common workloads. Factors like memory bandwidth, core count and consumption/heat would have higher impact.
That said, I doubt it was explicitly for R1, but rather based the industry a few years ago when GPT 3s 170B was SOTA, but the industry was still looking larger. "As much memory as possible" is the name of the game for AI in a way that's not true for other workloads. It may not be true for AI forever either.
The high end Intel Macs supported over a TB of RAM, over 5 years ago. It's kinda crazy Apple's own high end chips didn't support more RAM. Also, the LLM use case isn't new... Though DeepSeek itself may be. RAM requirements always go up.
Just to clarify. There is an important difference between unified memory, meaning accessible by both CPU and GPU, and regular RAM that is only accessible by CPU.
As mentioned elsewhere in this thread, unified memory has existed long before Apple released the M1 CPU, and in fact many Intel processors that Apple used before supported it (though the Mac pros that supported 1.5TB of RAM did not, as they did not have integrated graphics).
The presence of unified memory does not necessarily make a system better. It’s a trade off: the M-series systems have high memory bandwidth thanks to the large number of memory channels, and the integrated GPUs are faster than most others. But you can’t swap in a faster GPU, and when using large LLMs even a Mac Studio is quite slow compared to using discrete GPUs.
Design work on the Ultra would have started 2-3 years ago, and specs for memory at least 18 months ago. I’m not sure they had that kind of inside knowledge for what Deepseek specifically was doing that far in advance. Did Deepseek even know that long ago?
Don't they build these Macs just-in-time? The bandwidth doesn't change with the RAM, so surely it couldn't have been that hard to just... use higher capacity RAM modules?
"No chance?" But it has been reported that the next generation of Apple Silicon started production a few weeks ago. Those deliveries may enable Apple to release its remaining M3 Ultra SKUs for sale to the public (because it has something Better for its internal PCC build-out).
It also may point to other devices ᯅ depending upon such new Apple Silicon arriving sooner, rather than later. (Hey, I should start a YouTube channel or religion or something. /s)
It's not completely out of the question that the 512gb version of M3 Ultra was built for their internal Apple silicon servers powering Private Compute Cloud, but not intended for consumer release, until a compelling use case suddenly arrived.
I don't _think_ this is what happened, but I wouldn't go as far as to call it impossible.
The scenario is that the 512gb M3 Ultra was validated for the Mac Studio, and in volume production for their servers, but a business decision was made to not offer more than a 256gb SKU for Mac Studio.
I don't think this happened, but it's absolutely not "literally impossible". Engineering takes time, artificial segmentation can be changed much more quickly.
This change is mostly just using higher density ICs on the assembly line and printing different box art with a SKU change. It does not take much time, especially if they had planned it as a possible product just in case management changed its mind.
That's absurd. Fabing custom silicon is not something anybody does for a few thousand internal servers. The unit economics simply don't work. Plus Apple is using OpenAI to provide its larger models anyway, so the need never even existed.
My thoughts too. This product was in the pipeline maybe 2-3 years ago. Maybe with LLMs getting popular a year ago they tried to fit more memory but it’s almost impossible to do that that close to a launch. Especially when memory is fused not just a module you can swap.
Your conclusion is correct but to be clear the memory is not "fused." It's soldered close to the main processor. Not even a Package-on-Package (two story) configuration.
I think by fuse I mean't its stuck on to the SOC module, not part of the SOC as I may have worded. While you could maybe still add NANDs later in the manufacturing process, it's probably not easy, especially if you need more NANDs and a larger module which might cause more design problems. The NAND is closer cause the controller is in the SOC. So the memory controller probably would also change with higher memory sizes which would mean this cannot be a last minute change.
An M3 Ultra is two M3 Max chips connected via fabric, so physics.
Did not mean to shit on anyone's parade, but it's a trap for novices, with the caveat that you reportedly can't buy a GB10 until "May 2025" and the expectation that it will be severely supply constrained. For some (overfunded startups running on AI monkey code? Youtube Influencers?), that timing is an unacceptable risk, so I do expect these things to fly off the shelves and then hit eBay this Summer.
Any ideas on power consumption? I wonder how much power would that use. It looks like it would be more efficient than everything else that currently exists.
I would be curious about context window size that would be expected when generating ballpark 20 to 20 tokens per second using Deepseek-R1 Q4 on this hardware?
I think this framing isn't quite right either. DeepSeek's R1 isn't very different from what OpenAI has already been doing with o1 (and that other groups have been doing as well). As for distilling - the R1 "distilled" models they released aren't even proper (logit) distillations, but just SFTs, not fundamentally new at all. But it's great that they published their full recipes and it's also great to see that it's effective. In fact we've seen now with LIMO, s1/s1.1, that even as few as 1K reasoning traces can get most LLMs to near SOTA math benchmarks. This mirrors the "Alpaca" moment in a lot of ways (and you could even directly mirror say LIMO w/ LIMA).
I think the main takeaway of GPT4.5 (Orion) is that it basically gives a perspective to all the "hit a wall" talk from the end of last year. Here we have a model that has been trained on by many accounts 10-100X the compute of GPT4, is likely several times larger in parameter count, but is only... subtly better, certainly not super-intelligent. I've been playing around w/ it a lot the past few days, both with several million tokens worth of non-standard benchmarks and talking to it and it is better than previous GPTs (in particular, it makes a big jump in humor), but I think it's clear that the "easy" gains in the near future are going to be figuring out how as many domains as possible can be approximately verified/RL'd.
As for the release? I suppose they could just have kept it internally for distillation/knowledge transfer, so I'm actually happy that they released it, even if it ends up not being a really "useful" model.
It's great to see vLLM getting faster/better for DeepSeek. I tested vLLM vs SGLang a couple weeks ago and SGLang's DeepSeek support was much better/faster (on 2 x p5 H100 nodes). It's great that no one's standing still, I saw this recent AMD article that reported SGLang perf on MI300X has increased by 4X over the past couple weeks: https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR...
(w/ the extra memory V3/R1 fits on a single MI300X or H200 node)
It'll be interesting to see if either project can take advantage/get any benefits from this FlashMLA implementation.
I've been using GenSpark.ai for the past month to do research (its agents usually does ~20 minutes, but I've seen it go up to almost 2 hours on a task) - it uses a Mixture of Agents approach using GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 Pro and searches for hundreds of sources.
I reran some of these searches and I've so far found OpenAI Deep Research to be superior for technical tasks. Here's one example:
I've been giving Deep Research a good workout, although I'm still mystified if switching between the different base model matters, besides o1 pro always seeming to fail to execute the Deep Research tool.
Yeah, it seems to not be able to execute the tool calling properly. Maybe it's a bad interaction w/ it's own async calling ability or something else (eg, how search and code interpreter can't seem to run at the same time for 4o)
Last fall I built a new workstation with an EPYC 9274F (24C Zen4 4.1-4.3GHz, $2400), 384GB 12 x 32GB DDR5-4800 RDIMM ($1600), and a Gigabyte MZ33-AR0 motherboard. I'm slowly populating with GPUs (including using C-Payne MCIO gen5 adapters), not focused on memory, but I did spend some time recently poking at it.
I spent extra on the 9274F because of some published benchmarks [1] that showed that the 9274F had STREAM TRIAD results of 395 GB/s (on 460.8 GB/s of theoretical peak memory bandwidth), however sadly, my results have been nowhere near that. I did testing with LIKWID, Sysbench, and llama-bench, and even w/ an updated BIOS and NUMA tweaks, I was getting <1/2 the Fujitsu benchmark numbers:
Assuming that you populated the channels correctly, which I believe you did, I can only think that this issue could be related to the motherboard itself or RAM. I think you could start by measuring the single-core RAM bandwidth and latency.
Since the CPU is clocked quite high, figures you should be getting are I guess around ~100ns, but probably less than that, and 40-ish GB/s of BW. If those figures do not match then it could be either a motherboard (HW) or BIOS (SW) issue or RAM stick issue.
If those figures closely match then it's not a RAM issue but a motherboard (BIOS or HW) and you could continue debugging by adding more and more cores to the experiment to understand at which point you hit the saturation point for the bandwidth. It could be a power issue with the mobo.
Yeah, that channels are populated correctly. As you can see from the mlc-results.txt, the latency looks fine:
mlc --idle_latency
Intel(R) Memory Latency Checker - v3.11b
Command line parameters: --idle_latency
Using buffer size of 1800.000MiB
Each iteration took 424.8 base frequency clocks ( 104.9 ns)
As does the per-channel --bandwidth_matrix results:
I've tried various NUMA configurations (from 1 domain to a per-CCD config) and it doesn't seem to make much difference.
Updating from the board-delivered F14 to the latest 9004 F31 BIOS (the F33 releases bricked the board and required using a BIOS flasher for manual recover) gave marginal (5-10%) improvement, but nothing major.
Since I'm not so concerned with CPU inference, I feel like the debugging/testing I've done is... the amount I'm going to do, which is enough to at least characterize, if not fix the performance.
I might write up a more step-by-step guide at some point to help others but for now the testing scripts are there - I think most people who are looking at theoretical MBW should probably do their own real-world testing as it seems to vary a lot more than GPU bandwidth.
w/ likwid-bench S0:5GB:8:1:2, 129136.28 MB/s . At S0:5GB:16:1:2 184734.43 MB/s (this is the max, S0:5GB:12:1:2 is 186228.62 and S0:5GB:48:1:2 is 183598.29 MB/s) - According to lstopo my 9274F has 8 dies with 3 cores on each (currently each die is set to its own NUMA domain (L3 strat). In any case, I also gave `numactl --interleave=all likwid-bench -t load -w S0:5GB:48:1:2 -i 100` a spin and topped out about the same place: 184986.45 MB/s.
Yes, you're correct that your CPU has 8 CCDs but the bw with 8 threads is already too low. Those 8 cores should be able to get you at roughly half of the theoretical bw. 8x zen5 cores for comparison can reach the ~230 GB/s mark.
Can you repeat the same lkwid experiment but with 1, 2 and 4 threads? I'm wondering when is it that it begins to detoriate quickly.
Maybe also worth doing is repeating the 8 threads but forcing lkwid to pick every third physical core so that you get 1 thread per CCD experiment setting.
With `likwid-bench -i 100 -t load -w M0:5GB:1 -w M1:5GB:1 -w M2:5GB:1 -w M3:5GB:1 -w M4:5GB:1 -w M5:5GB:1 -w M6:5GB:1 -w M7:5GB:1` we get 187976.60
Obvious there's a bottleneck either going on somewhere - at 33.5GB/s per channel, that would get close to 400GB/s, what you'd expect, but the reality is that it doesn't get to half of that. Bad MC? Bottleneck w/ the MB? Hard to tell, not sure that without swapping hardware there's much more that can be done to diagnose things.
I see. I am out of other ideas besides trying to play with BIOS tweaks wrt memory and CPU. I can see that there are plenty of them, for worse or for the better.
At a quick glance, some of them look interesting such as "Workload tuning" where you can pick different profiles. There is "memory throughput intensive" profile. You can also try to explicitly disable DIMMs that are not in use given you use only half of them. I wouldn't hold my breath that any of these will make a big difference but you can give it try.
Another idea: AFAICS there have been a few memory-bw zen-related bugs reported to likwid and, in particular, https://github.com/RRZE-HPC/likwid/issues/535 may suggest that you could be hitting a similar bug but with another CPU series.
The bug report used AMDuProf to confirm that the bandwidth is actually ~2x than what likwid reported. You could try the same.
I have read the R1 paper. My observation is that there is no information whatsoever about how they are overcoming the limitations of the H800 compared to the H100 which is what the parent article is about. That's the piece Im curious about.
I will concede that I have not read all their papers or looked through their code, but that's why I asked the question: I hoped someone here might be able to point me to specific places in specific papers instead of a axvix search.
Give Section 3 of the DeepSeek-V3 paper a read. The discuss their HAI-LLM framework and have a pretty in-depth description of their DualPipe algorithm and how it compares to other pipeline bubbles. They also describe how they work around NVLink limits and tons of other optimizations in extreme depth. The section is 10 pages long, and it's relatively dense, not fluff!
One might argue he's had a pattern for even longer. While he did do some early hypervisor glitching, even his PS3 root key release was basically just applying fail0verflow's ECDSA exploit (fail0verflow didn't release the keys specifically because they didn't want to get sued ... so that was a pretty dick move [1]).
For his projects, I think it's important to look at what he's done that's cool (eg, reversing 7900XTX [2], creating a user-space driver that completely bypasses AMD drivers for compute [3]) and separating it from his (super cringe) social media postings/self-hype.
Still, at the end of the day, here's hoping that someone at AMD realizes that having terrible consumer and workstation support will basically continue to be a huge albatross/handicap - it cuts them off basically all academic/research development (almost every single ML library and technique you can name/used in production is CUDA first because of this) and the non-hyperscaler enterprise market as well. Any dev can get a PO for a $500 Nvidia GPU (or has one on their workstation laptop already). What's the pathway for ROCm? (honestly, if I were in charge, my #1 priority would be to make sure ROCm is installed and works w/ every single APU installed, even the 2CU ones).
Fireworks, Together, and Hyperbolic all offer DeepSeek V3 API access at reasonable prices (and full 128K output) and none of them will retain/train on user submitted data. Hyperbolic's pricing is $0.25/M tokens, which is actually pretty competitive to even DeepSeek's "discount" API pricing.
I've done some testing and if you're inferencing on your own system (2xH100 node, 1xH200 node, or 1xMI300X node) sglang performs significantly better than vLLM on deepseek-v3 (also vLLM had an stop token issue for me, not sure if that's been fixed, sglang did not have output oddities).
But you pay triple damages if you knowingly vs unknowingly violate a patent (35 U.S.C. § 284). Of course, everything is patented, so, engineers are just told to not read patents.
From my napkin math, the M3 Ultra TFLOPs is still relatively low (around 43 FP16 TFLOPs?), but it should be more than enough to handle bs=1 token generation (should be way <10 FLOPs/byte for inference). Now as far is its prefill/prompt processing speed... well, that's another matter.
reply