Well, that’s disappointing since the Mac Studio 128GB is $3,499. If Apple happen...

pjmlp · 2025-10-14T08:52:22 1760431942

Only if it runs CUDA, MLX / Metal isn't comparable as ecosystem.

People that keep pushing for Apple gear tend to forget Apple has decided what industry considers industry standards, proprietary or not, aren't made available on their hardware.

Even if Metal is actually a cool API to program for.

thom · 2025-10-14T10:16:52 1760437012

It depends what you're doing. I can get valuable work done with the subset of Torch supported on MPS and I'm grateful for the speed and RAM of modern Mac systems. JAX support is worse but hopefully both continue to develop.

omneity · 2025-10-14T11:46:21 1760442381

CUDA is equally proprietary and not an industry standard though, unless you were thinking of Vulcan/OpenCL which doesn’t bring much in this situation.

pjmlp · 2025-10-14T12:02:51 1760443371

Yes it is an industry standard, there is even a technical term for it.

It is called De facto standard, which you can check in your favourite dictionary.

NewsaHackO · 2025-10-14T11:55:43 1760442943

CUDA isn't the industry standard? What is then?

newman314 · 2025-10-14T05:38:39 1760420319

Agreed. I also wonder why they chose to test against a Mac Studio with only 64GB instead of 128GB.

yvbbrjdr · 2025-10-14T05:48:20 1760420900

Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that one of my friend has this device.

ggerganov · 2025-10-14T05:59:34 1760421574

FYI you should have used llama.cpp to do the benchmarks. It performs almost 20x faster than ollama for the gpt-oss-120b model. Here are some samples results on my spark:

  ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
  | model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |          pp4096 |       3564.31 ± 9.91 |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |            tg32 |         53.93 ± 1.71 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |          pp4096 |      1792.32 ± 34.74 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |            tg32 |         38.54 ± 3.10 |

rajatgupta314 · 2025-10-14T06:41:05 1760424065

Is this the full weight model or quantized version? The GGUFs distributed on Hugging Face labeled as MXFP4 quantization have layers that are quantized to int8 (q8_0) instead of bf16 as suggested by OpenAI.

Example looking at blk.0.attn_k.weight, it's q8_0 amongst other layers:

https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main?s...

Example looking at the same weight on Ollama is BF16:

https://ollama.com/library/gpt-oss:20b/blobs/e7b273f96360

yvbbrjdr · 2025-10-14T06:03:57 1760421837

I see! Do you know what's causing the slowdown for ollama? They should be using the same backend..

alecco · 2025-10-14T09:01:03 1760432463

Dude, ggerganov is the creator of llama.cpp. Kind of a legend. And of course he is right, you should've used llama.cpp.

Or you can just ask the ollama people about the ollama problems. Ollama is (or was) just a Go wrapper around llama.cpp.

ilc · 2025-10-14T11:34:02 1760441642

Was. They've been diverging.

xs83 · 2025-10-14T10:34:10 1760438050

Now this looks much more interesting! Is the top one input tokens and the second one output tokens?

So 38.54 t/s on 120B? Have you tested filling the context too?

ggerganov · 2025-10-14T14:56:47 1760453807

Yes, I provided detailed numbers here: https://github.com/ggml-org/llama.cpp/discussions/16578

nialse · 2025-10-14T18:51:37 1760467897

Makes sense you have one of the boxes. What's your take on it? [Respecting any NDAs/etc/etc of course]

__mharrison__ · 2025-10-14T06:22:18 1760422938

Curious to how this compares to running on a Mac.

xs83 · 2025-10-14T10:34:55 1760438095

TTFT on a Mac is terrible and only increases as the context increases, thats why many are selling their M3 Ultra 512GB

Eggpants · 2025-10-15T04:01:01 1760500861

So so many… eBay search shows only 15 results, 6 of them being ads for new systems…

https://www.ebay.com/sch/i.html?_nkw=mac+studio+m3+ultra+512...

moondev · 2025-10-14T06:15:55 1760422555

Just don't try to run a NCCL

zackangelo · 2025-10-15T03:39:58 1760499598

Wouldn't you be able to test nccl if you had 2 of these?

ddelnano · 2025-10-15T06:03:39 1760508219

What kind of NCCL testing are you thinking about? Always curious what’s hardest to validate in people’s setups.

moondev · 2025-10-15T04:31:23 1760502683

Not with Mac studio(s), but yes multi host NCCL over RoCE with two DGX Sparks or over PCI with one