> you can reach ~75% of peak performance for same matrix config Not on the same ...

shihab · 2024-11-12T15:37:53 1731425873

Yes, that's why I was focusing on percentage of peak hardware performance, not actual flops.

ladyanita22 · 2024-11-12T07:47:27 1731397647

That's exactly what I was wondering. That cannot be.

stephencanon · 2024-11-12T13:12:26 1731417146

Probably more relevant here is that a single CPU core on that computer exceeds 1 tflop/s on gemm with plenty of margin using a single lib call, and leaves the rest of the CPU cores and all of the GPU free to do other work.

adrian_b · 2024-11-12T16:41:56 1731429716

Nope, no Apple CPU core has such performance.

That single lib call must have used the AMX accelerator, which is separate from the cores and shared by a group of cores.

So that AMX accelerator performance may be greater than of all CPU cores together. AFAIK, some Apple CPUs have one AMX accelerator for the big cores and another AMX accelerator for the smaller cores, but in any case there is no chance to hope that if you have obtained 1 TFLOP/s when running the program on 1 core you will get much more when running it on multiple cores, because all cores of the same type will use the same shared accelerator.

Const-me · 2024-11-12T17:07:04 1731431224

One interesting thing about these newfangled matrix/AI/ML accelerators that’s very rarely mentioned on the internets, they only deliver that many TFLOP because they operate in very low precision.

nVidia tensor cores support int8, couple versions of FP16 (BF16 and the standard IEEE one) and FP19 which they call TensorFloat-32. I think Intel AMX only supports int8 and BF16.

None of them supports FP32 let alone FP64 input numbers, which makes them completely useless for traditional GEMM stuff.

wtallis · 2024-11-12T17:38:14 1731433094

https://github.com/corsix/amx indicates that Apple's AMX supports up to 64-bit FP, but I don't see any performance metrics. They also have the ANE, which is the low-precision ML-focused accelerator.

Const-me · 2024-11-12T17:56:46 1731434206

I wasn’t aware there’re two completely different things from different companies both called AMX. I assumed that AMX: https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions

The Apple’s version is indeed interesting. I wonder why haven’t Apple exposed it to programmers, or implemented a BLAS library on top of that thing?

wtallis · 2024-11-12T18:19:32 1731435572

> I wonder why haven’t Apple exposed it to programmers, or implemented a BLAS library on top of that thing?

Using the Accelerate framework (which includes Apple's BLAS) is the only supported way for programmers to access the AMX. Reverse engineering the instruction set to access it directly is discouraged, because it's not a documented stable interface.

sgerenser · 2024-11-13T00:31:57 1731457917

It’s because Apple’s AMX was an early, in-house built version of what was eventually released by Arm as SME. Apple adopted SME in the M4 and dropped AMX, but as long as you were using their Accelerate framework instead of directly writing AMX code (which they told people not to do), you wouldn’t notice.

Now that they’re using “standard” SME, it shouldn’t be a problem to write SME assembly opcodes directly, although I suspect Apple themselves is still probably sparse on the documentation. I’m not aware if there’s any way to use intrinsics or something slightly higher level than inline-ASM, but lower level than the Accelerate framework.

stephencanon · 2024-11-12T20:57:52 1731445072

Apple's matrix unit supports FP16, 32, and 64 sources and accumulators.

stephencanon · 2024-11-12T20:44:43 1731444283

Right. So, like I said, using one CPU core, you can exceed 1 TFLOP/s, leaving all the other CPU cores and the GPU free for other work.

adrian_b · 2024-11-13T09:15:22 1731489322

Your initial claim was ambiguous.

It sounded like you claimed that using only one core you already reach 1 TFLOP/s, implying that you could reach more than that by using more cores, which is false.

Now you have clarified that you actually claim that it is good that when using a single core you can reach the maximum throughput of the shared matrix operation accelerator.

This is correct, but there is no essential difference between this and a Zen 5 CPU that reaches this throughput by using only half of the cores, while having the other half of the cores free to do any other tasks.

stephencanon · 2024-11-14T04:08:04 1731557284

What’s the power draw of however many zen 5 cores you have to tie up to hit, say, 1.5tflop/s on sgemm?

(Also, that’s a M2 number, since that’s what OP was talking about. Someone will presumably post M4 benchmarks for BLAS sometime soon, if they haven’t already.)

menaerus · 2024-11-13T09:03:25 1731488605

Top of the line AMD zen5 core can sustain ~80GFLOPS@FP64 and ~160GFLOPS@FP32 using AVX-512, 2x FMA units and ~5Ghz of clock frequency.

This is way way lower than what you claim M2 Pro is capable of and since I'm comparing it against the state-of-the-art datacenter CPU I'm curious how did you get to this number?

M2 Pro core runs at much lower frequency, what it seems to be around ~3.4GHz. And I couldn't find any information about SVE vector widths supported nor number of FMAs.

Const-me · 2024-11-12T13:51:17 1731419477

OP implemented that stuff with WebGPU. In runtime, that weird language compiles into a compute shader which computes that stuff on GPU.

saagarjha · 2024-11-12T15:15:09 1731424509

I believe the point being made was that this could be done in the CPU faster than was achieved here.

Const-me · 2024-11-12T15:53:29 1731426809

Yeah, but not on a single core.

In my desktop computer, I have Ryzen 7 8700G CPU, which has 8 Zen 4 cores, 4.2 GHz base frequency, 65W TDP. Theoretically, when doing FP32 FMA, each CPU core can do 32 FLOP/cycle. At the base frequency, this translates into 134 GFlops per core. You gonna need all 8 cores to achieve 1 theoretical TFlops.

BTW, integrated GPU inside the same 8700G processor can theoretically do 8.2 TFlops FP32.

menaerus · 2024-11-13T09:29:29 1731490169

> Theoretically, when doing FP32 FMA, each CPU core can do 32 FLOP/cycle. At the base frequency, this translates into 134 GFlops per core.

Isn't it that zen4 doesn't have "native" support for AVX-512 but "mimics" it through 2x 256-bit FMA units?

Because of this, a single AVX-512 instruction will occupy both FMA units and therefore I think that the theoretical limit for a single zen4 core should be half of the 134 GFLOPS number?

Const-me · 2024-11-13T13:16:12 1731503772

One FMA counts as two floating-point operations: one multiplication and one addition.

According to uops.info, Zen 4 cores can do two 8-wide FMA instructions per cycle, or one 16-wide FMA per cycle. See VFMADD132PS (YMM, YMM, YMM) and VFMADD132PS (ZMM, ZMM, ZMM) respectively, the throughput column is labelled TP. That’s where 32 FLOP/cycle number comes from.

> doesn't have "native" support for AVX-512 but "mimics" it through 2x 256-bit FMA units

That’s correct, AVX512 doesn’t deliver more FLOPs on that CPU. The throughput of 32-byte FMA and 64-byte FMA is the same, 32 FLOP/cycle for FP32 numbers.

menaerus · 2024-11-13T19:27:08 1731526028

> One FMA counts as two floating-point operations: one multiplication and one addition.

Right. This is where the discrepancy comes from. I counted FMA as a single FLOP.

Const-me · 2024-11-14T18:15:13 1731608113

BTW, it’s the same for GPUs. In DXBC shader byte code, mad instruction does FMA. When reporting theoretical FLOPs, GPU vendors count that as 2 float operations.

For example, I have GeForce 4070 Ti Super in my desktop. The chip has 8448 execution units; nVidia calls them CUDA cores but I don’t like the name, the correct number is 66 cores where each core can do 4 wavefronts of 32 threads each. Anyway, these EUs can do one FP32 FMA each cycle, and the boost clock frequency is 2.61 GHz. Multiplying these two numbers results in 22.04928E+12 cycles*EU/second, and nVidia reports 44E+12 FLOPs peak FP32 performance of the GPU.

saagarjha · 2024-11-13T05:33:37 1731476017

I am told the numbers above require the core to have a matrix multiply unit (such as SME)