Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fast Multidimensional Matrix Multiplication on CPU from Scratch (2022) (siboehm.com)
74 points by georgehill on July 31, 2024 | hide | past | favorite | 23 comments


Related: I created a CUDA kernel typically much faster than kernels from cuBLAS when multiplying large square float32 matrices. Tested mostly on a 4090 GPU so far.

Source code: https://github.com/arekpaterek/Faster_SGEMM_CUDA

  size    tflops_cublas  tflops_my  diff      gpu
  4096²   50.8-50.9      61.8       +21%      4090
  6144²   55.3           59.8       +8%       4090
  8192²   56.3-56.5      67.1       +19%      4090
  12288²  53.7           66.7       +24%      4090
  16384²  53.6           66.7       +24%      4090
  4096²   28.7-28.8      32.5       +13%      4070ts
  4096²   3.8-4.3        6.7        +56-76%   T4


Tested on more GPUs. The biggest improvement found so far over the standard matrix multiplication from the cuBLAS library is +43% when multiplying matrices of size 12288² on an A100 GPU.

  size    tflops_cublas  tflops_my  diff      gpu
  12288²  51.4           56.3       +9%       h100
  8192²   50.5           56.1       +11%      h100
  4096²   43.8           53.9       +23%      h100
  12288²  18.9           27.0       +43%      a100
  8192²   19.0           26.3       +38%      a100
  4096²   17.5           19.8       +13%      a100
  12288²  28.8           34.5       +20%      3090ti
Edit: The values for A100 are fishy. They should not exceed 19.5 TFLOPS, which is the maximum from the spec. I have to take a closer look into that.


Your `sum` array is only 64 elements but you're indexing with indices out of bounds, which is UB and the compiler knows it at compile time so it's skipping a bunch of work.

E.g. consider the line:

  sum[y2*8 + x2] += ...
In the final loop iteration when y2=15 and x2=7, the index is 127.


That's the trick. This is intended. The point is that the compiler does not notice oob access in the first stage, but notices it in the later stages, and compiles the code to a correctly working kernel. The result is correct, as checked by the function verify_matrix().


Sorry I'm a bit lost here, could you explain the reasoning behind this and why it works?


I want to have 512 threads per block, each thread calculating simultaneously 128 values. That's 65536 values per block. I can't accumulate each of these values in registers, because the GPU has the limit of max 65536 registers per block, and some additional registers are needed in the kernel. But if I find a way to trick the first stages of the compiler that it has sufficient amount of free registers, then sometimes, like in the case of this kernel, the later stages of the compiler are sufficiently smart to give me what I want: 512 threads per block, each calculating 128 values.


I hate to say it but that simply doesn't work: you can't write out of bounds to trick the compiler, it'll just ignore your out of bounds work.

You can look at the generated sass on godbolt: https://cuda.godbolt.org/z/19excTxM3

Note that there are 1024 FFMA instructions in the loop but you would expect 16*8*BK = 2048. This would suggest half the operations are skipped, which lines up with the half of writes that are out of bounds being omitted.

After the compute loop when you're calculating the final result and storing it, you can see that the FFMAs referencing out of bounds indices write QNAN instead of any real results.

Is it possible that the NANs are what are messing with your tests? Those are notoriously hard to deal with correctly, but you should assert that the result doesn't have any NANs whatsoever.


You are right. The function verify_matrix() from the original SGEMM_CUDA repository did not check for NANs. I deleted the repository. It was the 13th CUDA kernel I wrote in my life, and the whole endeavor teached me a lot. I appreciate the feedback.


Glad it was a learning experience for you and I apologize if I came off argumentative at all! I was mainly so incredulous because this is my day job haha, so I have a bit more experience than most in the area.

It definitely sucks to be led astray and have time wasted by a bug inherited from the original repo though, sorry to hear that :/


For those interested in going deeper I think the classic reference in this area is the GotoBLAS paper: https://www.cs.utexas.edu/~pingali/CS378/2008sp/papers/gotoP...


https://github.com/flame/blis#citations is the current set of references for various aspects of the approach.

The article, or something similar, was posted recently with discussion.


The quoted assembly looks L1D$ bandwidth bound; on most common and vaguely recent architectures one has to use register tiling to saturate the FMA units, as a system unable to do more than one vector load and one vector store each per cycle can't ever fully saturate a single FMA unit on GEMM; for 2 FMA units even 2 vector loads and a vector store per cycle won't be enough without register tiling.


Quasi-related: Do BLAS libraries ever actually implement Strassen's Algorithm?


Typically no, although BLAS software engineers occasionally write HPC strassen's implementations and papers about them. In my opinion there's a few reasons why they're not common in BLAS libraries:

- They're a bit less numerically stable, which used to matter more for common BLAS use cases. Less so these days.

- The memory access patterns and algorithm parallelism make it much harder to reach as high a fraction of peak performance as standard GEMM. Matrix size restrictions for recursive mult algorithms is also an issue.


One such paper is https://jianyuhuang.com/papers/sc16.pdf; I don't know if there's anything more recent/relevant.


I honestly didn't realize how performant the decades-old 2013 Haswell architecture is on vector workloads.

250GFLOP/core is no joke - He also cross-compared to an M1 Pro, that when not using the secret matrix coprocessor achieves effectively the same vector throughput, a decade later...


250 is nonsense. 2xFMA per cycle @ ~4.5Ghz = 32*4.5 = ~144 Gflops

Beating cuBlas is unlikely. You probably made a mistake. Last I tested it, it was even better than MKL in efficiency.


Yes, the figure was 250GFLOP for 4 cores instead of per core, I misread. Still impressive but more reasonable


The floating-point FMA throughput per desktop CPU socket and per clock cycle has been doubled every few years in the sequence: Athlon 64 (2003) => Athlon 64 X2 (2005) => Core 2 (2006) => Nehalem (2008) => Sandy Bridge (2011) => Haswell (2013) => Coffee Lake Refresh (2018) => Ryzen 9 3950X (2019) => Ryzen 9 9950X (2024), going from 1 FP64 FMA/socket/clock cycle until 256 FP64 FMA/socket/clock cycle, with double numbers for FP32 FMA (1 FMA/s is counted as 2 Flop/s).


I'd wish memory bandwidth could also be doubled so often on desktops. Instead of 256 (even more due to 2-3 times higher core frequency) only 14 times increase: DDR-400 6.4 GB/s => DDR5-5600 89.6 GB/s. The machine balance keeps falling even further.

While flash memory became so fast in recent years, I haven't heard of any break-through technology prototypes to bring some significant progress into RAM. Let alone the RAM latency, which remained constant (+/- few ns) through all the years.


You are right, which is why in modern CPUs the maximum computational throughput can be reached only when a large part of the operands can be reused, so they can be taken from the L1 or from the L2 cache memories.

Unlike the main memory bandwidth or that of the shared L3 cache memory, the memory bandwidth of the non-shared L1 and L2 caches has been increased exactly in the same ratio as the FMA throughput. Almost all CPUs have always been able to do exactly the same number of FMA operations per clock cycle and loads from the L1 cache per clock cycle (simultaneously with typically only a half of that number, of stores to the L1 cache per clock cycle).

Had this not been true, the computational execution units of the cores would have become useless.

Fortunately, the solution of systems of linear equations and the multiplication of matrices are very frequent operations and these reuse most of their operands, so they can reach the maximum computational throughput.


i7 6700 is skylake not haswell


True, but it has the same floating-point throughput per socket and per clock cycle as Haswell.

Until Haswell, for many years Intel had the target of doubling the FP throughput per desktop socket every 2 years, in order to avoid the risk of AMD catching up again with them.

After launching Haswell, Intel has relaxed and they did not increase again the throughput for about a half of decade, until they had to start again to do that, under the menace of AMD Zen, when for lack of other better means they have increased every year the number of cores per socket from Kaby Lake (4 cores in 2016) to Comet Lake (10 cores in 2019).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: