Related: I created a CUDA kernel typically much faster than kernels from cuBLAS when multiplying large square float32 matrices. Tested mostly on a 4090 GPU so far.
Tested on more GPUs. The biggest improvement found so far over the standard matrix multiplication from the cuBLAS library is +43% when multiplying matrices of size 12288² on an A100 GPU.
Your `sum` array is only 64 elements but you're indexing with indices out of bounds, which is UB and the compiler knows it at compile time so it's skipping a bunch of work.
E.g. consider the line:
sum[y2*8 + x2] += ...
In the final loop iteration when y2=15 and x2=7, the index is 127.
That's the trick. This is intended. The point is that the compiler does not notice oob access in the first stage, but notices it in the later stages, and compiles the code to a correctly working kernel. The result is correct, as checked by the function verify_matrix().
I want to have 512 threads per block, each thread calculating simultaneously 128 values. That's 65536 values per block. I can't accumulate each of these values in registers, because the GPU has the limit of max 65536 registers per block, and some additional registers are needed in the kernel.
But if I find a way to trick the first stages of the compiler that it has sufficient amount of free registers, then sometimes, like in the case of this kernel, the later stages of the compiler are sufficiently smart to give me what I want: 512 threads per block, each calculating 128 values.
Note that there are 1024 FFMA instructions in the loop but you would expect 16*8*BK = 2048. This would suggest half the operations are skipped, which lines up with the half of writes that are out of bounds being omitted.
After the compute loop when you're calculating the final result and storing it, you can see that the FFMAs referencing out of bounds indices write QNAN instead of any real results.
Is it possible that the NANs are what are messing with your tests? Those are notoriously hard to deal with correctly, but you should assert that the result doesn't have any NANs whatsoever.
You are right. The function verify_matrix() from the original SGEMM_CUDA repository did not check for NANs. I deleted the repository. It was the 13th CUDA kernel I wrote in my life, and the whole endeavor teached me a lot. I appreciate the feedback.
Glad it was a learning experience for you and I apologize if I came off argumentative at all! I was mainly so incredulous because this is my day job haha, so I have a bit more experience than most in the area.
It definitely sucks to be led astray and have time wasted by a bug inherited from the original repo though, sorry to hear that :/
The quoted assembly looks L1D$ bandwidth bound; on most common and vaguely recent architectures one has to use register tiling to saturate the FMA units, as a system unable to do more than one vector load and one vector store each per cycle can't ever fully saturate a single FMA unit on GEMM; for 2 FMA units even 2 vector loads and a vector store per cycle won't be enough without register tiling.
Typically no, although BLAS software engineers occasionally write HPC strassen's implementations and papers about them. In my opinion there's a few reasons why they're not common in BLAS libraries:
- They're a bit less numerically stable, which used to matter more for common BLAS use cases. Less so these days.
- The memory access patterns and algorithm parallelism make it much harder to reach as high a fraction of peak performance as standard GEMM. Matrix size restrictions for recursive mult algorithms is also an issue.
I honestly didn't realize how performant the decades-old 2013 Haswell architecture is on vector workloads.
250GFLOP/core is no joke - He also cross-compared to an M1 Pro, that when not using the secret matrix coprocessor achieves effectively the same vector throughput, a decade later...
The floating-point FMA throughput per desktop CPU socket and per clock cycle has been doubled every few years in the sequence: Athlon 64 (2003) => Athlon 64 X2 (2005) => Core 2 (2006) => Nehalem (2008) => Sandy Bridge (2011) => Haswell (2013) => Coffee Lake Refresh (2018) => Ryzen 9 3950X (2019) => Ryzen 9 9950X (2024), going from 1 FP64 FMA/socket/clock cycle until 256 FP64 FMA/socket/clock cycle, with double numbers for FP32 FMA (1 FMA/s is counted as 2 Flop/s).
I'd wish memory bandwidth could also be doubled so often on desktops. Instead of 256 (even more due to 2-3 times higher core frequency) only 14 times increase: DDR-400 6.4 GB/s => DDR5-5600 89.6 GB/s. The machine balance keeps falling even further.
While flash memory became so fast in recent years, I haven't heard of any break-through technology prototypes to bring some significant progress into RAM. Let alone the RAM latency, which remained constant (+/- few ns) through all the years.
You are right, which is why in modern CPUs the maximum computational throughput can be reached only when a large part of the operands can be reused, so they can be taken from the L1 or from the L2 cache memories.
Unlike the main memory bandwidth or that of the shared L3 cache memory, the memory bandwidth of the non-shared L1 and L2 caches has been increased exactly in the same ratio as the FMA throughput. Almost all CPUs have always been able to do exactly the same number of FMA operations per clock cycle and loads from the L1 cache per clock cycle (simultaneously with typically only a half of that number, of stores to the L1 cache per clock cycle).
Had this not been true, the computational execution units of the cores would have become useless.
Fortunately, the solution of systems of linear equations and the multiplication of matrices are very frequent operations and these reuse most of their operands, so they can reach the maximum computational throughput.
True, but it has the same floating-point throughput per socket and per clock cycle as Haswell.
Until Haswell, for many years Intel had the target of doubling the FP throughput per desktop socket every 2 years, in order to avoid the risk of AMD catching up again with them.
After launching Haswell, Intel has relaxed and they did not increase again the throughput for about a half of decade, until they had to start again to do that, under the menace of AMD Zen, when for lack of other better means they have increased every year the number of cores per socket from Kaby Lake (4 cores in 2016) to Comet Lake (10 cores in 2019).
Source code: https://github.com/arekpaterek/Faster_SGEMM_CUDA