Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: FP32 matmul of large matrices up to 24% faster than cuBLAS on a 4090 (github.com/arekpaterek)
4 points by ap4 4 months ago | hide | past | favorite | 5 comments
I decided to share a CUDA kernel I wrote over 5 months ago. Nvidia's hardware and software may surprise you.



Wow this is a surprising result. Does this reproduce on other GPUs or just the 4090?


Also on other GPUS.

So far I have tested on:

  size    tflops_cublas  tflops_my  diff      gpu
  4096²   28.7-28.8      32.5       +13%      4070ts
  8192²   27.7-28.2      33.5       +19-21%   4070ts
  4096²   9.9-10.0       10.1-10.2  +1-2%     1080ti
  4096²   3.8-4.3        6.7        +56-76%   T4


I did more tests on various GPUs. The TFLOPS values for A100 exceed the maximum from the spec of the GPU, so perhaps TFLOPS is calculated in a different way in the spec.

  size    tflops_cublas  tflops_my  diff      gpu
  12288²  51.4           56.3       +9%       h100
  8192²   50.5           56.1       +11%      h100
  4096²   43.8           53.9       +23%      h100
  12288²  18.9           27.0       +43%      a100
  8192²   19.0           26.3       +38%      a100
  4096²   17.5           19.8       +13%      a100
  16384²  28.8           34.9       +21%      3090ti
  12288²  28.8           34.5       +20%      3090ti
  8192²   29.3           33.3       +14%      3090ti
  4096²   27.9           26.7       -4%       3090ti


The github repo doesn't seem to be accessible anymore, it is giving a 404 error.


I have deleted it. The function verify_matrix() from the original SGEMM_CUDA repository did not check for NANs, and the kernel was returning NANs.

I have no way to delete this submission.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: