So far I have tested on:
size tflops_cublas tflops_my diff gpu 4096² 28.7-28.8 32.5 +13% 4070ts 8192² 27.7-28.2 33.5 +19-21% 4070ts 4096² 9.9-10.0 10.1-10.2 +1-2% 1080ti 4096² 3.8-4.3 6.7 +56-76% T4
size tflops_cublas tflops_my diff gpu 12288² 51.4 56.3 +9% h100 8192² 50.5 56.1 +11% h100 4096² 43.8 53.9 +23% h100 12288² 18.9 27.0 +43% a100 8192² 19.0 26.3 +38% a100 4096² 17.5 19.8 +13% a100 16384² 28.8 34.9 +21% 3090ti 12288² 28.8 34.5 +20% 3090ti 8192² 29.3 33.3 +14% 3090ti 4096² 27.9 26.7 -4% 3090ti
I have no way to delete this submission.