
Nvidia Ampere GA102 GPU Architecture [pdf] - M277
https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf
======
dragontamer
The A100 whitepaper "spoiled" a lot of these factoids already.
([https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
Cent...](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
Center/nvidia-ampere-architecture-whitepaper.pdf))

The new bit seems to be the doubling of FP32 "CUDA cores" (I really hate that
word: when Intel or AMD double their CPU pipelines it doesn't mean that
they're selling more cores, it means their cores got wider... anyway). A100
didn't have this feature (I assume A100 was 16 Floating point + 16 Integer
"Cuda cores" per CU like Turing. Correct me if I'm wrong)

You don't need to read the whitepaper to understand that NVidia has really
improved performance/cost here. The 3rd party benchmarks are out and the
improved performance is well documented at this point.

The FP32 doubling, is one of the most important bits here. But fortunately for
programmers, this doesn't really change how you do your code. The compiler /
PTX assembler will schedule your code at compile time to best take advantage
of that.

The other bit: larger L1 / Shared memory of 128kB per CU, does affect
programmers. GPU programmers have tight control over shared memory, and is
very useful for optimization purposes.

\----------

GDDR6's improved memory bandwidth is also big. "Feeding the beast" with faster
RAM is always a laudable goal, and sending 2-bits per pin per clock cycle
through PAM4 is a nifty trick.

Sparse Tensor Cores were already implemented in A100, and don't seem to be
new. If you haven't heard of the tech before, its cool: basically hardware
accelerated sparse-matrix computations. A 4x4xFP16 matrix uses 32 bytes under
normal conditions, but can be "compressed" into 16 bytes if half-or-more of
its values are 0. NVidia Ampere supports hardware-accelerated matrix
multiplications of these 16-byte "virtual" 4x4xFP16 matrixes.

I swear that RTX I/O existed before in some other form. This isn't the first
time I heard about offloading PCIe to the GPU. Its niche and I don't expect
video games to use it (are M.2 SSDs popular enough to be assumed on the PC /
Laptop market yet?). But CUDA-coders probably can control their hardware more
carefully and benefit from such a feature.

