
Nvidia Ampere GA102 GPU Architecture [pdf] - pjmlp
https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf
======
dragontamer
I posted this a day or two ago:

The A100 whitepaper "spoiled" a lot of these factoids already.
([https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
Cent](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent)...) The
new bit seems to be the doubling of FP32 "CUDA cores" (I really hate that
word: when Intel or AMD double their CPU pipelines it doesn't mean that
they're selling more cores, it means their cores got wider... anyway). A100
didn't have this feature (I assume A100 was 16 Floating point + 16 Integer
"Cuda cores" per CU like Turing. Correct me if I'm wrong)

You don't need to read the whitepaper to understand that NVidia has really
improved performance/cost here. The 3rd party benchmarks are out and the
improved performance is well documented at this point.

The FP32 doubling, is one of the most important bits here. But fortunately for
programmers, this doesn't really change how you do your code. The compiler /
PTX assembler will schedule your code at compile time to best take advantage
of that.

The other bit: larger L1 / Shared memory of 128kB per CU, does affect
programmers. GPU programmers have tight control over shared memory, and is
very useful for optimization purposes.

\----------

GDDR6's improved memory bandwidth is also big. "Feeding the beast" with faster
RAM is always a laudable goal, and sending 2-bits per pin per clock cycle
through PAM4 is a nifty trick.

Sparse Tensor Cores were already implemented in A100, and don't seem to be
new. If you haven't heard of the tech before, its cool: basically hardware
accelerated sparse-matrix computations. A 4x4xFP16 matrix uses 32 bytes under
normal conditions, but can be "compressed" into 16 bytes if half-or-more of
its values are 0. NVidia Ampere supports hardware-accelerated matrix
multiplications of these 16-byte "virtual" 4x4xFP16 matrixes.

I swear that RTX I/O existed before in some other form. This isn't the first
time I heard about offloading PCIe to the GPU. Its niche and I don't expect
video games to use it (are M.2 SSDs popular enough to be assumed on the PC /
Laptop market yet?). But CUDA-coders probably can control their hardware more
carefully and benefit from such a feature

~~~
jagger27
> The FP32 doubling, is one of the most important bits here. But fortunately
> for programmers, this doesn't really change how you do your code.

Early benchmarks are showing games under-performing quite a bit in the worst
cases. The crux of the issue is that it's not /exactly/ a no-compromise
doubling of FP32. Each data path per SM can either do 2xFP32 or 1xINT32/1xFP32
per clock cycle. So if your game or application has any significant INT32
operations scheduled, all of a sudden you're back to the number of FP32 cores
you had last generation, though you get the benefit of parallel INT32
execution.

It's a pretty cool architecture overall though.

~~~
PaulKeeble
Its not uncommon for GPU workloads in games to max out about 20% INT32
calculations, but alas its enough to drop the FP32 performance quite a bit. I
suspect Nvidia next time will probably separate out the INT32 and 2x FP32
units and gradually move towards going towards a better ratio of hardware that
better suits the usual workload split.

~~~
_kbh_
Due to the lower amount of INT32 in game loads as you stated, I don't think
that separating INT32 and FP32 hardware makes a lot of sense, because you can
share a substantial amount of the hardware between the two overall leading to
space savings.

~~~
peterhj
On the contrary, "dark silicon" instead suggests that separating fp32 and
int32 (now in GA102/104, fp32 and int32/fp32) data paths at the cost of more
die space usage currently makes excellent sense. (See also: tensor cores, ray
tracing cores.) Jensen Huang very briefly alluded to this when during the
GA102/104 announcement he mentioned the end of Dennard scaling.

~~~
_kbh_
But the GA102/GA104 doesn’t have seperate execution units for INT and FP32
because the INT also does FP32. So I don’t see how that shows that separating
FP32 and INT hardware makes sense.

------
amelius
These GPU architectures are too complicated for my taste. It reminds me of the
days of segmented memory, and makes me feel like having to jump through hoops
all the time. I'm curious if people have tried to invent something more
elegant, and where these approaches have failed.

Also, the fact that NVidia has probably patented this particular architecture
makes it less interesting for me to really dig into.

~~~
modeless
It's a good question actually. Intel tried to make a GPU called Larrabee that
was mostly a bunch of small x86 cores with giant vector units. Turns out that
it couldn't compete in rendering performance on existing games (in 2010)
without the fixed function units that GPUs have, so they canceled it as a GPU.
It did result in the AVX-512 instruction set though.

I think the idea still has promise but there's a chicken and egg issue where
you'd really need to rearchitect game engines and content pipelines to take
full advantage of the flexibility before you'd see a benefit. It's possible
that it would work better today, and it's also possible that Intel just gave
up too early. In some cases we're already seeing people bypassing the fixed
function rasterizer in GPUs and doing rasterization manually in compute
shaders [1] [2].

[1] Doom Eternal:
[http://advances.realtimerendering.com/s2020/RenderingDoomEte...](http://advances.realtimerendering.com/s2020/RenderingDoomEternal.pdf)

[2] Epic Nanite:
[https://twitter.com/briankaris/status/1261098487279579136](https://twitter.com/briankaris/status/1261098487279579136)

~~~
dragontamer
Fujitsu's A64FX ARM is proof that 512 bit SIMD can work on a CPU based
platform.

~~~
formerly_proven
For graphics use GPUs perform a very significant amount of work in hardware
(rasterization and texture interpolation being the two computationally most
intensive [probably followed by ROP, which blends pixel shader output into the
framebufer]; you can easily calculate that the ALU bandwidth of the TMUs is
about the same order of magnitude as all the shader cores), which gives them a
huge efficiency lead over anything purely done with programmable hardware
only.

~~~
dr_zoidberg
Michael Abrash had a great series of articles in Dr. Dobbs detailing how he
came to work for Intel (which spun Larabee) after talking at a game conference
with some of their people to ask them for a lerp (linear interpolation)
instruction in x86 extensions[0] :)

Oh and Larabee gave us more than AVX512, it also gave us the Xeon Phis, which
were accelerators (much akin to the GPGPU of nvidia GPUs?) aimed at scientific
code undeer the promise that "since it's x86, you don't need to change your
code that much!". However:

> An empirical performance and programmability study has been performed by
> researchers, in which the authors claim that achieving high performance with
> Xeon Phi still needs help from programmers and that merely relying on
> compilers with traditional programming models is still far from reality.
> However, research in various domains, such as life sciences, and deep
> learning demonstrated that exploiting both the thread- and SIMD-parallelism
> of Xeon Phi achieves significant speed-ups.

(from Wikipedia[1])

So pretty much the same as a GPU. It is a bit unfortunate that, in theory,
good OpenCL support could have made running this code in 2/4/8 core CPUs (with
or without SMT) or in the thread-beast that are/were the Phis. But that
woud've probably required OpenCL to be a bit more mature, and Intel skipped
that train too.

[0] [https://www.drdobbs.com/parallel/a-first-look-at-the-
larrabe...](https://www.drdobbs.com/parallel/a-first-look-at-the-larrabee-new-
instruc/216402188)

[1]
[https://en.wikipedia.org/wiki/Xeon_Phi](https://en.wikipedia.org/wiki/Xeon_Phi)

~~~
dragontamer
OpenCL would have been a bad fit for Xeon Phi.

OpenCL is very specifically tailored for GPUs (though FPGAs may benefit). The
concept of "constant memory", "shared memory", and "global memory" is very
GPU-centric, and doesn't benefit Xeon Phi at all.

I'd assume that any OpenCL program would simply function better on a GPU, even
compared to a 60-core in-order 512-bit SIMD-based processor like Xeon Phi.

\---------------

Xeon Phi's main advantage really was running "like any other x86 processor",
with 60 cores / 240 threads. But you still needed to AVX512 up your code to
really benefit.

Honestly, I think Xeon Phi just needed a few more revisions to figure out
itself more. It was on the market for less than 5 years. But I guess it wasn't
growing as fast as NVidia or CUDA.

~~~
dr_zoidberg
Maybe I was mixing up names in my head, but I remember from 5~10 years back an
Open[Something] (thought it was OpenCL) that in theory could transparently
handle multithreaded code across single/dual/quad[0] core or GPGPU (either
nvidia or AMD).

This is what I had in mind when I wrote "if Intel had given it good OpenCL
support". Again, maybe I'm mixing things up in my head since my career never
took me down that lane to write massively paralell code (though I am a user of
it, indirectly, through deep learning frameworks).

[0] back then this was as big a CPU would get

~~~
dragontamer
There's a version of OpenCL that compiled to Intel, but I'm not very familiar
with it.

I remember reading things like:
[https://software.intel.com/content/www/us/en/develop/documen...](https://software.intel.com/content/www/us/en/develop/documentation/iocl-
tec-opg/top/coding-for-the-intel-architecture-processors/using-vector-data-
types.html)

Where you'd have to use float8 types to be assured of SIMD-benefits on CPU
code. As such, its probably more useful to rely upon auto-vectorizers in C++
code (such as #pragma omp simd) and maybe intrinsics for the complicated
cases.

Intel does seem to have some level of OpenCL -> AVX tech:
[http://llvm.org/devmtg/2011-11/Rotem_IntelOpenCLSDKVectorize...](http://llvm.org/devmtg/2011-11/Rotem_IntelOpenCLSDKVectorizer.pdf)

------
lajawfe
Can the FP32 and Tensor core modules compute at the same time, or are they an
abstraction over the underlying silicon, and we can only utilize one at a
time?

~~~
_kbh_
Pages 16 and 17 in the PDF have graphs that show them being scheduled
concurrently.

------
dannyw
Ampere seems like a great upgrade for gaming. How much of the FP32
improvements will transfer to hobbyist ML researchers training models on a
3080?

