Hacker News new | past | comments | ask | show | jobs | submit login
Nvidia Ampere GA102 GPU Architecture [pdf] (nvidia.com)
90 points by pjmlp 37 days ago | hide | past | favorite | 49 comments

I posted this a day or two ago:

The A100 whitepaper "spoiled" a lot of these factoids already. (https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...) The new bit seems to be the doubling of FP32 "CUDA cores" (I really hate that word: when Intel or AMD double their CPU pipelines it doesn't mean that they're selling more cores, it means their cores got wider... anyway). A100 didn't have this feature (I assume A100 was 16 Floating point + 16 Integer "Cuda cores" per CU like Turing. Correct me if I'm wrong)

You don't need to read the whitepaper to understand that NVidia has really improved performance/cost here. The 3rd party benchmarks are out and the improved performance is well documented at this point.

The FP32 doubling, is one of the most important bits here. But fortunately for programmers, this doesn't really change how you do your code. The compiler / PTX assembler will schedule your code at compile time to best take advantage of that.

The other bit: larger L1 / Shared memory of 128kB per CU, does affect programmers. GPU programmers have tight control over shared memory, and is very useful for optimization purposes.


GDDR6's improved memory bandwidth is also big. "Feeding the beast" with faster RAM is always a laudable goal, and sending 2-bits per pin per clock cycle through PAM4 is a nifty trick.

Sparse Tensor Cores were already implemented in A100, and don't seem to be new. If you haven't heard of the tech before, its cool: basically hardware accelerated sparse-matrix computations. A 4x4xFP16 matrix uses 32 bytes under normal conditions, but can be "compressed" into 16 bytes if half-or-more of its values are 0. NVidia Ampere supports hardware-accelerated matrix multiplications of these 16-byte "virtual" 4x4xFP16 matrixes.

I swear that RTX I/O existed before in some other form. This isn't the first time I heard about offloading PCIe to the GPU. Its niche and I don't expect video games to use it (are M.2 SSDs popular enough to be assumed on the PC / Laptop market yet?). But CUDA-coders probably can control their hardware more carefully and benefit from such a feature

GPU accelerated direct storage access was previously part of Telsa/DGX-2 as a feature named "GPUDirect storage" https://developer.nvidia.com/blog/gpudirect-storage/.

It's a feature the new consoles are doing so it'll be widely supported.

RTX I/O is going to be a big feature, and games are likely some of the first consumer-facing software that will use it because it is a standard features for the next console generation. And AAA devs already support multiple performance profiles, feature support fallbacks..etc. There's no reason they couldn't have the engine take advantage of RTX I/O when it exists, but otherwise fall back on an emulation layer of sorts.

In addition, I suspect the slice of the video game market that has a GPU with RTX I/O capability will also have a NVME SSD. Now, this is niche, but with that slice of the market also being the top-end performance tier, they're still going to be catered to by AAA devs.

Even without an NVMe drive you're better off with this just by skypping system RAM altogether. Bu you're not going to be able to use it to stream back and forth game content at the snap of a finger (well maybe that's a bit hyperbolic) as the console makers have been saying they will.

> The FP32 doubling, is one of the most important bits here. But fortunately for programmers, this doesn't really change how you do your code.

Early benchmarks are showing games under-performing quite a bit in the worst cases. The crux of the issue is that it's not /exactly/ a no-compromise doubling of FP32. Each data path per SM can either do 2xFP32 or 1xINT32/1xFP32 per clock cycle. So if your game or application has any significant INT32 operations scheduled, all of a sudden you're back to the number of FP32 cores you had last generation, though you get the benefit of parallel INT32 execution.

It's a pretty cool architecture overall though.

> though you get the benefit of parallel INT32 execution

Parallel INT32 was added with the last generation, in Turing. See page 13 of https://www.nvidia.com/content/dam/en-zz/Solutions/design-vi...

So nvidia split out the INT32 from FP32 last gen to make them independent, then re-added FP32 to the INT32 but kept it as 2 datapaths.

Its not uncommon for GPU workloads in games to max out about 20% INT32 calculations, but alas its enough to drop the FP32 performance quite a bit. I suspect Nvidia next time will probably separate out the INT32 and 2x FP32 units and gradually move towards going towards a better ratio of hardware that better suits the usual workload split.

Due to the lower amount of INT32 in game loads as you stated, I don't think that separating INT32 and FP32 hardware makes a lot of sense, because you can share a substantial amount of the hardware between the two overall leading to space savings.

On the contrary, "dark silicon" instead suggests that separating fp32 and int32 (now in GA102/104, fp32 and int32/fp32) data paths at the cost of more die space usage currently makes excellent sense. (See also: tensor cores, ray tracing cores.) Jensen Huang very briefly alluded to this when during the GA102/104 announcement he mentioned the end of Dennard scaling.

But the GA102/GA104 doesn’t have seperate execution units for INT and FP32 because the INT also does FP32. So I don’t see how that shows that separating FP32 and INT hardware makes sense.

I've been thinking that's why we are seeing the true doubling in full RTX like quake and minecraft but not on more traditional rendering engines.

From my understanding int is often used for lookups, and I'd presume a lot of that is some sort of environment mapping which adds some contention as int is more limited and "steals" from the doubling of FP.

I think "parallel execution of fp32/int32" is kind of vaguely defined by them... Do they mean fp32/int32 instructions from the same thread (aka warp/wavefront) or from different threads? If it's the latter I'm pretty sure AMD GPUs have been doing it too.

Everyone i know that plays games on a pc uses either sata m2 or nvme (most use the latter)

Some PC builds still equip HDD for installing game meanwhile small SSD. IMO it's waste.

RTX I/O is a parallel to a key feature in both of the new consoles (XBox Series and PS5) so I suspect cross-platform titles will have support for it.

These GPU architectures are too complicated for my taste. It reminds me of the days of segmented memory, and makes me feel like having to jump through hoops all the time. I'm curious if people have tried to invent something more elegant, and where these approaches have failed.

Also, the fact that NVidia has probably patented this particular architecture makes it less interesting for me to really dig into.

It's a good question actually. Intel tried to make a GPU called Larrabee that was mostly a bunch of small x86 cores with giant vector units. Turns out that it couldn't compete in rendering performance on existing games (in 2010) without the fixed function units that GPUs have, so they canceled it as a GPU. It did result in the AVX-512 instruction set though.

I think the idea still has promise but there's a chicken and egg issue where you'd really need to rearchitect game engines and content pipelines to take full advantage of the flexibility before you'd see a benefit. It's possible that it would work better today, and it's also possible that Intel just gave up too early. In some cases we're already seeing people bypassing the fixed function rasterizer in GPUs and doing rasterization manually in compute shaders [1] [2].

[1] Doom Eternal: http://advances.realtimerendering.com/s2020/RenderingDoomEte...

[2] Epic Nanite: https://twitter.com/briankaris/status/1261098487279579136

Although I don't think the programming model with Larrabee would have really been any simpler. You still face many of the same issues that you do with GPUs, although being SIMD instead of SIMT would actually make it slightly harder to work with.

The actual hard part with GPUs is ensuring you can divide up the work and that it doesn't branch within a given chunk size. You have those same issues when trying to leverage a many-core CPU with AVX-512. You still want to keep those AVX-512 units loaded, which means work units of 16 FP32's must all take the same "branch" - not really any different from feeding warps on a GPU. And you've still got to scale across dozens if not hundreds of CPU cores.

AVX-512 has an execution mask and usually should be programmed with an SIMT-like model (e.g. with ispc). Writing SIMT kernels or chunking up the work is not the hard part.

The actual actual hard part with GPUs is writing portable code in the face of a million edge cases due to different proprietary hardware architectures and buggy drivers, which you can't test without actually buying and maintaining whole rooms full of hardware. Reducing fixed function parts of the hardware and using a documented ISA, as Larrabee tried, would help with that.

If I'm reading that right, Doom Eternal only uses compute shader rasterization for writing a lighting acceleration structure where they need to make some fine-grained/coarse-grained decisions depending on depth complexity. The scene is still using rasterization hardware.

Nanite uses compute shader rasterization partly because of the quad overdraw problem since they are targeting near 1 triangle per pixel. But they also say they are using traditional rasterization with recent hardware's addition of mesh shaders when it is faster (which remove a different set of fixed function stuff though, for transform, so still makes the same point).

Fujitsu's A64FX ARM is proof that 512 bit SIMD can work on a CPU based platform.

For graphics use GPUs perform a very significant amount of work in hardware (rasterization and texture interpolation being the two computationally most intensive [probably followed by ROP, which blends pixel shader output into the framebufer]; you can easily calculate that the ALU bandwidth of the TMUs is about the same order of magnitude as all the shader cores), which gives them a huge efficiency lead over anything purely done with programmable hardware only.

Michael Abrash had a great series of articles in Dr. Dobbs detailing how he came to work for Intel (which spun Larabee) after talking at a game conference with some of their people to ask them for a lerp (linear interpolation) instruction in x86 extensions[0] :)

Oh and Larabee gave us more than AVX512, it also gave us the Xeon Phis, which were accelerators (much akin to the GPGPU of nvidia GPUs?) aimed at scientific code undeer the promise that "since it's x86, you don't need to change your code that much!". However:

> An empirical performance and programmability study has been performed by researchers, in which the authors claim that achieving high performance with Xeon Phi still needs help from programmers and that merely relying on compilers with traditional programming models is still far from reality. However, research in various domains, such as life sciences, and deep learning demonstrated that exploiting both the thread- and SIMD-parallelism of Xeon Phi achieves significant speed-ups.

(from Wikipedia[1])

So pretty much the same as a GPU. It is a bit unfortunate that, in theory, good OpenCL support could have made running this code in 2/4/8 core CPUs (with or without SMT) or in the thread-beast that are/were the Phis. But that woud've probably required OpenCL to be a bit more mature, and Intel skipped that train too.

[0] https://www.drdobbs.com/parallel/a-first-look-at-the-larrabe...

[1] https://en.wikipedia.org/wiki/Xeon_Phi

OpenCL would have been a bad fit for Xeon Phi.

OpenCL is very specifically tailored for GPUs (though FPGAs may benefit). The concept of "constant memory", "shared memory", and "global memory" is very GPU-centric, and doesn't benefit Xeon Phi at all.

I'd assume that any OpenCL program would simply function better on a GPU, even compared to a 60-core in-order 512-bit SIMD-based processor like Xeon Phi.


Xeon Phi's main advantage really was running "like any other x86 processor", with 60 cores / 240 threads. But you still needed to AVX512 up your code to really benefit.

Honestly, I think Xeon Phi just needed a few more revisions to figure out itself more. It was on the market for less than 5 years. But I guess it wasn't growing as fast as NVidia or CUDA.

Maybe I was mixing up names in my head, but I remember from 5~10 years back an Open[Something] (thought it was OpenCL) that in theory could transparently handle multithreaded code across single/dual/quad[0] core or GPGPU (either nvidia or AMD).

This is what I had in mind when I wrote "if Intel had given it good OpenCL support". Again, maybe I'm mixing things up in my head since my career never took me down that lane to write massively paralell code (though I am a user of it, indirectly, through deep learning frameworks).

[0] back then this was as big a CPU would get

There's a version of OpenCL that compiled to Intel, but I'm not very familiar with it.

I remember reading things like: https://software.intel.com/content/www/us/en/develop/documen...

Where you'd have to use float8 types to be assured of SIMD-benefits on CPU code. As such, its probably more useful to rely upon auto-vectorizers in C++ code (such as #pragma omp simd) and maybe intrinsics for the complicated cases.

Intel does seem to have some level of OpenCL -> AVX tech: http://llvm.org/devmtg/2011-11/Rotem_IntelOpenCLSDKVectorize...

Sure, but the question is can it render graphics competitively with traditional GPUs.

Probably not, since it's optimized for scientific workloads (being designed specifically for the K computer replacement) (so it doesn't have texture units, ROPs, etc; you'd have to do too much in software to make it actually render things). However I think the overall design is really good and has enormous potential, if not for graphics at the very least for ML.

The vector architectures with extremely high memory bandwidth coming out of Japan recently (NEC SX-Aurora Tsubasa, Fujitsu A64FX) are pretty fascinating.

Modern x86 processors are far from simple though and so it's arguable that it's not significantly less complicated.

Though to be fair I'm not sure it's really all that complex relatively to modern high end processors. Most of gpu is just the same unit repeated.

For mind boggling complexity in my mind is the manufacturing process undertaken by the likes of TSMC.

The Larrabee cores were intentionally simpler than even most 2010 CPUs.

Yeah, modern semiconductor fabrication is pretty much the pinnacle of human achievement. My favorite video on the subject: https://www.youtube.com/watch?v=NGFhc8R_uO4

They did actually make it and it was not supposed to be a GPU..

They pivoted to HPC when the GPU thing didn't work out, and it has since been discontinued.

Sure it was, I was at the GDCE 2009 session on Larrabe.

They showed quake ray tracing demos with it and other people assumed it was supposed to be sold as a GPU instead of listening to what they were actually saying.

No one thought a collection of atom CPUs with AVX512 SIMD was going to be able to compete head to head on rasterization of games with the best Nvidia cards.

Why would they put texture units on it if it wasn't intended to be sold as a GPU? Consumer gaming GPUs were explicitly planned. Initial released versions even had DirectX drivers. Here is Intel's SIGGRAPH paper featuring benchmarks of Half-Life 2 Episode 2, Gears of War, and F.E.A.R. http://download-software.intel.com/sites/default/files/m/9/4...

When you run up against the limits of Moore's law and the end of Dennard Scaling you have to get dirty and do what it takes to get more performance.

Personally, I think CPU architecture became too complicated for my taste after the 68k. So what?

As someone with the most basic understanding of CPUs and assembly, why do you say that?

Advances in semiconductor fabrication are no longer increasing single-threaded performance of general purpose CPU code very much. It's still possible to increase performance by orders of magnitude, but it requires giving up flexibility by using dedicated hardware acceleration blocks such as tensor cores, plus going massively parallel with various combinations of SIMD and threading. That all makes software immensely more complex and less portable.

You're posting this from a 68k?

Doubt it, but it is possible. I’ve seen a few videos on YouTube of people who’ve gotten old computers (such as the OG Macintosh 128k which runs on 68k) to connect to the internet. I’ve even seen people use slightly “less powerful” machines running on a 6502 (such as the Apple II and Commodore 64).

Each of the internal units is reasonably simple compared to a modern superscalar x86. There are just huge numbers of units. The complexity is in the software needed to keep all those units busy and pumping around data between them. What you're doing down at the bottom of rendering or machine learning are usually very simple computations done a huge number of times. Somewhere above that is the problem of parceling out work to all those hardware resources in a somewhat optimal way. That's the hard problem.

There is roughly equal levels of complexity in any modern, high performance, general purpose/programmable chip.

But it's not equally exposed to the programmer. The branch predictor in an Intel CPU may be insanely complex but when writing code you usually don't need to care. On the other hand if you want to use a GPU you have to care about a huge amount of complexity right from the start.

Except you really do need to care about this complexity on CPUs. Things like cache locality & predictable access patterns are critical to achieving good CPU performance. This is why there's things like data-oriented design, SoA vs. AoS, and Z-order curves. It's also why linked-lists are so incredibly awful in practice, despite having superb algorithmic performance in theory.

A big reason programming for CPUs doesn't seem as complex is because the vast, vast majority of time nobody actually cares about CPU performance. We all just prefer to pretend a runtime or JIT or compiler managed to magically make a language that's god-awful horrendous on modern CPUs run fast. They didn't, we just all look the other way though.

The difference between CPUs & GPUs is when people reach for GPUs, such as for games or HPC, those are also the people that care a lot about performance. And guides like this are for them.

I don't think we're fundamentally in disagreement. But I will say that there's a huge amount of value in CPUs not forcing you to care about the complexity when it doesn't matter, and it doesn't always matter.

I agree. It's a huge issue for software dev companies as it is related to a non-negligible effort to keep your software working efficiently with a large array of GPUs. One potential workaround are libraries such as KOKKOS [0]. Of course it's never going to be as fast as pure optimized CUDA, but if it gets you half the way there, you'll still be faster than using the CPU. Particularly when considering that some of the new CPUs seem to have scaling issues themselves [1]

[0] https://github.com/kokkos/kokkos [1] https://www.pugetsystems.com/labs/hpc/Threadripper-3990x-64-...

Can the FP32 and Tensor core modules compute at the same time, or are they an abstraction over the underlying silicon, and we can only utilize one at a time?

Pages 16 and 17 in the PDF have graphs that show them being scheduled concurrently.

Ampere seems like a great upgrade for gaming. How much of the FP32 improvements will transfer to hobbyist ML researchers training models on a 3080?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact