The A100 whitepaper "spoiled" a lot of these factoids already. (https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...)
The new bit seems to be the doubling of FP32 "CUDA cores" (I really hate that word: when Intel or AMD double their CPU pipelines it doesn't mean that they're selling more cores, it means their cores got wider... anyway). A100 didn't have this feature (I assume A100 was 16 Floating point + 16 Integer "Cuda cores" per CU like Turing. Correct me if I'm wrong)
You don't need to read the whitepaper to understand that NVidia has really improved performance/cost here. The 3rd party benchmarks are out and the improved performance is well documented at this point.
The FP32 doubling, is one of the most important bits here. But fortunately for programmers, this doesn't really change how you do your code. The compiler / PTX assembler will schedule your code at compile time to best take advantage of that.
The other bit: larger L1 / Shared memory of 128kB per CU, does affect programmers. GPU programmers have tight control over shared memory, and is very useful for optimization purposes.
GDDR6's improved memory bandwidth is also big. "Feeding the beast" with faster RAM is always a laudable goal, and sending 2-bits per pin per clock cycle through PAM4 is a nifty trick.
Sparse Tensor Cores were already implemented in A100, and don't seem to be new. If you haven't heard of the tech before, its cool: basically hardware accelerated sparse-matrix computations. A 4x4xFP16 matrix uses 32 bytes under normal conditions, but can be "compressed" into 16 bytes if half-or-more of its values are 0. NVidia Ampere supports hardware-accelerated matrix multiplications of these 16-byte "virtual" 4x4xFP16 matrixes.
I swear that RTX I/O existed before in some other form. This isn't the first time I heard about offloading PCIe to the GPU. Its niche and I don't expect video games to use it (are M.2 SSDs popular enough to be assumed on the PC / Laptop market yet?). But CUDA-coders probably can control their hardware more carefully and benefit from such a feature
It's a feature the new consoles are doing so it'll be widely supported.
In addition, I suspect the slice of the video game market that has a GPU with RTX I/O capability will also have a NVME SSD. Now, this is niche, but with that slice of the market also being the top-end performance tier, they're still going to be catered to by AAA devs.
Early benchmarks are showing games under-performing quite a bit in the worst cases. The crux of the issue is that it's not /exactly/ a no-compromise doubling of FP32. Each data path per SM can either do 2xFP32 or 1xINT32/1xFP32 per clock cycle. So if your game or application has any significant INT32 operations scheduled, all of a sudden you're back to the number of FP32 cores you had last generation, though you get the benefit of parallel INT32 execution.
It's a pretty cool architecture overall though.
Parallel INT32 was added with the last generation, in Turing. See page 13 of https://www.nvidia.com/content/dam/en-zz/Solutions/design-vi...
So nvidia split out the INT32 from FP32 last gen to make them independent, then re-added FP32 to the INT32 but kept it as 2 datapaths.
From my understanding int is often used for lookups, and I'd presume a lot of that is some sort of environment mapping which adds some contention as int is more limited and "steals" from the doubling of FP.
Also, the fact that NVidia has probably patented this particular architecture makes it less interesting for me to really dig into.
I think the idea still has promise but there's a chicken and egg issue where you'd really need to rearchitect game engines and content pipelines to take full advantage of the flexibility before you'd see a benefit. It's possible that it would work better today, and it's also possible that Intel just gave up too early. In some cases we're already seeing people bypassing the fixed function rasterizer in GPUs and doing rasterization manually in compute shaders  .
 Doom Eternal: http://advances.realtimerendering.com/s2020/RenderingDoomEte...
 Epic Nanite: https://twitter.com/briankaris/status/1261098487279579136
The actual hard part with GPUs is ensuring you can divide up the work and that it doesn't branch within a given chunk size. You have those same issues when trying to leverage a many-core CPU with AVX-512. You still want to keep those AVX-512 units loaded, which means work units of 16 FP32's must all take the same "branch" - not really any different from feeding warps on a GPU. And you've still got to scale across dozens if not hundreds of CPU cores.
The actual actual hard part with GPUs is writing portable code in the face of a million edge cases due to different proprietary hardware architectures and buggy drivers, which you can't test without actually buying and maintaining whole rooms full of hardware. Reducing fixed function parts of the hardware and using a documented ISA, as Larrabee tried, would help with that.
Nanite uses compute shader rasterization partly because of the quad overdraw problem since they are targeting near 1 triangle per pixel. But they also say they are using traditional rasterization with recent hardware's addition of mesh shaders when it is faster (which remove a different set of fixed function stuff though, for transform, so still makes the same point).
Oh and Larabee gave us more than AVX512, it also gave us the Xeon Phis, which were accelerators (much akin to the GPGPU of nvidia GPUs?) aimed at scientific code undeer the promise that "since it's x86, you don't need to change your code that much!". However:
> An empirical performance and programmability study has been performed by researchers, in which the authors claim that achieving high performance with Xeon Phi still needs help from programmers and that merely relying on compilers with traditional programming models is still far from reality. However, research in various domains, such as life sciences, and deep learning demonstrated that exploiting both the thread- and SIMD-parallelism of Xeon Phi achieves significant speed-ups.
So pretty much the same as a GPU. It is a bit unfortunate that, in theory, good OpenCL support could have made running this code in 2/4/8 core CPUs (with or without SMT) or in the thread-beast that are/were the Phis. But that woud've probably required OpenCL to be a bit more mature, and Intel skipped that train too.
OpenCL is very specifically tailored for GPUs (though FPGAs may benefit). The concept of "constant memory", "shared memory", and "global memory" is very GPU-centric, and doesn't benefit Xeon Phi at all.
I'd assume that any OpenCL program would simply function better on a GPU, even compared to a 60-core in-order 512-bit SIMD-based processor like Xeon Phi.
Xeon Phi's main advantage really was running "like any other x86 processor", with 60 cores / 240 threads. But you still needed to AVX512 up your code to really benefit.
Honestly, I think Xeon Phi just needed a few more revisions to figure out itself more. It was on the market for less than 5 years. But I guess it wasn't growing as fast as NVidia or CUDA.
This is what I had in mind when I wrote "if Intel had given it good OpenCL support". Again, maybe I'm mixing things up in my head since my career never took me down that lane to write massively paralell code (though I am a user of it, indirectly, through deep learning frameworks).
 back then this was as big a CPU would get
I remember reading things like: https://software.intel.com/content/www/us/en/develop/documen...
Where you'd have to use float8 types to be assured of SIMD-benefits on CPU code. As such, its probably more useful to rely upon auto-vectorizers in C++ code (such as #pragma omp simd) and maybe intrinsics for the complicated cases.
Intel does seem to have some level of OpenCL -> AVX tech: http://llvm.org/devmtg/2011-11/Rotem_IntelOpenCLSDKVectorize...
The vector architectures with extremely high memory bandwidth coming out of Japan recently (NEC SX-Aurora Tsubasa, Fujitsu A64FX) are pretty fascinating.
Though to be fair I'm not sure it's really all that complex relatively to modern high end processors. Most of gpu is just the same unit repeated.
For mind boggling complexity in my mind is the manufacturing process undertaken by the likes of TSMC.
Yeah, modern semiconductor fabrication is pretty much the pinnacle of human achievement. My favorite video on the subject: https://www.youtube.com/watch?v=NGFhc8R_uO4
No one thought a collection of atom CPUs with AVX512 SIMD was going to be able to compete head to head on rasterization of games with the best Nvidia cards.
Personally, I think CPU architecture became too complicated for my taste after the 68k. So what?
A big reason programming for CPUs doesn't seem as complex is because the vast, vast majority of time nobody actually cares about CPU performance. We all just prefer to pretend a runtime or JIT or compiler managed to magically make a language that's god-awful horrendous on modern CPUs run fast. They didn't, we just all look the other way though.
The difference between CPUs & GPUs is when people reach for GPUs, such as for games or HPC, those are also the people that care a lot about performance. And guides like this are for them.