I agree the cost difference being substantial, but something seems off here. I t...

Const-me · on July 7, 2023

Gamers don’t care about FP64 performance, and it seems nVidia is using that for market segmentation. The FP64 performance for RTX 4090 is 1.142 TFlops, for RTX 3090 Ti 0.524 TFlops. AMD doesn’t do that, FP64 performance is consistently better there, and have been this way for quite a few years. For example, the figure for 3090 Ti (a $2000 card from 2022) is similar to Radeon Vega 56, a $400 card from 2017 which can do 0.518 TFlops.

And another thing: nVidia forbids usage of GeForce cards in data centers, while AMD allows that. I don’t know how specifically they define datacenter, whether it’s enforceable, or whether it’s tested in courts of various jurisdictions. I just don’t want to find out answers to these questions at the legal expenses of my employer. I believe they would prefer to not cut corners like that.

I think nVidia only beats AMD due to the ecosystem: for GPGPU that’s CUDA (and especially the included first-party libraries like BLAS, FFT, DNN and others), also due to the support in popular libraries like TensorFlow. However, it’s not that hard to ignore the ecosystem, and instead write some compute shaders in HLSL. Here’s a non-trivial open-source project unrelated to CAE, where I managed to do just that with decent results: https://github.com/Const-me/Whisper That software even works on Linux, probably due to Valve’s work on DXVK 2.0 (a compatibility layer which implements D3D11 on top of Vulkan).

aseipp · on July 8, 2023

> However, it’s not that hard to ignore the ecosystem

I'd say this will only work if you have stable results or models to target like Whisper, that's one thing, and "going your own way" can help improve portability and stuff; Llama.cpp is another good example. But a lot of the software demand is not driven by that, it's driven by continuously evolving models and needs; a lot of the bloat or whatever you want to call it is a result of that.

Besides that, the programming models are moving on. The open source Nvidia Linux driver now enables fully heterogeneous memory management on x86 across the CPU and GPU. This means the GPU and CPU do not need the programmer to enforce memory coherency, perform device-specific allocations, or copy memory; migrations, page table/TLB flushes, etc all work out of the box with no modifications to userspace software. So now your io_uring asynchronous loop can write training data to memory that is implicitly available to the GPU, no matter what memory allocator you're using. It basically means arbitrary CPU compute and arbitrary GPU compute is now composable using the memory substrate (and OS kernel) as a coherent transport/storage layer. On x86/Nvidia, this works on the granularity of a page, but on the Grace Hopper superchip, this is going to take place at the level of a cache line. Multiple Hopper Superchips can be NVLink'd together over infiniband so this works across the cluster. You can drive an entire rack of systems this way and it works.

For people actually doing a lot of GPU-specific programming, or deploying models on servers (e.g. for API usage), this is going to be a big deal in the long run, and it started way back when they first introduced unified virtual memory. AMD is moving this way too for their compute stacks, I assume. The compute shader model just isn't evolving for these kinds of needs and it isn't clear it's going to anytime soon.

Const-me · on July 9, 2023

I’m not sure heterogeneous memory is such a huge deal, due to performance numbers. PCI express is relatively slow in terms of bandwidth, and especially latency. To compensate, GPUs have dedicated piece of hardware to asynchronously copy blocks of data over PCIe. In modern low-level APIs that hardware is even directly exposed to programmers, as transfer queue in Vulkan, and copy command queue in D3D12.

Manually moving data with APIs like cudaMemcpy or ID3D11DeviceContext.CopyResource complicates the code, but much faster than unified memory. Especially if you did it correctly with pipelining, and GPU computes something else (like previous batch of work) while the new data is being copied.

Speaking of new features, I would rather expect GPGPU users to be interested in DirectStorage technology which allows GPUs to efficiently load data from SSD. That thing is currently Windows-only but supported by all 3 GPU vendors. Because it was implemented primarily for videogames, it works just fine with compute shaders.