The GPU doesn't have DMA to the disk or RAM it has to go through the CPU to access it which causes quite a delay.
Even if you read from RAM during normal operations you'll have very low framerate unless you are using pretty damn good latency masking a good example will be texture streaming where you hold low res textures in your GPU memory and load the higher res ones from RAM or disk and even on very high end systems it causes allot of texture popins which people find rather annoying.
Latency hiding is indeed required, but it's a solvable problem. There's no way that this would work synchronously.
See my other comment above. I implemented a technique like this (on a <10 watt mobile device!) and I had no issues whatsoever with framerate, even my suboptimal technique (using OpenGL sparse textures, which are quite restrictive) that might stall. These stalls can be avoided with Vulkan/D3D12 techniques (using aliased sparse pages, ie. one physical page mapped on several textures) so maintaining a stable framerate simply isn't an issue.
This was done with the CPU orchestrating the whole deal. The GPU isn't required to have access to disk or initiate the DMA transfer.
Latency, on the other hand, is an issue but it doesn't seem to be that bad in practice (warning: anecdotal evidence). In my simple demo, practically every page uploaded was resident on the GPU by the next frame after it was required. A small number of pages (less than 1%) had 2 frames of latency. None had 3 or more.
Hiding the visual artifacts from texture popping is a real issue too, but can be mitigated by speculatively uploading pages and applying filtering between mip map levels.
All of this can be done today using Vulkan or D3D12. What Carmack is suggesting (if I understood him correctly) is to make the kernel driver on the CPU initiate asynchronous upload without userspace intervention, which would improve latency and bandwidth.
What if the GPU had its own private SSD with textures installed to it when you install the game? A 100GB SSD is about 20% of the price of a high end video card.
Inb4 the next GTX titan comes with 24GB of HBM and an M.2 slot.
I don't think it's necessary for applications that require so much memory you can probably develop some proprietary tech to access storage (NVIDIA does it).
JC isn't talking about compute tho, he's talking about gaming he really likes megatextures but he's in a minority and I honestly can't claim to have enough expertise to judge if he's right or not.
GPU memory management is a mixed bag there are many cards with different speed of RAM, different bandwidths all doing the magic voodoo in the background regarding low level memory access and compression way beyond what you are exposed to on the driver and API layers (e.g. NVIDIA's incard memory compression which is enabled regardless of how you compress or load your textures in the first place).
WDDM allows lower end system to take advantage of more video memory to improve performance in less demanding applications which are most apps today while still enabling all the nice graphics we come to expect from out window manager and applications.
I'm not sure if allowing the GPU to read directly from the SSD or RAM (outside of the current scope of virtualized GPU memory and asset loading) has actually enough benefit to justify both the engineering costs and the potential security pitfall that can happen when you run multiple applications that all of a sudden get DMA access to your RAM and local storage. This is more so the case for gaming considering the amount of RAM that graphic cards come with today and this is only increasing I don't know if anyone but JC actually wants/needs this.
I am wondering how do consoles handle it, IIRC from playing a bit with the UDK (UE3) at the time I recall that it supported streaming textures directly from the optical media on the Xbox 360, so if anyone knows how it works under the hood and is willing to share I think it would add to this topic.
That could happen with Intel's optane SSDs but I don't think it is necessary. The latency between the CPU and GPU is not measured in any sort of human perceivable time, so it doesn't matter unless the GPU is making millions of cpu memory accesses, and that isn't how games or GPUs work for obvious reasons.
The bandwidth is also not a problem, the entire memory of a 16GB GPU can be filled in under a second.
The latency between the GPU and CPU is a very big problem for gaming, if it wasn't we weren't be having this discussion a bad CPU call can easily add 50MS of latency to your frame which makes you all of a sudden drop from 60fps to 20.
You are talking about two different things. One is the latency of data from main memory to the gpu memory and the other is whatever video game metric you are using.
To get 60fps your video card needs to push out a frame every 16ms.
If say 6ms of this is actual in card processing and 10ms is CPU/API overhead and you add to it another CPU call + system memory access that adds 30-35MS latency you are now operating at ~50ms per frame which means that you can only output 20 frames each second.
GPUDirect is very specific it require implementation on the PCIe switch (some modification to IOMMU), UEFI/BIOS and card, as well as the OS Kernel and the display drivers.
Because it requires both very specific software and hardware configurations only some features work on some systems.
Also calling this DMA is kind of misleading it does require the CUDA software layer to work, and for the most part it's a lot of hacks coupled together to form a feature set.
So I would still currently stand by what I said that there isn't a standard and generic way for GPU's to have DMA access.
GPU can just queue page faults and raise an IRQ. Those faults can then be handled in any size of group on the CPU side. By using a queue, latency can be effectively hidden.
Perhaps disk I/O DMA can even be directly connected to GPU DMA, bypassing system RAM entirely. Even currently, ethernet and disk DMA can bypass memory and go directly to CPU L3 cache. With existing flexibility like that, it's not far-fetched to be able to connect DMA between two devices.
Pretty much same happens when you touch a page that's not present from user mode. CPU will get interrupted and queue disk I/O to get the faulting page.
Even disks themselves have their internal queues (like SATA NCQ). Disks also have CPUs to serve queued I/O requests.
> GPU can just raise IRQ and have a queue for the page faults.
Handling GPU IRQ's is possible but has unnecessarily high latency and pipeline stalling problems. However, the kernel mode drivers do a bit of this behind the scenes (mainly when switching from app to app).
But using sparse textures (ie. GL_EXT_sparse_texture2), the GPU can detect if a fault would occur and react to it. Instead of IRQ'ing for every missed page, a list of all missed pages can be extracted at the end of a frame.
> Perhaps disk I/O DMA can even be connected to GPU DMA, bypassing system RAM entirely.
Afaik Nvidia's NVLink does this but it's aimed at super computers, not graphics.
> Handling GPU IRQ's is possible but has unnecessarily high latency and pipeline stalling problems. However, the kernel mode drivers do a bit of this behind the scenes (mainly when switching from app to app).
Time it takes for GPU to access data present in its local internal GDDR RAM: about 1 microsecond. Time it takes from GPU asserting (pulls up) [1] IRQ until CPU handles IRQ: 5-50 microseconds. Time it takes for CPU to insert I/O request into OS internal queue: unknown, but it's almost certainly a small fraction of a microsecond. Hundreds of faults can be handled with one IRQ request.
So there's not really that much more latency than in a normal CPU page fault. Actually probably less on average, because these faults can be grouped. GPU memory access latency is very slow and they are already built to handle high latency with a massive number of hardware threads.
IRQ can for example be edge triggered by the first GPU page fault. Each GPU page fault doesn't need to trigger CPU side interrupt. That'd be pointless, it'd just cause high CPU load for no benefit.
I think some sort of page fault FIFO is simpler hardware wise than to wait for something arbitrary event like "end of a frame".
GPU can just keep pushing new faults in a CPU-visible FIFO. That way it's possible to amortize latency and to avoid IRQ storm (=excessive number of IRQ requests).
CPU side can just pull faults from the FIFO at whatever rate I/O system can support.
Fault data from GPU could also include priority, so that more visually important data can be fetched first even if it wasn't encountered first. For example geometry or textures covering major parts of screen could have high priority.
(I've co-designed a HW mechanism for avoiding an excessive number of IRQs without sacrificing latency and designed and implemented a kernel mode driver to support it. Of course GPUs are quite a bit more complicated than that simple case.)
How would GPU or GPU drivers know how to issue DMA transfer and know when DMA is ready on another piece of hardware? Those things are specific to a piece of hardware.
Besides there's a small detail called IOMMU that prevents PCI(-e) devices from writing or reading wherever they please.
You can pull assets from RAM somewhat transparently already (if you want to do it well you need to optimize it further, but to some extent NVIDIA and AMD do quite a bit of optimization in the driver too), shared page table between the CPU and GPU is also already in place under WDDM.
So adding a disk based page file to this won't be so hard, it could more or less work the same way as the page file currently works.
I'm still not sure if this is actually needed, GPU's already come with stupid amount of RAM today, i have 24GB of GDDR5 in my system, and even mid range cards today will not come with less than 6GB of RAM.
> You can pull assets from RAM somewhat transparently already (if you want to do it well you need to optimize it further, but to some extent NVIDIA and AMD do quite a bit of optimization in the driver too), shared page table between the CPU and GPU is also already in place under WDDM.
Can GPU page table entry point to non-present page(s)? Or does it only work for "pinned pages" [1], that cannot paged out of RAM?
If it doesn't require pinning, then what prevents from mmapping assets on the disk even today?
If, however, it does require pinning, then John Carmack has a damn good point.
IIRC the driver handles the pinning, there are basically 3 types of video memory under WDDM, Dedicated Video Memory (on-card) Dedicated System Memory (memory that is assigned to the GPU only usually via BIOS configuration and is off limit to the OS) and Shared System Memory (A part of the virtual system memory allocated for the GPU), so in theory if you only use Dedicated Video Memory and Shared System Memory you could be pulling off data from the on-disk page file, but it's not like you can map a specific asset (say a texture file) on disk to your video memory directly.
WDDM also limited the volume and commit sizes based on some "arbitrary" limits that MSFT set out (IIRC it's something like system memory / 2 or something silly like that), there's a bit more silliness that for example if the limit of the max memory available for graphics is 2GB you can't commit 3GB but you can do 3x1GB just fine.
I'm pretty sure atm WDDM/Vendor Display Driver pin all pages to RAM only so it won't end up in the page file, TBH I haven't had a page file on my system for a long long time windows doesn't use SSD's for paging unless it really has too and considering I haven't been using a system with less than 32GB of RAM for the past 5 years I never had issues with it.
P.S.
Apologies if I used any terminology incorrectly this is stretching both my knowledge and recollection regarding this subject, it's also more or less limited to how GPU/Graphics are handled within Windows.
I think it can only work through pinning, because it seems to rely on GPU bus mastering for accessing pages that are dedicated for graphics.
So you have 32 GB RAM and say 8 GB GPU RAM. Imagine you have a program that has 100 GB of graphics assets without built in mechanism to guess which subset of assets might be required for current scene. The program needs about 50 MB in any given frame -- in other words it has 50 MB working set.
As an operating system, which assets are you going to keep in memory?
Now imagine you have multiple programs running concurrently, each having 100 GB of assets. Each app has that same 50 MB working set.
How can the system handle this situation efficiently (or at all!) if the pages need to be pinned to RAM?
If you can have true on demand GPU paging, all of these apps need to only swap their current working set of data. User would not perceive any delay when switching from app to app. She could even display all of them in the same time without any issues.
"In the event that video memory allocation is required, and both video memory and system memory are full, the WDDM and the overall virtual memory system will then turn to disk for video memory surfaces. This is an extremely unusual case, and the performance would suffer dearly in that case, but the point is that the system is sufficiently robust to allow this to occur and for the application to reliably continue."
The only issue here is that as far as i can understand you can't really choose how this is done very specifically.
When you allocate memory the driver pretty much takes over, if you want granular over memory allocation you pretty much have to go through the GPUMMU path which means each process has a separate GPU and CPU address space so while you can control what you store in GPU memory and what you store in System memory I still don't see a way to control mapping an asset to disk specifically other than it being an edge case of you running out of system memory which results in the page file being used.
I guess it depends on if the IOMMU has a fault interrupt that can be acted upon. If it does, then it can probably be done. However, I'm not sure if the OS will handle this or not.
When the OS runs out of RAM, it can swap pages to disk. If this page also has a mapping in an IOMMU, it can invalidate the mapping there as well.
Then, when the device attempts to touch the page, the IOMMU faults and the CPU would swap the page back in.
I'm not sure if this is possible, but this is the route I'd expect something like this to follow.
> I guess it depends on if the IOMMU has a fault interrupt that can be acted upon. If it does, then it can probably be done. However, I'm not sure if the OS will handle this or not.
Interesting idea. I'd also like to know if IOMMU faults can be acted upon. It might require protocol support between the bus and hardware device (GPU) [1], to tell it the page is not currently present. And a way to tell GPU once the page is available.
As far as I can see this kind of mechanism would require one CPU interrupt per GPU fault. That might be too inefficient.
[1]: Edit: It's indeed possible to handle, if the device supports "PCI-SIG PCIe Address Translation Services (ATS) Page Request Interface (PRI) extension".
> Can GPU page table entry point to non-present page(s)?
Yes, new GPUs allow you to do this. This feature is called sparse textures / buffers in OpenGL (GL_ARB_sparse_texture and GL_EXT_sparse_texture2) and Vulkan (optional feature in Vulkan 1.0 core) or tiled resources in D3D.
This allows you to leave textures (or buffers) partially non-resident (accesses to which are safe but results undefined) and allow you to detect when accessing a non-resident region (EXT_sparse_texture2) so that you can write a fallback path in the shader (lookup lower mip level and/or somehow tell the CPU that the page will be required for the next frame).
The OpenGL extensions are a bit restrictive, but Vulkan/D3D12 allow much greater control (such as sharing pages between textures or repeating the same page inside a texture).
Hardware support for this is not ubiquitous at the moment, but should improve as time goes on.
This feature is somewhat orthogonal to pinned pages and WDDM residency magic (which is afaik more oriented to switching between processes), hopefully it will get more unified in the future.
Of course, something like this would need driver support.
What I was getting at was the fact that there should be nothing physically preventing them from implementing DMA support, so I was wondering why they didn't already support it. I’m not familiar with GPUs, so I assumed that this is how all large transfers between system RAM and the GPU worked.
I can only speak for the Linux context, but the IOMMU isn’t an issue. You can just allocate memory with the DMA API (dma_alloc_coherent) which will automatically populate the IOMMU tables (if required), pin the pages, and return you a PCI bus address as well as a kernel virtual address which both correspond to the same chunk of physical memory. Or, you can map an existing buffer in page by page using the dma_map routines (I forget the names).
Now, you have a shared pool of memory which can be accessed by both devices at the same time. The coherency fabric (if one exists) will handle all synchronization automatically, though this can be a bottleneck sometimes. If the CPU isn’t cache coherent, then the pages get marked as no-cache in the kernel PTEs so that any read from the CPU side pulls straight from memory.
Passing “messages” can be accomplished by an external notification like an interrupt or something.
You can then even map this buffer into a user space program.
I’m sure there are some security concerns with this approach though.
More complicated things like device to device transfers (GPU to and from disk) would have to be arbitrated by the CPU, but I see no reason that the CPU would actually have to do the copy itself. Why couldn’t the CPU just provide the GPU with the PCI bus address of the disk controller which should be written to?
If the GPU wanted to write to the disk, the CPU would initiate a transfer to disk, but before writing the actual data, you pass the destination PCI address to the GPU and let it write the data. Then, the CPU can resume doing whatever it has to do while this happens in the background.
Even if you read from RAM during normal operations you'll have very low framerate unless you are using pretty damn good latency masking a good example will be texture streaming where you hold low res textures in your GPU memory and load the higher res ones from RAM or disk and even on very high end systems it causes allot of texture popins which people find rather annoying.