Given that PCIe allows data to be piped directly from one device to another without going through the host CPU[1][2], I guess it might make sense to just have the GPU read blocks straight from the NVMe (or even NVMe-of[3]) rather than having the CPU do a lot of work.
edit: blind as a bat, says so right in the paper of course:
PMem is mapped directly to the GPU, and NVMe memory is accessed via Peer to Peer-DMA (P2PDMA)
I'm not sure they're actually doing NVMe yet; using Optane PMem is a bit of a cheat so that accessing storage is just plain memory reads and writes over PCIe. Implementing an NVMe device driver to set up and interact with command queues would be an extra layer of complexity that I think they left for future work.
But this work used the Optane DC Persistent Memory DIMMs that only work with certain Intel server CPUs. I'm not sure what the typical price people actually paid for those was, but it probably was not actually more expensive than DRAM.
For GPUs where Nvidia has turned off P2P, can RAM or NVMe drives be used for emulating P2P? Let’s assume you have a RAID AIC with 4 or 8 high speed SSDs. Could you make 3 3090s work as well as 3 A5000 RTX for training a model?
A friend of mine used to work for a GPU database startup as an integration engineer. He got frustrated because GPU drivers ( not just AMD but also Nvidia ) are intrinsically unstable and not designed for long flawless runs. If a few bits have a wrong value in a deep neural network or a pixel is wrong in a game, it does not matter much. In databases ( or file systems for that matter ) it does mean everything!
It is hard to believe at first, but his former company now offers solutions without GPU acceleration that simply work, but they also lost their USP.
Yeah, I had a lot of nVidia GPUs suddenly disappear mid-training when even nvidia-smi couldn't find them; this was on different systems (Linux) and only a reboot fixed it.
You don't want this kind of thing happening when it is running a filesystem.
Strange. I never had any problem with nvidia GPUs, but I only ever used data center GPU like the V100 (and don't set them up myself). There's a lot of things that go wrong, at least my nvidia GPU always works.
nvidia-smi exposes all cards, so you could run the same workload on multiple cards. This (likely) won't solve the problem of certain failure modes being intrinsic to the work being completed/compute environment. I would speculate some of those aggressive failure modes would present themselves across all the hardware.
Maybe someone could run workloads across CUDA and ZLUDA (Nvidia, and other hardware), but really we just might need more reliability to efficiently and reliability run a file system across disparate GPU hardware.
If the game or your training crashes though, it matters a lot. What sort of bugs give you wrong values without crashing, especially driver bugs?.. something is strange here
According to this paper, GPU4FS is a file system that can run on the GPU and be accessed by applications. Since GPUs cannot make system calls, GPU4FS uses shared video memory (VRAM) and a parallel queue implementation. Applications running on the GPU can utilize GPU4FS after modifying their code, eliminating the need for a CPU-side file system when accessing the file system. The experiments are done on Optane memory.
It would be interesting to know if this approach could optimize the performance of training and inference for large models.
GPUs seem to have a lot of memory these days - from my limited knowledge, games and other graphics-intensive applications will use too much to make this approach particularly useful but do other applications have a similar level of utilization?
I tried and tested it on my 5700xt,in crystaldiskmark i got (5 repeeated times on 1giB)
Read Write (MB/s)
seq1m 2339 2620
q8t1
seq1m 2205 2190
q1t1
rndq32 41.31 38.77
rnd q1t1 34.70 32.80
To be honest i didn't know what to expect, aside for a very high reading and writing speed. I was a bit disappointed in seeing random reading and writing were so slow, the only use i could think about would be having photosets or things like that over there, and then saving the session on ssd when closing the program, but it is easily solved by using a newer nvme ssd
I didn't fully read the paper, but few questions come into mind.
1) How does this work differ from Mark Silberstein's GPUfs from 2014 [1]?
2) Does this work assume the storage device is only accessed by the GPU? Otherwise, how do you guarantee consistency when multiple processes can map, read and write the same files? You mention POSIX. POSIX has MAP_SHARED. How is this situation handled?
3) Related to (2), on the device level, how do you sync CPU (on an SMP, multiple cores) and GPU accesses?
> 1) How does this work differ from Mark Silberstein's GPUfs from 2014 [1]?
Just quoting the paper:
>Using GPUfs, Silberstein et al . [ 24] demonstrate that offering a library interface to CPU FS eases access to storage for GPU programmers, but
GPUfs only calls a CPU-side file system. GPU4FS offers a similar interface to GPUfs, but runs the file system on the
GPU.
Nope. This is an implementation of one of several things that people often imagine Microsoft's DirectStorage to be, but the real DirectStorage is a lot more mundane.
DirectStorage is mostly an API for CPU code to asynchronously issue high-level storage requests such as asking for a file to be read from storage and the contents placed in a particular GPU buffer. Behind the scenes, the file contents could in theory be transferred from an SSD to the GPU using P2P DMA, because the OS now has enough of a big-picture view of what's going on to set up that kind of transfer when it's possible. But everything about parsing the filesystem data structures to locate the requested file data and issue commands to the SSD is still done on the CPU by the OS, and the application originating those high-level requests is a process running on the CPU and making system calls.
Making the requests asynchronous and issuing lots of requests in parallel is what makes it possible to get good performance out of flash-based storage; P2P DMA would be a relatively minor optimization on top of that. DirectStorage isn't the only way to asynchronously issue batches of storage requests; Windows has long had IOCP and more recently cloned io_uring from Linux.
DirectStorage 1.1 introduced an optional feature for GPU decompression, so that data which is stored on disk in a (the) supported compressed format can be streamed to the GPU and decompressed there instead of needing a round-trip through the CPU and its RAM for decompression. This could help make the P2P DMA option more widely usable by reducing the cases which need to fall back to the CPU, but decompressing on the GPU is nothing that applications couldn't already implement for themselves; DirectStorage just provides a convenient standardized API for this so that GPU vendors can provide a well-optimized decompression implementation. When P2P DMA isn't available, you can still get some computation offloaded from the CPU to the GPU after the compressed data makes a trip through the CPU's RAM.
(Note: official docs about DirectStorage don't really say anything about P2P DMA, but it's clearly being designed to allow for it in the future.)
The GPU4FS described here is a project to implement the filesystem entirely on the GPU: the code to eg. walk the directory hierarchy and locate what address actually holds the file contents is not on the CPU but on the GPU. This approach means the application running on the GPU needs exclusive ownership of the device holding the filesystem. For now, they're using persistent memory as the backing store, but in the future they could implement NVMe and have storage requests originate from the GPU and be delivered directly to the SSD with no CPU or OS involvement.
I'm glad that research papers don't start with "we've analyzed linux kernel 2.6.18 sources (because this is what we had on our lab machines) and determined that ext3 is the best filesystem for our research purpose and now present you with a novel idea of using high-tech device on that". The paper acknowledges modern features, takes design from other filesystems (mentioned BTRFS and tree structures). Overall the idea is interesting and promising.
Interesting they would discuss system call overhead of opening a file, reading from it and closing it. Seems like in almost all cases the open and close calls would be overwhelmed by the other operations.
There are plenty of cases where you can't just change the file layout. And the GPU filesystem is being implemented by someone else, so the choice is: migrate your data to another filesystem OR fix the data-in-files layout, even though the files may come from completely different source than your application, the layout may be a standard or other applications may depend on it, or you can't easily change it for another reason.
If you can get the data into the GPU-native filesystem, you can change the data layout at least as easily. The point is there is some sort of data ingestion pipeline involved.
In systems performance I would advise to never think of any workload as unidimensional (ie: Any file system optimization can either improve IO latency or be useless)
Issuing individual truncates of 1B files can be just as much of a CPU problem then an IO one for example.
Now this is all fun, but has anyone managed to make these mechanisms work with Multicast PCIe ? I really need GPUdirect and StorageDirect to support this, until PCIe catches up to today's (or Blackwell's) NVLink ... around PCIe 12?
edit: blind as a bat, says so right in the paper of course:
PMem is mapped directly to the GPU, and NVMe memory is accessed via Peer to Peer-DMA (P2PDMA)
[1]: https://nvmexpress.org/wp-content/uploads/Enabling-the-NVMe-...
[2]: https://lwn.net/Articles/767281/
[3]: https://www.nvmexpress.org/wp-content/uploads/NVMe_Over_Fabr...