Full-scale file system acceleration on GPU [pdf]

magicalhippo · on March 30, 2024

Given that PCIe allows data to be piped directly from one device to another without going through the host CPU[1][2], I guess it might make sense to just have the GPU read blocks straight from the NVMe (or even NVMe-of[3]) rather than having the CPU do a lot of work.

edit: blind as a bat, says so right in the paper of course:

PMem is mapped directly to the GPU, and NVMe memory is accessed via Peer to Peer-DMA (P2PDMA)

[1]: https://nvmexpress.org/wp-content/uploads/Enabling-the-NVMe-...

[2]: https://lwn.net/Articles/767281/

[3]: https://www.nvmexpress.org/wp-content/uploads/NVMe_Over_Fabr...

wtallis · on March 30, 2024

I'm not sure they're actually doing NVMe yet; using Optane PMem is a bit of a cheat so that accessing storage is just plain memory reads and writes over PCIe. Implementing an NVMe device driver to set up and interact with command queues would be an extra layer of complexity that I think they left for future work.

magicalhippo · on March 30, 2024

Sure, but my point was that it should be quite possible to get regular NVMes working.

Once you got that then the CPU is just the orchesterator, and wouldn't necessarily need to be so beefy.

dragontamer · on March 30, 2024

That's just called DirectStorage and was added as part of Windows 10 (erm.... some update in Windows 10).

The PS5 and Xbox both have GPU-access of NVMe Flash.

------

So you are right. But what you are talking about happened like 5 years ago.

EDIT: https://devblogs.microsoft.com/directx/directstorage-develop...

Looks like 3 years ago for Win10. But I feel like I heard it sooner than that as NVidia or AMD specific API calls.

nine_k · on March 30, 2024

Didn't they stop making Optane? :(

Also, Optane was like $4 per GB, so a moderately-sized drive, like 256GB, is already above $1000.

wtallis · on March 30, 2024

The Optane NVMe drives were more like $1-2 per GB when they were new, and are a fair bit cheaper now that they're basically on clearance: https://www.newegg.com/intel-optane-ssd-905p-series-960gb/p/...

But this work used the Optane DC Persistent Memory DIMMs that only work with certain Intel server CPUs. I'm not sure what the typical price people actually paid for those was, but it probably was not actually more expensive than DRAM.

amarcheschi · on March 30, 2024

Yes, Optane isn't produced anymore

az226 · on March 31, 2024

For GPUs where Nvidia has turned off P2P, can RAM or NVMe drives be used for emulating P2P? Let’s assume you have a RAID AIC with 4 or 8 high speed SSDs. Could you make 3 3090s work as well as 3 A5000 RTX for training a model?

multimind · on March 30, 2024

A friend of mine used to work for a GPU database startup as an integration engineer. He got frustrated because GPU drivers ( not just AMD but also Nvidia ) are intrinsically unstable and not designed for long flawless runs. If a few bits have a wrong value in a deep neural network or a pixel is wrong in a game, it does not matter much. In databases ( or file systems for that matter ) it does mean everything! It is hard to believe at first, but his former company now offers solutions without GPU acceleration that simply work, but they also lost their USP.

amelius · on March 30, 2024

Yeah, I had a lot of nVidia GPUs suddenly disappear mid-training when even nvidia-smi couldn't find them; this was on different systems (Linux) and only a reboot fixed it.

You don't want this kind of thing happening when it is running a filesystem.

LeanderK · on March 30, 2024

Strange. I never had any problem with nvidia GPUs, but I only ever used data center GPU like the V100 (and don't set them up myself). There's a lot of things that go wrong, at least my nvidia GPU always works.

solardev · on March 30, 2024

Could you use some sort of RAID array of GPUs to compensate...?

nonplus · on March 31, 2024

nvidia-smi exposes all cards, so you could run the same workload on multiple cards. This (likely) won't solve the problem of certain failure modes being intrinsic to the work being completed/compute environment. I would speculate some of those aggressive failure modes would present themselves across all the hardware.

Maybe someone could run workloads across CUDA and ZLUDA (Nvidia, and other hardware), but really we just might need more reliability to efficiently and reliability run a file system across disparate GPU hardware.

yosefk · on March 30, 2024

If the game or your training crashes though, it matters a lot. What sort of bugs give you wrong values without crashing, especially driver bugs?.. something is strange here

west0n · on March 30, 2024

According to this paper, GPU4FS is a file system that can run on the GPU and be accessed by applications. Since GPUs cannot make system calls, GPU4FS uses shared video memory (VRAM) and a parallel queue implementation. Applications running on the GPU can utilize GPU4FS after modifying their code, eliminating the need for a CPU-side file system when accessing the file system. The experiments are done on Optane memory.

It would be interesting to know if this approach could optimize the performance of training and inference for large models.

t-3 · on March 30, 2024

GPUs seem to have a lot of memory these days - from my limited knowledge, games and other graphics-intensive applications will use too much to make this approach particularly useful but do other applications have a similar level of utilization?

DemocracyFTW2 · on March 30, 2024

[flagged]

halayli · on March 30, 2024

This sounds like projection.

rrix2 · on March 30, 2024

Then this isn't for them

ShamelessC · on March 30, 2024

Highly unlikely any non-technical folks ever learn about this, much less try to decipher what it does.

Or perhaps I'm misunderstanding your comment? What do you mean exactly?

molticrystal · on March 30, 2024

While it is not a 1:1 comparison there has been a driver for windows that allows the creation of a ram drive from vram for NVIDIA cards.

>GpuRamDrive

>Create a virtual drive backed by GPU RAM.

https://github.com/prsyahmi/GpuRamDrive

Fork with AMD support:

https://github.com/brzz/GpuRamDrive/

Fork that has fixes and support for other cards and additional features:

https://github.com/Ado77/GpuRamDrive

amarcheschi · on March 30, 2024

I tried and tested it on my 5700xt,in crystaldiskmark i got (5 repeeated times on 1giB) Read Write (MB/s) seq1m 2339 2620 q8t1

seq1m 2205 2190 q1t1

rndq32 41.31 38.77

rnd q1t1 34.70 32.80

To be honest i didn't know what to expect, aside for a very high reading and writing speed. I was a bit disappointed in seeing random reading and writing were so slow, the only use i could think about would be having photosets or things like that over there, and then saving the session on ssd when closing the program, but it is easily solved by using a newer nvme ssd

Zambyte · on March 30, 2024

For Linux: https://wiki.archlinux.org/title/Swap_on_video_RAM

afr0ck · on March 30, 2024

I didn't fully read the paper, but few questions come into mind.

1) How does this work differ from Mark Silberstein's GPUfs from 2014 [1]?

2) Does this work assume the storage device is only accessed by the GPU? Otherwise, how do you guarantee consistency when multiple processes can map, read and write the same files? You mention POSIX. POSIX has MAP_SHARED. How is this situation handled?

3) Related to (2), on the device level, how do you sync CPU (on an SMP, multiple cores) and GPU accesses?

[1] https://dl.acm.org/doi/10.1145/2553081

riedel · on March 31, 2024

> 1) How does this work differ from Mark Silberstein's GPUfs from 2014 [1]?

Just quoting the paper:

>Using GPUfs, Silberstein et al . [ 24] demonstrate that offering a library interface to CPU FS eases access to storage for GPU programmers, but GPUfs only calls a CPU-side file system. GPU4FS offers a similar interface to GPUfs, but runs the file system on the GPU.

afr0ck · on April 1, 2024

Thanks for the quote!

In this case, it is indeed novel to run the logic of the filesystem on the GPU itself. It's definitely worth the investigation!

yeison · on March 30, 2024

How to get hired by NVIDIA! If it does work it's a brilliant idea.

KingOfCoders · on March 30, 2024

Like Microsoft DirectStorage?

wtallis · on March 30, 2024

Nope. This is an implementation of one of several things that people often imagine Microsoft's DirectStorage to be, but the real DirectStorage is a lot more mundane.

KingOfCoders · on March 30, 2024

I have no clue, so I've asked, where is the difference?

wtallis · on March 30, 2024

DirectStorage is mostly an API for CPU code to asynchronously issue high-level storage requests such as asking for a file to be read from storage and the contents placed in a particular GPU buffer. Behind the scenes, the file contents could in theory be transferred from an SSD to the GPU using P2P DMA, because the OS now has enough of a big-picture view of what's going on to set up that kind of transfer when it's possible. But everything about parsing the filesystem data structures to locate the requested file data and issue commands to the SSD is still done on the CPU by the OS, and the application originating those high-level requests is a process running on the CPU and making system calls.

Making the requests asynchronous and issuing lots of requests in parallel is what makes it possible to get good performance out of flash-based storage; P2P DMA would be a relatively minor optimization on top of that. DirectStorage isn't the only way to asynchronously issue batches of storage requests; Windows has long had IOCP and more recently cloned io_uring from Linux.

DirectStorage 1.1 introduced an optional feature for GPU decompression, so that data which is stored on disk in a (the) supported compressed format can be streamed to the GPU and decompressed there instead of needing a round-trip through the CPU and its RAM for decompression. This could help make the P2P DMA option more widely usable by reducing the cases which need to fall back to the CPU, but decompressing on the GPU is nothing that applications couldn't already implement for themselves; DirectStorage just provides a convenient standardized API for this so that GPU vendors can provide a well-optimized decompression implementation. When P2P DMA isn't available, you can still get some computation offloaded from the CPU to the GPU after the compressed data makes a trip through the CPU's RAM.

(Note: official docs about DirectStorage don't really say anything about P2P DMA, but it's clearly being designed to allow for it in the future.)

The GPU4FS described here is a project to implement the filesystem entirely on the GPU: the code to eg. walk the directory hierarchy and locate what address actually holds the file contents is not on the CPU but on the GPU. This approach means the application running on the GPU needs exclusive ownership of the device holding the filesystem. For now, they're using persistent memory as the backing store, but in the future they could implement NVMe and have storage requests originate from the GPU and be delivered directly to the SSD with no CPU or OS involvement.

KingOfCoders · on March 30, 2024

Thanks!

_kdave · on March 31, 2024

I'm glad that research papers don't start with "we've analyzed linux kernel 2.6.18 sources (because this is what we had on our lab machines) and determined that ext3 is the best filesystem for our research purpose and now present you with a novel idea of using high-tech device on that". The paper acknowledges modern features, takes design from other filesystems (mentioned BTRFS and tree structures). Overall the idea is interesting and promising.

ec109685 · on March 30, 2024

Interesting they would discuss system call overhead of opening a file, reading from it and closing it. Seems like in almost all cases the open and close calls would be overwhelmed by the other operations.

eru · on March 30, 2024

For lots of small files, that might not be the case.

(I worked on a FUSE filesystem that had these issues.)

loeg · on March 30, 2024

It seems more straightforward to fix your data-in-files layout than to implement a novel in-GPU filesystem, though.

I think the main benefit here is not having to do memory copies through the CPU, which frees up memory bandwidth for other things.

maxcoder4 · on March 30, 2024

There are plenty of cases where you can't just change the file layout. And the GPU filesystem is being implemented by someone else, so the choice is: migrate your data to another filesystem OR fix the data-in-files layout, even though the files may come from completely different source than your application, the layout may be a standard or other applications may depend on it, or you can't easily change it for another reason.

loeg · on March 30, 2024

If you can get the data into the GPU-native filesystem, you can change the data layout at least as easily. The point is there is some sort of data ingestion pipeline involved.

eru · on March 30, 2024

> It seems more straightforward to fix your data-in-files layout than to implement a novel in-GPU filesystem, though.

You can improve file-open overhead in conventional filesystems, too. Including the FUSE one I was working on.

loeg · on March 30, 2024

Sure!

hieu229 · on March 30, 2024

I hope GPU files leads to faster database

brcmthrowaway · on March 30, 2024

Is this implementing a file system using shader code? Thats insane

are shaders turing complete ? ;)

amelius · on March 30, 2024

A GPU seems overkill when the bottleneck is the I/O.

OlivierLi · on March 30, 2024

In systems performance I would advise to never think of any workload as unidimensional (ie: Any file system optimization can either improve IO latency or be useless)

Issuing individual truncates of 1B files can be just as much of a CPU problem then an IO one for example.

amelius · on March 30, 2024

But why wouldn't using one of many CPU cores be sufficient?

touisteur · on March 30, 2024

Now this is all fun, but has anyone managed to make these mechanisms work with Multicast PCIe ? I really need GPUdirect and StorageDirect to support this, until PCIe catches up to today's (or Blackwell's) NVLink ... around PCIe 12?