
GPUDirect Storage: A Direct Path Between Storage and GPU Memory - rrss
https://devblogs.nvidia.com/gpudirect-storage/
======
choppaface
I wish NVidia would focus on integrating their software (like this DMA
support) into more widely adopted frameworks like Tensorflow, Pandas, or
Pytorch. Like TensorRT or nvidia-docker, they always have to release their own
library that either breaks in weird ways or has utterly awful support
(TensorRT).

Note that the 16-Volta contraption is basically only for use with NVidia’s own
software. The two Xeons it has only have 80 PCI-e lanes, and 16 Voltas will
consume 256 lanes, which is why each Volta gets its own flash. (Careful to
shuffle your data before putting it on disk!). For that money, you’re better
off building your own rack with a balance for your real application.

One way to avoid the PCI-e bottleneck for vision is to use NVidia’s own JPEG
decompressor. Then only the file bytes have to cross the link, and the image
tensor (which is probably 10x the memory) never leaves GPU RAM.

~~~
ericd
Whoa, awesome, thanks for the tip about nvJPEG. Do you know if there are any
good tools for data augmentation
(rotation/contrast/flip/zoom/blur/crop/letterbox) for those JPEGs once they're
on-gpu tensors?

edit: Looks like DALI might handle this. Now to see how it integrates with
pytorch...

~~~
IfOnlyYouKnew
All those operations are fairly standard vector math once you're working with
bitmaps.

~~~
ericd
Yep, I know, but I’d rather not reimplement them if I don’t have to.

------
magnat
So, would a vulnerability in WebGL/WebGPU mean a shader on 3D web app can read
all my disks? Neat.

------
hagreet
If this catches on there will be exciting new security problems.

------
tutanchamun
Interesting, this seems more flexible then the solution AMD previously used in
their Radeon Pro SSG [0], in which they connected a 2TB NVMe SSD RAID directly
to the GPU by placing it on the GPU PCB (which the application sees as 2 TB
GPU memory since HBCC comines the SSD storage with the 16 GB HBM).

[0] [https://www.amd.com/de/products/professional-
graphics/radeon...](https://www.amd.com/de/products/professional-
graphics/radeon-pro-ssg)

------
rrss
(This is entirely unrelated to the content of this submission, but I'm
curious:

I'm pretty sure I submitted this more than 24 hours before the time that HN
now shows, and it didn't get much attention. But then it somehow got
resurrected with the timestamp of the submission (and the few existing
comments) adjusted.

Does HN do this a lot?)

~~~
simonebrunozzi
Sometimes they do it for posts that they deem interesting. It happened to me a
few times, and I was notified via email.

------
gumby
So, DMA?

~~~
snops
Yes, but using the DMA engine in the NVMe drive to write directly into GPU
RAM, rather than go through system RAM first. This was not previously
possible, as DMA is almost always to/from main RAM.

From the article [emphasis added]:

> If a DMA engine in an NVMe drive or elsewhere near storage can be used to
> move data instead of the GPU’s DMA engine, then there’s no interference in
> the path between the CPU and GPU. _Our use of DMA engines on local NVMe
> drives vs. the GPU’s DMA engines_ increased I/O bandwidth to 13.3 GB/s,
> which yielded around a 10% performance improvement relative to the CPU to
> GPU memory transfer rate of 12.0 GB/s shown in Table 1 below.

They don't seem to have revealed many more technical details about how they
did it, just a lot of benchmarks. I presume this only works where there is a
shared PCI-E switch in between the GPU and NVMe drive, and so traffic can be
routed past the CPU?

------
scohesc
With very minimal knowledge of hardware, I wonder if we'll eventually have
storage cheap enough and fast enough so we could just remove RAM from a
computer all together and we'd have HDD and RAM in the same physical
component. We spend so much time optimizing communication between different
technologies, removing "middle-men" components but still requiring them as
part of a whole that it's surprising (to me) that I haven't seen anything like
that in the news/media/etc.

I assume it's from an obvious technical/price limitation, but again IDK a lot
about hardware so...

~~~
proverbialbunny
Latency is more important than bandwidth when it comes to reading and writing
data, except for specialized work loads.

I don't know the delay for current gen graphics cards, but when CUDA first
came out in '07 vram had a roughly a 10x delay to system ram.

vram would work great as virtual memory, where chunks of ram can be loaded and
unloaded when the systems main memory runs out, but using vram as system ram
today isn't ideal. Likewise running an ssd as system ram wouldn't be ideal
either.

This is one of the reasons why cache sizes have increased so much throughout
the years. It's not just more cores. The cpu is branching out further pulling
in data that will be hypothetically necessary in a few ms. Using vram you'd
need the cpu to prefetch hundreds of ms ahead of time. This would add a lot of
complexity to the cpu.

AMD announced a cpu with a 256MB L3 cache. Imagine, in 10 years a cpu with 1GB
of cache.

~~~
kbumsik
> Latency is more important than bandwidth when it comes to reading and
> writing data, except for specialized work loads.

It is also worth noting that there is pretty big delay between CPU and PCIe.

