Hacker News new | past | comments | ask | show | jobs | submit login
GPUDirect Storage: A Direct Path Between Storage and GPU Memory (nvidia.com)
69 points by rrss 67 days ago | hide | past | web | favorite | 25 comments

I wish NVidia would focus on integrating their software (like this DMA support) into more widely adopted frameworks like Tensorflow, Pandas, or Pytorch. Like TensorRT or nvidia-docker, they always have to release their own library that either breaks in weird ways or has utterly awful support (TensorRT).

Note that the 16-Volta contraption is basically only for use with NVidia’s own software. The two Xeons it has only have 80 PCI-e lanes, and 16 Voltas will consume 256 lanes, which is why each Volta gets its own flash. (Careful to shuffle your data before putting it on disk!). For that money, you’re better off building your own rack with a balance for your real application.

One way to avoid the PCI-e bottleneck for vision is to use NVidia’s own JPEG decompressor. Then only the file bytes have to cross the link, and the image tensor (which is probably 10x the memory) never leaves GPU RAM.

The lane issue is why there is this push for GPUDirect, the bulk memory transfer goes through the nic that is closest to the GPU and straight to the GPU DRAM without the need to cross through the CPUs limited lanes. In my experience, this works rather nicely.

(NVIDIAn here but speaking for myself) - We still have a ways to go, but we're working on it! Really trying to integrate with open data/ML standards, though sometimes an existing codebase doesn't exactly match the GPU approach, so the best thing we can do is to match their APIs...

TensorRT is now integrated with Tensorflow: https://github.com/tensorflow/tensorrt and it speaks ONNX, which is PyTorch's and MXNet's inference format.

cuDF (open source) supports the Pandas API on GPUs, using the Arrow standard for interop https://github.com/rapidsai/cudf and cuML (also open source) matches the scikit-learn interface: https://github.com/rapidsai/cuml

PyTorch and a bunch of other DL libraries can interop with cuDF/cuML (as in, do your ETL in those libs and get zero-copy, on-GPU transfer to the DL framework) via the dlpack standard. This is a general standard from DMLC that any similar ML library can adopt.

Also, if you're doing nvJpeg decompression on GPU, also check out Dali, which implements super-fast data loading and augmentation steps all on GPU. For this one, there wasn't an existing standard, so we had to build our own approach, but it integrates with PyTorch, TF, and MXNet pipelines nicely: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guid... Hopefully it becomes like the cuDNN of data prep - integrated into all the frameworks well enough that you don't even have to think about it.

Whoa, awesome, thanks for the tip about nvJPEG. Do you know if there are any good tools for data augmentation (rotation/contrast/flip/zoom/blur/crop/letterbox) for those JPEGs once they're on-gpu tensors?

edit: Looks like DALI might handle this. Now to see how it integrates with pytorch...

All those operations are fairly standard vector math once you're working with bitmaps.

Yep, I know, but I’d rather not reimplement them if I don’t have to.

PCI-E bottleneck is why Nvidia bought Mellanox Technologies this year. It's likely that they replace PCI-E with InfiniBand derivative in the future.

How would Infiniband fix this?

So, would a vulnerability in WebGL/WebGPU mean a shader on 3D web app can read all my disks? Neat.

If this catches on there will be exciting new security problems.

Interesting, this seems more flexible then the solution AMD previously used in their Radeon Pro SSG [0], in which they connected a 2TB NVMe SSD RAID directly to the GPU by placing it on the GPU PCB (which the application sees as 2 TB GPU memory since HBCC comines the SSD storage with the 16 GB HBM).

[0] https://www.amd.com/de/products/professional-graphics/radeon...

(This is entirely unrelated to the content of this submission, but I'm curious:

I'm pretty sure I submitted this more than 24 hours before the time that HN now shows, and it didn't get much attention. But then it somehow got resurrected with the timestamp of the submission (and the few existing comments) adjusted.

Does HN do this a lot?)

Sometimes they do it for posts that they deem interesting. It happened to me a few times, and I was notified via email.

So, DMA?

Yes, but using the DMA engine in the NVMe drive to write directly into GPU RAM, rather than go through system RAM first. This was not previously possible, as DMA is almost always to/from main RAM.

From the article [emphasis added]:

> If a DMA engine in an NVMe drive or elsewhere near storage can be used to move data instead of the GPU’s DMA engine, then there’s no interference in the path between the CPU and GPU. Our use of DMA engines on local NVMe drives vs. the GPU’s DMA engines increased I/O bandwidth to 13.3 GB/s, which yielded around a 10% performance improvement relative to the CPU to GPU memory transfer rate of 12.0 GB/s shown in Table 1 below.

They don't seem to have revealed many more technical details about how they did it, just a lot of benchmarks. I presume this only works where there is a shared PCI-E switch in between the GPU and NVMe drive, and so traffic can be routed past the CPU?

With very minimal knowledge of hardware, I wonder if we'll eventually have storage cheap enough and fast enough so we could just remove RAM from a computer all together and we'd have HDD and RAM in the same physical component. We spend so much time optimizing communication between different technologies, removing "middle-men" components but still requiring them as part of a whole that it's surprising (to me) that I haven't seen anything like that in the news/media/etc.

I assume it's from an obvious technical/price limitation, but again IDK a lot about hardware so...

Latency is more important than bandwidth when it comes to reading and writing data, except for specialized work loads.

I don't know the delay for current gen graphics cards, but when CUDA first came out in '07 vram had a roughly a 10x delay to system ram.

vram would work great as virtual memory, where chunks of ram can be loaded and unloaded when the systems main memory runs out, but using vram as system ram today isn't ideal. Likewise running an ssd as system ram wouldn't be ideal either.

This is one of the reasons why cache sizes have increased so much throughout the years. It's not just more cores. The cpu is branching out further pulling in data that will be hypothetically necessary in a few ms. Using vram you'd need the cpu to prefetch hundreds of ms ahead of time. This would add a lot of complexity to the cpu.

AMD announced a cpu with a 256MB L3 cache. Imagine, in 10 years a cpu with 1GB of cache.

> Latency is more important than bandwidth when it comes to reading and writing data, except for specialized work loads.

It is also worth noting that there is pretty big delay between CPU and PCIe.

That's the dream of Intel's "Optane DC Persistent Memory". As with most new technology though, it's still very expensive and currently targeted exclusively at enterprise solutions.

Edit: 'exclusively' is a bit heavy handed, but for the sizes that a typical consumer would think of (250gb-2tb), it still isn't available outside of a simple cache drive. This link has a good breakdown: https://www.pcworld.com/article/3388135/everything-you-need-...

>... but for the sizes that a typical consumer would think of (250gb-2tb), it still isn't available outside of a simple cache drive.

The 900P and 905P series are Optane devices targeted towards prosumers presumably; they offer 280GB-1.5TB of storage. That said, they're still very expensive.

I really wish Optane would get cheaper, or at least refresh faster. It's so utterly superior to NAND devices in terms of random access latency and endurance. All of my systems are Optane and it's amazing, but I feel the capacity squeeze since I can't afford the higher capacity drives.

Intel's decision to offer small capacity cache devices was a mistake in my opinion. It was too confusing for the average consumer, and too complicated on a vendor support level.

Their prosumer 900P series was initially PCI-E x4 AIC form factor, which was also a mistake in my opinion since they later followed up with proper U.2 and M.2 form factors.

Likewise their enterprise offerings aren't really great in terms of availability.

It's too many SKUs that represent poor vision and execution. In hindsight they should've just directly competed with Samsung NAND in M.2 form factor somehow. Sell at a loss for all I care; the tech is so superior to NAND that it deserves a much larger foothold than it currently has.

I haven't had the benefit of experiencing optane first hand, but the specs look insane to me. And I completely agree, Intel bumbled the whole strategy for it. And on top of the strategy, the marketing is abysmal, not just for optane but even their processor lines. This video laid out their new 10nm processor naming scheme (https://youtu.be/hzXKDQR_1d4) and at this point it seems malicious. Like they're trying to make it so consumers accidentally get the highest margin chip on the stack by getting a low power 'i7'. Or maybe hope consumers just stop trying to compare between processor skus, or head to head benchmarks against AMD, and instead focus on the brand that is 'intel inside'.

> I wonder if we'll eventually have storage cheap enough and fast enough

At a fundamental level, RAM is faster than flash/hdd for two basic reasons: it's allowed to be volatile, and your working set is usually much smaller than your total storage requirements, so RAM is allowed to be more expensive on a byte-for-byte basis than persistent storage.

Unless something weird happens with relative research paces, you can basically expect RAM to be faster than persistent storage for those two reasons.

The question then becomes: at what point whether there is a point where even persistent storage is fast enough that your CPU/GPU/*PU can't make use of any further bandwidth/latency improvements. Looking at the relative performance of L1, L2, L3, and RAM sets some milestones you need to hit before this even becomes a long-term goal.

Ideally we'd get rid of the discrete CPU altogether for systems like this. Mellanox attempted to do this with their bluefield box, which essentially put arm cores on the nics, and let them control the transfers to storage. There's really no reason to have a dual Xeon system when you're just rdma'ing to storage. Now, if we can only have that for GPU, but CUDA no longer supports arm outside of tegra.

> CUDA no longer supports arm outside of tegra.

ARM support is coming (back), apparently: https://www.nextplatform.com/2019/06/17/nvidia-makes-arm-a-p...

This can only happen if someone develops a storage technology that is simultaneously the lowest cost, lowest latency, highest bandwidth, lowest MTBF, lowest power, smallest footprint, and includes every extra feature (ECC, encryption, etc). Until that unicorn exists, engineers need to select a storage technology that meets requirements for the use case at hand. If these requirements result in a mix of storage technologies, then at some point it's likely to require moving data from one storage device to another.

HPE believes (believed?) in your vision and made a lot of noise about "The Machine", a memristor-based system which incorporated a lot of ideas around the sorts of hardware and software architecture changes that would come from a radical change in the underlying storage technology.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact