Note that the 16-Volta contraption is basically only for use with NVidia’s own software. The two Xeons it has only have 80 PCI-e lanes, and 16 Voltas will consume 256 lanes, which is why each Volta gets its own flash. (Careful to shuffle your data before putting it on disk!). For that money, you’re better off building your own rack with a balance for your real application.
One way to avoid the PCI-e bottleneck for vision is to use NVidia’s own JPEG decompressor. Then only the file bytes have to cross the link, and the image tensor (which is probably 10x the memory) never leaves GPU RAM.
TensorRT is now integrated with Tensorflow: https://github.com/tensorflow/tensorrt
and it speaks ONNX, which is PyTorch's and MXNet's inference format.
cuDF (open source) supports the Pandas API on GPUs, using the Arrow standard for interop https://github.com/rapidsai/cudf
and cuML (also open source) matches the scikit-learn interface: https://github.com/rapidsai/cuml
PyTorch and a bunch of other DL libraries can interop with cuDF/cuML (as in, do your ETL in those libs and get zero-copy, on-GPU transfer to the DL framework) via the dlpack standard. This is a general standard from DMLC that any similar ML library can adopt.
Also, if you're doing nvJpeg decompression on GPU, also check out Dali, which implements super-fast data loading and augmentation steps all on GPU. For this one, there wasn't an existing standard, so we had to build our own approach, but it integrates with PyTorch, TF, and MXNet pipelines nicely: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guid...
Hopefully it becomes like the cuDNN of data prep - integrated into all the frameworks well enough that you don't even have to think about it.
edit: Looks like DALI might handle this. Now to see how it integrates with pytorch...
I'm pretty sure I submitted this more than 24 hours before the time that HN now shows, and it didn't get much attention. But then it somehow got resurrected with the timestamp of the submission (and the few existing comments) adjusted.
Does HN do this a lot?)
From the article [emphasis added]:
> If a DMA engine in an NVMe drive or elsewhere near storage can be used to move data instead of the GPU’s DMA engine, then there’s no interference in the path between the CPU and GPU. Our use of DMA engines on local NVMe drives vs. the GPU’s DMA engines increased I/O bandwidth to 13.3 GB/s, which yielded around a 10% performance improvement relative to the CPU to GPU memory transfer rate of 12.0 GB/s shown in Table 1 below.
They don't seem to have revealed many more technical details about how they did it, just a lot of benchmarks. I presume this only works where there is a shared PCI-E switch in between the GPU and NVMe drive, and so traffic can be routed past the CPU?
I assume it's from an obvious technical/price limitation, but again IDK a lot about hardware so...
I don't know the delay for current gen graphics cards, but when CUDA first came out in '07 vram had a roughly a 10x delay to system ram.
vram would work great as virtual memory, where chunks of ram can be loaded and unloaded when the systems main memory runs out, but using vram as system ram today isn't ideal. Likewise running an ssd as system ram wouldn't be ideal either.
This is one of the reasons why cache sizes have increased so much throughout the years. It's not just more cores. The cpu is branching out further pulling in data that will be hypothetically necessary in a few ms. Using vram you'd need the cpu to prefetch hundreds of ms ahead of time. This would add a lot of complexity to the cpu.
AMD announced a cpu with a 256MB L3 cache. Imagine, in 10 years a cpu with 1GB of cache.
It is also worth noting that there is pretty big delay between CPU and PCIe.
Edit: 'exclusively' is a bit heavy handed, but for the sizes that a typical consumer would think of (250gb-2tb), it still isn't available outside of a simple cache drive. This link has a good breakdown:
The 900P and 905P series are Optane devices targeted towards prosumers presumably; they offer 280GB-1.5TB of storage. That said, they're still very expensive.
I really wish Optane would get cheaper, or at least refresh faster. It's so utterly superior to NAND devices in terms of random access latency and endurance. All of my systems are Optane and it's amazing, but I feel the capacity squeeze since I can't afford the higher capacity drives.
Intel's decision to offer small capacity cache devices was a mistake in my opinion. It was too confusing for the average consumer, and too complicated on a vendor support level.
Their prosumer 900P series was initially PCI-E x4 AIC form factor, which was also a mistake in my opinion since they later followed up with proper U.2 and M.2 form factors.
Likewise their enterprise offerings aren't really great in terms of availability.
It's too many SKUs that represent poor vision and execution. In hindsight they should've just directly competed with Samsung NAND in M.2 form factor somehow. Sell at a loss for all I care; the tech is so superior to NAND that it deserves a much larger foothold than it currently has.
At a fundamental level, RAM is faster than flash/hdd for two basic reasons: it's allowed to be volatile, and your working set is usually much smaller than your total storage requirements, so RAM is allowed to be more expensive on a byte-for-byte basis than persistent storage.
Unless something weird happens with relative research paces, you can basically expect RAM to be faster than persistent storage for those two reasons.
The question then becomes: at what point whether there is a point where even persistent storage is fast enough that your CPU/GPU/*PU can't make use of any further bandwidth/latency improvements. Looking at the relative performance of L1, L2, L3, and RAM sets some milestones you need to hit before this even becomes a long-term goal.
ARM support is coming (back), apparently: https://www.nextplatform.com/2019/06/17/nvidia-makes-arm-a-p...
HPE believes (believed?) in your vision and made a lot of noise about "The Machine", a memristor-based system which incorporated a lot of ideas around the sorts of hardware and software architecture changes that would come from a radical change in the underlying storage technology.