
GpuScan and SSD-To-GPU Direct DMA - matsuu
http://kaigai.hatenablog.com/entry/2016/09/08/003556
======
exDM69
There is no explanation how it works. Does it work on top of existing APIs in
user space? Or is there a custom kernel driver bypassing user space?

I've done some high throughput streaming from HD/SSD to GPU before, and it's
pretty easy to beat the naive solution but getting the most out of it would
require kernel space code.

I was doing random access streaming of textures using memory mapped files for
input and copying to persistent/coherent mapped pixel buffers on the CPU with
memcpy with background threads. This was intended to take advantage of the
buffer caches (works great when a page is reused) and intended for random
access. If I would have been working on a sequential/full file upload, my
solution would be entirely different.

Edit: here's the source:
[https://github.com/kaigai/ssd2gpu](https://github.com/kaigai/ssd2gpu)

It has a custom kernel module.

~~~
kaigai
Its kernel module provides some special APIs. The userspase application
(PostgreSQL) is enhanced to use them. From the point of user view, SQL still
has been the interface to access the data.

------
zokier
This is very interesting in the light of recent AMD announcement of their
"Solid State Graphics", ie GPU with SSD ducktaped on:
[http://www.anandtech.com/show/10518/amd-announces-radeon-
pro...](http://www.anandtech.com/show/10518/amd-announces-radeon-pro-ssg-fiji-
with-m2-ssds-onboard)

------
foobar2020
This would be incredibly useful for distributed machine learning - imagine a
Tensorflow implementation that almost entirely bypasses CPU.

~~~
Eridrus
I was thinking the same thing, but is SSD to GPU faster than RAM to GPU? In
many (not all) cases you buy a tonne of RAM and load your entire dataset into
memory once and then iterate over it as necessary.

You also lose the flexibility of doing any sort of data modification or
augmentation. One domain where your data usually doesn't fit in RAM is image
recognition, but often you want to do things like apply random flips, crops
and change hues before training to make the neural net less sensitive to those
changes, which you can't really do with this.

~~~
foobar2020
SSD is probably not as fast as RAM, but it's much much cheaper, in the order
of 10x per gigabyte. With SSD-GPU bridge you can have fast access to a
multiple TiB training set, on a single machine.

Data pre-processing is indeed an issue, but hue adjustment/flipping/cropping
could be implemented as Tensorflow operations, on the GPU. Similarly with
input decompression - it would either have to be done on GPU, or the data
would have to be stored uncompressed.

~~~
emn13
As long as the average bandwidth isn't a bottleneck, it's not going to matter
- at worst, you're just going to need to prefetch (and due to SSD latency,
that's likely optimal regardless).

------
witty_username
So, if I understand correctly, data is being loaded directly from the SSD to
the GPU and then filtered by the GPU before the CPU handles the more difficult
queries.

Neat.

------
justinclift
This is very awesome. If further developed + made into a feasible option for
PostgreSQL, this has potential to do interesting things to TPC benchmarks. :)

------
nl
See also
[https://developer.nvidia.com/gpudirect](https://developer.nvidia.com/gpudirect)
and to some extent
[https://en.wikipedia.org/wiki/NVLink](https://en.wikipedia.org/wiki/NVLink).

NVLink is in the Power9 servers Google is using.

~~~
monocasa
Do we have any data that Google is actually using their POWER9 boxes? I always
read that as investing in anything not Intel (see also RISC-V), just to be in
a better negotiating position with Intel for the vast number of CPUs that they
buy.

~~~
nl
Allegedly they have Power8 servers in their data centres:

 _Maire Mahoney, engineering manager at Google and now a director of the
OpenPower Foundation, confirmed to The Next Platform that Google does indeed
have custom Power8 machines running in its datacenters and that developers can
deploy key Google applications onto these platforms if they see fit. Mahoney
was not at liberty to say how many Power-based machines are running in
Google’s datacenters or what particular workloads were running in production
(if any)._ [1]

It's pretty unclear what that actually means, though.

[1] [http://www.nextplatform.com/2016/04/06/inside-future-
google-...](http://www.nextplatform.com/2016/04/06/inside-future-google-
rackspace-power9-system/)

------
carbocation
I'm really hoping that Optane delivers on the hype, in which case our durable
storage could be just 10x slower than RAM. At least, I imagine that it would
be really helpful for speeding up even this approach.

------
Razengan
I hope this brings us closer to widespread external GPUs, where you could use
a slower-than-PCIe bus like Thunderbolt 3 or USB 3.1 to upload all assets to
the EGPU's SSD during a one-time loading screen.

------
musha68k
Amazing results! We need more of that kind of thinking - GPU/SSD accelerate
all the things!

------
MrBuddyCasino
Who is providing the DMA engine in this case? Has the GPU access to PCIe
device memory?

~~~
kaigai
NVME-SSD performs as DMAC in this case. All GPU doing is mapping its own
device memory on the PCI BAR area.

------
foobarbecue
Direct Direct Memory Access? That's pretty direct.

~~~
flamedoge
Redundancy makes it pretty indirect

~~~
kaigai
My headache is painful. It might be called "SSD-to-GPU P2P DMA".

