
Data processing on modern hardware - zdw
https://lemire.me/blog/2018/06/26/data-processing-on-modern-hardware/
======
lmeyerov
As a team working a _lot_ on modern GPUs in the cloud for pretty normal (non-
just-deep-learning) compute, we have a different experience:

1\. Big GPU memories, sufficiently flexible compute units, breakthroughs in
irregular algorithms & tooling => we found end-to-end GPU computing quite
viable for most of our data processing. GOAI (link below) is becoming the next
big leap here.

2\. Getting data on device isn't the issue: Slow communication part is data
warehouse -> device (3 GB/s), not cpu->gpu (30-300 GB/s). No significant
advantage to CPU here, and for the slight CPU tax here, I expect network->GPU
to become cloud-standard.

3\. Multi-GPU & GPU memory is insane: The exciting thing for us is (a) all
cloud providers have GPUs and are updating them regularly and (b) Nvidia is
providing multi-GPU designs with a lot of fast shared memory (ex: DGX2/HGX2).
It's a world where for most datasets, there is more GPU RAM than workload
data, and 100K+ threads means a node can destroy a dataset in a few steps. For
the bigger datasets that require streaming, GPU memory bandwidth & thread
count is a monster.

1+3 => We're playing with interactive (<100ms) tasks on dataset sizes that, as
interactive tool devs, we regularly get surprised by.

Ex: For some of our high-level interactive visual analytics task that renders
in the browser, we'll run a a GPU compute pipeline end-to-end (5+ GPU
kernels), where most subtasks take 0-1ms (so many threads!), maybe repeatedly
for simulations, and within 20ms, get back to the CPU, and then send down to
the browser. Amdahl's Law hits us.. at the network compression & transfer
layer.

For a sense of where members of the GOAI (gpu open analytics initiative)
community members are going, from the perspective of web devs, we wrote up
some planning notes on our path to making regular nodejs apps plug into all
this: [https://www.graphistry.com/blog/js-gpus-ml-arrow-
goai](https://www.graphistry.com/blog/js-gpus-ml-arrow-goai) .

------
abainbridge
> networking folks like something called DMA (direct memory access). As I
> understand it, it allows one machine to access “directly” the memory of
> another machine, without impacting the remote CPU. Thus it should allow you
> to take several distinct nodes and make them “appear” like a multi-CPU
> machine. Building software on top of that requires some tooling and it is
> not clear to me how good it is right now.

In the old days CPUs couldn't saturate DRAM bandwidths but DMA could, so DMA
was important. Today a single core of a CPU can read or write about 100
GByte/sec, about 4x more than the bandwidth of a single DRAM module
([https://zsmith.co/images/XeonE5-2630v4.png](https://zsmith.co/images/XeonE5-2630v4.png)
and
[https://en.wikipedia.org/wiki/DDR4_SDRAM#Modules](https://en.wikipedia.org/wiki/DDR4_SDRAM#Modules)).
Also PCs tend to have more CPU cores than DRAM channels.

If you use something like DPDK to get fast network performance, then I'm not
sure what more DMA brings.

If you are already using all your CPU power and not saturating your DRAM
channels, then DMA is a win. But instead of buying the fancy NIC with the DMA
support, you could have bought more CPU cores.

I can see that the latency and power consumption of DMA might be lower than if
the CPU was involved because when a NIC supports DMA, the data path looks like
this:

NIC <-> PCIe <-> Northbridge <-> DRAM

If the CPU is involved, it looks like this:

NIC <-> PCIe <-> Northbridge <-> CPU <-> Northbridge <-> DRAM

And you have to jump through hoops to prevent the DPDK NIC polling thread from
being descheduled occasionally, particularly if you are running in a cloud VM.

Another interesting thing is that modern machines allow the NIC to write
traffic into CPU cache directly, bypassing DRAM
([https://www.intel.co.uk/content/www/uk/en/io/data-direct-
i-o...](https://www.intel.co.uk/content/www/uk/en/io/data-direct-i-o-
technology.html)). If the data you are trying to read (via DMA) is in CPU
cache, then I'm guessing it'll be slower than if you asked the CPU for it.

And another thing: the memory controller in the Northbridge maintains a table
of outstanding memory requests that CPU cores or PCIe peripherals have
demanded. I don't know what the contention policy is. That could alter things
drastically.

~~~
bogomipz
There hasn't been a Northbridge for quite some time. This was replaced by the
QPI/HyperTransport and an integrated memory controller.

~~~
abainbridge
OK, perhaps I'm showing my age by clinging to these archaic terms. By
Northbridge I just meant the functional block that contains the memory
controller.

~~~
bogomipz
I totally understand, it was around forever :)

I thought you made good observations, I was trying to add to it. Cheers.

------
drej
I love reading about modern CPUs, SIMD, cache efficiency etc. But then people
serialise the hell out of their data and send it over the network (sometimes
across regions), so all the benefit is gone. I hope people start processing
data on single nodes more often, I see too much distributed processing when a
single node will do.

------
collyw
> We have cheap and usable machine learning; it can run on neat hardware if
> you need it to. Meanwhile, databases are hard to tune. It seems that we
> could combine the two to tune automagically database systems. People are
> working on this. It sounds a bit crazy, but I am actually excited about it
> and I plan to do some of my own work in this direction.

Aren't databases well understood in comparison to machine learning? I know it
would take me a lot less time to look up how to tune a database than it would
to get a machine learning system up and running.

~~~
contingencies
This was the only point that struck me as well. I believe the author was
referring to self-tuning within a general case (ie. as part of the DB software
to be deployed against many new databases) rather than one-off ML-based tuning
(ie. as part of a manual optimization effort for an existing database).

Of course, the real optimization should be for legibility and
maintainability... human time being far more important than machine time. But
that wouldn't allow fetish discussions diving in to low level efficiencies,
now would it?

 _Any sufficiently complex system acts as a black box when it becomes easier
to experiment with than to understand. Hence, black-box optimization has
become increasingly important as systems become more complex._ \- Google
Vizier: a service for black-box optimization Golovin et al., KDD'17

------
jeffreyrogers
There's a lot of research on this topic. Peloton[0] is one example. It's an
in-memory SQL DB designed for OLTP and OLAP workloads that tunes itself. Andy
Pavlo, one of the researchers, has an awesome series of lectures on YouTube
that describes a lot of the modern research on database implementation[1].

[0]: [https://pelotondb.io/](https://pelotondb.io/)

[1]:
[https://www.youtube.com/playlist?list=PLSE8ODhjZXjYplQRUlrgQ...](https://www.youtube.com/playlist?list=PLSE8ODhjZXjYplQRUlrgQKwIAV3es0U6t)

------
bitL
> There is an argument that says that if you have the GPU anyhow, you might as
> well use it even if it is inefficient. I do not like this argument.

Why not? People use CPUs all the time where GPUs are much more efficient, why
not the other way round as well? If you have GPU, why to keep it idle?

~~~
snorrah
I think he touches on why not during the comments about block chain. It is
probably about “wasting” power and energy with inefficiency.

~~~
bitL
Alright, with proof of work you indeed waste energy, but it's still just
tangentially related - if you do it on CPU, you are probably wasting even
more. However, I see e.g. my Core-M laptop as vastly underused - if I run Deep
Learning there, it's on CPU, which is inefficient, but given it has only Intel
HD graphics with OpenCL for which I don't have a working TensorFlow/Keras, I
might have it 100x slower than I could. But I still use CPU for simpler Deep
Learning models when I need it "on the road", though wish I could use iGPU.
OTOH when I have a problem that requires AVX-512, it's much more efficient to
compute it on latest Xeon, but if I only have a stack of single-precision
GPUs, I might use those even knowing my results will be less precise and they
would take longer to compute?

------
supergirl
what's the point of making a post to dismiss everything you heard without much
arguments...

