
“Hiding” network latency for fast memory in data centers - rbanffy
https://news.engin.umich.edu/2020/07/hiding-network-latency-for-fast-memory-in-data-centers/
======
MaxBarraclough
Forgive a semi-relevant ramble:

From a GPUs point of view, this isn't latency hiding, this is latency
reduction.

In GPU terminology, using pre-fetching and caching to reduce the latency for a
read is called _latency reduction_. This approach is used heavily in CPUs.

This is in contrast to the main approach used in GPUs, where you just switch
threads if you're waiting on a read, dubbed _latency hiding_. GPUs are able to
do this terribly quickly, so there's little worry about thread-switching
overhead. It's analogous to 'simultaneous multithreading'/HyperThreading in
CPUs.

Latency-hiding is vital in GPUs, which handle massively parallel workloads and
use graphics RAM, which has high throughput but poor latency. If you give a
GPU a single-threaded workload, you can expect it to stall on reads far more
than if you had a CPU handle the equivalent workload.

(Disclaimer: my GPU knowledge is rather stale.)

~~~
jacquesm
About your last paragraph: the RAM used in GPUs has pretty good latency when
compared to that used in typical CPU applications, it's just that between that
RAM and a CPU there are multiple layers of caching to make it appear as though
the RAM is faster than it really is. CPUs that max out their caches crawl to a
halt whereas GPUs crunching on datasets that are carefully laid out to
maximize RAM bandwidth can process vast amounts of data, much more than a CPU
if it had the equivalent workload.

Giving a GPU a single threaded workload would be like having a sausage factory
start up to make a single sausage, it's inappropriate use of the tool, which
will always give poor results.

~~~
MaxBarraclough
> RAM used in GPUs has pretty good latency when compared to that used in
> typical CPU applications

I see [0] that what I should have said is this (taken from page 7):

 _High throughput is obtained through the use of wide memory buses and
specialized GDDR (graphics double data rate) memories that operate most
efficiently when memory access granularities are large. Thus, GPU memory
controllers must buffer, reorder, and then coalesce large numbers of memory
requests to synthesize large operations that make efficient use of the memory
system._

Which roughly translates to higher latency from the point of view of any
particular wavefront/warp, compared to the cache-heavy CPU approach. Of
course, as the GPU's cores can easily switch between wavefronts, the latency
from the wavefront's point of view isn't all that important; it's the overall
throughput that counts.

> Giving a GPU a single threaded workload would be like having a sausage
> factory start up to make a single sausage, it's inappropriate use of the
> tool, which will always give poor results.

I don't like the analogy, as the single-threaded job could be enormous.

Offloading a very small job onto a GPU is a bad idea because of orchestration
overheads. Offloading a large single-threaded job onto a GPU is a bad idea
because of poor core utilisation (you're using just one of the lanes of just
one of the cores). Offloading a fixed-point-intensive job onto a GPU may be a
bad idea because of inappropriate instruction-set. There's the matter of
wavefront-coherent flow control. And if the job has chaotic memory access
patterns, we're back to our original topic, and you'll get bitten by the wide
memory bus.

[0]
[https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15869-f...](https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15869-f11/www/readings/fatahalian08_cacm.pdf)

~~~
jacquesm
> I don't like the analogy, as the single-threaded job could be enormous.

Anything that isn't an embarrassingly parallel workload is a bad match for a
GPU. Use tools appropriately.

~~~
MaxBarraclough
That's a necessary condition, but not a sufficient one, for GPU-friendliness.

An embarrassingly parallel workload on 64 bit integers, without wavefront
flow-control coherence, and with highly chaotic memory-access patterns, might
not fare well on GPU.

~~~
jacquesm
This is what makes GPU programming hard. Finding ways to shoehorn a sizeable
fraction of a problem into a pattern that matches GPU capabilities is quite a
bit of work. Very small code improvements can have a huge effect on
throughput, and very small mistakes can just as fast tank your performance.
Great care is required and a lot of knowledge of how the data access patterns
translate into hits on the memory.

I actually like that kind of cycle squeezing, much more than I like web
development or other 'high level' plumbing.

~~~
rbanffy
> I actually like that kind of cycle squeezing, much more than I like web
> development or other 'high level' plumbing.

Fun problems are always the rarest.

------
baybal2
It will still not do anything about random access — the main use case of RDMA
in HPC. Sequential access over network is fast enough without RDMA.

Writing programs to take advantage of RDMA, and use of advanced cases like
scatter-gather across multiple RDMA servers almost always benefits from
building your compute logic around the data access pattern you have.

------
soamv
direct link to paper (pdf):
[https://www.usenix.org/system/files/atc20-maruf.pdf](https://www.usenix.org/system/files/atc20-maruf.pdf)

------
jtsiskin
A lot of work here is done on predictive prefetching.

What if the application had a richer way of communicating with the IO layer
instead of a series of reads?

For example reading and processing data in a loop, it could first say that it
will need the data at 1a, then x time processing, then 2a, then x time again,
then 3a, etc. The IO layer learns what x is, finds the best access pattern
given all its requests and upcoming requests. Although this doesn’t work for
cases where your next read depended on the last read, it still unlocks many
more usage patterns than sequential prefetching does.

Is this just too much extra work for the application developer? What if
instead a process similar to JIT was used?

Bottom line is I think incorporating application level knowledge could improve
the prefetching

------
centimeter
Tl;dr version: use a prefetcher and optimize the network stack.

