
Hardware Graph Prefetchers - luu
http://www-dyn.cl.cam.ac.uk/~tmj32/wordpress/hardware-graph-prefetchers/
======
akkartik
Oddly enough, my PhD thesis was on the same subject, and the solutions also
seem to have something in common. Look for example at figure 5.1 on page 95 of
[http://akkartik.name/akkartik-phd07.pdf](http://akkartik.name/akkartik-
phd07.pdf).

I haven't fully internalized OP's approach yet (and this was many years ago,
so I'm no longer very knowledgeable), but they seem to be making more
assumptions about what the data structure looks like, where I had a compiler
distilling arbitrary pointer traversals out of large programs.

------
jhallenworld
How about rearrange the graph to take advantage of the CPU's existing prefetch
logic? The problem is that even with pointer chasing prefetch logic there are
other device limitations: for example DRAM will deliver higher performance
with sequential access.

~~~
richdougherty
Here's a paper from Facebook about how they rearrange their data to improve
locality. The paper is about distributed systems, but it's interesting and
maybe relevant to graph locality on a single machine…

[https://blog.acolyer.org/2016/05/25/socialhash-an-
assignment...](https://blog.acolyer.org/2016/05/25/socialhash-an-assignment-
framework-for-optimizing-distributed-systems-operations-on-social-networks/)

------
abc_lisper
Nice. I am sure this can be extended to object graphs too.

Now, if only we(or the compiler) has a way of hinting to the cpu how to
prefetch the objects, it would give us awesome performance.

~~~
gcc_programmer
Most modern architectureshavs prefetch instructions. I'm not surprised
compilers don't have this (they optimise for the average program), but I am
surprised that specialised libraries like Boost aren't doing this.

~~~
wsxcde
Prefetch instructions are difficult to get right from a compiler perspective.
If you prefetch too early, you might kick useful data out of the cache. If you
do it too late, it's obviously useless, plus you just wasted an entire
instruction: I-cache space, one fetch slot, one issue slot, one ROB entry,
etc.

Plus, they're not portable, even if you get the timing right now, it won't
work on the next generation of the microprocessor or on a slightly different
memory subsystem. You might be able to tune it just right on your desktop, but
it probably won't work as well on your laptop, and it certainly won't work as
well on a cluster.

IMO prefetch instructions are a useful hack when you have a very specific
machine on which you're hand-tuning assembly to get the most performance.
There aren't too many scenarios where this is useful - maybe scientific
workloads on huge supercomputers are one - but they're otherwise kinda
useless.

~~~
abc_lisper
> Plus, they're not portable, even if you get the timing right now, it won't
> work on the next generation of the microprocessor or on a slightly different
> memory subsystem. You might be able to tune it just right on your desktop,
> but it probably won't work as well on your laptop, and it certainly won't
> work as well on a cluster

I think using Java or any other JIT system will solve this problem.

------
rjurney
Graph analytics are my business but hardware is not. What is the tldr?

~~~
zitterbewegung
If you give the processor a cache that is tailored to graphs you can get a
significant speed up because the processor knows what to cache .

