
Is software prefetching (__builtin_prefetch) useful for performance? - ingve
https://lemire.me/blog/2018/04/30/is-software-prefetching-__builtin_prefetch-useful-for-performance/
======
yvdriess
It is useful in case where the hardware prefetcher is not helpful. That is,
indirect memory access (e.g. graph traversal), constantly accessing different
pages (eg column wise access of a matrix with >4k kB rows) Howeve, finding the
right perfetch distance is key. Too early and you wasted an instruction. Too
late and your cache line might already be evicted, degrading performance. When
perfetch instructions were introduced, unit prefetches were commonly fine. Now
you should probably do a sweep to find an ideal value.

Prefer giving your optimising compiler a prefetch hint, it will be better at
guessing the distance and be less architecture-dependant.

~~~
evincarofautumn
Yeah, speaking of graph traversal, when I was working on the Mono runtime
performance team at Xamarin, it helped considerably in the garbage collector
(SGen) when we added some manual prefetching to scanning the object graph, but
it took a fair amount of benchmarking to find the sweet spot.

------
throwaway84742
I was able to cheaply and measurably boost the perf of some heavily optimized
deep learning inference stuff through judicious use of prefetch. CPU can only
do the job for you if it notices coalesced reads. It won’t do anything about
that next buffer you could be pulling in now from a different place while the
CPU is busy processing data it already has in the cache.

Benchmaking on realistic workloads is crucial for this kind of work. If you
can’t measure the benefit, leave it out, because what is harmless on one arch
can be measurably harmful on another. As if that wasn’t bad enough, results
can and do differ between compilers, too, even on the same arch.

------
int0x80
Analisis on why (software) prefetch is problematic and not a win generally
(especially interesting are the results obtained by I. Molnar linked in the
article):

[https://lwn.net/Articles/444336/](https://lwn.net/Articles/444336/)

------
gpderetta
Anecdote: a very long time ago, I optimized (as in 40% speed up) the inner
loop of a computation heavy application (a custom document clustering
application) by using __builtin_prefetch [1]. The code was simply a series of
sparse-vector/sparse-vector multiplications, with each vector kept in a linked
list [2]. The big win was in overlapping the prefetching the next list node
and the contained vector while the multiplication was in progress.

This optimization worked fine the Core 2 architecture. After moving to
Westmere, the optimization didn't have any significant effect (I doubt the
hardware was doing list prefetching, so some other bottleneck was preventing
it from being effective). On the other hand, enabling hyper-threading gave a
near 100% speedup (the application was trivially parallelizable and already
parallelized via openmp).

[1] Actually for portability, I ended up doing explicit throwaway reads in the
released code instead of using prefetch instructions.

[2] The linked list was not optimal but the vectors were shuffled around a lot
so a fast splice was useful. Prefetching the vector itself would still have
been a win though.

------
a_t48
Anecdote - at my previous job, we had a container that was essentially a
vector of (pooled?) owned pointers - it had some other stuff like being able
to go from item->index in constant time, it ended up being very convenient for
certain types of operations. We used prefetching for iteration and it was
supposedly faster for containers larger than a dozen or so items.

------
lukego
I would want to control for CPU microarchitecture when doing benchmarks like
this. It bugs me to see prefetch calls going upstream into various projects
based on one microbenchmark on one (unspecified) CPU. This is a moving target.

------
gnufx
It's useful in linear algebra implementations (if not necessarily using
__builtin_prefetch). You can study the (micro-)architecture-dependent uses in
e.g. [https://github.com/flame/blis](https://github.com/flame/blis)
[https://github.com/xianyi/OpenBLAS](https://github.com/xianyi/OpenBLAS) and
[https://github.com/hfp/libxsmm](https://github.com/hfp/libxsmm) and it may be
discussed in the papers describing them. libxsmm provides a view of the
complexity outside "large" dimension BLAS.

------
voidmain
Software prefetch can be a big win when doing a traversal that lets you
predict random reads far enough in advance, which is generally not the case
for linked list or tree traversals. (Since it's microarchitecture as well as
data dependent, you'll need to do a lot of profiling)

A better trick for hiding memory latency in these cases is to interleave
multiple searches (perhaps of different instances of the data structure). For
example you can write binary_tree.find_both(key1, key2) that's almost as fast
(on large trees) as finding a single key.

------
bhouston
I think it helped on one project for us in a tight looping by 10%. It
generally isn't worth it unless you are writing something like a video codec
or in our case a cfd algorithm..

------
fanf2
Sometimes, explicit prefetching can be a win: I found an unusual case where it
significantly speeded up traversing a tree structure, because the structure
luckily allowed overlapping computing which element in the child while
prefetching the child

[https://dotat.at/prog/qp/blog-2015-10-11.html](https://dotat.at/prog/qp/blog-2015-10-11.html)

