

Efficient transformation of Adjacency Matrix using cache optimized C++ code - karimouda
https://medium.com/@karim_ouda/efficient-transformation-of-adjacency-matrix-using-cache-optimized-c-code-e41327d88618

======
shepik
I believe an efficient algorithm should combine recursive approach
[http://stackoverflow.com/a/28027063](http://stackoverflow.com/a/28027063)
with sse-based small matrices transposer
[http://stackoverflow.com/a/16743203](http://stackoverflow.com/a/16743203)

Also, a good reference point to compare performance to is math libraries, like
eigen/mtl/ibm essl or something like that.

------
nkurz
It's an interesting sample problem. I might try to play with it tonight to see
what I can come up with. If you haven't done it yet, you should profile your
current solution using the hardware performance counters. On Linux, you can
just run with 'perf record program'; there are parallels for other systems.

A first note is that testing against a -O0 base case is cheating. Although the
manual says "no optimization", what it really means is something like "do
really inefficient things that might make debugging easier". For a fair
comparison, your base case should probably use the same optimization as your
final.

It's hard to talk about the efficiency of a C++ routine without considering
the target machine, the compiler, the size of the input, and the distribution
of the input. Compilers are sufficiently smart that the 'base case' might not
be doing what you expect it to. Switching the order of loops is fair game for
an optimizer, so one reason you might not have seen gain in the simple
approaches is that the compiler was already doing them.

It would be interesting to test your current solution with different degrees
of sparsity. Currently it avoids writing to memory if the input is 0. This is
great, unless the branch for this check is unpredictable. At 50% fill,
mispredict half of your checks. In the naive case, this doesn't matter much,
since the main cost is the random access for the writes. But now that you are
blocking, it might become more relevant. You might want to see how the timing
changes with different degrees of sparsity.

For another approach, it's possible that you'd benefit by making two passes.
Instead of an if statement to decide whether you should write the output,
first make a vectorized branch-free pass that writes a vector of addresses to
update, and then make a second pass that writes ones to these addresses.

The tricky part is making the first pass vectorized and branch free. You'd
probably create the output addresses with vector math, shuffle these based on
_mm256_testz_si256() on the input, write the full vector to the temp array,
and then advance the output pointer by the number of valid entries. Believe it
or not, on modern processors (especially Haswell) this can be faster than the
single pass approach.

Alternatively, you could try to make better use of prefetching. On modern
processors, you can have about 10 requests to memory outstanding at a time,
but most loops though will only have a single outstanding request at a time.
If you can batch things to consider 8 writes at at time, you can reduce the
effective latency of the random access from ~100 cycles per access to ~15
cycles. It can be very tricky to do this, though, since the compiler will
usually be working against you and optimizing out your optimizations!

A last thing you might consider would be looping over the output instead of
the input. Read the output (lower left) sequentially, and then unconditionally
write it back depending on the value of the corresponding input. This would
also depend heavily on prefetching, but this arrangement might have the
compiler working for you rather than against you.

Amazingly, if you arrange things correctly, processors are actually executing
multiple iterations from the same loop at the same time. The key here would be
that the writes need to be unconditional, so that you aren't mispredicting
branches and impairing the processors ability to work out-of-order.

~~~
nkurz
I tried the last option I mentioned (unconditional sequential writes and
random reads), and got approximately the results I expected on Haswell. I'm
managing one read/write combo about every 8 cycles, which I think is about the
maximum throughput that can be expected for random reads from RAM by Little's
law.

Somewhat as expected, the results for Karim's approach were faster if the
input array was very sparse. Unexpectedly, a simple conditional on a basic
approach performed better for me on Haswell than Karim's. It's possible I did
something to damage the code when I sprinkled it with 'unsigned' to quiet the
warnings about signed/unsigned comparisons.

I didn't find any benefit to prefetching for code that was already fast. I
wasn't up to trying the vectorized approach. For reasons I don't understand,
performance for just about everything I tried was abysmal on Sandy Bridge. For
many cases, I was finding 3-5x better absolute performance on Haswell, both
for small and large arrays.

I put my test code up here:
[https://gist.github.com/nkurz/43bd7754155d63381758](https://gist.github.com/nkurz/43bd7754155d63381758)

