Efficient transformation of Adjacency Matrix using cache optimized C++ code

shepik · on Feb 1, 2015

I believe an efficient algorithm should combine recursive approach http://stackoverflow.com/a/28027063 with sse-based small matrices transposer http://stackoverflow.com/a/16743203

Also, a good reference point to compare performance to is math libraries, like eigen/mtl/ibm essl or something like that.

nkurz · on Jan 31, 2015

It's an interesting sample problem. I might try to play with it tonight to see what I can come up with. If you haven't done it yet, you should profile your current solution using the hardware performance counters. On Linux, you can just run with 'perf record program'; there are parallels for other systems.

A first note is that testing against a -O0 base case is cheating. Although the manual says "no optimization", what it really means is something like "do really inefficient things that might make debugging easier". For a fair comparison, your base case should probably use the same optimization as your final.

It's hard to talk about the efficiency of a C++ routine without considering the target machine, the compiler, the size of the input, and the distribution of the input. Compilers are sufficiently smart that the 'base case' might not be doing what you expect it to. Switching the order of loops is fair game for an optimizer, so one reason you might not have seen gain in the simple approaches is that the compiler was already doing them.

It would be interesting to test your current solution with different degrees of sparsity. Currently it avoids writing to memory if the input is 0. This is great, unless the branch for this check is unpredictable. At 50% fill, mispredict half of your checks. In the naive case, this doesn't matter much, since the main cost is the random access for the writes. But now that you are blocking, it might become more relevant. You might want to see how the timing changes with different degrees of sparsity.

For another approach, it's possible that you'd benefit by making two passes. Instead of an if statement to decide whether you should write the output, first make a vectorized branch-free pass that writes a vector of addresses to update, and then make a second pass that writes ones to these addresses.

The tricky part is making the first pass vectorized and branch free. You'd probably create the output addresses with vector math, shuffle these based on _mm256_testz_si256() on the input, write the full vector to the temp array, and then advance the output pointer by the number of valid entries. Believe it or not, on modern processors (especially Haswell) this can be faster than the single pass approach.

Alternatively, you could try to make better use of prefetching. On modern processors, you can have about 10 requests to memory outstanding at a time, but most loops though will only have a single outstanding request at a time. If you can batch things to consider 8 writes at at time, you can reduce the effective latency of the random access from ~100 cycles per access to ~15 cycles. It can be very tricky to do this, though, since the compiler will usually be working against you and optimizing out your optimizations!

A last thing you might consider would be looping over the output instead of the input. Read the output (lower left) sequentially, and then unconditionally write it back depending on the value of the corresponding input. This would also depend heavily on prefetching, but this arrangement might have the compiler working for you rather than against you.

Amazingly, if you arrange things correctly, processors are actually executing multiple iterations from the same loop at the same time. The key here would be that the writes need to be unconditional, so that you aren't mispredicting branches and impairing the processors ability to work out-of-order.

nkurz · on Feb 1, 2015

I tried the last option I mentioned (unconditional sequential writes and random reads), and got approximately the results I expected on Haswell. I'm managing one read/write combo about every 8 cycles, which I think is about the maximum throughput that can be expected for random reads from RAM by Little's law.

Somewhat as expected, the results for Karim's approach were faster if the input array was very sparse. Unexpectedly, a simple conditional on a basic approach performed better for me on Haswell than Karim's. It's possible I did something to damage the code when I sprinkled it with 'unsigned' to quiet the warnings about signed/unsigned comparisons.

I didn't find any benefit to prefetching for code that was already fast. I wasn't up to trying the vectorized approach. For reasons I don't understand, performance for just about everything I tried was abysmal on Sandy Bridge. For many cases, I was finding 3-5x better absolute performance on Haswell, both for small and large arrays.

I put my test code up here: https://gist.github.com/nkurz/43bd7754155d63381758