Assist threads

fmstephe · on Oct 12, 2015

This can be done now with hyper-theads. It is discussed in Ulrich Drepper's excellent 'What every programmer should know about memory' in section 6.3.4.

http://www.akkadia.org/drepper/cpumemory.pdf

To be clear I am not declaring 'been done before, not interested' I like this blog post and if your software is cpu bound then you are almost certain memory access bound. So there is a lot of value in finding better ways to speed up access.

nkurz · on Oct 12, 2015

This is an excellent area to look for optimizations, but my instinct is that a "pre-runner" approach is going to be of very limited use. If it's 1:1 and the problem is bound by memory latency, the pre-runner is going to go no faster than the working thread. If it's N:1, why not simplify the architecture by doing the computation in the pre-runner?

I think the only case it would make sense is with a heterogenous architecture where you have unlimited small cores available, but in this case there may be better solutions. There's nothing wrong with spreading this work across multiple cores, but in most cases every core will have underutilized memory operation ports. What we need are compilers (and languages) that are smart enough to make effective use of these.

I think this is much better solved by small batch processing and more effective use of software prefetch. Instead of having each core handling one "request" at a time, batch them so that it's handling about 5 at a time (which happens to be half the per core limit of outstanding memory accesses). As a batch, issue all the lookups (with prefetch or directly to registers), then as a batch process the now in-cache results of the previous requests. Magic --- by keeping the request queue full, you've increased your throughput 5x[+] at the cost of a small additional static latency while you build your batches!

[+] This would be theoretical for a memory-latency-bound pointer-chasing scenario. There are caveats once you get away from this to the real world, but 2x-4x is achievable if you can parallelize your latencies.

fmstephe · on Oct 12, 2015

Modern CPUs do exactly this right now. If you are using a powerful x86 (not an old atom), it will process your software out of order. Both your compiler and the CPU itself are aggressively trying to execute as many LOADs as possible before work can't continue.

Or were you describing something more than that?

nkurz · on Oct 12, 2015

I'm arguing that modern superscalar out-of-order hardware supports lots of parallelism, but that modern compilers (and software practices) leave a lot of performance on the table. The key to overcoming latency is concurrency, and current software rarely manages to reach its potential. This makes it look like hardware isn't getting faster, but really it's just not getting faster at running software targeted at earlier in-order CPU's.

Consider performing a series of simplified hash table lookups based a list of strings. You hash the first string, use the prefix of the hash as an array offset to look up another pointer, use the pointer to access a variable length array, and then check whether you have a matching element. Then you hash the second string, and repeat. Third, fourth, etc.

To achieve high performance, you need to overlap the lookups, so that the latencies are shared rather than sequential. But I don't know of any compilers (although I'm only familiar with GCC, Intel, and Clang) smart enough to do so. The hardware tries to get out ahead, but will almost always experience a branch prediction error before it can start working on the next iteration. And if it does manage, it will usually get at most a single iteration ahead.

By contrast, if you were to do 8 hashes in parallel (often just as fast with AVX2), and then issue 8 loads, then chase those 8 pointers in parallel, you've increased your throughput almost 8x. Little's Law is the low-hanging fruit, and you don't need to figure out how to do cycle accurate synchronization between cores.

I realize I probably sound like a quack. Here's a more formal paper describing some similar approaches with a Cuckoo Hash: http://www.cs.columbia.edu/~orestis/sigmod15.pdf

fmstephe · on Oct 13, 2015

I'm glad I commented on your comment. You certainly were describing much more that out-of-order execution. Good food for thought :)

nkurz · on Oct 15, 2015

Just came across a good paper that discusses these issues: http://arxiv.org/pdf/1509.05053v1

The approach it from a different angle, but come to similar conclusions:

Our conclusion is contrary to our expectations and to previous findings and goes against conventional wisdom, which states that accesses to RAM are slow, and should be minimized. A more accurate statement, that accounts for our findings, is that accesses to RAM have high latency and this latency needs to be mitigated.

...

Our results show that, in a processor that has speculative execution, one can obtain many of the benefits of runahead even on processors that do not specifically implement it. While this may not be news to hardware experts, to the best of our knowledge, we are unaware of any algorithms (or even implementations) that deliberately make use of it.

I would suggest this is a very fruitful direction to pursue.

fmstephe · on Oct 19, 2015

Great paper, I am building a red-black tree at the moment (well, rebuilding one I wrote earlier), so I will try test out some of the techniques described here and see if I can get any speed out of it.

If I get any clear results, I'll ping you back here. Cheers.