
Assist threads - luu
http://arrdem.com/2015/10/10/assist_threads/
======
fmstephe
This can be done now with hyper-theads. It is discussed in Ulrich Drepper's
excellent 'What every programmer should know about memory' in section 6.3.4.

[http://www.akkadia.org/drepper/cpumemory.pdf](http://www.akkadia.org/drepper/cpumemory.pdf)

To be clear I am not declaring 'been done before, not interested' I like this
blog post and if your software is cpu bound then you are almost certain memory
access bound. So there is a lot of value in finding better ways to speed up
access.

------
nkurz
This is an excellent area to look for optimizations, but my instinct is that a
"pre-runner" approach is going to be of very limited use. If it's 1:1 and the
problem is bound by memory latency, the pre-runner is going to go no faster
than the working thread. If it's N:1, why not simplify the architecture by
doing the computation in the pre-runner?

I think the only case it would make sense is with a heterogenous architecture
where you have unlimited small cores available, but in this case there may be
better solutions. There's nothing wrong with spreading this work across
multiple cores, but in most cases every core will have underutilized memory
operation ports. What we need are compilers (and languages) that are smart
enough to make effective use of these.

I think this is much better solved by small batch processing and more
effective use of software prefetch. Instead of having each core handling one
"request" at a time, batch them so that it's handling about 5 at a time (which
happens to be half the per core limit of outstanding memory accesses). As a
batch, issue all the lookups (with prefetch or directly to registers), then as
a batch process the now in-cache results of the previous requests. Magic ---
by keeping the request queue full, you've increased your throughput 5x[+] at
the cost of a small additional static latency while you build your batches!

[+] This would be theoretical for a memory-latency-bound pointer-chasing
scenario. There are caveats once you get away from this to the real world, but
2x-4x is achievable if you can parallelize your latencies.

~~~
fmstephe
Modern CPUs do exactly this right now. If you are using a powerful x86 (not an
old atom), it will process your software out of order. Both your compiler and
the CPU itself are aggressively trying to execute as many LOADs as possible
before work can't continue.

Or were you describing something more than that?

~~~
nkurz
I'm arguing that modern superscalar out-of-order hardware supports lots of
parallelism, but that modern compilers (and software practices) leave a lot of
performance on the table. The key to overcoming latency is concurrency, and
current software rarely manages to reach its potential. This makes it look
like hardware isn't getting faster, but really it's just not getting faster at
running software targeted at earlier in-order CPU's.

Consider performing a series of simplified hash table lookups based a list of
strings. You hash the first string, use the prefix of the hash as an array
offset to look up another pointer, use the pointer to access a variable length
array, and then check whether you have a matching element. Then you hash the
second string, and repeat. Third, fourth, etc.

To achieve high performance, you need to overlap the lookups, so that the
latencies are shared rather than sequential. But I don't know of any compilers
(although I'm only familiar with GCC, Intel, and Clang) smart enough to do so.
The hardware tries to get out ahead, but will almost always experience a
branch prediction error before it can start working on the next iteration. And
if it does manage, it will usually get at most a single iteration ahead.

By contrast, if you were to do 8 hashes in parallel (often just as fast with
AVX2), and then issue 8 loads, then chase those 8 pointers in parallel, you've
increased your throughput almost 8x. Little's Law is the low-hanging fruit,
and you don't need to figure out how to do cycle accurate synchronization
between cores.

I realize I probably sound like a quack. Here's a more formal paper describing
some similar approaches with a Cuckoo Hash:
[http://www.cs.columbia.edu/~orestis/sigmod15.pdf](http://www.cs.columbia.edu/~orestis/sigmod15.pdf)

~~~
fmstephe
I'm glad I commented on your comment. You certainly were describing much more
that out-of-order execution. Good food for thought :)

~~~
nkurz
Just came across a good paper that discusses these issues:
[http://arxiv.org/pdf/1509.05053v1](http://arxiv.org/pdf/1509.05053v1)

The approach it from a different angle, but come to similar conclusions:

 _Our conclusion is contrary to our expectations and to previous findings and
goes against conventional wisdom, which states that accesses to RAM are slow,
and should be minimized. A more accurate statement, that accounts for our
findings, is that accesses to RAM have high latency and this latency needs to
be mitigated._

 _..._

 _Our results show that, in a processor that has speculative execution, one
can obtain many of the benefits of runahead even on processors that do not
specifically implement it. While this may not be news to hardware experts, to
the best of our knowledge, we are unaware of any algorithms (or even
implementations) that deliberately make use of it._

I would suggest this is a very fruitful direction to pursue.

~~~
fmstephe
Great paper, I am building a red-black tree at the moment (well, rebuilding
one I wrote earlier), so I will try test out some of the techniques described
here and see if I can get any speed out of it.

If I get any clear results, I'll ping you back here. Cheers.

