

Packing the Instruction Pipeline - nkurz
https://github.com/knappador/pipe-packing-demo

======
nkurz
Nice article. Here are some quick thoughts about it and the code, realizing
that I don't quite understand your computational goals.

 _Padding the inputs to align the loop and changing the conditions could
improve this more._

Not sure which 'conditions', but I'm doubtful that you'll gain my aligning the
loops. Loops this small should be running out of a "loop cache" or "micro-op
cache" that doesn't care about instruction addresses.

 _You will almost never have a hint to tell you that a pipeline is waiting on
results to become available._

Probably true, although using 'perf' or 'likwid' to measure instructions
executed per cycle can answer the question if you think to ask it.

 _assert(fv.valley != NULL);_

Using 'assert' for error handler is strongly discouraged. You could argue for
it, but you are bucking tradition.

 _/ / three yields almost no changes, so it's not implemented_

Looks like 3 is implemented, but 4 is not?

 _This PoC demonstrates that in a very tight loop, read-dependency causes the
loop body to be so small that it doesn 't even fill the entire pipeline
(12-14ish on modern x86 CPU's) and therefore expanding the loop body to do
more work will increase performance._

There are ways to interpret this that seem correct, but I don't like phrasing
the goal as 'filling the pipeline'. The goal is avoiding pipeline stalls: the
length of the loop is mostly irrelevant. Tight loops (shorter than pipeline
depth) are just fine if the loops are independent, and if they are not
independent, you may find it difficult to create the parallelism you need to
interleave.

You allude to it in the description, but I think a lot of the gain you are
seeing may be due to Superscalar Execution (instruction level parallelism)
rather than "filling the pipeline".

I find Agner Fog's articles and charts tremendously useful for trying to
understand and optimize performance at this level:
[http://www.agner.org/optimize/](http://www.agner.org/optimize/)

Related to this, you didn't mention what CPU you are using. The clock speed is
largely irrelevant to intepretting your numbers, but the 'generation' (Nehalem
vs Sandy Bridge vs Haswell) can be very significant to how instructions fit
together (their latencies and which can be executed simultaneously).

 _90% of the execution (more?) is spent in the following code sections._

If you are on Linux, start using 'perf' to answer questions like this quickly.

 _Only an object dump will tell if the actual assembly is tighter and thereby
more sensitive to speculative execution optimization._

Intel's IACA is another great tool to know about. It's buggy and awkward, but
does a great job of replacing the process of scribbling latencies on napkins.
Once you figure out how to interpret it's results, it will generally answer
your questions about data dependence:

[http://software.intel.com/en-us/articles/intel-
architecture-...](http://software.intel.com/en-us/articles/intel-architecture-
code-analyzer)

 _The convert instruction moves two doubles into a single?_

No, not really. Intel's docs are probably the best reference for these
questions:
[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-
software-developer-instruction-set-reference-manual-325383.pdf)

The various abbreviations used in the instructions are useful to understand.
For example:

    
    
      ss: scalar single precision
      sd: scalar double precision
      ps: packed single precision
      pd: packed double precision
    

_s_real = ((drand48() - 0.5)_ 0.001) - 0.743643135;*

The conversions you are seeing are actually a problem. The floating point
literals default to double precision. As a result, you are doing the math as a
double and then converting back to a float. You probably should tack an "F" on
to the end of your constants to specify that you want single precision floats
like your variables. I'm not sure it will improve performance here, but it
will definitely make the assembly simpler to understand.

 _GCC (and probably most others) seem to have essentially zero awareness or
capability to deal with instruction-level parallelism._

Generally true, although Intel's compiler is worth comparing. They offer a
free academic license.

 _Unfortunately the compiler had reasons to rearrange certain operations..._

This is a very dangerous assumption to make. I think a better attitude is
"something I did in the code made the compiler do something strange" or even
just "compilers are dumb". Assuming that strange changes a compiler makes
improve the code is often a losing bet.

Hope this is helpful. Thanks for the write up.

~~~
knappador
Thanks for review. Bookmarked so I can eventually improve my write-up.

