
Exploring .NET Core platform intrinsics: Part 4 – Alignment and pipelining - benaadams
https://mijailovic.net/2018/07/20/alignment-and-pipelining/
======
zvrba
Further optimization potential: the four lines

    
    
        sum = Avx2.Add(block0, sum);
        sum = Avx2.Add(block1, sum);
        sum = Avx2.Add(block2, sum);
        sum = Avx2.Add(block3, sum);
    

have all a serializing dependency on sum variable. But (integer) addition is
associative and commutative, so you could sum it in a tree-like manner, ending
up only with a a single serializing dependency:

    
    
        sum01 = Avx.Add(block0, block1);
        sum23 = Avx.Add(block2, block3); // These two run in parallel
        sum = Avx.Add(sum, sum01); // sum01 hopefully ready; parallel with sum23
        sum = Avx.Add(sum, sum23); // sum23 hopefully ready
    

Where only the last line serializes with the previous one. Maybe the HW is
smart enough to rename the registers and do the same thing internally, but
it'd be interesting to benchmark it.

~~~
Metalnem
I already tried that, but was disappointed that the performance gain was only
1%, which is why I didn't include the optimization in the post.

~~~
physguy1123
You should try maintaining 4 independent sum variables and summing after the
loop so there's no serializing dependency at all. Such a transformation in
microbenchmarks is a fun trick to show the power of a proper OOO engine with
pipelined instruction units. Assuming no memory problems, one should be able
use issue-width*instruction latency independent sum streams without spending
more time in the hot loop.

For what it's worth, the vmovdqa only has a 4-wide issue width if it is moving
between registers, the memory load has a 2-wide issue width. Floating point
adders themselves only have a 1-2 wide issue widths depending on your hardware
so it doesn't really matter.

------
rossnordby
Seeing the intrinsics APIs get filled out- in the open, no less- has been
pretty exciting. The fact that something like AES would be implemented
competitively in C# is not something I would have predicted even five years
ago.

It's remarkable how fast the language and runtime have evolved for
performance. It wasn't that long ago that I was manually inlining Vector3
operators to try to get a few extra cycles out of XNA on the Xbox360.

~~~
pjmlp
The Xbox360 runtime was notorious bad and suffered from the WinDev/DevTools
difference of opinions how the future of WIndows development should look like.

Hence killing XNA when they took over Windows 8 development, WinRT and such.

It took all the reorganizations and change of politics, for the .NET Runtime
finally start getting some additional love regarding performance.

~~~
oceanswave
> The Xbox360 runtime was notorious bad and suffered from the WinDev/DevTools
> difference of opinions how the future of WIndows development should look
> like.

The past tense structure makes it sound like progress has been made on this
front while it’s still the same problem presently. It’s just that those tools
in particular have been deprecated (and not replaced)

~~~
pjmlp
Well, how would you correctly phrase it in proper English then?

That was the reason why the XBox 360 runtime was bad, the remaining of my
comment refers to the standard .NET Framework.

