
Decimating Array.Sort with AVX2 - matt_d
https://bits.houmus.org/2020-02-02/this-goes-to-eleven-pt5
======
redcalx
I frequently go down rabbit holes like this, one such journey resulted in this
port of Timsort to C#/dotnet:

[https://github.com/colgreen/Redzen/tree/master/Redzen/Sortin...](https://github.com/colgreen/Redzen/tree/master/Redzen/Sorting)

Available via this nuget package:

[https://www.nuget.org/packages/Redzen/](https://www.nuget.org/packages/Redzen/)

Timsort may give much improved performance if the items in an array are
already partially sorted. I had a specific need for this, and couldn't find
any good quality implementations in the dotnet world, so ended up putting the
time in to port it and update it to the latest fixed/patched version.

As part of the work I looked at the source for Array.Sort and noticed it was
doing some redundant null checks when you call it with a secondary sort array.
I submitted a pull request to the MS github repo with some supporting
performance numbers, and they merged it in.

It looked to me that the code there was originally one class that used null
checks to support the secondary sort array, and at some point the code was
copied to a second class that does the secondary sort only, and the first
class had that logic removed, but some of the null checks apparently got left
behind.

Next steps for my timsort will involve moving from dotnet standard to dotnet
core, and starting to leverage some of the new functionality, such as
Intrisics for x86-64 and Arm. So this article will be a good starting point
for that.

~~~
GordonS
A bit OT, but I really like looking through other people's utilities. I've
never used one of these as a whole (e.g. pulling in the NuGet), but you
occasionally find some real gems.

------
_bxg1
It took way too long for me to figure out what language this was about. Would
be good to make that clear up-front.

~~~
recursive
So which language was it about?

~~~
_bxg1
C#

------
watersb
Excellent!

"I ended up going down the rabbit hole re-implementing array sorting with AVX2
intrinsics, and there’s no reason I should go down alone."

:-)

------
vkaku
True story, I found AVX code to run faster than AVX2 despite half the overall
vectorized inputs, turns out the chips clock down when running AVX2
instructions. Ditto with AVX-512 as well.

Unless they improve chip processes, reduce power consumption, reduce nm and
increase density, AVX2 will not be a 100% effective, for a WHILE. This is my
empirical opinion.

------
vitus
Towards the end, when the author tries to replace the inner branch with
arithmetic ("Replacing the branch with arithmetic"), one thing I noticed is
that the branchless code actually reduced branches by 5x the number of
eliminated branch mispredictions (233k vs 43k).

That reads to me that the processor was actually doing a pretty good job
predicting the branches for large inputs (toward the end of the sort, when
values are close together?), not the lack of CMOV that the author asserts is
at fault.

edit: that said, I don't think that's the full explanation why that's slower,
given that the small inputs actually do okay compared to the overlined
(although most of that is recouping a bit of the losses against the Jedi
baseline).

------
gbronner
The series is pretty well-written, and useful for C# practitioners, but I
assume that the techniques have been known in the C++ community for a long
time, so while it is faster than an unoptimized C++ implementation, it is no
faster than a similar C++ implementation; they both wind up with nearly the
same assembly instructions.

------
dmaldona
The visuals are beautiful. How does he do it?

~~~
AdamCraven
For the charts he's using chart-js with a chartjs-plugin-rough [1] to give it
the hand-drawn look. Chardin Js for the overlays [2] and it looks like the
rest of the graphics are created by hand using Sodipod or Inkscape [3].

\----

[1] [https://nagix.github.io/chartjs-plugin-
rough/](https://nagix.github.io/chartjs-plugin-rough/)

[2]
[https://heelhook.github.io/chardin.js/](https://heelhook.github.io/chardin.js/)

[3] [https://inkscape.org/](https://inkscape.org/)

------
antpls
On one hand, I found this comment from Linus about CMOV in 2007, explaining
that CMOV is generally a bad idea (but maybe it changed in 13 years) :
[https://yarchive.net/comp/linux/cmov.html](https://yarchive.net/comp/linux/cmov.html)

On the other hand : "The slowdown we’ve measured here is directly related to
NOT having CMOV available to us through the CoreCLR JIT."

I guess you would need to benchmark your program on Intel, AMD and ARM with
CMOV and without to have a definitive answer.

~~~
throwaway_pdp09
I didn't buy linus' argument that CMOV was necessarily slow. Anyway, agner
fog's instructions stuff
([http://www.agner.org/optimize/instruction_tables.pdf](http://www.agner.org/optimize/instruction_tables.pdf))
says that CMOV has a 1 cycle latency on skylake.

~~~
CalChris
Linus made that rant in 2007 [1]. He was right back then for circa 2007
hardware but Intel has improved CMOV quite a bit over the 13 years since.
Intel now recommends CMOV for unpredictable branches in their _Intel® 64 and
IA-32 Architectures Optimization Reference Manual_ as Assembly/Compiler Coding
Rule 2:

"Use the SETCC and CMOV instructions to eliminate unpredictable conditional
branches where possible. Do not do this for predictable branches. Do not use
these instructions to eliminate all unpredictable conditional branches
(because using these instructions will incur execution overhead due to the
requirement for executing both paths of a conditional branch). In addition,
converting a conditional branch to SETCC or CMOV trades off control flow
dependence for data dependence and restricts the capability of the out-of-
order engine. When tuning, note that all Intel 64 and IA-32 processors usually
have very high branch prediction rates. Consistently mispredicted branches are
generally rare. Use these instructions only if the increase in computation
time is less than the expected cost of a mispredicted branch."

NB: Intel is recommending CMOV for _unpredictable branches._ For well
predictable branches, they're recommending ... branches.

[1]
[https://yarchive.net/comp/linux/cmov.html](https://yarchive.net/comp/linux/cmov.html)

~~~
throwaway_pdp09
That was a dumb oversight on my part not to think to mention this vital point.
Thanks. (also thanks for SETcc, I didn't know of this instruction)

------
beagle3
Is anyone familiar with a fast sort-permutation AVX2 implementation for ints
and/or strings? That is, a sort that doesn't permute its input data, but
rather produces as output a permutation that, when applied to the input data,
produces the same data in sorted order?

(Those are more often what one needs, rather than in-place sorts; and their
lack or otherwise afterthought-implementation leads to requiring Schwartzian
Transforms and the like).

------
ncmncm
AVX512 intrinsics are more powerful and better-suited for optimal sorting.

[https://arxiv.org/abs/1704.08579](https://arxiv.org/abs/1704.08579)

Of course this presumes you have AVX512 hardware. (AMD gets it in Zen 3.) In
the meantime, AVX2 will have to do.

~~~
saagarjha
AVX512 has long been infamous for causing thermal-related performance drops.

~~~
aseipp
You have to be careful what you're talking about when you talk about AVX-512
performance - it's more complicated than you might think at first (like I
did!) First, the instruction set and the datapath width are different; you can
use AVX-512 with 128/256 bit vector registers if your CPU supports AVX512VL,
which is nice, because the instruction set has a bunch of good stuff, without
incurring the harshest downclock penalties on some architectures (a different
"power license" in Intel terms, which varies per core on newer systems.)

But it's also very uarch dependent. Ice Lake has only a single FMA for
instance, while _some_ Skylake-SP cores will have 2 (there are no 2 FMA
AVX-512 chips besides Skylake/Cascade at the moment.) The power usage is
presumably lower than with 2 FMAs, but the reciprocal throughput is halved. If
you only use 512-bit vectors on one core for instance, Ice Lake downclocks
very very little, while it does not downclock at all with 256-bit vectors. So
it's maybe worth doing. On the other hand, with all cores active, there is a
penalty, but 512-bit vectors have no worse of a downclock penalty _relative to
256-bit ones_ , so you are no worse off just using 512 bit instead of 256 bit
if that's an option. This in effect will give you 2x the data throughput at
the same power usage and same transition cost, which is a net win.[1] Thus,
for any kind of batch work, where you're willing to pay for it, AVX-512 will
probably be a free upgrade on Ice Lake. Ice Lake-SP will presumably change
this up (2 FMAs at minimum) so who knows what will happen then.

However, all AVX transitions in general remain expensive (possibly millions of
cycles dependent on many factors), thus you may not want to use them for very
very short batch work, and you may want to think carefully about big vectors
on older CPUs. It's certainly possible some kinds of code will perform worse
_in practice_ in some workloads (but this was always true e.g. with AVX2 on
Haswell[2], where you do a full-chip downclock for even a single instruction,
but nobody complained as much about that, presumably because nobody wrote lots
of catchy blogs about it :)

And of course, the situation is different for AMD CPUs. Someone else can fill
in the details.

TL;DR AVX transitions are expensive and micro architecture dependent. Ice Lake
is pretty good at it, despite being mobile only for now. Presumably this will
get better from uarch refinements and improved node processes by Intel, and
also AMD. Do your own experiments and validate your own hypotheses.

[1]
[https://twitter.com/trav_downs/status/1258226270527197190](https://twitter.com/trav_downs/status/1258226270527197190)
[2]
[https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e0...](https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e02d774)

~~~
praseodym
Cloudflare did write a blog about getting poorer performance with AVX-512
enabled in a mixed workload: [https://blog.cloudflare.com/on-the-dangers-of-
intels-frequen...](https://blog.cloudflare.com/on-the-dangers-of-intels-
frequency-scaling/)

~~~
janwas
Important to note that was a Xeon Silver, which is much more affected than
Gold. More in-depth discussion here: [https://lemire.me/blog/2018/08/13/the-
dangers-of-avx-512-thr...](https://lemire.me/blog/2018/08/13/the-dangers-of-
avx-512-throttling-myth-or-reality/)

