Available via this nuget package:
Timsort may give much improved performance if the items in an array are already partially sorted. I had a specific need for this, and couldn't find any good quality implementations in the dotnet world, so ended up putting the time in to port it and update it to the latest fixed/patched version.
As part of the work I looked at the source for Array.Sort and noticed it was doing some redundant null checks when you call it with a secondary sort array. I submitted a pull request to the MS github repo with some supporting performance numbers, and they merged it in.
It looked to me that the code there was originally one class that used null checks to support the secondary sort array, and at some point the code was copied to a second class that does the secondary sort only, and the first class had that logic removed, but some of the null checks apparently got left behind.
Next steps for my timsort will involve moving from dotnet standard to dotnet core, and starting to leverage some of the new functionality, such as Intrisics for x86-64 and Arm. So this article will be a good starting point for that.
> if you join me in this rather long journey, we’ll end up with a replacement function for Array.Sort, written in pure C# that outperforms CoreCLR’s C++2 code by a factor north of 10x on most modern Intel CPUs, and north of 11x on my laptop.
You need to read the earlier parts to have any clue what on earth a "DoublePumpJedi" might refer to in this context.
Unless they improve chip processes, reduce power consumption, reduce nm and increase density, AVX2 will not be a 100% effective, for a WHILE. This is my empirical opinion.
"I ended up going down the rabbit hole re-implementing array sorting with AVX2 intrinsics, and there’s no reason I should go down alone."
That reads to me that the processor was actually doing a pretty good job predicting the branches for large inputs (toward the end of the sort, when values are close together?), not the lack of CMOV that the author asserts is at fault.
edit: that said, I don't think that's the full explanation why that's slower, given that the small inputs actually do okay compared to the overlined (although most of that is recouping a bit of the losses against the Jedi baseline).
(Those are more often what one needs, rather than in-place sorts; and their lack or otherwise afterthought-implementation leads to requiring Schwartzian Transforms and the like).
On the other hand : "The slowdown we’ve measured here is directly related to NOT having CMOV available to us through the CoreCLR JIT."
I guess you would need to benchmark your program on Intel, AMD and ARM with CMOV and without to have a definitive answer.
Obviously a CPU could use a predictor for CMOVs as well, although I think the commonly used CPUs don't (but not too sure).
"Use the SETCC and CMOV instructions to eliminate unpredictable conditional branches where possible. Do not do this for predictable branches. Do not use these instructions to eliminate all unpredictable conditional branches (because using these instructions will incur execution overhead due to the requirement for executing both paths of a conditional branch). In addition, converting a conditional branch to SETCC or CMOV trades off control flow dependence for data dependence and restricts the capability of the out-of-order engine. When tuning, note that all Intel 64 and IA-32 processors usually have very high branch prediction rates. Consistently mispredicted branches are generally rare. Use these instructions only if the increase in computation time is less than the expected cost of a mispredicted branch."
NB: Intel is recommending CMOV for unpredictable branches. For well predictable branches, they're recommending ... branches.
Of course this presumes you have AVX512 hardware. (AMD gets it in Zen 3.) In the meantime, AVX2 will have to do.
But it's also very uarch dependent. Ice Lake has only a single FMA for instance, while some Skylake-SP cores will have 2 (there are no 2 FMA AVX-512 chips besides Skylake/Cascade at the moment.) The power usage is presumably lower than with 2 FMAs, but the reciprocal throughput is halved. If you only use 512-bit vectors on one core for instance, Ice Lake downclocks very very little, while it does not downclock at all with 256-bit vectors. So it's maybe worth doing. On the other hand, with all cores active, there is a penalty, but 512-bit vectors have no worse of a downclock penalty relative to 256-bit ones, so you are no worse off just using 512 bit instead of 256 bit if that's an option. This in effect will give you 2x the data throughput at the same power usage and same transition cost, which is a net win. Thus, for any kind of batch work, where you're willing to pay for it, AVX-512 will probably be a free upgrade on Ice Lake. Ice Lake-SP will presumably change this up (2 FMAs at minimum) so who knows what will happen then.
However, all AVX transitions in general remain expensive (possibly millions of cycles dependent on many factors), thus you may not want to use them for very very short batch work, and you may want to think carefully about big vectors on older CPUs. It's certainly possible some kinds of code will perform worse in practice in some workloads (but this was always true e.g. with AVX2 on Haswell, where you do a full-chip downclock for even a single instruction, but nobody complained as much about that, presumably because nobody wrote lots of catchy blogs about it :)
And of course, the situation is different for AMD CPUs. Someone else can fill in the details.
TL;DR AVX transitions are expensive and micro architecture dependent. Ice Lake is pretty good at it, despite being mobile only for now. Presumably this will get better from uarch refinements and improved node processes by Intel, and also AMD. Do your own experiments and validate your own hypotheses.
No doubt the 5nm or 3nm chips will actually deliver the 512-bit path. It must be a crushing burden to come up with actually helpful uses for another billion transistors. And it must be frustrating knowing those AVX512 data paths usually sit idle because compilers don't know how to use them.