Hacker News new | past | comments | ask | show | jobs | submit login
Decimating Array.Sort with AVX2 (houmus.org)
131 points by matt_d 1 day ago | hide | past | web | favorite | 31 comments

I frequently go down rabbit holes like this, one such journey resulted in this port of Timsort to C#/dotnet:


Available via this nuget package:


Timsort may give much improved performance if the items in an array are already partially sorted. I had a specific need for this, and couldn't find any good quality implementations in the dotnet world, so ended up putting the time in to port it and update it to the latest fixed/patched version.

As part of the work I looked at the source for Array.Sort and noticed it was doing some redundant null checks when you call it with a secondary sort array. I submitted a pull request to the MS github repo with some supporting performance numbers, and they merged it in.

It looked to me that the code there was originally one class that used null checks to support the secondary sort array, and at some point the code was copied to a second class that does the secondary sort only, and the first class had that logic removed, but some of the null checks apparently got left behind.

Next steps for my timsort will involve moving from dotnet standard to dotnet core, and starting to leverage some of the new functionality, such as Intrisics for x86-64 and Arm. So this article will be a good starting point for that.

A bit OT, but I really like looking through other people's utilities. I've never used one of these as a whole (e.g. pulling in the NuGet), but you occasionally find some real gems.

It took way too long for me to figure out what language this was about. Would be good to make that clear up-front.

It is hard to understand before noticing it is part 5 of a series, and going back to read part 1. Part 1 sets the stage far better:

> if you join me in this rather long journey, we’ll end up with a replacement function for Array.Sort, written in pure C# that outperforms CoreCLR’s C++2 code by a factor north of 10x on most modern Intel CPUs, and north of 11x on my laptop.

You need to read the earlier parts to have any clue what on earth a "DoublePumpJedi" might refer to in this context.

So which language was it about?


The capitalization didn’t give it away?

Thought it might've been JavaScript with the S mistakenly capitalized, but I'm sure there are lots of other languages with Array.Sort's

I guessed Haskell.

True story, I found AVX code to run faster than AVX2 despite half the overall vectorized inputs, turns out the chips clock down when running AVX2 instructions. Ditto with AVX-512 as well.

Unless they improve chip processes, reduce power consumption, reduce nm and increase density, AVX2 will not be a 100% effective, for a WHILE. This is my empirical opinion.


"I ended up going down the rabbit hole re-implementing array sorting with AVX2 intrinsics, and there’s no reason I should go down alone."


Towards the end, when the author tries to replace the inner branch with arithmetic ("Replacing the branch with arithmetic"), one thing I noticed is that the branchless code actually reduced branches by 5x the number of eliminated branch mispredictions (233k vs 43k).

That reads to me that the processor was actually doing a pretty good job predicting the branches for large inputs (toward the end of the sort, when values are close together?), not the lack of CMOV that the author asserts is at fault.

edit: that said, I don't think that's the full explanation why that's slower, given that the small inputs actually do okay compared to the overlined (although most of that is recouping a bit of the losses against the Jedi baseline).

The visuals are beautiful. How does he do it?

For the charts he's using chart-js with a chartjs-plugin-rough [1] to give it the hand-drawn look. Chardin Js for the overlays [2] and it looks like the rest of the graphics are created by hand using Sodipod or Inkscape [3].


[1] https://nagix.github.io/chartjs-plugin-rough/

[2] https://heelhook.github.io/chardin.js/

[3] https://inkscape.org/

I liked the freeze/scroll UX as well. Not sure how screen-reader friendly it is, but some of it (especially the click to see axes marked) is quite nice.

The series is pretty well-written, and useful for C# practitioners, but I assume that the techniques have been known in the C++ community for a long time, so while it is faster than an unoptimized C++ implementation, it is no faster than a similar C++ implementation; they both wind up with nearly the same assembly instructions.

Is anyone familiar with a fast sort-permutation AVX2 implementation for ints and/or strings? That is, a sort that doesn't permute its input data, but rather produces as output a permutation that, when applied to the input data, produces the same data in sorted order?

(Those are more often what one needs, rather than in-place sorts; and their lack or otherwise afterthought-implementation leads to requiring Schwartzian Transforms and the like).

On one hand, I found this comment from Linus about CMOV in 2007, explaining that CMOV is generally a bad idea (but maybe it changed in 13 years) : https://yarchive.net/comp/linux/cmov.html

On the other hand : "The slowdown we’ve measured here is directly related to NOT having CMOV available to us through the CoreCLR JIT."

I guess you would need to benchmark your program on Intel, AMD and ARM with CMOV and without to have a definitive answer.

If the conditional is almost always true or almost always false and the CPU has a branch predictor, then a branch is likely faster; otherwise, the CMOV will be faster.

Obviously a CPU could use a predictor for CMOVs as well, although I think the commonly used CPUs don't (but not too sure).

I didn't buy linus' argument that CMOV was necessarily slow. Anyway, agner fog's instructions stuff (http://www.agner.org/optimize/instruction_tables.pdf) says that CMOV has a 1 cycle latency on skylake.

Linus made that rant in 2007 [1]. He was right back then for circa 2007 hardware but Intel has improved CMOV quite a bit over the 13 years since. Intel now recommends CMOV for unpredictable branches in their Intel® 64 and IA-32 Architectures Optimization Reference Manual as Assembly/Compiler Coding Rule 2:

"Use the SETCC and CMOV instructions to eliminate unpredictable conditional branches where possible. Do not do this for predictable branches. Do not use these instructions to eliminate all unpredictable conditional branches (because using these instructions will incur execution overhead due to the requirement for executing both paths of a conditional branch). In addition, converting a conditional branch to SETCC or CMOV trades off control flow dependence for data dependence and restricts the capability of the out-of-order engine. When tuning, note that all Intel 64 and IA-32 processors usually have very high branch prediction rates. Consistently mispredicted branches are generally rare. Use these instructions only if the increase in computation time is less than the expected cost of a mispredicted branch."

NB: Intel is recommending CMOV for unpredictable branches. For well predictable branches, they're recommending ... branches.

[1] https://yarchive.net/comp/linux/cmov.html

That was a dumb oversight on my part not to think to mention this vital point. Thanks. (also thanks for SETcc, I didn't know of this instruction)

Linus specifically explains that for a predictable predicate a branch is preferable and why, while for an unpredictable predicate a conditional move is preferable.

AVX512 intrinsics are more powerful and better-suited for optimal sorting.


Of course this presumes you have AVX512 hardware. (AMD gets it in Zen 3.) In the meantime, AVX2 will have to do.

AVX512 has long been infamous for causing thermal-related performance drops.

You have to be careful what you're talking about when you talk about AVX-512 performance - it's more complicated than you might think at first (like I did!) First, the instruction set and the datapath width are different; you can use AVX-512 with 128/256 bit vector registers if your CPU supports AVX512VL, which is nice, because the instruction set has a bunch of good stuff, without incurring the harshest downclock penalties on some architectures (a different "power license" in Intel terms, which varies per core on newer systems.)

But it's also very uarch dependent. Ice Lake has only a single FMA for instance, while some Skylake-SP cores will have 2 (there are no 2 FMA AVX-512 chips besides Skylake/Cascade at the moment.) The power usage is presumably lower than with 2 FMAs, but the reciprocal throughput is halved. If you only use 512-bit vectors on one core for instance, Ice Lake downclocks very very little, while it does not downclock at all with 256-bit vectors. So it's maybe worth doing. On the other hand, with all cores active, there is a penalty, but 512-bit vectors have no worse of a downclock penalty relative to 256-bit ones, so you are no worse off just using 512 bit instead of 256 bit if that's an option. This in effect will give you 2x the data throughput at the same power usage and same transition cost, which is a net win.[1] Thus, for any kind of batch work, where you're willing to pay for it, AVX-512 will probably be a free upgrade on Ice Lake. Ice Lake-SP will presumably change this up (2 FMAs at minimum) so who knows what will happen then.

However, all AVX transitions in general remain expensive (possibly millions of cycles dependent on many factors), thus you may not want to use them for very very short batch work, and you may want to think carefully about big vectors on older CPUs. It's certainly possible some kinds of code will perform worse in practice in some workloads (but this was always true e.g. with AVX2 on Haswell[2], where you do a full-chip downclock for even a single instruction, but nobody complained as much about that, presumably because nobody wrote lots of catchy blogs about it :)

And of course, the situation is different for AMD CPUs. Someone else can fill in the details.

TL;DR AVX transitions are expensive and micro architecture dependent. Ice Lake is pretty good at it, despite being mobile only for now. Presumably this will get better from uarch refinements and improved node processes by Intel, and also AMD. Do your own experiments and validate your own hypotheses.

[1] https://twitter.com/trav_downs/status/1258226270527197190 [2] https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e0...

Cloudflare did write a blog about getting poorer performance with AVX-512 enabled in a mixed workload: https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

Important to note that was a Xeon Silver, which is much more affected than Gold. More in-depth discussion here: https://lemire.me/blog/2018/08/13/the-dangers-of-avx-512-thr...

The value of AVX512 is more in the extra instructions it provides than in its (somewhat illusory) wider data path.

No doubt the 5nm or 3nm chips will actually deliver the 512-bit path. It must be a crushing burden to come up with actually helpful uses for another billion transistors. And it must be frustrating knowing those AVX512 data paths usually sit idle because compilers don't know how to use them.

I haven't seen any confirmation that AMD is implementing AVX-512 any time soon, not even on rumour sites. Where are you getting this info from?

I might be hallucinating.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact