just to make you even more paranoid - zen2/4 (and probably e-cores and zen5) can rename memory operands too (you start to do strange things when Intel limits you to 16 registers)!
so today it needs to go through SSD or maybe the network LOL. but for real conspiracy, we should use paper and pencil
indeed, highway is the popularity leader, it implements lots of SIMD operations and supports lots of hardware including SVE/RVV with runtime SIMD width, although IMHO it has a bit longer learning curve
high-performance sorting algos do either merging or partitioning. I.e., you merge R input streams into one, or split one input stream into R (for quick, radix and sample sort).
1. For merge sort of N elements, you have to perform log(N)/log(R) passes
2. For sample sort - C*log(N)/log(R), where С depends on the distribution of your data, there are no strict guarantees
3. For radix sort of K-byte elements exactly K passes (indeed, 256 ways is optimal according to my tests), which is usually larger than the previous value
While Mergesort looks preferable since it uses the least and fixed amount of passes, the merging code itself is less efficient - there are not much independent CPU operations, making it slower than samplesort and especially radix sort for small inputs.
So, it seems that the best strategy combines two different levels - for larger blocks you absolutely need to minimize the amount of passes [over memory] and employ multithreading, while for smaller blocks we need to minimize the amount of CPU operations and increase ILP.
Today two well-recognized algos are IPS4O for larger blocks and Google's SIMD QuickSort for smaller ones.
we set CX only once and then use it 10000 times. the problem is not the slow calculation of CX per se, but the slow shift once we got CX from the renamer
so today it needs to go through SSD or maybe the network LOL. but for real conspiracy, we should use paper and pencil