

Fastest sort of fixed length 6 int array - g-garron
http://stackoverflow.com/questions/2786899/fastest-sort-of-fixed-length-6-int-array/

======
psykotic
His xor-laden branchless min/max code in the fastest version so far,
sort6_sorting_network_v4, is some seriously janky shit. It's making it harder
for the compiler to generate good platform-specific machine code. That initial
impression was confirmed after looking at the x64 assembly code generated by
gcc -O3. It was awful. I rewrote his min/max macros to use plain old ternary
operators. The generated assembly code looked a lot better, and my code beats
his by 2-2.5x on my late 2008 MacBook Pro:

    
    
        $ gcc -O3 sortnet -o sortnet && ./sortnet
        sort6_sorting_network_v4: time = 988 cycles
        sort6_sorting_network_per: time = 390 cycles
    

That win is without even trying. Just from eyeballing those cycle counts, I'd
expect you could wring a lot more speed out of this. My coworker Fabian
(@rygorous) immediately suggested using the old difference trick for
SORT2(x,y) rather than calculating min and max independently. That brings it
down further by about 50% in my benchmark. Fabian estimates that's still about
2.5 times off from optimal: "Best bit-magic solution is ~5 cycle dep chain,
stupid simple mov / cmp / cmovg / cmovg boils down to 2."

His benchmarking harness also leaves a lot to be desired. But that's another
issue. He also says that this is eventually intended for use on a GPU, but
he's ranking the implementations by performance on a deeply pipelined out-of-
order CPU. Modern GPU cores have some minimal scoreboarding but none of them
do the full dance of register renaming and decoupled execution units.

------
5hoom
This is the sort of stuff that's great to read because you can't help but
learn something.

I'd never never heard of sorting networks or oblivious sorting before, and
it's good to see some answers other than "Quicksort!"

~~~
giardini
Surely by now you have read that quicksort's worst-case behavior is O(n __2)?

~~~
eru
If you use a randomized implementation the worst-case behaviour is expected to
be O(n log n).

(Worst-case in the input, averaged over your random bits.)

~~~
giardini
While using a randomized pivot implementation decreases the _likelihood_ of
encountering a worst case, quicksort's worst-case behavior remains O(n2). See
the last pages of

[http://www.cs.duke.edu/~reif/courses/alglectures/skiena.lect...](http://www.cs.duke.edu/~reif/courses/alglectures/skiena.lectures/lecture5.pdf)

I don't know that I've actually experienced a worst-case quicksort situation
but I do know of a particular job using quicksort that ran far, far too long -
indeed it never completed the quicksort step, because...

After 6 hours of sorting (all previous runs had completed the sort step in 40
minutes or less) we finally killed it, changed the sort implementation, fired
it up again and never had the same problem again. It was an enterprise time-
critical job running on a mainframe that simply had to complete within 24
hours of start.

~~~
eru
Depends on your definition of worst-case behaviour.

------
skimbrel
Wow, thanks for the link. I'd never heard of sorting networks before. The
first thing I thought of was "use an XOR swap to cut memory accesses"...

~~~
rpearl
If you do the naive implementation: int temp = x; x = y; y = temp;

Then a sufficiently smart compiler can decide to assign, say, %eax to x and
%ebx to y, and then just rename its notion of registers, and after the swap
just begin using %ebx for x and %eax for y. A swap with no copies at all!

(It won't do this in _every_ case, and it depends on context, but such an
optimization is possible... but not with xor swaps.)

~~~
eru
At least xor swaps would need an even smarter compiler.

------
hxa7241
To be a little more speculative . . . if it is on GPU, is there some way of
using rendering operations? The 'hidden-surface problem' in graphics has been
described as basically a sorting problem: so represent each number as a
triangle Z . . . well, maybe it would not be very fast, but you could do a few
million at once!

~~~
pyrtsa
Seven years ago, maybe. But today's GPUs do it easily in their programmable
pipeline (shaders).

And where conditionals are costly, you can use tricks like:

    
    
        float temp = min(x, y);
        x = max(x, y);
        y = temp;
    

instead of the conditional swapping operations, to make the whole network
sorting algorithm deterministic. (Edit: notice that min() and max() here are
hardware operations.)

------
drv
The premise is interesting, but why compare results running on general-purpose
CPUs when the goal is to run on a GPU? At this level of optimization,
comparing two vastly different architectures and compilers would make any
results pretty irrelevant.

------
mrcapers
could anyone recommend some reading on theory and implementation of sorting
networks?

~~~
route66
Knuth's TAOCP covers it in vol III of recent editions. (<http://www-cs-
staff.stanford.edu/~uno/taocp.html>)

------
pmiller2
Related: Minimum-comparison sorting
(<http://www.mimuw.edu.pl/~marpe/research/index.html#MCS>)

