Beyond 8 inputs, there isn’t a pattern to follow to construct the optimal sorting network, and there are only “best known” sorting networks (but not proven to be optimal) for networks beyond 11 inputs.
Unintuitively, the optimal sorting network for, let’s say, 16 inputs is not just an iteration on the optimal sorting network for 15 inputs. So you can’t just build upon what was optimal for a smaller sorting network to get what is optimal for a larger sorting network.
I’ve spent more time than I probably should have several years ago experimenting with various genetic algorithms to construct sorting networks for 14, 15, and 16 inputs. It was a fun challenge and learning experience though.
Here is a little utility I wrote in Python to check whether a comparison network is a sorting network and can generate diagrams for them in svg format.
Are they used anywhere in modern applications? I can't think of a lot of applications that needs sufficient amount of sorting but perhaps somewhere in GPUs or NICs?
Beyond 1024 "threads", its difficult and inefficient for communication to take place.
GPU Mergepath seems like the coolest algorithm for sorting more than 1024 elements in 1024-way parallel.
In practice, I think only the 32-way or 64-way optimal sorting network was calculated, and then its all merge-path beyond that. (Merge 64 + 64 sorted lists == 128 sorted list)
Mergepath is a very efficient means of parallelizing the "Merge" step of mergesort's algorithm. There's a little branch divergence: but even with the branch divergence you're at O(n + p*log(n)) overall work per merge (assuming constant "p", number of processors, where p is much smaller than "n"... its effectively a O(n) algorithm and therefore asymptotically work efficient with original O(n) mergesort. Parallel Bitonic sort fails in this test)
IIRC, CUDA Thrust's Mergepath algorithm is designed to scale beyond one SM (so it can compute in parallel larger than 1024 elements), because the Mergepath algorithm has very little communication that gets the job done.
In contrast, an optimal sorting network of large size (ex: size 1-million) would require arbitrary communication between elements, which is just not how modern computers / GPUs work.
So not optimal. I guess I was mistaken.