Hacker News new | past | comments | ask | show | jobs | submit login
Sorting networks and their applications (1968) [pdf] (kent.edu)
27 points by tjalfi 27 days ago | hide | past | favorite | 7 comments

I find sorting networks to be fascinating.

Beyond 8 inputs, there isn’t a pattern to follow to construct the optimal sorting network, and there are only “best known” sorting networks (but not proven to be optimal) for networks beyond 11 inputs.

Unintuitively, the optimal sorting network for, let’s say, 16 inputs is not just an iteration on the optimal sorting network for 15 inputs. So you can’t just build upon what was optimal for a smaller sorting network to get what is optimal for a larger sorting network.

I’ve spent more time than I probably should have several years ago experimenting with various genetic algorithms to construct sorting networks for 14, 15, and 16 inputs. It was a fun challenge and learning experience though.

Here is a little utility I wrote in Python to check whether a comparison network is a sorting network and can generate diagrams for them in svg format.


Sorting networks are one of those cool algorithms I learned about once and then never used again despite some of the theoretical advantages. (This is true about most sorting algorithms FWIW)

Are they used anywhere in modern applications? I can't think of a lot of applications that needs sufficient amount of sorting but perhaps somewhere in GPUs or NICs?

Yes, I think most GPU sorting is based on sorting networks as there are no data dependent branches (I mean, if you exclude the exchange)

Only within a CUDA block / OpenCL workgroup of ~1024 items maximum.

Beyond 1024 "threads", its difficult and inefficient for communication to take place.


GPU Mergepath seems like the coolest algorithm for sorting more than 1024 elements in 1024-way parallel.

In practice, I think only the 32-way or 64-way optimal sorting network was calculated, and then its all merge-path beyond that. (Merge 64 + 64 sorted lists == 128 sorted list)

Mergepath is a very efficient means of parallelizing the "Merge" step of mergesort's algorithm. There's a little branch divergence: but even with the branch divergence you're at O(n + p*log(n)) overall work per merge (assuming constant "p", number of processors, where p is much smaller than "n"... its effectively a O(n) algorithm and therefore asymptotically work efficient with original O(n) mergesort. Parallel Bitonic sort fails in this test)

IIRC, CUDA Thrust's Mergepath algorithm is designed to scale beyond one SM (so it can compute in parallel larger than 1024 elements), because the Mergepath algorithm has very little communication that gets the job done.

In contrast, an optimal sorting network of large size (ex: size 1-million) would require arbitrary communication between elements, which is just not how modern computers / GPUs work.

Please if you know where to find the optimal 64-way sorting network) or even better, selecting network, I'm yugely interested.

Seems like it's just a bitonic network sort over 32 or 64 elements at the bottom.

So not optimal. I guess I was mistaken.

Oh :-) thanks anyway

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact