
Nibble Sort - caf
http://www.hanshq.net/nibble_sort.html
======
davidtgoldblatt
Just to emphasize the truly remarkable thing here: the winning (SIMD) solution
takes about _1ns per word sorted_ \-- just over 3 machine cycles to sort 16
nibbles. Note that even extracting a nibble from a word in the naive manner
would take about the same amount of time. The speedup from using SIMD
registers is tremendous, and should be a lesson to anyone who spends their
time micro-optimizing hot functions.

------
jcr
The hanshq.net story is just one of three blog entries that John Regehr linked
to yesterday:

[http://blog.regehr.org/archives/1242](http://blog.regehr.org/archives/1242)

The other two are:

[http://jbeekman.nl/blog/2015/01/nibble-
sort/](http://jbeekman.nl/blog/2015/01/nibble-sort/)

[http://belkadan.com/blog/2015/05/Nibblesort/](http://belkadan.com/blog/2015/05/Nibblesort/)

------
serkanyersen
This kind of blog posts are super interesting and depressing at the same time
:) makes me question my programing skills. Almost all of the things mentioned
here are over my head.

Thanks for sharing though, very interesting

~~~
nly
Sometimes programming feels like: "Interesting work, useful work. Pick one".
People who get to do both are privileged indeed, the rest of us sit around
feeling either bored or useless.

I'm thankful at least, in contrast to several friends who suffer from
depression and are not prone to nerd sniping, that, for me, these two states
are almost mutually exclusive.

~~~
stephengillie
"May you live in boring times."

------
iso8859-1
I have a short Python script to generate sorting networks (and diagrams like
the one found in the post): [https://github.com/ysangkok/python-sorting-
networks/blob/mas...](https://github.com/ysangkok/python-sorting-
networks/blob/master/b.py)

I was a bit disappointed to see no code in this blog post for the algorithms
used to generating the pairs. Is there are templated C++ library for this?

~~~
hansw2000
I don't think the network in this case was generated by an algorithm.
According to Knuth, this network was discovered by M. W. Green and it's still
the smallest (60 comparators) 16-input network known.

------
userbinator
Upon seeing the description of the problem, my first thought was "use a
sorting network" too, although I'd probably try to use fewer registers...

~~~
ggreer
For this problem, I wouldn't worry too much about register usage. The x86-64
ISA has 16 GP and 16 FP registers, but modern implementations use register
renaming[1] to give many more under the hood. For example, Haswell has 168 GP
and 168 FP entries in its register file.[2] This contest was purely about
speed, so one should favor using as many registers as possible when making any
time-memory tradeoffs.

1\.
[https://en.wikipedia.org/wiki/Register_renaming](https://en.wikipedia.org/wiki/Register_renaming)

2\. [http://www.realworldtech.com/haswell-
cpu/3/](http://www.realworldtech.com/haswell-cpu/3/)

------
amelius
How does it scale?

~~~
ploxiln
These solutions do not scale at all. Even with tweaks, they could not sort 18
nibbles (or 32 nibbles). But they are very interesting, and these kinds of
tricks do find some practical use in things like video codecs.

~~~
dragontamer
But they can sort 16 nibbles twice, and then you can use merge-sort between
the two sets of nibbles.

Which would scale at O(n*log(n)). Which is asymptotically optimal IIRC.

