Hacker News new | past | comments | ask | show | jobs | submit login
Nibble Sort (hanshq.net)
118 points by caf on July 21, 2015 | hide | past | web | favorite | 14 comments

Just to emphasize the truly remarkable thing here: the winning (SIMD) solution takes about 1ns per word sorted -- just over 3 machine cycles to sort 16 nibbles. Note that even extracting a nibble from a word in the naive manner would take about the same amount of time. The speedup from using SIMD registers is tremendous, and should be a lesson to anyone who spends their time micro-optimizing hot functions.

The hanshq.net story is just one of three blog entries that John Regehr linked to yesterday:


The other two are:



This kind of blog posts are super interesting and depressing at the same time :) makes me question my programing skills. Almost all of the things mentioned here are over my head.

Thanks for sharing though, very interesting

Sometimes programming feels like: "Interesting work, useful work. Pick one". People who get to do both are privileged indeed, the rest of us sit around feeling either bored or useless.

I'm thankful at least, in contrast to several friends who suffer from depression and are not prone to nerd sniping, that, for me, these two states are almost mutually exclusive.

"May you live in boring times."

There's an ongoing Algorithms course on Coursera right now!


edit: found another one, also happening now (can't believe I missed it): https://www.coursera.org/course/algo

I have a short Python script to generate sorting networks (and diagrams like the one found in the post): https://github.com/ysangkok/python-sorting-networks/blob/mas...

I was a bit disappointed to see no code in this blog post for the algorithms used to generating the pairs. Is there are templated C++ library for this?

I don't think the network in this case was generated by an algorithm. According to Knuth, this network was discovered by M. W. Green and it's still the smallest (60 comparators) 16-input network known.

Upon seeing the description of the problem, my first thought was "use a sorting network" too, although I'd probably try to use fewer registers...

For this problem, I wouldn't worry too much about register usage. The x86-64 ISA has 16 GP and 16 FP registers, but modern implementations use register renaming[1] to give many more under the hood. For example, Haswell has 168 GP and 168 FP entries in its register file.[2] This contest was purely about speed, so one should favor using as many registers as possible when making any time-memory tradeoffs.

1. https://en.wikipedia.org/wiki/Register_renaming

2. http://www.realworldtech.com/haswell-cpu/3/

How does it scale?

These solutions do not scale at all. Even with tweaks, they could not sort 18 nibbles (or 32 nibbles). But they are very interesting, and these kinds of tricks do find some practical use in things like video codecs.

But they can sort 16 nibbles twice, and then you can use merge-sort between the two sets of nibbles.

Which would scale at O(n*log(n)). Which is asymptotically optimal IIRC.

> How does it scale?

what do you mean ? the task asks to sort 4bit nibbles in 64bit words, so you optimize for that problem space...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact