
Inspecting C's qsort Through Animation - nayuki
http://nullprogram.com/blog/2016/09/05/
======
morecoffee
Frame Count != sort speed. Cache effects are going to dominate. How close
together the red dots are will more closely represent how fast it is.

~~~
kr7
> Cache effects are going to dominate.

I see this stated frequently, but I rarely see evidence to back it up.

I grabbed his code, removed the drawing function and tested it with perf [1].
(Using glibc's qsort).

First test was with 1 million elements in the array. That's 4 MB of data, so
it fits in L3 cache with room to spare.

Perf reports an average of 1.02 instructions/cycle with a cache miss rate of
2.829%.

Second test: 10 million elements (40 MB). Doesn't fit in cache.

Perf says: 0.95 instructions/cycle, cache miss rate of 37.352%.

Third test has 500 million elements (2 GB). Fits in RAM but not cache.

1.01 instructions/cycle and a cache miss rate of 66.555%.

If the CPU was stalling frequently, waiting for data, I would expect the
number of instructions/cycle to drop dramatically as the cache miss rate
increases. It drops slightly from 4MB to 40MB, but inexplicable increases from
40MB to 2GB.

So I see two possibilities:

1\. I'm dumb and misinterpreting the perf data.

2\. The out-of-order core can work around data stalls so effectively for this
algorithm, the cache miss rate doesn't really matter.

[1]
[https://perf.wiki.kernel.org/index.php/Main_Page](https://perf.wiki.kernel.org/index.php/Main_Page)

~~~
terryf
I wonder if the speculatively executed instructions that are not used are
still counted as a part of the instructions/cycle number? Definitely
interesting results though.

------
saretired
Sedgewick did animations of quicksort and other algorithms back in the 80's.
He coincidentally did his PhD on quicksort under Knuth. Jon Bentley gave a
Google Talk about quicksort and inputs that drive it to O(n^2)
[https://www.youtube.com/watch?v=QvgYAQzg1z8](https://www.youtube.com/watch?v=QvgYAQzg1z8)
His implementation of a production quicksort with McIroy is widely used. It's
in BSD, glibc (non-recursive version--some here are saying that glibc qsort is
a mergesort, but at least in the code I've read that's not true. Perhaps
there's some confusion over the memory allocated by glibc qsort for the
pointer stack it creates to avoid recursion). The paper by Bentley and McIlroy
called ``Engineering a quicksort'' fills in many details that Bentley omits in
his Google Talk.

------
mangix
Surprised that musl I'd the slowest. Anyone have any insight into why qsort
was coded like this?

~~~
DSMan195276
I can wager a guess: On paper smoothsort appears to be the better sorting
algorithm, with identically or better complexity then Quicksort and Mergesort
in basically every category. In addition, smoothsort works _better_ with data
that's close to being sorted.

Why then is it slower in this case, and not as widely used? Part of the reason
is that smoothsort is simply not very widely known, and thus not as many
people know how it works compared to Quicksort or Mergesort. Nothing that,
while complexity is important to take into account and looks good on paper,
complexity leaves off the 'constants', which is generally what people
attribute heapsort's (And thus, smoothsort's) higher number of comparisons.

All that noted, smoothsort (and heapsort) have the big advantages of being in-
place (merge is not), requiring O(1) extra space in the worst case (Quicksort
can require up to O(n) extra space), and has a guaranteed upper-bound of O(n
lg n) in the worst case (Quicksort can degenerate to O(n^2) performance with
bad data). This means that heapsort may not always outperform Quicksort, but
gives better guarantees about speed and data usage then Quicksort (And much
better then Mergesort, which requires a lots of extra allocated data).
Smoothsort is essentially just a better heapsort which functions even better
when the list is close to already sorted.

~~~
masklinn
Couldn't the implementation use a cheap check for near-sorting and only use
smoothsort for near-sorted sequences, or use an insertion sort to increase
sorted runs like timsort?

------
shric
[https://www.youtube.com/user/udiprod/videos](https://www.youtube.com/user/udiprod/videos)
has some fun/pretty sort animations that illustrate clearly the comparison and
swaps separately.

------
forrestthewoods
I'd love to see this with different implementations of C++ std::sort.

------
kylepdm
Any particular reason none of the implementations just do a random pick of a
pivot? This usually is good enough to prevent the O(n^2) solution that the
diet implementation runs into.

~~~
xyzzyz
The sort in standard library should have predictable performance, so "usually"
is not good enough. Moreover, to do random pivot, you need to have a good
source of randomness, and if you don't, an attacker often could predict the
sequence of pivot choices, and prepare the input to ensure "random" choice of
pivot results in a quadratic performance.

------
rhaps0dy
Why does the BSD implementation compare with copies of the data outside the
original array? Is there a performance benefit to doing this?

~~~
MaxLeiter
EDIT: I skimmed the article and my previous answer was completely wrong.

The BSD implementation is quicksort, which doesn't have a duplicate (AKA
auxiliary) array. Their implementation of mergesort has an auxiliary/duplicate
array for the actual merging - that's common. In-place mergesorts exist, but
the effort isn't usually worth it.

You can, however, eliminate the need for copying to the duplicate array (which
they may do) by invoking sort twice - once takes input from the actual array
and puts the sorted output in the duplicate, and the second takes from the
duplicate and puts it back into the default. Somehow, you can merge the two
together to speed up the algorithm (I personally don't know how, however they
can be found implemented here:
[http://algs4.cs.princeton.edu/22mergesort/MergeX.java.html](http://algs4.cs.princeton.edu/22mergesort/MergeX.java.html))

------
wfunction
Why the hell does glibc use quicksort as a fallback with no apparent
protection against the O(n^2) case? Aren't there easy in-place O(n log n)
sorts? Like heapsort?

~~~
matslina
Glibc falls back to quicksort from merge sort (its default algorithm) when
there's to little free memory available to accommodate merge sort's O(n)
requirement. Quicksort only uses O(log n) additional memory, making it a good
option in that sense. O(1) would of course be better, but even for huge inputs
that logarithm makes the memory consumption a non-issue in most cases.
Heapsort does give you O(1) memory, but also comes with poorer cache locality
and overall performance than quicksort, making it a not so compelling option.

------
gabrielcsapo
This is really awesome! I love seeing code visualized!

